Using Class Frequency for Improving Centroid-based Text Classification


Published on

Most previous works on text classification,
represented importance of terms by term occurrence frequency
(tf) and inverse document frequency (idf). This paper presents
the ways to apply class frequency in centroid-based text
categorization. Three approaches are taken into account. The
first one is to explore the effectiveness of inverse class
frequency on the popular term weighting, i.e., TFIDF, as a
replacement of idf and an addition to TFIDF. The second
approach is to evaluate some functions, which are used to
adjust the power of inverse class frequency. The other approach
is to apply terms, which are found in only one class or few
classes, to improve classification performance, using two-step
classification. From the results, class frequency expresses its
usefulness on text classification, especially the two-step

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Using Class Frequency for Improving Centroid-based Text Classification

  1. 1. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 Using Class Frequency for Improving Centroid-based Text Classification Verayuth Lertnattee1 and Chanisara Leuviphan1 1 Department of Health-related Informatics, Faculty of Pharmacy Silpakorn University, Maung, Nakorn Pathom, 73000 Thailand E-mail:,,— M ost previous works on text classification, This paper investigates the usefulness of inverse classrepresented importance of terms by term occurrence frequency frequency in a more systematic way. The traditional term(tf) and inverse document frequency (idf). This paper presents frequency and inverse document frequency utilize informationthe ways to apply class frequency in centroid-based text within classes and in the whole collection of training data.categorization. Three approaches are taken into account. The Besides these two sources of information, information amongfirst one is to explore the effectiveness of inverse classfrequency on the popular term weighting, i.e., TFIDF, as a classes so called inter-class information is expected to bereplacement of idf and an addition to TFIDF. The second useful. An inverse class frequency can utilize this source ofapproach is to evaluate some functions, which are used to information. The effectiveness of applying class frequencyadjust the power of inverse class frequency. The other approach in term weighting and classification process is to apply terms, which are found in only one class or few Three approaches are taken into account. The first one is toclasses, to improve classification performance, using two-step explore the effectiveness of inverse class frequency on theclassification. From the results, class frequency expresses itsusefulness on text classification, especially the two-step popular term weighting, i.e., tf  idf , as a replacement ofclassification. idf and an addition to tf  idf . The second approach isIndex Terms—text classification, term weighting, class to evaluate some functions, which are used to adjust thefrequency, linear classifier power of inverse class frequency. The other approach is to apply terms that are found in only one class or few classes, I. INTRODUCTION to improve classification performance, using two steps of classification. In the rest of this paper, section II presents The increasing availability of online text information, there centroid-based text categorization. Term weighting inhas been extreme need to find and organize relevant centroid-based text classification by frequency-basedinformation in text documents. The automated text patterns is given in section III. Section IV described threecategorization (also known as text classification) becomes a approaches for applying class frequency in text classification.significant tool to organize text documents efficiently. A The data sets and experimental settings are described invariety of classification methods were developed and used section V. In section VI, a number of experimental results arein different schemes, such as probabilistic models [1], neural given. A conclusion is made in section [2], example-based models (e.g., k -nearest neighbor)[3], linear models [4], [5], support vector machine [6] and so II. CENTROID-BASED TEXT CLASSIFICATIONon. Among these methods, a linear model called a centroid-based method is attractive since it has relatively less In the centroid-based text categorization, a document (orcomputation than other methods in both the learning and a class) is represented by a vector using a vector space modelclassification stages. Despite less computation time, a with a bag of words (BOW) [9]. The simplest and popularcentroid-based method was shown in several literatures one is applied term frequency (tf) and inverse documentincluding those in [4], [5], to achieve relatively high accuracy. frequency (idf) in the form of tf  idf as a term weight forIn this method, an individual class is modeled by weighting representing a document. In a vector space model, given aterms appearing in training documents assigned to the class. set of documents D = {d1 , d 2 ,..., d|D| } , a document d j isThis makes classification performance strongly depend on represented by a document vectorterm weighting applied in the model. Most previous works of a centroid-based method focused on weighting factors related d j ={w1 j , w2 j ,...,w|T| j }={tf1 j idf , tf2 j idf2 ,...,tf|T| j idf|T|} , 1to frequency patterns of terms or documents in the class.The most popular two factors are term frequency (tf ) and where wij is a weight assigned to a term ti in a set of termsinverse document frequency (idf ) . Some previous works, (T ) of the document. In this definition, tf ij is term frequencysuch as those in [7], [8], attempted to apply another factor of a term ti in a document d j and idf i is inverse documentcalled inverse class frequency (icf ) . However, the impactof this factor needs more investigation. frequency, defined as log (| D | /df i ) . Here, | D | is the total© 2012 ACEEE 62DOI: 01.IJIT.02.02.57
  2. 2. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012number of documents in a collection and df i is the number This represents an average class term frequency in a class. The tf is considered as an intra-class factor. A term that isof documents, which contain the term ti . Besides termweighting, normalization is another important factor to important for a specific class, should have a high tf . The  ikrepresent a document or a class. Class prototype ck is tf deals with the first property of the significant terms forobtained by summing up all document vectors in C k and classification. Term frequency alone may not be enough tothen normalizing the result by its size. The formal description represent the contribution of a term in a document. To achieve    a better performance, the well-known inverse documentof a class prototype ck is djC d j / || djC d j || , where frequency can be applied to eliminate the impact of frequent k k terms that exist in almost all documents. The idf i is theC k = { d j | d j is a document belonging to the class ck }. inverse ratio of the number of training documents that containThe simple term weighting is tf  idf where tf is an average the term ti to the total number of training documents. It isclass term frequency of the term. The formal description of usually applied in the form of the logarithm value of its original value. It deals with the second property of the significanttf is  djC k tf ijk / | Ck | , where | C | is the number of k terms. However, this tends to be true if those documents, which hold the terms, are in the same class. Inversely, if thedocuments in a class ck . Term weighting described above distribution of the terms is uniform in all classes, idfcan also be applied to a query or a test document. In general, becomes useless and is not helpful in classification. The idfthe term weighting for a query is tf  idf . Once a classprototype vector and a query vector have been constructed, is considered as a collection factor, i.e., its value is the samethe similarity between these two vectors can be calculated. for a particular term, independent of arrangement of classesThe most popular one is cosine distance [10]. This similarity in a collection.can be calculated by the dot product between these two The first and second items can be coped by thevectors. Therefore, the test document will be assigned to the conventional term frequency and inverse documentclass whose class prototype vector is the most similar to the frequency, respectively. However, for the third property, it isvector of the test document. necessary to utilize other frequency factors. For this property, class frequency is expected to be useful. Class frequency III. TERM WEIGHTING IN CENTROID-BASED T EXT enables the classifier to utilize the information among classes. CLASSIFICATION BY FREQUENCY-BASED PATTERNS It is considered as an inter-class factor. Some collections may have several possible organizations of documents This section presents concept of term weighting in (viewpoints). For example, a collection may be grouped intocentroid-based text classification using frequency-based a set of classes based on its content (e.g., course, faculty,patterns. Before the detail is described, some general project, student, ...) or it may be grouped by its universitycharacteristics of terms that are significant for representing a (e.g., Cornell, Texas, Washington, Wisconsin,...). In these twocertain class, are listed below: cases, the collection factor (e.g., idf ) of a term is identical • A term tends to be a representative of a certain class, itshould appear frequently with high occurrence frequency while the inter-class factor of that term is varied. An importantwithin the class. term of the specific category seems to exist in one class or • An important term seems to exist in relatively a few only few classes (the third property of significant terms).documents while a general term tends to appear in most Therefore, we can apply class frequency to representdocuments in a collection. important of that term. In the past, there were some works • A crucial term seems to exist in only one class or few utilizing class frequency in general forms, such as logarithmicclasses. of inverse class frequency ( icf ). It is an analogous of idf .To realize these characteristics, frequency-based patterns ina training collection can be applied. Three main frequency In some situations that we cannot calculate idf such as infactors are term frequency, document frequency and class [8], it is possible to set a sentence as a processing unit insteadfrequency. The simplest and popular one is applied term of a document, and hence icf replaces idf . Although thefrequency (tf ) and inverse document frequency (idf ) in icf seems to be useful, it is not popular for applying intothe form of tf  idf for representing a document. In a term weighting.centroid-based method, term frequency of a class vectorcomes from average value of term frequencies of documents IV. APPLYING CLASS FREQUENCY IN TEXT CLASSIFICATIONin a class. Due to this, we use a symbol tf instead of tf . In this paper, the usefulness of class frequency is© 2012 ACEEE 63DOI: 01.IJIT.02.02.57
  3. 3. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012presented on both the concept and experiments. Three contrary, it demotes tf and tf  idf when its value lessapproaches are taken into account: (1) the effectiveness of than 1.icf in term weighting (2) several functions are applied to B. Functions to modify ICF.icf to enhance classification performance and (3) the algo-rithm to apply terms which are found in only one class or few In this approach, several functions are applied to icf forclasses, to improve classification performance, using two adjusting level of promoting and demoting to tf andsteps of classification. The detail is decribed as follow:A. The Effectiveness of ICF in Term Weighting tf  idf . The first pattern is adjusting the power of icf . The popular method to utilize class frequency is in the When the term weighting is tf  idf  icf , a new termform of logarithmic function of inverse class frequency. Theequation is shown below. weighting is shown below |C | tf  idf  icf  (2) ICFi = log (1) cf i Here, the  is the power of icf . The icf promotes theHere cf i is the number of classes that contain a term ti . The tf  idf when its value > 1. If the power of the icf is greatereffect of the icf is to promote a term which occurs in only than 1, it is more powerful to promote or demote a term thanfew classes (later called few-class terms) and demote a term a normal icf . On the other hand, when the power of the icfwhich appears in many classes (later called most-class terms). is less than 1, it is less powerful to promote or demote a termIn the extreme case, where a term occurs only in one class than a normal icf . The optimum value of the  depends on(cf i = 1) , the function obtains the maximum value. These data sets. In the first pattern, a term may be promoted orterms are called one-class terms. We belief that few-class demoted. The second pattern considers only promotion of aterms are quite important for classifying documents. Inverse term. An example of a term weighting formula is shown belowclass frequency promotes the importance of these terms. tf  idf  (icf  1) (3)Inversely, when a term occurs in all classes (cf i =| C |) , a From the equation, terms are found on all classes, are given afunction of inverse class frequency achieves a minimum value, term weight of the tf  idf . The other terms are given withi.e., 0. These terms are called all-class terms. They areconsidered as less important and may cause the classifier the values which are higher than the tf  idf .misclassify the document. However, their usefulness alsodepends on their term frequencies. The promotion of few- C. Two-step Classificationclass terms is more effective than the demotion of most-class In this approach of classification, class frequency of aterms since it may demote the important terms that occur in term is applied to select representations of the class prototypeall classes but useful for classifying documents. The icf and the test document vectors. The representative vectorscan be included into term weighting. Two simple patterns are are based on n-class terms. A representative vector of n- class terms, means it represents by terms which occur inapplied. The first pattern is tf  icf , i.e., substitution of the 1,2,...,n classes where 1  n | C | . In the first step ofidf with icf . The other pattern is tf  idf  icf , i.e., classification, the n is set. A test document is classified to an appropriated class, based on n-class term. The rest of theinclude the icf in the tf  idf . In normal situation, the test documents, which have not n-class terms as theirnumber of documents is greater than the number of classes representation, will be classified using all terms in the second step. For example, when n=1, i.e., the representation vectors( | D |>| C | ). This means the idf promotes the importance of prototypes and test documents is only a set of terms thatof terms more than the icf . When the logarithm of base 2 is are found only in one class. A test document whose have one-class terms of a class, is automatically classified to thatused, the value of icf for a term is equal to 1 when the term class in the first step. In the second step, the rest of the testis found in the half number of the total classes in a data set documents will be classified by all terms.( cf i =| C | /2 ). Therefore, the value of icf > 1 when V. EXPERIMENTAL SETTINGcf i <| C | /2 and icf < 1 when cf i >| C | /2 . If the value To evaluate our concept about class frequency, three collections called WebKB, 7Sectors and 20Newsgroups areof icf more than 1, it promotes tf and tf  idf . On the used. The WebKB, is a collection of web pages of computer© 2012 ACEEE 64DOI: 01.IJIT.02.02.57
  4. 4. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012science departments in four universities with some additional WebKB1, WebKB2, 7Sectors and 20Newsgroups. Table Ipages from other universities. This collection can be viewed showed the result in forms of the classification two-dimensional viewpoints. In our experiment, we use TABLE I. EFFECT OF ICF ON THE FOUR D ATA SETSthe four most popular classes: student, faculty, course and project. This includes 4,199 web pages. Focusing on eachclass, five subclasses (the 2 nd dimension) are definedaccording to the university a web page belongs to: Cornell,Texas, Washington, Wisconsin and miscellaneous. We usethis collection in two viewpoints: first dimension (WebKB1) From the result, some observations can be made as follows.and second dimension (WebKB2). The total number of datasets for this collection is two. Therefore, the effect of inverse The performance of a prototype vector with tf  idf , isclass frequency with different numbers of classes on the better than that of a prototype vector with tf  icf on threesame collection, can be evaluated. The second collection isthe 7Sectors, a data set from CMU World Wide Knowledge data sets, except only the 7Sectors. The performance ofBase Project from the CMU Text Learning Group. This data classifiers with tf  icf on the same collection but differentset has seven classes: basic material, energy, financial,healthcare, technology, transportation and utilities. The viewpoints, i.e. WebKB1 and WebKB2, may be different withnumber of documents in this collection is 4,582. The third a large gap. The most effective term weighting for all datacollection is 20Newsgroups (20News). The articles are sets is tf  idf  icf . It can be concluded that the icfgrouped into 20 different UseNet discussion groups. Itcontains 19,997 documents and some groups are very similar. express its usefulness when it is used to combine with All experiments were performed on closed test, .i.e., we tf  idf . In generally, the should not been used to replaceused a training set as a test set. The performance was mea- the .sured by classification accuracy defined as the ratio betweenthe number of documents assigned with correct classes and B. Funtions to Modify ICFthe total number of test documents. As a preprocessing, some In this experiment, three functions are applied to , i.e., ,stop words (e.g., a, an, the) are excluded from all data sets. and . These denote by (ICF+1), SqrtICF and (ICF)^2,For the HTML-based data sets, all HTML tags (e.g., < B > , respectively. The result is shown in Table II. Note that the < /HTML > ) were omitted from documents to eliminate the result of TFIDFICF is represented again for comparison.affect of these common words and typographic words. All Some observations on several functions on ICF can be madeheaders are omitted from Newsgroups documents. A unigram as follows. The (ICF)^2 is the best by average on the fourmodel is applied in all experiments. data sets. It can improve the performance of the classifier on Three experiments are performed. The first experiment is three of four data sets, with the exception of WebKB2. Onto investigate effects of inverse class frequency on the four the 20News, the (ICF)^2 improve the performance with a gapdata sets of the three collections. The icf is multiplied to of 4.43%. Although the performance of the classifier is the best by (ICF+1) on WebKB2, the average performance is lesstf and tf  idf . The standard term weighting, tf  idf , is than the ICF. The average performance of the classifier byused as a baseline for comparison. The default term weighting the SqrtICF is a little bit less than ICF.for a test document is tf  idf on all experiments. In the C. Two-step Classificationsecond one, we investigate effects of the three functions on In the last experiment, two-step classification is used. We apply n-class terms in the first step. The values of n are 1 andicf . In the last experiment, two-step classification is 2, i.e., one-class terms and two-class terms are investigated.performed by using one-class terms and two-class terms in For n=1, only a set of one-class terms is used as athe first step. representation of prototype and test document vectors. A test document that contains a set of one-class terms can be VI. EXPERIMENTAL RESULTS automatically classified to a class of those terms. In case ofA. Effect of ICF n=2, a set of one-class terms and two-class terms is used as a representation. In this first experiment, the effect of inverse class fre- TABLE II. EFFECT OF D IFFERENT FUNCTIONS APPLIED TO ICFquency are investigated on tf and tf  idf . Three differ-ent term weightings, i.e., tf  idf (TFIDF), tf  icf (TFICF)and tf  idf  icf (TFICFIDF), are applied to centroid-based classifiers. Four data sets are used for evaluation, i.e.,© 2012 ACEEE 65DOI: 01.IJIT.02.02.57
  5. 5. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012The term weighting of the prototype vector is tf  idf  icf . From the result, the function of class frequency, which ex- pressed the effect on both promoting and demoting someThe rest of test documents are classified in the second step. terms was more effective. Moreover, increasing the exponentTwo term weightings are applied, i.e., tf  idf and of the icf , performance of classifiers is better on several data sets. The other approach was to apply terms, whichtf  idf  icf . The result is shown in Table III. were found in only one class or few classes, to improve clas- The result shows a lot improvement of performance from sification performance, using two-step classification. Theclassifiers when two-step classification is applied. In the first result showed obviously that, a lot improvement of perfor-step, average classification accuracy on the four data sets is mance from classifier over the single step classification. Forrelatively high. Although performance in the first step of one- conclusion, class frequency expressed its usefulness in textclass terms representation is less than performance in two- classification.class terms representation, the number of the rest documents For the future works, effect of class frequency on otherof one-class terms representation is larger than that of two- classification methods and performance of classifiers withclass terms representation. The consequence of this is the class frequency on cross data sets, should be evaluated.number of documents that are assigned the correct class inthe second step from one-class terms representation, is larger ACKNOWLEDGMENTthan that of two-class terms representation. When theclassification process is completed, classification accuracy This work was funded by the Research and Developmentfrom all documents in a collection, beginning with one-class Institute, Silpakorn Univeristy via research grant SURDI 53/terms representation is higher than that of beginning with 01/12.two-class terms representation. From the result, performance REFERENCESof classifiers with tf  idf and tf  idf  icf in the second [1] K. Nigam, A. K.  McCallum,  S. Thrun,  and  T. M.  Mitchell,step, is quite competitive. Performance of the two-step “Text classification from labeled and unlabeled documents usingclassification is superior than that of the single step em,” Machine Learning, vol. 39, no. 2/3, pp. 103–134, 2000.classification. [2] M. Ruiz  and  P. Srinivasan,  “Hierarchical  text  classification using neural networks,” Information Retrieval, vol. 5, no. 1, pp. 87– VII. CONCLUSION 118, 2002. [3] M. Kubat  and  M. Cooperson,  Jr.,  “Voting  nearest-neighbor This paper showed that class frequency was useful in subclassifiers,” in Proceedings of 17th International Conference oncentroid-based classification. Three approaches of a class Machine Learning, pp. 503–510, Morgan Kaufmann, San Francisco,frequency were investigated to exploit information among CA, 2000.classes in a systematic way. The evaluation was conducted [4] E.-H. Han and G. Karypis, “Centroid-based documentusing various data sets. The first approach was to explore classification: Analysis and experimental results,” in Proceedingsthe effectiveness of inverse class frequency on the popular of PKDD-00, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, (Lyon, FR), pp. 424–431,term weighting, i.e., tf  idf , as a replacement of idf and Springer-Verlag Publisher, 2000. [5] V. Lertnattee  and  T. Theeramunkong,  “Effect  of  terman addition to tf  idf . The experimental results showed distributions on centroid-based text categorization,” Informationthat classification accuracy of a classifier using the term Sciences, vol. 158, pp. 89–115, 2004. [6] T. Joachims, Learning to Classify Text using Support Vectorweighting of tf  idf  icf , outperformed those of Machines. Dordrecht, NL: Kluwer Academic Publishers, 2002. [7] K. Cho  and  J. Kim,  “Automatic  text  categorization  ontf  idf and tf  icf . The second approach was to evalu- hierarchical category structure by using icf (inverse category frequency) weighting.,” in Proceedings of KISS-97, Conference ofate some functions, which were used to adjust the power of Korean Institute of Intelligent Systems, pp. 507–510, 1997.inverse class frequency. [8] Y. Ko  and  J. Seo,  “Automatic  text  categorization  by TABLE III. TWO -STEP CLASSIFICATION WITH O NE-CLASS TERMS unsupervised learning,” in Proceedings of COLING-00, the 18th AND T WO-CLASS T ERMS International Conference on Computational Linguistics, pp. 453- 459, Saarbrücken, DE, 2000. [9] G. Salton  and  C. Buckley,  “Term-weighting  approaches  in automatic text retrieval,” Information Processing and Management, vol. 24,  no. 5, pp. 513–523,  1988. [10] A. Singhal, G. Salton, and  C. Buckley, “Length  normalization in degraded text collections,” Tech. Rep. TR95-1507, 1995.© 2012 ACEEE 66DOI: 01.IJIT.02.02.57