A signal processing approach to distributional clustering of terms in automatic text categorization

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    A signal processing approach to distributional clustering of terms in automatic text categorization - Presentation Transcript

    1. A signal processing approach to distributional clustering of terms in automatic text categorization Marta Capdevila Dalmau Oscar W. Márquez Flórez University of Vigo (Spain) martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    2. Automatic Text Categorization Automated categorization of texts into predefined categories, given a training set of pre-categorized text documents Problem! High dimensionality of the indexing term space can be problematic for most of the categorizers commonly used martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    3. Distributional Clustering ? An effective and powerful approach to term extraction aimed at reducing the original term space dimensionality for Automatic Text Categorization Terms are characterized by their probability distribution functions over the different document categories Clustering is done following a similarity measure of the above functions martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    4. The method we propose Is a Distributional Clustering based Not on information-theoretic measures as previous authors proposed But on a new Signal Processing interpretation Step 1: Elimination of noisy terms Step 2: Clustering of remaining informative terms following a measure of signal interdependence or correlation martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    5. The results we obtain Re-confirm those obtained by other Distributional clustering algorithms Drastic improvement in categorization accuracy especially at lower number of features The 20 Newsgroup reference dataset can be indexed with a minimum loss of categorization accuracy with only 20 clusters martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    6. Methodology of our clustering Text documents are represented by the classic bag-of-words indexing The weight of each word corresponds to the number of times the word occurs in the document Each term is characterized by the probability distribution function over the discrete variable category (see next →) martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    7. Probability distribution function C ∑ f (c ) = 1 f k : ci α f k (ci ) = P(ci t k ) with k i i =1 The probabilities are calculated by dividing the number of occurrences of term tk in all documents belonging to each category ci by the total number of occurrences of the term tk in all documents of the dataset martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    8. Signal processing approach tk is assumed to be a probabilistic signal Step 1: elimination of noisy signals Signals with very flat distribution are not informative of the category variable These signals have a low variance C 1 1 ∑f σf = (ci ) − 2 2 k 2 C k C i =1 martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    9. Signal processing approach Step 2: clustering of similar signals The degree of similarity between signals is measured by the correlation coefficient which estimates their interdependence ⎛ ⎞⎛ ⎞ C ρ jk ∈ [0,1] 1 1 ⎜ f j (ci ) − 1 ⎟⎜ f k (ci ) − 1 ⎟ with ∑⎜ ρ jk = C ⎟⎜ C⎟ σfσf C i =1 ⎝ ⎠⎝ ⎠ j k (in the hypothesis of equiprobable categories) martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    10. Hard clustering algorithm The initial agglomerative algorithm we implemented Sorts the vocabulary by decreasing variance order Eliminates terms with variance lower than a certain threshold (noisy signals) Initializes the M clusters as singletons with the top M terms Loops until all terms have been put into one of the M clusters Merges the correlated clusters Creates new clusters from next terms in the list martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    11. Further clustering algorithms To avoid the merging of poorly correlated terms in the same cluster we implemented Dynamic window expansion/compression Static window of dimension M is dynamically expanded and further on compressed Soft clustering Any term can belong to more than one cluster martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    12. Simulation scenario The 20 Newsgroups reference dataset 20.000 newsgroups documents aprox. Partitioned (nearly) evenly across 20 different newsgroups (category) Pre-filtering of dataset Removal of stop-words Removal of non-alphabetical words Removal of terms occurring in less than 4 documents appearing less than 4 times in all dataset martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    13. Categorization accuracy With Naïve Bayes classic categorizer martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    14. Categorization accuracy With Naïve Bayes classic categorizer Distributional clustering accuracy results are notably better than those of classic Information Gain and Chi-square term selection functions Curves present an abrupt initial increase up to 20 clusters (accuracy 74%-76%) From there they asymptotically get to an accuracy of around 79% martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    15. Clusters produced Clustering is good for 20 clusters or more (= number of categories defined in 20 Newsgroups collection) In the case of 20 clusters produced Each category is mainly identified by a single and different cluster in a probability range from 0,9474 to 0,7552 (mean 0,8337) martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    16. Clusters produced Category probability distributions of the 20 clusters obtained martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    17. Clusters produced Category probability distributions of the two of the clusters martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    18. Conclusions The results obtained with our Signal processing clustering approach are very encouraging and re-confirm those obtained by other Distributional clustering algorithms The elimination of noisy terms (based in the variance of the category probability distribution function) has shown to be a correct procedure martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    19. Future work A deep statistical study of the effects of the variance threshold Testing with Reuters 21578 dataset, which is an extremely non-uniformly distributed text collection The design of a full new categorizer based on our Signal processing Distributional clustering approach martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es
    20. Acknowledgements We wish to thank the Weka Machine Learning Project for making their software open source under a GPL license For this research Marta Capdevila was supported in part by a predoctoral grant from the R&D General Department of the Xunta de Galicia regional government (Spain), awarded on July 19th 2005 martacap@gts.tsc.uvigo.es http://www.gts.tsc.uvigo.es

    + inscit2006inscit2006, 3 years ago

    custom

    1257 views, 0 favs, 1 embeds more stats

    Marta Capdevila Dalmau, Oscar W. Mrquez Flrez
    Un more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 1257
      • 1251 on SlideShare
      • 6 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 0
    Most viewed embeds
    • 6 views on http://www.instac.es

    more

    All embeds
    • 6 views on http://www.instac.es

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories