Context Driven Technique for Document Classification

355 views
294 views

Published on

In this paper we present an innovative hybrid Text
Classification (TC) system that bridges the gap between
statistical and context based techniques. Our algorithm
harnesses contextual information at two stages. First it extracts
a cohesive set of keywords for each category by using lexical
references, implicit context as derived from LSA and wordvicinity
driven semantics. And secondly, each document is
represented by a set of context rich features whose values are
derived by considering both lexical cohesion as well as the extent
of coverage of salient concepts via lexical chaining. After
keywords are extracted, a subset of the input documents is
apportioned as training set. Its members are assigned categories
based on their keyword representation. These labeled
documents are used to train binary SVM classifiers, one for
each category. The remaining documents are supplied to the
trained classifiers in the form of their context-enhanced feature
vectors. Each document is finally ascribed its appropriate
category by an SVM classifier.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
355
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Context Driven Technique for Document Classification

  1. 1. ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011 Context Driven Technique for Document Classification *Upasana Pandey, @ S. Chakraverty, # Rahul Jain *upasana1978@gmail.com @apmahs@rediffmail.com #rahul.jain@nsitonline.in NSIT, SEC 3, Dwarka, New Delhi 110078, IndiaAbstract: In this paper we present an innovative hybrid Text contextual information is harnessed at two stages. First, it isClassification (TC) system that bridges the gap between used to extract a meaningful and cohesive set of keywordsstatistical and context based techniques. Our algorithm for each input category. Secondly, it is used to refine theharnesses contextual information at two stages. First it extracts feature set representing the documents to be classified.a cohesive set of keywords for each category by using lexical For the rest of the paper, section II presents a briefreferences, implicit context as derived from LSA and word- background of the field and the relevance of context basedvicinity driven semantics. And secondly, each document is TC. Section III brings into perspective prior work in the area.represented by a set of context rich features whose values arederived by considering both lexical cohesion as well as the extent Section IV presents our proposed context-enhanced TCof coverage of salient concepts via lexical chaining. After model. In Section V, we compare the proposed TC schemekeywords are extracted, a subset of the input documents is with current approaches and overview its implications andapportioned as training set. Its members are assigned categories advantages. We conclude in Section VI.based on their keyword representation. These labeleddocuments are used to train binary SVM classifiers, one for II. BACKGROUND AND MOTIVATIONeach category. The remaining documents are supplied to thetrained classifiers in the form of their context-enhanced feature A TC system accepts two primary inputs; a set ofvectors. Each document is finally ascribed its appropriate categories and a set of documents to be classified. Most TCcategory by an SVM classifier. systems use supervised leaning methods that entail training a classifier. Support Vector Machines (SVM) use kernelKeywords: Lexical references, Vicinity driven semantics, Lexical functions to transform the feature space into a higherchaining. dimensional space and have been found to be especially suitable for TC applications [2]. Their training process may I. INTRODUCTION use prior labeled documents which guide the classifier in Text Classification (TC) is the task of inspecting input tuning its parameters. In another approach, as illustrated indocuments with the aim of assigning them categories from a the Sleeping Experts algorithm [3], a classifier is trainedpredefined set. TC serves a wide range of applications such dynamically during the process of classification using aas classifying scientific and medical documents, organizing subset of the input documents and its parameters aredocuments for query-based information dissemination, email progressively refined. Another approach is to identify a setfolder management and spam filtering, topic-specific search of training documents from among the input dataset and labelengines and library / web document indexing [1]. them with keyword- matched categories. The world-wide-web has witnessed a mind boggling A concern here is to derive a set of keywords for eachgrowth of both volume as well as variety of text content, category. While user-input keywords can be used, this isdriving the need for sophisticated TC techniques. Over the cumbersome and may be impractical. It is attractive toyears, numerous statistical approaches have been generate the keywords automatically. This motivates thesuccessfully devised. But increasingly, the so-called bag-of- application of contextual information to cull out meaningfulwords approach has reached a point of saturation where it keywords for each category.can no longer handle the increasing complexities posed by The documents must be pre-processed and representedpolysemous words, multi-word expressions and closely by a well-defined and cohesive set of features; the most time-related categories that require semantic analysis. Recently, consuming part of TC. The main aim of feature extraction isresearchers have illustrated the potential for applying context- to reduce the large dimensionality of a document’s featureoriented approaches to improve upon the quality of TC. This space and represent it in concise manner within ahas resulted in a fruitful convergence of several fields such meaningfully derived context space. Once encoded as aas Information Retrieval (IR), Natural Language Processing feature set, a trained classifier can be invoked to assign it a(NLP) and Machine Learning (ML). The aim is to derive the suitable category. A plethora of classification techniques aresemantics encapsulated in groups of words and utilize it to available and have been tapped by TC researchers. Theyimprove classification accuracy. include Bayesian classifiers [4], SVM [5], Decision trees This paper showcases a proposal which bridges the gap [6], Ada Boost [7], RIPPER [3], fuzzy classifiers [8] andbetween statistical and context based approaches for TC. The© 2011 ACEEE 23DOI: 01.IJNS.02.02.196
  2. 2. ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011growing seeded clusters based on Euclidean distance between In [5], Diwakar et al have used lexical semantics such asdata points [9]. synonyms, analogous words and metaphorical words/phrases Predominantly, Statistical approaches have been applied to tag input documents before passing them to a statisticalfor feature extraction with highly acceptable performance Bayesian classifier. After semantic pre-processing andof about 98%. These methods employ statistical metrics for normalization, the documents are fed to a Naïve Byes feature evaluation, the most popular being Term Frequency- classifier. The authors have reported improvement inInverse Document Frequency (TF-IDF), Chi-square and classification accuracy with semantic tagging. HoweverInformation Gain[10]. Statistical techniques have been processing is almost entirely manual.widely applied to web applications and harnessed to capacity. In [7], the authors locate the longest multiword expressionTheir applicability in further improving the quality of TC leading to a concept. They derive concepts, synonyms andseems to have reached a saturation point. They have some generalized concepts by utilizing ontology. Since conceptsinherent limitations, being unable to deal with synonyms, are predominantly noun phrases, a Part Of Speech (POS)polysemous words and multi-word expressions. analyzer is used to reject unlikely concepts. The set of terms On the other hand, Semantic or context-based feature and their lexically derived concepts is used to train and testextraction offers a lot of scope to experiment on the relevancy a Ada-boost classifier. Using the Reuters-21578 [14],among words in a document. Semantic features encapsulate OHSUMED [15] and FAODOC [16] corpora, the authorsthe relationship between words and the concepts or the report that the combined feature set of terms and derivedmental signs they represent. Context can be interpreted in a concepts yield noticeable improvements over the classic termvariety of intuitively appealing ways such as: stem representation. • Implicit correlation between words as expressed In [17] the authors select a set of likely positive examples through Latent Semantic Analysis (LSA) from an unlabeled dataset using co-training. They also • Lexical cohesiveness as implied by synonyms, improve the positive examples with semantic features. The hypernyms, hyponyms, meronyms or as sparse expanded positive dataset as well as the unlabeled negative matrices of ordered words as described in [3] examples together train the SVM classifier, thus increasing • Syntactic relation between words as reflected by its reliability. For the final classification, the TF-IDF of all Parts-Of-Speech (POS) phrases the semantic features used in the positive data sets are • Shallow semantic relationships as expressed by evaluated in test documents. The author’s experiments on basic questions such as WHO did WHAT to Reuters-21578 [14] and Usenet articles [18] corpora reveal WHOM, WHERE [2] that using positive and unlabeled negative datasets combined • Enhancement of a word’s importance based on the with semantic features gives improved result. influence of other salient words in its vicinity [11] Ernandes et al [11] propose a term weighting algorithm • Domain based context features expressed by that recognizes words as social networks and enhances the domain specific ontology [7]. default score of a word by using a context function that relates With such a wide spectrum, different context-oriented it to neighboring high scoring words. Their work has beenfeatures can be flexibly combined together to target different suitably applied for single word QA as demonstrated forapplications. The issue of real time performance can be crossword solving.handled through efficient learning algorithms and compactfeature representation. IV. PROPOSED WORK In this paper, we propose a scheme for incorporatinglexical, referential as well as latent semantics to produce Our proposal taps the potential of lexical and contextualcontext enhanced text classification. semantics at two tiers: A. Keyword extraction: III. REVIEW OF CONTEXT BASED TC Automatic keyword generation frees the user from the Several authors have suggested the use of lexical cohesion encumbrance of inputting them manually. Our schemein different schemes to aid the process of classification. employs lexical references derived from wordnet[19],implicit L. Barak et al have used a lexical reference scheme to context as derived from an LSA and vicinity based contextexpand upon a given category name with the aim of as given the terms surrounding already known keywords.automatically deriving a set of keywords for the category The three driving forces together yield a cohesive set of[12]. A reference vector comprises lexical references derived keywords. The set of extracted keywords for each categoryfrom wordnet and wikipedia for each category and a context serves as a template for identifying representative documentsvector is derived from the LSA space between categories for that category. These documents are utilized as supervisorsand input documents. Each vector is used to calculate two to train an SVM classifier.cosine similarity scores between categories and documents. B. Document representation:These scores are multiplied to assign a final similarity score.The final bootstrapping step uses the documents that are Once training documents are identified, these andidentified for each category to train its classifier. With their the remaining unlabeled test documents must be representedcombined reference plus context approach, the authors have by a compact feature set. As during the keyword extractionreported improvement in precision with Reuters-10 corpus. phase, we use lexical relationships to identify generic© 2011 ACEEE 24DOI: 01.IJNS.02.02.196
  3. 3. ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011concepts, synonyms and specialized versions of terms. the algorithm carries out a syntactic parsing of theTaking a further step of refinement, we use anaphoric sentences containing keywords. Since concepts arereferential phrases and strong conjunctions to identify the usually represented by noun phrases, they are separatedlength of cohesiveness or lexical chain of each salient out. The number of instances of each vicinity related pairconcept. These parameters are utilized to fine tune TF-IDF is counted. If this exceeds a pre-decided threshold τv ,feature values. Only those features whose values cross a the keyword set is augmented with the new term. Thestipulated threshold finally represent the document. two augmented keyword sets are now AK lex(c) and The algorithm Context-driven Text Classification CTC, AKimc(c).is described in the pseudo-code in figure 1. Step 3: Evaluate documents’ representative index:Step1: Removal of stop words and stemming: The first step is pre-processing of documents. Stop words,i.e. those words that have negligible semantic significancesuch as ‘the’ ‘an’ ‘a’ etc are removed. Weak conjunctionssuch as ‘and’ are removed but strong conjunctions ‘therefore’and ‘so’ are retained so that their contribution to lexicalchaining can be utilized. Next the system carries out The cosine similarity, which gives the cosine of thestemming, viz. the process of reducing inflected words to a angle between two vectors has been used in [12,22] to matchroot word form. The entire set of documents is partitioned documents with categories. However the keyword vectorsinto one thrid training set and two third testing set. being equally weighted, the cosine similarity will biasStep 2: Keyword extraction: documents with equal number of instances of each keyword Words that bear a strong semantic connection with against others. Also, a document with more instances of acategory names are keywords. This concept can be extended keyword may receive the same similarity score as a documentto hypothesize that words strongly associated with identified with less instances of the keyword. An appropriate metrickeywords are themselves good candidates as keywords. We should reflect the fact that any document containing moreapply these principles to extract keywords from the following keywords from a category is more representative of it and asources. document with more instances of each keyword deals with the category in greater detail. Given a keyword list AK(c) Keyword Set- from lexical references: and a document d we multiply two factors to derive its Words that are lexically related to a category c are representative index ρ(AK(c),d): (1) fraction of the total collected from the wordnet resource. Taking cue from number of category keywords present in d (2) Ratio of the [12], we use only that sense that corresponds to the average numbers of keywords in d to the sum total of these category’s meaning. First synonyms are extracted. Then averages for all documents. Thus: hypernyms of each term are transitively located to allow A K (d ) ∑ T F _ ID F ( w ∈ A K (d ) i generalization. Hypernyms are collected up to a pre- ρ ( A K (c ), d ) = ⋅ i A K (c ) ∑ ∑ T F _ ID F ( w ∈ A K (d i k ) specified level k so that term meanings are not obfuscated k i by excessive generalization. Next, hyponyms at Step 4: Assigning categories to documents: immediate lower level are located. At the end of this The next step labels documents with categories. The process, the system is armed with an initial set of document with the highest r-score for a category is assigned keywords that explicitly refer to category names. to it. Thereafter, all documents whose r-scores for that Keyword Set- from implicit context: category are within a preset category assignment factor F of this maximum similarity score are also are assigned to it. Latent Semantic Analysis (LSA) is a well known technique to help identify the implicit contextual bonding Step 5: Document feature representation: between category names and input documents. LSA uses The TC system utilizes contextual information to derive the Singular Value Decomposition (SVD) to cull out the feature set values of documents in the following ways: groups of words which have high vector similarity in the Generalization with lexical references: LSA space [20]. This step generates a new set of As a document is browsed, new terms encountered are contextually related keywords AKimc(c) for each category annotated with their synonyms, hypernyms for k-up levels c. and hyponyms k-down levels. A rescanning of the document Augmented Keyword Sets from vicinity context: replaces any of the lexical references encountered with their Words in the vicinity of keywords weave a semantic equivalent base words. Term counts are simultaneously thread that correlates with the category. LSA does club updated. When all documents have been tagged, TF-IDF together word co-occurring anywhere in a document but values are evaluated. does not highlight semantics inherently present in Creating Lexical chains: neighboring words. The idea is to locate those words Vertical search engines focus on a particular topic or text surrounding keywords that have a high potential for segment only. They need TC methods with greater depth of becoming keywords themselves. Given a keyword set,© 2011 ACEEE 25DOI: 01.IJNS.02.02.196
  4. 4. ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 2011search. This is in contrast with generic Web search engines. capture their faithfulness to a given category. The r-scoreTowards this goal, we recognize that the extent of coverage used here encodes the extent to which a document representsof a concept is as important as its frequency of occurrence. a category. It encapsulates both parameters ; the fraction ofIn our framework, we have introduced an optional step that the keywords present in a document and the weighted averagecan be invoked for vertical categorization requirements. keywords frequencyLexical chaining: Weighting TF-IDF with length of lexical chain: Links consecutive sentences that are related though To allow categorization for applications that require arepetition of a concept term or its lexical references vertical search based on a topic, we use lexical chaining that(iteration), anaphoric references and strong conjunctions allows us to measure the extent of coverage on a concept.[21]. Besides iteration, resolved anaphoric references also Being compute-intensive, we allow this feature as a user-lead to the next link in the chain. Strong conjunctions such requested option.as therefore or so serve as final links. When a gap of a few Tunable Guiding parameters:sentences occurs, it indicates termination of a local chain. Parameters such as maximum level of lexical referencing, vicinity score threshold and category assignment factor can be experimentally tuned. There is flexibility to adjust them for different applications and input categories. VI. CONCLUSION In conclusion, we have proposed a comprehensive scheme for context based TC that starts its job with onlyStep 6: Classifier Training: category names. A cohesive set of keywords is generated by exploiting three kinds of semantic cohesion; lexical The documents output by step 4 are positively labeled references, implicit context and vicinity induced cohesion.documents. They are used to train a set of binary SVM Semantic cohesion is also utilized for representing documentsclassifiers, one category. All documents labeled with a given as context oriented feature sets. While lexical referencescategory are employed to train its classifier. bring in generalization, an optional lexical chaining schemeStep 7: Testing: allows the depth of coverage of concepts to influence the The test documents are applied to the trained classifiers classification decision. The application of context at variousto be finally assigned their appropriate categories. levels gives a framework for more meaning-driven classification which cannot be expected from a purely bag- V.DISCUSSION of-words approach. The TC system proposed above differs from reported REFERENCESschemes with the following innovations. In this [1] F. Sebastani, “Text categorization”, Trans. of State-of-the-artA. Application of context for TC: in Science and Engineering, Procs Text Mining and its Applications In this paper, context driven information is utilized for to Intelligence, CRM and Knowledge Management, volume 17, 2005, WIT Press.keyword generation as well as for document representation. [2] S. Pradhan et.al., “Support vector learning for semantic B. Keyword extraction: argument classification,” Machine Learning, 60, pp 11-39, 2005. [3] William W. Cohen and Y. Singer, “Context Sensitive Learning We use a three pronged approach to extracting keywords Methods for Text Categorization,” ACM Transaction onautomatically staring with only category names using, Information Systems, Vol 17 No.2, Pages 141-173, 1999. i. An LSA space vector similarity between documents [4] I. Androutsopoulos et al., “Learning to filter spam mail: a Bayes and category names and a memory based approach,” Procs of the workshop “Machine ii. Deriving lexical references to category names and Learning and Textual Information Access”, 4 th European iii. Utilizing the strong semantic connection between Conference on Principles and Practice of Knowledge discovery in concepts in the vicinity of identified keywords. Databases, 2000. [5] Q. Wang al, “SVM Based Spam Filter with Active and Online While the first method has been tried out in [22] and the Learning”, Procs. of the TREC Conference, 2006.first two combined has been reported in [12], our algorithm [6] Jiang Su; Harry Zhang, “ A Fast Decision Tree learningaugments both kinds of keywords by tapping distance based algorithm”, Procs of the 21st conf. on AI-Vol. 1 Boston, Pagescontext. The combined approach for keyword extraction is 500-505, , 2006.geared towards an expanded and more cohesive keyword [7] Stephen Bloehdorn et al, “Boosting for text classification withset. semantic features”, Procs. of the MSW 2004 workshop at the 10th ACM SIGKDD Conference on Knowledge Discovery and DataDocument to category matching: Mining , AUG (2004) , p. 70-87. In [12,22], the authors have used the cosine similarity [8] El-Sayed M. El-Alfy, Fares S. Al-Qunaieer, “A Fuzzymetric to map documents with categories. This does not truly Similarity Approach for Automated Spam Filtering”, Procs. of the© 2011 ACEEE 26DOI: 01.IJNS.02.02.196
  5. 5. ACEEE Int. J. on Network Security , Vol. 02, No. 02, Apr 20112008 IEEE/ACS International Conference on Computer Systemsand Applications - Volume 00, Pages 544-550, 2008. [9] P.I. Nakov, P.M. Dobrikov, “Non–Parametric SpamFiltering Based on KNN and LSA”, Procs of the 33th NationalSpring Conference, 2004. [10] Yiming Yang et al , “A comparative Study on FeatureSelection in Text Classification”, Proceedings of ICML-97, 14thInternational Conference on Machine Learning, page 412—420.Nashville, US, Morgan Kaufmann Publishers, San Francisco, US,(1997) [11] Marco Ernandes et al, “An Adaptive Context Based Algorithmfor Term Weighting”, Proceedings of the 20th international jointconference on Artifical intelligence, 2748-2753, 2007 [12] Libby Barak et al, “Text Categorization from Category Namevia Lexical Reference”, Proc. of NAACL HLT 2009: Short Papers,pages 33-36, June 2009. [13] Diwakar Padmaraju et al, “Applying Lexical Semantics toImprove Text Classification” , http:// w e b 2 p y. i i i t . a c . i n / p u b l i c a t i o n s / d e f a u l t / d o w n l o a d /inproceedings.pdf.9ecb6867-0fb0-48a5-8020-0310468d3275.pdf [14] Reuters dataset: www.reuters.com [15]OHSUMED test collection dataset: http://medir.ohsu.edu/~hersh/sigir-94-ohsumed.pdf [16]FAODOC test collection dataset: www.tesisenxarxa.net/TESIS_UPC/AVAILABLE/TDX.../12ANNEX2.pdf [17] Na Luo et al, “Using Co-training and semantic featureextraction for positive and unlabeled text classification”, 2008International Seminar on Future Information Technology andManagement Engineering, http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04746478 [18] http://www.brothersoft.com/downloads/nzb- senet.html [19] http://WordNet.princeton.edu/WordNet/download/ [20] http://en.wikipedia.org/wiki/Latent_semantic_analysis [21]Halliday, M.A.K. and Hasan, R., Cohesion In English,Longman 1976. [22] Gliozzo et al, “Investigating Unsupervised Learning for TextCategorization Botstraping”, Proc. of Human LanguagesTechnology Conference and Conference on Empirical Methods inNatural Languages, Pages 129-136, October 2005© 2011 ACEEE 27DOI: 01.IJNS.02.02.196

×