International Journal Of Computational Engineering Research ( Vol. 2 Issue. 5           A Survey of Text M...
International Journal Of Computational Engineering Research ( Vol. 2 Issue. 52.1. MEASURES FOR TEXT RETRIE...
International Journal Of Computational Engineering Research ( Vol. 2 Issue. 5extract more related keywords...
International Journal Of Computational Engineering Research ( Vol. 2 Issue. 5of frequent patterns derived ...
Upcoming SlideShare
Loading in …5

IJCER ( International Journal of computational Engineering research


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

IJCER ( International Journal of computational Engineering research

  1. 1. International Journal Of Computational Engineering Research ( Vol. 2 Issue. 5 A Survey of Text Mining: Retrieval, Extraction and Indexing Techniques 1 R. Sagayam, 2 S.Srinivasan, 3 S. Roshni 1, 2, 3 Department Of Computer Science Govt. Arts College (Autonomous) Salem-7 2 Periyar University Salem-636011AbstractText mining is the analysis of data contained in natural And relevance of the documents, or find patterns and trendslanguage text. Text mining works by transposing words and across multiple documents. Thus, text mining has becomephrases in unstructured data into numerical values which an increasingly popular and essential theme in data mining.can then be linked with structured data in a database andanalyzed with traditional data mining techniques. Text 2. Information Retrievalinformation retrieval and data mining has thus become Information retrieval is a field that has been developing inincreasingly important. In this paper a survey of Text parallel with database systems for many years. Unlike themining have been presented. field of database systems, which has focused on query and transaction processing of structured data, informationKeyword: Information Retrieval, Information Extraction retrieval is concerned with the organization and retrieval ofand Indexing Techniques information from a large number of text-based documents. Since information retrieval and database systems each handle different kinds of data, some database system1. Introduction problems are usually not present in information retrievalText mining is a variation on a field called data mining, that systems, such as concurrency control, recovery, transactiontries to find interesting patterns from large databases. Text management, and update. Also, some common informationdatabases are rapidly growing due to the increasing amount retrieval problems are usually not encountered in traditionalof information available in electronic form, such as database systems, such as unstructured documents,electronic publications, various kinds of electronic approximate search based on keywords, and the notion ofdocuments, e-mail, and the World Wide Web . Nowadays relevance. Due to the abundance of text information,most of the information in government, industry, business, information retrieval has found many applications. Thereand other institutions are stored electronically, in the form exist many information retrieval systems, such as on-lineof text databases. Data stored in most text databases are library catalog systems, on-line document managementsemi structured data in that they are neither completely systems, and the more recently developed Web searchunstructured nor completely structured. For example, a engines. A typical information retrieval problem is to locatedocument may contain a few structured fields, such as title, relevant documents in a document collection based on aauthors, publication date, category, and so on, but also user’s query, which is often some keywords describing ancontain some largely unstructured text components, such as information need, although it could also be an exampleabstract and contents. There have been a great deal of relevant document. In such a search problem, a user takesstudies on the modeling and implementation of semi the initiative to “pull” the relevant information out from thestructured data in recent database research. Moreover, collection; this is most appropriate when a user has some adinformation retrieval techniques, such as text indexing hoc information need, such as finding information to buy amethods, have been developed to handle unstructured used car. When a user has a long-term information need , adocuments. Traditional information retrieval techniques retrieval system may also take the initiative to “push” anybecome inadequate for the increasingly vast amounts of text newly arrived information item to a user if the item isdata. Typically, only a small fraction of the many available judged as being relevant to the user’s information need.documents will be relevant to a given individual user. Such an information access process is called informationWithout knowing what could be in the documents, it filtering, and the corresponding systems are often called filtering systems or recommender systems. From a technicalIs difficult to formulate effective queries for analyzing and viewpoint, however, search and filtering share manyextracting useful information from the data. Users need common techniques. Below we briefly discuss the majortools to compare different documents, rank the importance techniques in information retrieval with a focus on search techniques.Issn 2250-3005(online) September| 2012 Page 1443
  2. 2. International Journal Of Computational Engineering Research ( Vol. 2 Issue. 52.1. MEASURES FOR TEXT RETRIEVAL measure. term table consists of a set of term records, eachThe set of documents relevant to a query be denoted as containing two fields: term id and posting list, where{Relevant}, and the set of documents retrieved be denoted as posting list specifies a list of document identifiers in which{Retrieved}. The set of documents that are both relevant the term appears. With such organization, it is easy toand retrieved is denoted as {Relevant }∩{Retrieved}, as answer queries like “Find all of the documents associatedshown in the Venn diagram of Figure 1. There are two basic with a given set of terms,” or “Find all of the termsmeasures for assessing the quality of text retrieval. associated with a given set of documents.” For example, toPrecision: This is the percentage of retrieved documents that find all of the documents associated with a set of terms, weare in fact relevant to the query . It is formally defined as can first find a list of document identifiers in term table for Precision=‫{׀‬Relevant}∩{Retrieved}‫{׀/׀‬Retrieved}‫׀‬ each term, and then intersect them to obtain the set ofRecall: This is the percentage of documents that are relevant relevant documents. Inverted indices are widely used into the query and were, in fact, retrieved. It is formally industry. They are easy to implement. The posting listsdefined as could be rather long, making the storage requirement quite Recall=‫{׀‬Relevant}∩{Retrieved}‫{׀/׀‬Relevant} ‫׀‬ large. They are easy to implement, but are not satisfactory atAn information retrieval system often needs to trade off handling synonymy (where two very different words canrecall for precision or vice versa. One commonly used trade- have the same meaning) and polysemy (where an individualoff is the F-score, which is defined as the harmonic mean of word may have many meanings). A signature file is a filerecall and precision that stores a signature record for each document in theF-score =recall×precision/(recall+ precision)/2 database. Each signature has a fixed size of b bitsThe harmonic mean discourages a system that sacrifices one representing terms. A simple encoding scheme goes asmeasure for another too drastically follows. Each bit of a document signature is initialized to 0. A bit is set to 1 if the term it represents appears in the document. A signature S1 matches another signature S2 if each bit that is set in signature S2 is also set in S1. Since there are usually more terms than available bits, multiple terms may be mapped into the same bit. Such multiple-to one mappings make the search expensive because a document that matches the signature of a query does not necessarily contain the set of keywords of the query. The document has to be retrieved, parsed, stemmed, and checked. Improvements can be made by first performing frequency analysis, stemming, and by filtering stop words, Figure 1. Relationship between the set of relevant and then using a hashing technique and superimposeddocuments and the set of retrieved documents. coding technique to encode the list of terms into bitPrecision, recall, and F-score are the basic measures of a representation. Nevertheless, the problem of multiple-to-oneretrieved set of documents. These three measures are not mappings still exists, which is the major disadvantage ofdirectly useful for comparing two ranked lists of documents this approach. Readers can refer to [2] for more detailedbecause they are not sensitive to the internal ranking of the discussion of indexing techniques, including how todocuments in a retrieved set. In order to measure the quality compress an index.of a ranked list of documents, it is common to compute anaverage of precisions at all the ranks where a new relevant 4. Query Processing Techniquesdocument is returned. It is also common to plot a graph of Once an inverted index is created for a document collection,precisions at many different levels of recall; a higher curve a retrieval system can answer a keyword query quickly byrepresents a better-quality information retrieval system. For looking up which documents contain the query keywords.more details about these measures, readers may consult an Specifically, we will maintain a score accumulator for eachinformation retrieval textbook, such as [3]. document and update these accumulators as we go through each query term. For each query term, we will fetch all of3. Text Indexing Techniques the documents that match the term and increase their scores.There are several popular text retrieval indexing techniques, More sophisticated query processing techniques areincluding inverted indices and signature files. An inverted discussed in [2].When examples of relevant documents areindex is an index structure that maintains two hash indexed available, the system can learn from such examples toor B+-tree indexed tables: document table and term table, improve retrieval performance. This is called relevancewhere document table consists of a set of document records, feedback and has proven to be effective in improvingeach containing two fields: doc id and posting list, where retrieval performance. When we do not have such relevantposting list is a list of terms (or pointers to terms) that occur examples, a system can assume the top few retrievedin the document, sorted according to some relevance documents in some initial retrieval results to be relevant andIssn 2250-3005(online) September| 2012 Page 1444
  3. 3. International Journal Of Computational Engineering Research ( Vol. 2 Issue. 5extract more related keywords to expand a query. Such version of the Porter stemming algorithm with a fewfeedback is called pseudo-feedback or blind feedback and is changes towards the end in which we have omitted someessentially a process of mining useful keywords from the retrieved documents. Pseudo-feedback also often leads e.g. apply – applied – appliesto improved retrieval performance. One major limitation of print – printing – prints – printedmany existing retrieval methods is that they are based on In both the cases, all words of the first example will beexact keyword matching. However, due to the complexity of treated as ‘apply’ and all words of the second example willnatural languages, keyword based retrieval can encounter be treated as ‘print’.two major difficulties. The first is the synonymy problem:two words with identical or similar meanings may have very 5.2 Domain dictionarydifferent surface forms. For example, a user’s query may In order to develop tools of this sort, it is essential touse the word “automobile,” but a relevant document may provide them with a knowledge base. A collective set of alluse “vehicle” instead of “automobile.” The second is the the ‘feature terms’ is the Domain dictionary (our source waspolysemy problem: the same keyword, such as mining, or The structure of the DomainJava, may mean different things in different contexts. dictionary which we implemented consisted of three levels in the hierarchy. Namely, Parent Category, Sub-category5. Information Extraction and word. Parent categories define the main category underThe general purpose of Knowledge Discovery is to “extract which any sub-category or word falls. A parent categoryimplicit, previously unknown, and potentially useful will be unique on its level in the hierarchy. Sub-categoriesinformation from data”. Information Extraction IE mainly will belong to a certain parent category and each sub-deals with identifying words or feature terms from within a category will consist of all the words associated with it. Astextual file. Feature terms can be defined as those which are an example, consider the followingdirectly related to the domain. Table 1.Structure of the Domain Dictionary Table 1 is an example that shows how we identify words which belong to the Parent Category ‘Hardware’ and Sub-Figure 2. A layered model of the Text Mining Application category ‘Input Devices’.These are the terms which can be recognized by the tool. Inorder to perform this function optimally, we had to look into 5.3 Exclusion Listfew more aspects which are as follows: A lot of words in a text file can be treated as unwanted noise. To eliminate these, we devised a separate file which5.1 Stemming includes all such words. These include words such as the, a,Stemming refers to identifying the root of a certain word. an, if, off, on etc.There are basically two types of stemming techniques, oneis inflectional and other is derivational. Derivationalstemming can create a new word from an existing word, 6. Research Directionssometimes by simply changing grammatical category (for With abundant literature published in research into frequentexample, changing a noun to a verb) [Wikipedia]. The type pattern mining, one may wonder whether we have solvedof stemming we were able to implement is called most of the critical problems related to frequent patternInflectional Stemming. A commonly used algorithms is the mining so that the solutions provided are good enough for‘Porter’s Algorithm’ for stemming. When the normalization most of the data mining tasks. However, based on our view,is confined to regularizing grammatical variants such as there are still several critical research problems that need tosingular/plural or past/present, it is referred to inflectional be solved before frequent pattern mining can become astemming [10].To minimize the effects of inflection and cornerstone approach in data mining applications. First, themorphological variations of Words (stemming), our most focused and extensively studied topic in frequentapproach has pre-processed each word using a provided pattern mining is perhaps scalable mining methods. The setIssn 2250-3005(online) September| 2012 Page 1445
  4. 4. International Journal Of Computational Engineering Research ( Vol. 2 Issue. 5of frequent patterns derived by most of the current pattern textual sources (e.g.: the World Wide Web). Discoveringmining methods is too huge for effective usage. There are such hidden know ledge is an essential requirement forproposals on reduction of such a huge set, including closed many corporations, due to its wide spectrum of applications.patterns, maximal patterns, approximate patterns, condensed In this short survey, the notion of text mining have beenpattern bases, representative patterns, clustered patterns, and introduced and several techniques available have beendiscriminative frequent patterns, as introduced in the presented. Due to its novelty, there are many potentialprevious sections. However, it is still not clear what kind of research areas in the field of Text Mining, which includespatterns will give us satisfactory pattern sets in both finding better intermediate forms for representing thecompactness and representative quality for a particular outputs of information extraction, an XML document mayapplication, and whether we can mine such patterns directly be a good choice. Mining texts in different languages is aand efficiently. Much research is still needed to substantially major problem, since text mining tools should be able toreduce the size of derived pattern sets and enhance the work with many languages and multilingual documents.quality of retained patterns. Frequent pattern mining: current Integrating a domain knowledge base with a text miningstatus and future directions. Second, although we have engine would boost its efficiency, especially in theefficient methods for mining precise and complete set of information retrieval and information extraction phases.frequent patterns, approximate frequent patterns could bethe best choice in many applications. For example, in the Referencesanalysis of DNA or protein sequences, one would like to [1] M. A. Hearst. What is text mining?find long sequence patterns that approximately match the˜hearst/text-mining.html, Oct.sequences in biological entities, similar to BLAST. Much 2003.research is still needed to make such mining more effective [2] Ian H. Witten, Alistair Moffat, and Timothy C. Bell.than the currently available tools in bioinformatics. Third, to Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, San Francisco,make frequent pattern mining an essential task in data CA, 1999.mining, much research is needed to further develop pattern- [3] R.Baeza-Yates and B.Ribeiro-Neto, Modern Informationbased mining methods. For example, classification is an Retrieval addition-Wesley,Boston,1999essential task in data mining. Fourth, we need mechanisms [4] Jochen Dorre, Peter Gersti, Roland Seiffert (1999), Textfor deep understanding and interpretation of patterns, e.g., Mining: Finding Nuggets in Mountains of Textual Data,semantic annotation for frequent patterns, and contextual ACM KDD 1999 in San Diego, CA, USA.analysis of frequent patterns. The main research work on [5] Ah-Hwee Tan, (1999), Text Mining: The state of art and thepattern analysis has been focused on pattern composition challenges, In proceedings, PAKDD99 Workshop on(e.g., the set of items in item-set patterns) and frequency. Knowledge discovery from Advanced Databases (KDAD99), Beijing, pp. 71-76, April 1999.The semantic of a frequent pattern includes deeper [6] Danial Tkach, (1998), Text Mining Technology Turninginformation: what is the meaning of the pattern; what are the Information Into Knowledge A white paper from IBM .synonym patterns; and what are the typical transactions that [7]. Helena Ahonen, Oskari Heinonen, Mika Klemettinen, A.this pattern resides? In many cases, frequent patterns are Inkeri Verkamo, (1997), Applying Data Mining Techniquesmined from certain data sets which also contain structural in Text Analysis, Report C-1997-23, Department of Computerinformation. Finally, applications often raise new research Science, University of Helsinki, 1997issues and bring deep insight on the strength and weakness [8]. Mark Dixon,(1997), An Overview of Document Miningof an existing solution. This is also true for frequent pattern Technology,mining. On one side, it is important to go to the core part of mark/writings/dixm 97_dm.pspattern mining algorithms, and analyze the theoretical [9] Juan José García Adeva and Rafael Calvo, “Mining Text withproperties of different solutions. On the other side, although Pimiento”, University of Sydneywe only cover a small subset of applications in this article, [10] Text Mining Application Programming by Manu Konchadi.frequent pattern mining has claimed a broad spectrum of Published by Charles River Media. ISBN: 1584504609applications and demonstrated its strength at solving some [11] Automated Concept Extraction From Plain Text. Boris,problems. Much work is needed to explore new applications GARAGe Michigan State University, East Lansing MI 48824.of frequent pattern mining. For example, bioinformatics has [12] Dr. Antoine Spinakis; Asanoula Chatzimakri, “Comparisonraised a lot of challenging problems, and we believe Study of Text Mining Tools”.frequent pattern mining may contribute a good deal to it [13] Raymond J. Mooney and Razvan Bunescu, “Mining Knowledge from Text Using Information Extraction”,with further research efforts. University of Texas at Austin [14] Tova, Milo, “Active views for Electronic commerce”. Paper7. Conclusion number: EUROPE64.Most of knowledge hidden in electronic media of anorganization is encapsulated in documents. Acquiring thisknowledge implies effective querying of the documents aswell as the combination of information pieces from differentIssn 2250-3005(online) September| 2012 Page 1446