A comparative study on different types of effective methods in text mining


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A comparative study on different types of effective methods in text mining

  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME535A COMPARATIVE STUDY ON DIFFERENT TYPES OF EFFECTIVEMETHODS IN TEXT MINING: A SURVEY#1 Inje Bhushan V. and #2 Prof. Mrs. Ujwalapatil#1 Post Graduate Student M.E (Computer)Department of Computer Science and EngineeringR.C.Patel Institute of Technology,Shirpur, DistDhule, Maharashtra, India.#2 Associate Professor(Department of Computer Science & engineering)Department of Computer Science and EngineeringR.C.Patel Institute of Technology,Shirpur, DistDhule, Maharashtra, India.ABSTRACTTextmining is the one of the most resent area for research because of in databasesstoring information in text form, to extracting information that is the challenging issue tomotivate textmining. This survey paper tries to cover the all textmining method that solvesthese challenges. We presented an exhaustive survey of different pattern mining methodsproposed in the literature. Pattern mining methods have been used to analyze this data andidentify patterns. Textmining is the discovery by computer for extracting new, previouslyunknown information and also by automatically extracting information from different writtenresources.In this survey paper we discuss such successful techniques they gives effectivenessover information retrieval in textmining.Keywords: Textmining, Information Retrieval, Sequential pattern model, Pattern taxonomymodel.1 INTRODUCTIONNowadays most of the information in business, industry, government and otherinstitutions are stored in the form of text into databases. This text database contains semistructured data in that they are not only completely unstructured and structured. For example,INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING& TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online)Volume 4, Issue 2, March – April (2013), pp. 535-542© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI)www.jifactor.comIJCET© I A E M E
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME536a document may contain a few structured fields, such as title, name of authors, date ofpublication, category, and so on, but also contain some largely unstructured text components,such as abstract and detail content. There have been a great deal of studies on the modelingand implementation of semi structured data in recent database research. So that informationretrieval techniques [18], such as text indexing methods, have been developed to handleunstructured documents. On other handin traditional search, the user is typically looking foralready known terms and has been written by someone else. The problem is in resultappearing all the material that currently is not relevant to your needs in order to find therelevant information. This is the goal of textmining discover unknown information,something that no one yet knows and so could not have yet written down.Text mining is a variation on a field called data mining [2] that tries to find interestingpatterns from large databases. Text mining, also known as Intelligent Text Analysis, TextData Mining or Knowledge-Discovery in Text (KDT), refers generally to the process ofextracting interesting and non-trivial information and knowledge from unstructured text.Figure 1. Shows a generic process model for a text mining application [1]. Startingwith a collection of documents, a text mining tool would retrieve a particular document andpreprocess it by checking format and character sets. Then it would go through a text analysisphase, sometimes repeating techniques until information is extracted. Three text analysistechniques are shown in the example, but many other combinations of techniques could beused depending on the goals of the organization. The resulting information can be placed in amanagement information system, yielding an abundant amount of knowledge for the user ofthat system.Information ExtractionIn computers firstly it analyze unstructured text is to use information extraction [2].An information extraction technique identifies key phrases and relationships within text. Itdoes this by looking for predefined sequences in text, a process called pattern matching. Thetechnique infers the relationships between all theFigure 1. Generic Process Model for a Text MiningApplicationidentified people, places, and time to provide the user with meaningful information. Thistechnology can be very useful when dealing with large volumes of text. Traditional datamining techniques assumes that the information to be “mined” is already in the form of arelational database. Unfortunately, for many applications, electronic information is onlyavailable in the form of free natural-language documents rather than structured databases.
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME537Since IE addresses the problem of transforming a corpus of textual documents into a morestructured database, the database constructed by an IE module can be provided to the KDDmodule for further mining of knowledge as illustrated in Figure 2 [2].Figure 2.Overview of Information Extraction Based Text MiningKnowledge discoveryKnowledge discovery [19] and data mining have attracted a great deal of attentionwith an imminent need for turning such data into useful information and knowledge. Manyapplications, such as market analysis and business management, can benefit by the use of theinformation and knowledge extracted from a large amount of data. Knowledge discovery canbe viewed as the process of nontrivial extraction of information from large databases,information that is implicitly presented in the data, previously unknown and potentiallyuseful for users. Data mining is therefore an essential step in the process of Knowledgediscovery in databases. In the past decade, a significant number of data mining techniqueshave been presented in order to perform different knowledge tasks.These techniques includeassociation rule mining, frequent item set mining, sequential pattern mining, maximumpattern mining and closed pattern mining.Most of them are proposed for the purpose of developing efficient mining algorithmsto find particular patterns within a reasonable and acceptable time frame. With a largenumber of patterns generated by using data mining approaches, how to effectively use andupdate these patterns is still an open research issue. In this paper, we focus on thedevelopment of a knowledge discovery model to effectively use and update the discoveredpatterns and apply it to the field of text mining.Text mining is the discovery of interesting knowledge in text documents. It is achallenging issue to find accurate knowledge (or features) in text documents to help users tofind what they want. In the beginning, Information Retrieval (IR) provided many term-basedmethods to solve this challenge, such as Rocchio and probabilistic models [4], Rough setmodels [4],Okapi BM25 and SVM [20] based filtering models.The advantages of term-based methods in term of performance improvement for IRand machine learning. However, term-based methods suffer from the problems of polysemyand synonymy, where polysemy means a word has multiple meanings, and synonymy ismultiple words having the same meaning. The semantic meaning of many discovered terms isuncertain for answering what users want.
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME5382. METHODS AND MODELS USED IN TEXMININGTraditionally there are so many technique was developed to solve the problem intextmining that is nothing but the relevant information retrieval according to user’srequirement. So that research in textmining broadly divides in several terms to find thesolution. The list of the datamining technique are often try to overcome the problem , and thetechniques likes Association rule mining[8],Sequential pattern mining[16] , close patternmining[4] , frequent itemset mining[16] ,maximum pattern mining [4], minimum patternmining[4] .According to the information retrieval basically there are four methods are used1) Term Based Method (TBM).2) Phrase Based Method (PBM).3) Concept Based Method (CBM).4) Pattern taxonomy Method(PTM).There are some more models are used to evaluate and improving the efficiency in textmininglikeA. Sequential pattern mining (SPM).B. Sequential closed pattern mining (SCPM).C. Frequent itemset mining (NSPM).D. Frequent closed itemset mining (NSCPM).The algorithms from the Data Mining community inherited some characteristics from theassociation rule mining algorithms, and are best suited to work with many (from hundreds ofthousands to millions) sequences with relative small length (from 4 to 20). The firstalgorithms proposed for this task were AprioriAll [2] and GSP[16], from Agrawal andSrikant. Other algorithms like FreeSpan [8], PrefixSpan [4],SPADE [19], CloSpan [18],SPAM [3], were developed afterwards and successively improved the task of find frequentsequence patterns. Algorithms with particular features like, MEMISP [11] which is a memoryindexing approach, or SPIRIT [5], which integrates constraints to the mining process throughregular expressions, can also be found in literature.Term Based Method (TBM).In TBM [3] include efficient computation performance is the advantages are but inother side there are also the limitation in TBM like it occurring polysemy and synonymyproblem polysemy mince word having multiple meaning and synonyms mince multiple wordhaving same meaning.There are some methods based on TBM like1. Rocchio and probabilistic models [4].2. Rough set models [4].3. BM25 and SVM based filtering models [4].Phrase Based Method [PBM]In PBM [4], phrases are less ambiguous and more discriminative than individualterms, the likely reasons for the discouraging performance include:1) Phrases have inferior statistical properties to terms,2) They have low frequency of occurrence, and3) There are large numbers of redundant and noisy phrases among them [4].
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME539Concept Based MethodText Mining techniques are mostly based on statistical analysis of a word or phrase.The statistical analysis of a term frequency captures the importance of the term without adocument only. But two terms can have the same frequency in the same document. But themeaning that one term contributes might be more appropriate than the meaning contributedby the other term. Hence, the terms that capture the semantics of the text should be givenmore importance. Here, a new concept-based mining is introduced [6].In Concept-Based Information Retrieval Using Explicit Semantic Analysis[5]in thispaper author Concept-based IR using Explicit Semantic Analysis (ESA) makes use ofconcepts that encompass human world knowledge, encoded into resources such as Wikipedia(from which an ESA model is generated), and that allow intuitive reasoning and analysis.Feature selection is applied to the query concepts to optimize the representation and removenoise and ambiguity.Pattern Taxonomy MethodPattern mining has been extensively studied in data mining communities for manyyears. Many data mining techniques have been proposed in the last decade. These techniquesinclude association rule mining, frequent itemset mining, sequential pattern mining,maximum pattern mining, and closed pattern mining. However, using these discoveredknowledge (or patterns) in the field of text mining is difficult and ineffective. The reason isthat some useful long patterns with high specificity lack in support (i.e., the low-frequencyproblem). Here author NingZhonget.al argue that not all frequent short patterns are useful.Hence, misinterpretations of patterns derived from data mining techniques lead to theineffective performance.In this research work, an effective pattern discovery technique [3] has been proposedto overcome the low-frequency and misinterpretation problems for text mining. The proposedtechnique uses two processes, pattern deploying and pattern evolving, to refine the discoveredpatterns in text documents. The experimental results show that the proposed modeloutperforms not onlyother pure data mining-based methods and the concept-based model, butalso term-based state-of-the-art models, such as BM25 and SVM-based models.Sequential pattern mining (SPM)Before going to elaborate term SPM first we see what is Sequence Data? Sequencedata is omnipresent. Customer shopping sequences, medical treatment data, and data relatedto natural disasters, science and engineering processes data, stocks and markets data,telephone calling patterns, weblog click streams, program execution sequences, DNAsequences and gene expression and structures data are some examples of sequence data.A sequential pattern mining algorithm shouldA. Find the complete set of patterns, when possible, satisfying the minimumSupport (Frequency) threshold,B. Be highly efficient, scalable, involving only a small number of database scansC. Be able to incorporate various kinds of user-specific constraints.There are two major difficulties in sequential pattern mining:(1) Effectiveness: the mining may return a huge number of patterns, many of which could beuninteresting to users, and(2) Efficiency: it often takes substantial computational time and space for mining thecomplete set of sequential patterns in a large sequence database.
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME540Table 1. Comparative summery of textmining methodsModel/ Method Approach /AlgorithmAuthor Parameters Inference InputsInformationRetrievalRocchio[24] ThorstenJoachimsModels,documents andqueries as TF-IDFvectors.Learning is very fastin this methodSet ofdocuments thatare not relevantAssociation RuleMining[8]Apriori [8] Agrawaland Srikant1994association ruleswith high supportand confidenceThe proposedalgorithms alwaysoutperform AIS andSETMItems in a largedatabase oftransactionsSequentialPatternMining[16]SPADE [14] Mohammed J.Zaki et allDatabase D forsequence miningconsists of acollection Sid,EidSPADE outperformsby a factor of two,and by an order ofmagnitude with somepre-processed dataText documentSPAM[] Jay Ayres, etall 2002Sequence ofdocument miningconsists of acollectionSPAM outperformsprevious works up toan order ofmagnitudeVertical bitmaprepresentationof the databasewith efficientsupportcountingClose PatternMining [7]CHARM[23] Mohammed J.Zaki et allset of all frequentclosed item-setsCHARM performedto discover thelongest patternIBM Almaden,pumsb andpumsb containcensus dataCloSpan[7] X. Yan, J.Han, and R.Afsharfrequent patternsin the datasetIt mine longsequence for KDD itproducessignificantly lessnumber of discoveredsequences.Sequence of s,Projected DBD8 andmin_supFrequentItemsetMining[13][27]FPgrowth[27] C. Borgelt,2005prefix treerepresentation ofthe given databaseof transactionsThis algorithm cansave considerableamounts of memoryfor storing thetransactions.set oftransactionsMaximalPattern Mining[4][28]MaxMiner[28] Mohammed J.Zakireal and syntheticdatasetsMaxMiner showsgood performance onsome datasets, whichwere not used inprevious studiesFrequentitemsetGenMax[22] Karam Gouda,Mohammed J.Zaki , 2003frequent items andthefrequent 2-itemsetsThis algorithm works2 times faster thanother like Mafia PPusing dataset “Chessand pumsb”Dataset is inthe vertical“tidset” formatPatternTaxonomy [3]D-PatternMining [3]NingZhongetall2012deploying process,which consists ofthe d-patterndiscovery and termsupport evaluationThe proposedtechnique uses twoprocesses, patterndeploying and patternevolving, to refinethe discoveredpatterns in textdocuments.Positivedocument D+,minimumsupportmin_sup
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME541CONCLUSIONSWe discussed basics of textmining method. We presented an exhaustive survey ofdifferent pattern mining methods proposed in the literature. Pattern mining methods havebeen used to analyze this data and identify patterns. Such patterns have been used toimplement efficient systems that can recommend based on previously observed patterns, helpin making predictions, improve usability of systems, detect events and in general help inmaking strategic product decisions. We envision that the power of Textmining miningmethods like Sequential pattern mining Pattern taxonomy model has not yet been fullyexploited. We hope to see many more strong applications of these methods in a variety ofdomains in the years to come.REFERENCES1. Weiguo Fan, Linda Wallace, Stephanie Rich, and Zhongju Zhang, (2005), “Tappinginto the Power of Text Mining”, Journal of ACM, Blacksburg.2. N. Kanya and S. Geetha (2007), “Information Extraction: A Text Mining Approach”,IET-UK International Conference on Information and Communication Technology inElectrical Sciences, IEEE, Dr. M.G.R. University, Chennai, Tamil Nadu, India,1111-1118.3. NingZhong, Yuefeng Li “Effective pattern discovery in text mining” IEEETRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. , NO.4. Y. Li and N. Zhong, “Interpretations of Association Rules by GranularComputing,”Proc. IEEE Third Int’l Conf. Data Mining (ICDM ’03),pp. 593-596, 2003.5. OferEgozi, ShaulMarkovitch, and EvgeniyGabrilovich“Concept-Based InformationRetrieval Using Explicit Semantic Analysis” ACM Transactions on InformationSystems, Vol. 29, No. 2, Article 8, Publication date: April 2011.6. Shady Shehata, FakhriKarray, and Mohamed Kamel. “Enhancing text clustering usingconcept-based mining model”.In Proceedings of the 6th IEEE International Conferenceon Data Mining (ICDM 2006), pages 1043–1048, Hong Kong, 2006.7. X. Yan, J. Han, and R. Afshar, “Clospan: Mining Closed Sequential Patterns in LargeDatasets,”Proc. SIAM Int’l Conf. Data Mining (SDM ’03), pp. 166-177, 2003.8. R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in LargeDatabases,”Proc. 20th Int’l Conf. Very Large Data Bases (VLDB ’94),pp. 478-499,1994.9. J.S. Park, M.S. Chen, and P.S. Yu, “An Effective Hash-Based Algorithm for MiningAssociation Rules,”Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD’95),pp. 175-186, 1995.10. R. Srikant and R. Agrawal, “Mining Generalized Association Rules,”Proc. 21th Int’lConf. Very Large Data Bases (VLDB ’95), pp. 407-419, 1995.11. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, “Prefixspan:Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,”Proc. 17thInt’l Conf. Data Eng. (ICDE ’01),pp. 215-224, 2001.12. J. Han and K.C.-C. Chang, “Data Mining for Web Intelligence,” Computer,vol. 35, no.11, pp. 64-70, Nov. 2002.13. J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without CandidateGeneration,”Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’00),pp.1-12, 2000.
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME54214. M. Zaki, “Spade: An Efficient Algorithm for Mining Frequent Sequences,”MachineLearning,vol. 42, pp. 31-60, 2001.15. M. Seno and G. Karypis, “Slpminer: An Algorithm for Finding Frequent SequentialPatterns Using Length-Decreasing Support Constraint,”Proc. IEEE Second Int’l Conf.Data Mining (ICDM ’02),pp. 418-425, 2002.16. Y. Huang and S. Lin, “Mining Sequential Patterns Using Graph SearchTechniques,”Proc. 27th Ann. Int’l Computer Software and Applications Conf.,pp. 4-9,2003.17. M. Gupta andJ. Han “Approaches for Pattern Discovery Using Sequential DataMining” , 2011 - Information Science Reference.18. S.T. Dumais, “Improving the Retrieval of Information from External Sources,”Behavior Research Methods, Instruments, and Computers,vol. 23, no. 2, pp. 229-236,1991.19. FatudimuI.T , Musa A.G and Ayo C.K “Knowledge Discovery in OnlineRepositories: A Text Mining Approach” European Journal of Scientific Research ISSN1450-216X Vol.22 No.2 (2008), pp.241-250 © EuroJournals Publishing, Inc. 2008.20. S. Robertson and I. Soboroff, “The Trec 2002 Filtering Track Report,” TREC, 2002,trec.nist.gov/ pubs/ trec11/ papers/ OVER. FILTERING.ps.gz.21. R. Srikant and R. Agrawal, "Mining Sequential Patterns: Generalizations andPerformance Improvements", 5th Intl Conf. on Extending Database Technology(EDBT), Avignon, France, March 1996.22. Karam Gouda, Mohammed J. Zaki “GenMax: An Efficient Algorithm for MiningMaximal Frequent Itemsets”, Data Mining and Knowledge Discovery 11, 1–20, 2005 c2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.23. Mohammed J. Zakiand Ching-Jui Hsiao “CHARM: An Efficient Algorithm for ClosedItemset Mining”.24. T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with tfidf for TextCategorization,”Proc. 14th Int’l Conf. Machine Learning (ICML ’97), pp. 143-151,1997.25. Bart Goethals “Frequent Set Mining” Data Mining and Knowledge DiscoveryHandbook chapter no. 17.26. GostaGrahne and Jianfei Zhu “Efficiently Using Prefix-trees in Mining FrequentItemsets”27. C. Borgelt, 2005. “An Implementation of the FP-growth Algorithm”, Workshop OpenSource data Mining Software, OSDM05, Chicago, IL, 1-5.ACM Press, USA.28. Mohammed J. Zaki “Mining Closed & Maximal Frequent Itemsets” NSF CAREERAward IIS-0092978, DOE Early Career Award DE-FG02-02ER25538, NSF grant EIA-0103708.29. Prakasha S, Shashidhar Hr and Dr. G T Raju, “A Survey on Various Architectures,Models and Methodologies for Information Retrieval”, International journal ofComputer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194,ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.30. M. Karthikeyan, M. Suriya Kumar and Dr. S. Karthikeyan, “A Literature Review on theData Mining and Information Security”, International journal of Computer Engineering& Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 –6367, ISSN Online: 0976 – 6375.