ISSN: 2277 – 9043                              International Journal of Advanced Research in Computer Science and Electron...
ISSN: 2277 – 9043                           International Journal of Advanced Research in Computer Science and Electronics...
ISSN: 2277 – 9043                           International Journal of Advanced Research in Computer Science and Electronics...
ISSN: 2277 – 9043                           International Journal of Advanced Research in Computer Science and Electronics...
ISSN: 2277 – 9043                          International Journal of Advanced Research in Computer Science and Electronics ...
Upcoming SlideShare
Loading in...5

1 5


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Transcript of "1 5"

  1. 1. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 Frequent Itemsets Patterns in Data Mining Leena Nanda, Vikas BeniwalAbstract  Look for hidden patterns and trends in data thatWe are in the information age. In this age, we believe that is not immediately apparent from summarizinginformation leads to power and success. The efficient database the datamanagement systems have been very important assets formanagement of a large corpus of data and especially for Data mining, the extraction of hidden predictive informationeffective and efficient retrieval of particular information from alarge collection whenever needed. Unfortunately, these massive from large databases, is a powerful new technology withcollections of data vary rapidly became overwhelming. great potential to help companies focus on the most importantSimilarly, in data mining discovery of frequent occurring subset information in their data warehouses. Data mining toolsof items, called itemsets, is the core of many data mining predict future trends and behaviors, allowing businesses tomethods. Most of the previous studies adopt Apriori –likealgorithms, which iteratively generate candidate itemsets and make proactive, knowledge-driven decisions. The automated,check their occurrence frequencies in the database. These prospective analyses offered by data mining move beyondapproaches suffer from serious cost of repeated passes over the the analyses of past events provided by retrospective toolsanalyzed database. To address this problem, we purpose two typical of decision support systems. Data mining tools cannovel method; called Impression method, for reducing databaseactivity of frequent item set discovery algorithms and answer business questions that traditionally were too timeTransaction Database Spin Algorithm for the efficient consuming to resolve. They source databases for hiddengeneration for large itemsets and effective reduction on patterns, finding predictive information that experts maytransaction database size and compare it with the variousexisting algorithm. Proposed method requires fewer scans over miss because it lies outside their expectations.the source database. Knowledge Discovery in Databases:Keywords: Itemsets , Apriori , Database mining Data Mining, also popularly known as Knowledge Discovery INTRODUCTION in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful  Non-trivial extraction of implicit, previously information from data in databases. The following figure unknown and potentially useful information (Figure 1.1) shows data mining as a step in an iterative from data warehouse knowledge discovery process. The Knowledge Discovery in Databases process comprises of a few steps leading from raw  Exploration & analysis, by automatic or data collections to some form of new knowledge. The semi-automatic means, of large quantities of iterative process consists of the following steps: data in order to discover meaningful patterns •Data cleaning: also known as data cleansing, it is a phase in  “Data mining is the entire process of applying which noise data and irrelevant data are removed from the computer-based methodology, including new collection. techniques for knowledge discovery, from data •Data integration: at this stage, multiple data sources, often warehouse.” heterogeneous, may be combined in a common source.  A process that uses various techniques to •Data selection: at this step, the data relevant to the analysis discover “patterns” or knowledge from data. is decided on and retrieved from the data collection. •Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. Manuscript received Aug 3, 2012. Leena Nanda, Computer Science, N.C.College of Engineering ,Isran , •Data mining: it is the crucial step in which clever Panipat ,India, techniques are applied to extract patterns potentially useful. Vikas Beniwal, Computer Science N.C.College of Engineering ,Israna, Gohana,India 1 All Rights Reserved © 2012 IJARCSEE
  2. 2. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 number of records in the database. For Example, if we say that the support of a rule is 5% then it means that 5% of the total records contain X  Y. Confidence :For a given number of records, confidence (α) is the ratio (in percent) of the number of records that contain XY to the number of records that contain X. For Example, if we say that a rule has a confidence of 85%, it means that 85% of the records containing X also contain Y. The confidence of a rule indicates the degree of correlation in the dataset between X and Y. Confidence is a measure of a rule’s Fig 1: Traditional process for Data Mining strength. Often a large confidence is required for association•Pattern evaluation: in this step, strictly interesting patterns rules.[10]Different situations for support and confidence.representing knowledge are identified based on givenmeasures.•Knowledge representation: is the final phase in which thediscovered knowledge is visually represented to the user.This essential step uses visualization techniques to help usersunderstand and interpret the data mining results.How does data mining work?While large-scale information technology has been evolvingseparate transaction and analytical systems, data miningprovides the link between the two and its software analyzesrelationships and patterns in stored transaction data based onopen-ended user queries. Types of analytical software areavailable: statistical, machine learning, and neural networks.Generally, any of four types of relationships are sought:Classification: Stored data is used to locate data in Fig 2.1 support and confidence situationspredetermined groups. For example, a restaurant chain couldmine customer purchase data to determine when customersvisit and what they typically order. This information could be Basic Algorithm “APRIORI”: It is by far the most wellused to increase traffic by having daily specials. known association rule algorithm. This technique uses the property that any subset of a large item set must be a largeClusters: Data items are grouped according to logical item set. The Apriori generates the candidate itemsets byrelationships or consumer preferences. For example, data can joining the large itemsets of the previous pass and deletingbe mined to identify market segments or consumer affinities. those subsets, which are small in the previous pass without considering the transactions in the database. By onlyAssociations: Data can be mined to identify associations. considering large itemsets of the previous pass, the number ofThe beer-diaper example is an example of associative candidate large itemsets is significantly reduced. In Apriori,mining. in the first pass, the itemsets with only one item are counted.Related Work: The discovered large itemsets of the first pass are used toLet I = {I1, I2… Im} be a set of m distinct attributes, also generate the candidate sets of the second pass using thecalled literals. Let D be a database, where each record (tuple) apriori_gen () function. Once the candidate itemsets areT has a unique identifier, and contains a set of items such that found, their supports are counted to discover the largeT Є I An association rule is an implication of the form X=> itemsets of size two by scanning the database.. This iterativeY, where X, YЄI, are sets of items called item sets, and X∩ process terminates when no new large itemsets are found.Y=φ. Here, X is called antecedent, and Y consequent. Two Each pass i of the algorithm scans the database once andimportant measures for association rules support (s) and determine large itemsets of size i. Li denotes large itemsets ofconfidence (α) can be defined as follows: size i, while Ci is candidates of size i.Support : The support (s) of an association rule is the ratio The apriori_gen () function has two steps. During the first(in percent) of the records that contain X Y to the total step, L is joined with itself to obtain Ck. In the second step, K-1 2 All Rights Reserved © 2012 IJARCSEE
  3. 3. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012apriori_gen () deletes all itemsets from the join result, which Proposed Workhave x some (k-1)–subset that is not in L . Then, it returns Impression Algorithm: K-1the remaining large k-itemsets. Consider the solve example in These above approach suffer from serious cost of repeatedfig 2.3. passes over the analyzed database. To address this problem, we purpose a novel method; called Impression method, for reducing database activity of frequent item set discoveryMethod: apriori_gen () algorithms. The idea of this method consists of usingInput: set of all large (k-1)-itemsets L Impression table for pruning candidate itemsets. The K-1 Output: A superset of the set of all large k-itemsets //Join proposed method requires fewer scans over the sourcestep Ii = Items i insert into C database. The first scan creates Impression, while the KSelect p.I , p.I , ……. , p.I , q .I subsequent ones verify discovered itemsets. 1 2 K-1 K-1From L is p, L is q The goodness of the Impression algorithm is: K-1 K-1 Where p.I = q.I and …… and p.I = q.I and p.I < 1) Databases scans will be less, 1 1 K-2 K-2 K-1q.I . 2) Faster snip of the candidate itemsets. K-1//pruning step for all itemsets c Є C Most of the previous studies on frequent itemsets adopt Kdo for all (k-1)-subsets s of c do If (s∉L ) then delete c from Apriori-like algorithms, which iteratively generate candidate K-1C itemsets and check their occurrence frequencies in the KThe subset () function returns subsets of candidate sets that database. It has been shown that Apriori in its original formappear in a transaction. Counting support of candidates is a suffer from serious costs of repeated passes over the analyzedtime-consuming step in the algorithm. Consider the below database and from the number of candidates that have to bementioned Algorithm. checked, especially when the frequent itemsets to beAlgorithm 1: discovered are long.//procedure large Itemsets Impression method generates Impression table from the1) C :=I; //Candidate 1-Itemsets original database and use them for pruning candidate itemsets 12) Generate L by traversing database and counting each in the iteration. 1occurrence of an attribute in a transaction Apriori-like algorithms use full database scans for pruning3) for (K=2; L ≠ϕ ;k++) candidate itemsets, which are below the support threshold. k-1//Candidate itemsets generation Impression method prunes candidates by using dynamically//New K-Candidate itemsets are generated from (k-1) large generated Impression table thus reducing the number ofitemsets database blocks read.4) Ck=apriori_gen (Lk-1) Impression Property: // counting support of Ck A Impression table used by our method is a set of Impression5) Count (Ck, D) generated for each database item set. The Impression of a set6) Lk={C Є Ck | c.count ≥ minsup} X is an N-bit binary number created, by means of bit-wise7) end OR operation from the Impression of all data items contained8) L: = υ k Lk in X. The Impression has the following property. For any two set XApriori always outperforms AIS. Apriori incorporates buffer and Y, we have X Є Y if:management to handle the fact that all the large itemsets L Impr (X) AND Impr (Y) = Impr (X) K-1and the candidate itemsets C need to be stored in the Where AND is the bit-wise AND operator. The property is Kcandidate generation phase of a pass k may not fit in the not reversible in general (when we find that the abovememory. A similar problem may arise during the counting formula evaluates to TRUE we still have to verify the resultphase where storage for C and at least one page to buffer the traditionally). Kdatabase transactions are needed. Algorithm:Improving Apriori Algorithm Efficiency Scan D to generate Impression S and to find L ; 1• Transaction reduction –A transaction that does not contain For (K=2; L ≠ 0; K++) do Begin K-1any frequent k-item set is useless in subsequent scans. C = apriori_gen (L ); k k-1• Partitioning –Any item set that is potentially frequent in BeginDatabase must be frequent in at least one of the partitions of For all transactions t Є D do beginDatabase. C = subset ( C ,t); t K• Sampling – Mining on a subset of given data, lower support For all candidate c Є C do c.count++; tthreshold + a method to determine the completeness. End 3 All Rights Reserved © 2012 IJARCSEE
  4. 4. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012End EndL {c Є C | c.count ≥ minsup}; For all transactions t Є D do Begin k= kEnd If (n< transaction (length))Answer= υ L ; delete (database, transaction); K KScan D to verify the answer; EndApriori_gen () End Lk = {c Є Ck | c.count ≥ minsup};The apriori_gen () function works in two steps: End Answer: υK L K; 1. Join step 2. Prune stepTransaction Database Snip Algorithm: RESULTSTo determine large itemsets from a huge number of candidatelarge itemsets in early iterations is usually the dominating Graphical representation of the time taken by the variousfactor for the overall data mining performance. To address algorithms.this issue we propose an effective algorithm for the candidateset generation. Explicitly, the number of candidate 2-itemsetsgenerated by the proposed algorithm is, in order ofmagnitude, smaller than that by the previous method, thusresolving the performance bottleneck.Another performance related issue is on the amount of datathat has to be scanned during large item set discovery. Fig : Comparison of Apriori and ImpressionGeneration of smaller candidate sets by the snip methodenables us to effectively trim the transaction database atmuch earlier stage of the iteration i.e. right after thegeneration of large 2-itemsets, therefore reducing thecomputational cost for later iteration significantly.The algorithm proposed has two major features:  Efficient generation for large itemsets.  Reduction on transaction database size.Here we add a number field with each database transaction. Fig: Comparison of Apriori and Snip AlgorithmSo when we check about the support of each item set in thecandidate itemsets. Then if the candidate item set is the subsetof that transaction then the number field is incremented byone each time.After calculating the support of the candidate item set. Thenwe snip those transactions of the database that have:Number < length of the transaction.So if the transaction is ABC then the number of thistransaction should be 3 or greater than 3.i.e. at least 3candidate itemsets are subset of this transaction. Fig: Comparison of Apriori, Impression and Snip AlgorithmSo each time the database is sniped on these criteria. So input The number of database scans:output cost is reduced, as the scanning time is reduced. Here -In the Apriori algorithm the number of database scan is of Kwe reduces the storage required by the candidate set item sets times where K is the largest value of K in L or of C . K Kby only adding those itemsets in to the large itemsets which -In the Impression method of number of database scans arequalify the minimum support level. only one i.e. while creating the signature table.Algorithm: -In the Transaction Database Snip algorithm the number ofScan D to generate Impression C1 and to find L1; database scans is of K times where K is the largest value of K For (K=2; LK-1 ≠ 0; K++) do Begin in L or of C . But the size of the database scanned is reduced K K Ck = apriori_gen (Lk-1); in each iteration. Begin For all transactions t Є D do begin Ct = subset ( CK ,t); For all candidate c Є Ct do c.count++; D.n++; 4 All Rights Reserved © 2012 IJARCSEE
  5. 5. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 6, August 2012 CONCLUSION [12] Marek Wojciechowski, An Hash – Mine algorithm forPerformance study shows Impression method is efficient discovery of frequent itemsets , Maciej Zakrzewicz Institutethan the Apriori algorithm for mining in terms of time as it of computer science, ul. Piotrowo 3a Poland.reduces the number of and Database scans, and also the [13]Ming Chen and Philip S. Yu An Effective Hash-BaseTransaction Database Snip Algorithm is efficient than the Algorithm for Mining Association Rules. Jong Soo Park,,Apriori algorithm for mining in terms of time, as it snip the IBM Thomas J. Watson Research Center ,New York 1998database in each iteration. The Impression method is roughlytwo times faster than the Apriori algorithm, as Impression [14] Felman ,Ozel An Algorithm for Mining Associationmethod does not require more than one database scan, and Rules, Using Perfect Hashing and Database Pruning inTransaction Database Snip Algorithm snip the database in Proceedings of the 2007 in IEEE International Conference oneach iteration. Data Engineering . REFERENCES [1]Arun K Pujari. Data Mining Techniques (Edition 5):Hyderabad, India: Universities Press (India) Private Limited,2003.[2] Ashok Savasere ,Edward Omiecinski, Shamkant NavatheAn Efficient Algorithm for mining Association rules in Largedatabases ,college of computing , Georgia Institute oftechnology, Atlanta.[3] Bing Liu, Wynne Hsu and Yiming Ma, IntegratingClassification and Association Rule Mining, American Leena Nanda, M.Tech (Pursuing)Association of Artificial Intelligence.[4] Charu C. Aggarwal, and Philip S. Yu, A New Frameworkfor Itemset Generation, Principles of Database Systems(PODS) 1998, Seattle, WA.[5] Charu C. Aggarwal, Mining Large Itemsets forAssociation Rules, IBM research Lab.[6] Gurjit Kaur in Data Mining: An Overview, Dept. of Vikas Beniwal, P.hd (Submitted)Computer Science & IT, Guru Nanak Dev EngineeringCollege, Ludhiana, Punjab, India[7] Jiawei Han. Data Mining, concepts and Techniques: SanFrancisco, CA: Morgan Kaufmann Publishers.,2004.[8] Jiawei Han and Yongjian Fu, Discovery ofMultiple-Level Association Rules from Large Databases,Proceedings of the 21nd International Conference on VeryLarge Databases, pp. 420-431, Zurich, Swizerland, 1995[9] Ja-HwungSu, Wen-Yang Lin CBW: An EfficientAlgorithm for Frequent Itemset Mining, In Proceedings ofthe 37th Hawaii International Conference on SystemSciences – 2004.[10] Karl Wang: Introduction to Data Mining, Chapter-4,Conference in Hawaii.[11]M. S. Chen, J. Han, and P. S. Yu. Data mining: Anoverview from a database perspective. IEEE Trans.Knowledge and Data Engineering, 1996. 5 All Rights Reserved © 2012 IJARCSEE