Mining Irregular Association Rules based on Action & Non-action Type Data

325
-1

Published on

Mining Irregular Association Rules based on Action & Non-action Type
Data

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
325
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mining Irregular Association Rules based on Action & Non-action Type Data

  1. 1. Mining Irregular Association Rules based on Action & Non-action Type Data Razan Paul, Abu Sayed Md. Latiful Hoque Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh razanpaul@yahoo.com, asmlatifulhoque@cse.buet.ac.bd Abstract rules from desire itemsets compared to any level- wise search algorithm. The algorithm treats all the Conventional positive association rules are the items as being either actions that include decision,patterns that occur frequently together. These action and output or non-actions that include facts,patterns represent what decisions are routinely made statements and any criterion. In our problem, non-based on a set of facts. Irregular association rules action items appear very frequently in the data, whileare the patterns that represent what decisions are action items rarely appear with the high frequentrarely made based on the same set of facts. Many non-action items. Negative association rules finddomains like Healthcare, Banking etc need the patterns where items are conflict with each other andirregular rule to improve their system. In this paper, do not find decision, which are made rarely. Rarewe propose a level wise search algorithm that works association rules are the patterns containing rarebased on action and non-action type data to find items which are less frequent items and do not findirregular association rules. We have observed that decisions which are made rarely. Irregular patternirregular association rules can be discovered represents wrong decision, illegal practice andefficiently based on action type and non-action type variability in decision.data from large database. To the best of ourknowledge, there is no algorithm that can determine 2. Related Worksuch type of associations. Its effectiveness has beendemonstrated by testing it for a real world patient A rare association rule forms with rare data itemsdata set. or between frequent and rare data items. However, this rule does not map items that are used to make1. Introduction decision/action to antecedent and items that represent actions or decision to consequent. Moreover it does Data mining is applied to discover new not expect antecedent to be high frequent because itinformation, which is hidden in the existing find rule with high confidence. The algorithms toinformation. One of the techniques in data mining is find these rules only assign different support basedfinding association rule. The first pioneering work to on items frequency. A number of approaches hasmine conventional positive association rules was been proposed to mine rare associations [7-11]. Inexplained in [1]. In this work, they showed finding [7], minimum support is computed for each itemassociation rule problem can be separated into two based on the frequency of the item. In [8], A fixedsub problems. After that many algorithms [1-6] have minimum support is applied to extract desiredbeen proposed to mine positive association rules itemsets, which consist of only frequent items andefficiently. These algorithms find rules that represent relative support is applied to extract desired itemsets,decisions that occur frequently based on a set of which consist of rare items. In [9], Negative-facts. In other words, rules discovered by current Binomial distribution is applied to extract Negative-association mining algorithms [1-6] are patterns that Binomial desired itemsets. In [10], the associationrepresent what decisions are usually made. In this rules are found by taking into consideration onlyproblem, we need patterns that are rarely made. We infrequent items. The approach suggested in [11]have proposed a novel mining algorithm that can finds the association rules by computing item-wiseefficiently discover the association rules from the minimum support.existing data that are strong candidates of variability. A negative association rule presents a relationshipThe algorithm uses a different candidate itemset among itemsets and states the presence of someselection process, a modified candidate generation itemsets in the absence of others. Every positiveprocess, and a different mechanism of generating association rule P Q has three corresponding978-1-4244-7571-1/10/$26.00 ©2010 IEEE 63
  2. 2. negative association rules, P ¬Q, ¬P Q and ¬P 3. Irregular Association Rules ¬Q. To extract negative association rules, mostpapers employ different correlation measures Let D = t1 , t 2 , . . . , t n be a database of nbetween attributes [12-14]. In [13], the author transactions with a set of items I = i1 , i2 , . . . , im .proposed a level-wise search algorithm for mining Let set of action items of I be AI = ai1 , ai2 , . . . , aikboth positive and negative association rules that where k is the number of action items. Let set ofemploys rule dependency measures. In [14], authors non-action items of I beproposed another level-wise search algorithm for NAI = nai1 , nai2 , . . . , naim k where is thesimultaneously extracting positive and negative number of non- action items. For an itemset P Iassociation rules using Pearson correlation and a transaction t in D, we say that t supports P if tcoefficient. In [15] , author have proposed detection has values for all the attributes in P; for conciseness,model using multi layer perceptron neural networks we also write P t. By Dp we denote the(MLP) to detect fraud/abuse problem based on transactions that contain all attributes in P. Themedical claims. It has been proposed to detect new, Dunusual and known fraudulent/abusive behaviors. It support of P is computed as P = p , i.e. theworks based on detection model which is very slow fraction of transactions containing P. A irregular ruleand need huge memory requirement to analyze is of the form: P Q, with P NAI, Q A I, Pexisting large database. In [16], author used positive Q = . To hold the rule following condition mustassociation rule to build clinical pathways, which can meet: P P or support P >=detect fraud and abuse on new data. However, this , P P , Q or support P, Q <model cannot detect fraud and abuse from the =existing large healthcare data. Our proposed P P ,Q and <= confidence where P(x)approach detects fraud and abuse from the existing P Plarge information. is the probability of x. Original Mapped Original Mapped Generate dictionary for value value value value each categorical attribute Headache 1 Yes 1 Fever 2 No 2 PatientActual Data Age Smoke Diagnosis Dictionary of Dictionary of ID Diagnosis attribute Smoke attribute 1020D 33 Yes Headache 1021D 63 No Fever Map to integer items using rule base and dictionaries Actual data If age <= 12 then 1 Medical If 13<=age<=60 then 2 domain If 60 <=age then 3 Patient Age Smoke Diagnosis knowledge If smoke = y then 1 ID If smoke = n then 2 1020D 2 1 1 If Sex = M then 1 1021D 3 2 2 If Sex = F then 2 Rule Base Data suitable for Knowledge Discovery Figure 1. Data transformation of medical data4. Mapping complex medical data to map continuous numerical data to items using thesemineable items developed rules. We have used domain dictionary approach to For knowledge discovery, the medical data have transform the data, for which medical domain expertto be transformed into a suitable transaction format knowledge is not applicable, to numerical form. Asto discover knowledge. We have addressed the cardinality of attributes except continuous numericproblem of mapping complex medical data to items data are not high in medical domain, these attributeusing domain dictionary and rule base as shown in values are mapped integer values using medicalfigure 1. The medical data are types of categorical, domain dictionaries. Therefore, the mapping processcontinuous numerical data, Boolean, interval, is divided in two phases. Phase 1: a rule base ispercentage, fraction and ratio. Medical domain constructed based on the knowledge of medicalexpert have the knowledge of how to map ranges of domain experts and dictionaries are constructed fornumerical data for each attribute to a series of items. attributes where domain expert knowledge is notFor example, there are certain conventions to applicable, Phase 2: attribute values are mapped toconsider a person is young, adult, or elder with integer values using the corresponding rule base andrespect to age. A set of rules is created for each the dictionaries.continuous numerical attribute using the knowledgeof medical domain experts. A rule engine is used to 5. The proposed algorithm 64
  3. 3. General intuition of this algorithm is as follows:based on a set of lab tests with same results, if 99% 5.1. Candidate Generationdoctors practice patients as disease x and 1 percentdoctors practice patients as other diseases, then there The idea behind candidate generation of all level-is a strong possibility that this 1 percent doctors are wise algorithms like Apriori is based on thedoing illegal practice. In other words, if consequent following simple fact: Every subset of a frequentC occurs infrequently with antecedent A and itemset is frequent so that they can reduce the number of itemsets that have to be checked.that is a strong candidate of variability. In every However, our proposed algorithm in candidatedomain, there are a set of facts. Based on these facts, generation phase check this fact if the itemsets onlydecision and action are taken. In a rule S T, if S contains non-action items. This idea makes itemsetscontains a set of facts and T contains decision or consist of both rare action items and high frequencyaction. Then such rules represent the decision T with non-action items. If the new candidate contains onetheir corresponding facts S. If S T has sufficient or more action items then it is selected as a validsupport and confidence then it represents that candidate. If the new candidate contains only non-decision or action T is taken routinely based on facts action items then, it is selected as a valid candidateS. However, if S is high frequent and rule S-T has only if every subset of new candidate is frequent.very low confidence. Then it indicates based on facts This way the algorithm keeps the new candidatesS any other decision instead of T is usually taken. It that have one or more action items.also indicates that the decision is exceptionally takenbased on these facts. The main features of the 5.2. Candidate Selectionproposed algorithm are as follows: If minimum support is only used like We have used two separate supports metrics to filter conventional association mining algorithm, out candidates. An itemset with only non-action desired itemsets that involve rarely appeared items is compared with minimum antecedent support action items with the high frequent non-action metric as non-action items can only take part in items will not be found. To find rules that antecedent part of irregular rule, which need to be involve both frequent antecedent part and rare high frequent. An itemset with one or more action consequent items, we have used two support items is compared with maximum antecedent metrics: minimum antecedent support, consequent support metric to keep rare action items maximum antecedent consequent support. with the high frequent non-action items. An itemset The proposed algorithm uses maximum with only non-action items is selected if it has confidence constraint instead of widely used support greater or equal to minimum antecedent minimum confidence constraint to form the support. An itemset with one or more action items rules. Moreover, it partitions itemsets into action is selected if it has support smaller or equal to item and non-action items instead of subset maximum antecedent consequent support. By this generation to form rules. way, itemsets are explored which has high support Rules have non-action items in the antecedent for non-action items and low support for action items and action items in the consequent. with high support non-action items. Here pruning is In candidate generation, it does not check the based mostly on minimum antecedent support, maximum antecedent consequent support and checki or more action items to keep that itemset. Let MAS is minimum antecedent support, MACSis maximum antecedent consequent support, Ij is the 5.3. Generating Association Ruleitemsets of size j, Sm is the desired itemset of size m;Ck be the sets of candidates of size k. Figure 2 shows This problem needs association rules that representthe association mining algorithm for finding irregular irregular relationships between action and non-actionrule. Like algorithm Apriori, our algorithm is also items that occur rarely together. For this reason, thebased on level wise search. Each item consists of proposed algorithm uses maximum confidenceattribute name and its value. Retrieving information constraint to form rules as it needs rule that hasof a 1-itemset, we make a new 1-itemset if this 1- high support in antecedent portion and has very lowitemset is not created already, otherwise update its support in itemset from which the rule is generated.support. The non-action 1-itemset is selected if it has It selects a rule if its confidence is less or equal tosupport greater or equal to minimum antecedent maximum confidence constraint. Moreover, it doessupport. The action 1-itemset is selected whatever not use subset generation to the itemsets to formsupport it has. By this way, 1-itemsets are explored rules. Here an itemset is partitioned into action itemwhich have high support for antecedent items and and non-action items. Action items are forhave arbitrary support for consequent items. consequent part and non-action items are for 65
  4. 4. antecedent part. Here each itemset is mapped to onlyone rule. Algorithm: Find itemsets which consist of non-action Procedure CalculateCandidatesSupport(Ck) items with high support and action items with low 1, For each transaction t of Database support based on candidate generation. 1.1 CalculateSupportFromOneTransactionFor- Input: Database, minimum antecedent support, maximum Cadidates(Ck, t); antecedent consequent support procedure CalculateSupportFromOneTransaction Output : Itemsets which are strong candidates of variability. ForCadidates(Ck, t) 1. K=1, S = {Ø}; 1.Ct =Find the subsets of Ck which are candidate 2. Read the metadata about which attributes are action t type and which are not. 2.1 c.count++ 3. Ik = Select 1-itemsets either which consist of a Algorithm : Find Assosiation rules for Variability non-action item and has support greater or equal to Finding minimum antecedent support or which consists of Input: I (Vaiavility Itemsets), maximumConfidence an action item. Output: R ( set of rules ) 4. While(Ik { 1. R = Ø 4.1 K++; 2. For each X I 4.2 Ck = Candidate_generation(Ik-1) 2.1 Antecedent set AS = (as1, as2 n){ 4.3 CalculateCandidatesSupport(Ck) where asi X and AC(asi 4.4 Ik = SelectDesiredItemSetFromCandidates 2.2 Consequent set CS = (cs1, cs2 n){ (CK, Sk , MAS, MACS); where csi X and AC(csi 4.5 S = S U Sk 2.3 if (support (AS CS)/Support (AS)) <= 5. return S maximum confidence procedure SelectDesiredItemSetFromCandidates 2.3.1 AS CS is a valid rule. (CK , Sk , MAS, MACS) 2.3.2 R = R U (AS CS) 1. For each Itemset c Ck procedure Candidate_generation(Ik-1) 1.1 If c contains only non-action items 1.For each Itemset i1 Ik-1 1.1. 1 If c.support >= MAS 1.1 For each Itemset i2 Ik-1 1.1.2 Add it to I 1.1.1 Newcandidate, NC = Union(i1,i2); 1.2 else if c contains one or more action items 1.1.2 If Size of NC is k with non-action items. 1.1.2.1 If NC contains one or more action items 1.2.1 If c.support <= MACS 1.1.2.1.1 Add it to Ck if every subset of 1.2.2 Add it to I & Sk non-action items is frequent. 1.3 If c contains only action items 1.1.2.2 else 1.3.1 Add it to I 1.1.2.2.1 If every subset of NC is frequent 2. return I 1.1.2.2.1.1 Add it to Ck othewise remove it. 2. return Ck; Figure 2. Association mining algorithm for finding irregular rule5.3.1. Lemma 1. Number of rules is equal to number maximum confidence support. Action item and nonof desired itemsets and number of discarded rules = action item of a desired itemset is mapped tomp S where S is the number of desired itemsets. antecedent items and consequent items of a rule. SoProof: A single desired itemset consists of action every desired itemset is mapped to a single validtype items and non-action type items. Action items rule. Total rules = number of desired itemsets = S.and non-action items are mapped to consequent and Let m is the average number of distinct value, eachantecedent parts respectively. Let I = { i1, i2 n} multidimensional attribute holds. P is the number ofbe the set of items to be mined, where items can be attributes to be mined. Number of possible differenteither action type or non-action type. Let AI = {ai1, rules = . Number of discarded rules =ai2 u) be the set of action items to be mined. where S is the number of desired itemsets.Let NAI= { nai1, nai2 v) be the set of non-action items to be mined. Each nai has to have 6. Results and discussionconfidence greater than minimum confidence supportto be included as 1- itemset and all ai are included as The experiments were done using PC with core 21- itemset. Let, C= {c1,c2,c3 n} be the set of duo processor with a clock rate of 1.8 GHz and 3GBcandidate itemsets. A new candidate NC is added to of main memory. The operating system wasC if the non-action part of NC named NCNA holds Microsoft Vista and implementation language wasthe following property: support (each subset of c#. We used a patient dataset to verify our method.NCNA) >= minimum antecedent support. A The dataset contains items, which are either actionscandidate c is selected for rule generation if and only that include decision, diagnosis and cost or non- actions that include lab tests, any symptom of patient 66
  5. 5. and any criterion of disease. Each instance represents which has 50273 instances and 514 attributesthe data of one patient. We have filtered out (included 150 discrete and 364 numerical attributes).instances which has noisy or missing values. The All these data are converted into mineable itemsdata set of interest has collected and preprocessed (integer representation) using domain dictionary andfrom the different local hospitals of Bangladesh, rule base. 2000 6 times in secs Number of rules 1500 A priori 4 MAS = .7 1000 MACS = .1 500 2 Proposed MAS = .85 0 0 Algorithm MACS = .5 10K 25K 50K 2 5 10 Number of transactions Maximum Confidence Figure 3. Time comparison of Apriori and Figure 4. Number of rules based on maximum proposed algorithm for the patient dataset confidence 2000 2000 Time(in seconds) times in secs 1500 1500 MAS = .85 MACS = .1 1000 1000 MC = .1 MC = .05 500 500 MACS = .05 MAS = .70 0 0 MC = .05 MC = .1 90% 70% 50% 3% 5% 10% minimum antecedent support MACS Figure 5. Time comparison of different Figure 6. Time comparison of different maximum maximum antecedent supports antecedent consequent supports 1 1 Accuracy Accuracy MACS = .1 MAS = .85 0.5 0.5 MCS = .05 MC = .1 MACS = .1 MAS = .7 0 0 MC = .05 MCS = .1 50% 70% 90% 10% 5% 2% Maximum confidence minimum antecedent support Figure 7. Accuracy of the proposed algorithm Figure 8. Accuracy of the proposed algorithm based on irregular metric based on maximum Confidence Table 1. Test result for patient dataset on minimum antecedent support, maximumMinimum antecedent support 70% 85% antecedent consequent support and checking theMaximum antecedent 10% 5%consequent support -action items. Figure 4 presents ifMaximum confidence 10% 5% maximum confidence (MC) increases, number ofNumber of desired itemsets 49 31 valid rules increases. Figure 5 shows how time isNumber of Desired rules 5 3 varied with different minimum antecedent supportTime (Seconds) 922.2 1634.56 (MAS) values for irregular rule finding algorithm.Table 1 shows test result for patient dataset, after Here we measured the performance of irregular rulerunning the program of the proposed algorithm with finding algorithm in terms of MAS keeping MACS,different parameters. Second column of the table MC, number of action items, number of non-actionpresents the test result, where we used minimum items constant. Time is not varied significantlyantecedent support of 70%, maximum antecedent because MAS has no lead to reduce disk access asconsequent support of 10% and maximum the patient data set has all sizes of candidates forconfidence of 10%. 49 desired itemsets were these MAS values. It has only lead to the number ofgenerated in total. 3 rules were discovered in total. It valid candidate generations and it can save sometook about 922.2013 seconds to find these rules. CPU time. As it has lead to the CPU time, the threeThird column of the table presents the test result, different cases take slightly different time.where we used minimum antecedent support of 85%, Figure 6 shows how time is varied with differentmaximum antecedent consequent support of 5% and MACS by keeping MAS , MC, number of actionmaximum confidence of 5%. 31 desired itemsets items, number of non-action items constant. Time iswere generated in total. 5 rules were discovered in not varied significantly because MACS has no leadtotal. It took about 1634.5634 seconds to find these to reduce disk access as the patient data set has allrules. sizes of candidates for these MAS values. It has only Figure 3 shows Apriori has taken significant lead to the number of valid candidate generationshigher time compared to the proposed algorithm. It is and it can save some CPU time. As it has lead to thebecause pruning in the proposed algorithm is based CPU time, the three different cases take slightly different time. As maximum consequent support 67
  6. 6. decreases, number of valid candidate generation New York, NY, USA, 1995, pp. 175 - 186.decreases For this reason, case with 5% MACS takes [6] A. Savasere, E. Omiecinski, and S. B. Navathe, "Anmore time than case with 3% MACS and case with Efficient Algorithm for Mining Association Rules in10% MACS takes more time than case with 5% Large Databases," in Proceedings of the 21thMACS. Figure 7 illustrates accuracy results for our International Conference on Very Large Data Bases, 1995, pp. 432 - 444.proposed algorithm based on minimum antecedent [7] B. Liu, W. Hsu, and Y. Ma, "Mining Associationsupport. The value of minimum antecedent support Rules with Multiple Minimum Supports.," infor each presented result is also indicated. The figure SIGKDD Explorations, 1999, pp. 337--341.presents MAS has no lead in accuracy, as it is not [8] H. Yun, D. Ha, B. Hwang, and K. H. Ryu, "Miningused as a parameter in selecting valid candidate and association rules on significant rare data using relativerules. Figure 8 illustrates accuracy results for our support.," Journal of Systems and Software archive,proposed algorithm based on maximum confidence. vol. 67, no. 3, pp. 181 - 191, 2003.The figure presents maximum confidence has lead in [9] M. Hahsler, "A Model-Based Frequency Constraintaccuracy as it is used as parameter in selecting valid for Mining Associations from Transaction Data.,"rules. As maximum confidence decreases, accuracy Data Mining and Knowledge Discovery, vol. 13, no. 2, pp. 137 - 166, 2006.increases and the number of discovered rules [10] L. Zhou and S. Yau, "Association rule anddecreases. It is because less confidence indicates that quantitative association rule mining among infrequentantecedent and consequent occurs rarely together in items," in International Conference on Knowledgethe dataset. Discovery and Data Mining, San Jose, California, 2007, pp. 156-167.7. Conclusion [11] R. U. Kiran and P. K. Reddy, "An improved multiple minimum support based approach to mine rare Irregular patterns represent wrong decision, association rules.," in The IEEE Symposium onillegal practice and variability in decision. In this Computational Intelligence and Data Mining, Nashville, TN, USA, 2009, pp. 340-347.paper, we propose a level wise search algorithm that [12] S. Brin, R. Motwani, and C. Silverstein, "Beyondworks based on action and non-action type data to Market Baskets: Generalizing Association Rules tofind irregular association rule. The proposed Correlations," in In The Proceedings of SIGMOD,algorithm has been applied to a real world patient AZ,USA, 1997, pp. 265-276.data set. We have shown significant accuracy in the [13] X. Wu, C. Zhang, and S. Zhang, "Efficient Mining ofoutput of the proposed algorithm. Although we have Both Positive and Negative Association Rules," ACMused level-wise search for finding irregular patterns, Transactions on Information Systems, vol. 22, no. 3,each step of our algorithm is different from any other p. 381 405, 2004.level-wise search algorithm. Rules generation from [14] M. L. Antonie and O. R. Zaïane, "Mining positive anddesired item sets is also different from conventional negative association rules: an approach for confined rules," in Proceedings of the 8th Europeanassociation mining algorithms. Conference on Principles and Practice of Knowledge Discovery in Databases, Pisa, Italy, 2004, pp. 27 - 38.8. References [15] P. A. Ortega, C. J. Figueroa, and G. A. Ruz, "A Medical Claim Fraud/Abuse Detection System based [1] R. Agrawal, T. . Swami, "Mining on Data Mining: A Case Study in Chile," in DMIN, Association Rules between Sets of Items in Very 2006, pp. 224-231. Large Databases," in Proceedings of the 1993 ACM [16] W. S. Yanga and S. Y. Hwangb, "A Process-Mining SIGMOD international conference on Management of Framework for the Detection of Healthcare Fraud and data, Washington, D.C., 1993, pp. 207-216. Abuse.," Expert Systems with Applications, vol. 31,[2] R. Agrawal and R. Srikant, "Fast Algorithms for no. 1, p. 56 68, July 2006. Mining Association Rules in Large Databases," in Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA, USA, 1994, pp. 487 - 499.[3] S. Brin, R. Motwani, J. D. Ullman, and Shalom Tsur, "Dynamic Itemset Counting and Implication Rules for Market Basket Data," in Proceedings of the 1997 ACM SIGMOD international conference on Management of data, Tucson, Arizona, United States, 1997, pp. 255-264.[4] H. Mannila, H. Toivonen, and A. I. Verkamo, "Efficient Algorithms for Discovering Association Rules," in AAAI Workshop on Knowledge Discovery in Databases, 1994, pp. 181-192.[5] J. S. Park, M. S. Chen, and P. S. Yu, "An Effctive Hash based Algorithm for mining association rules," in Prof. ACM SIGMOD Conf Management of Data, 68

×