Finding Symmetric Association Rules to Support Medical Qualitative Research


Published on

Finding Symmetric Association Rules to Support Medical Qualitative

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Finding Symmetric Association Rules to Support Medical Qualitative Research

  1. 1. Finding Symmetric Association Rules to Support Medical Qualitative Research Razan Paul, Abu Sayed Md. Latiful Hoque Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh, Abstract the algorithms produce rules eliminating all infrequent item sets. On the other hand, if we set In medical qualitative research, medical minimum support too low, the algorithms produceresearchers analyze historical patient data to verify far too many rules that are meaningless. In order toknown relationships and to discover unknown deal with this problem many algorithms have beenrelationships among medical attributes. All the proposed to mine rare associations [10-13].existing algorithms to solve this problem use However, these algorithms do not find themeasures which are asymmetric measure, so only relationship between rare and high frequent medicalone direction of the rule( P -> Q or Q->P) is taken items. In [7], the authors propose few algorithms thatinto account. However, medical researchers are allow a user to specify Boolean expressions over theinterested to find both asymmetric and symmetric presence or absence of items in association rule or torelationship among medical attributes. We have specify a certain hierarchy [6] of items in associationdeveloped pruning strategies and devised an efficient rule. These approaches are not enough to minealgorithm for the symmetric relationship problem. desired rules for medical researchers.We propose measuring interestingness of known All the existing algorithms [14-18] to discoversymmetric relationships and unknown symmetric interesting association rules in medical data only findrelationships via the correlation measure of asymmetric pattern, whereas medical researchers areantecedent items and consequent items. We have interested to find both asymmetric and symmetricdemonstrated its effectiveness by testing it on real relationship among medical attributes. For thesedataset. reasons, we have proposed an association-mining algorithm, which will find rules among the attributes1. Introduction of the researcher interest, so that it can help in decision making of the researchers. The problem in In medical qualitative research, medical discovering relationships is to avoid redundantresearchers are interested in finding association rules relationships and control the quality of them. Thisto see relationship among specified items and to see algorithm allows the researchers to define thehow a group of items is related with a different group following constraints: group information ofof items. For instance, a medical researcher can attributes, minimum confidence and support for eachdiscover relationship between the age and the group, which item will appear in antecedent andHbA1c% of a patient. Medical researchers are which item will appear in consequent and whichinterested to find relationship among various attributes will appear in both. One attribute candiseases, lab tests, symptoms, etc. Due to high belong to several groups.dimensionality of medical data, conventionalassociation mining algorithms [1-8] discover a very 2. Mapping complex medical data tohigh number of rules with many attributes, which are mineable itemstedious, redundant to medical researchers and notamong their desired set of attributes. Medical For knowledge discovery, the medical data haveresearchers may need to find the relationship to be transformed into a suitable transaction formatbetween rare and high frequent medical items, but to discover knowledge. We have addressed theconventional mining processes for association rules problem of mapping complex medical data to itemsexplore interesting relationships between data items using domain dictionary and rule base as shown inthat occur frequently together [1-7]. figure 1. The medical data are types of categorical, Rare item problem is presented in [9]. According continuous numerical data, boolean, interval,to this problem if minimum support is set too high, percentage, fraction and ratio. Medical domain978-1-4244-7571-1/10/$26.00 ©2010 IEEE 81
  2. 2. experts have the knowledge of how to map ranges of cardinality of attributes except continuous numericnumerical data for each attribute to a series of items. data are not high in medical domain, these attributeFor example, there are certain conventions to values are mapped to integer values using medicalconsider a person is young, adult, or elder with domain dictionaries. Therefore, the mapping processrespect to age. A set of rules is created for each is divided in two phases. Phase 1: a rule base iscontinuous numerical attribute using the knowledge constructed based on the knowledge of medicalof medical domain experts. A rule engine is used to domain experts and dictionaries are constructed formap continuous numerical data to items using these attributes where domain expert knowledge is notdeveloped rules. applicable, Phase 2: attribute values are mapped to We have used domain dictionary approach to integer values using the corresponding rule base andtransform the data, for which medical domain expert the dictionaries.knowledge is not applicable, to numerical form. As Original Mapped Original Mapped Generate dictionary for value value value value each categorical attribute Headache 1 Yes 1 Fever 2 No 2 PatientActual Data Age Smoke Diagnosis Dictionary of Dictionary of ID Diagnosis attribute Smoke attribute 1020D 33 Yes Headache 1021D 63 No Fever Map to integer items using rule base and dictionaries Actual data If age <= 12 then 1 Medical If 13<=age<=60 then 2 domain If 60 <=age then 3 Patient Age Smoke Diagnosis knowledge If smoke = y then 1 ID If smoke = n then 2 1020D 2 1 1 If Sex = M then 1 1021D 3 2 2 If Sex = F then 2 Rule Base Data suitable for Knowledge Discovery Figure 1. Data transformation of medical data3. The proposed algorithm Uninteresting relationships among medical attributes are avoided in the candidate The main theme of this algorithm is based on the generation phase which reduces number offollowing two statements. Interesting relationships rules, finds out only interesting relationshipsamong various medical attributes are concealed in and makes the algorithm fast.subsets of the attributes, but do not come out on all Confidence is not the perfect method to rankattributes taken together. All interesting relationships symmetric medical relationships because it does notamong various medical attributes have not same account for the consequent frequency with thesupport and confidence. The algorithm constructs a antecedent. For the ranking of medical relationship, acandidate itemsets based on groups constraint and direct measure of association rule between variablesuse the corresponding support of each group in is a perfect scheme. For a medical relationship s tcandidate selection process to discover all possible , s is a group of medical items where each item isdesired itemsets of that group. The goals of this constrained to be appear in antecedent or both and talgorithm are the following: finding desired rules of is a group of medical attributes where each item ismedical researcher and running fast. The features of appear to be in consequent or both. Moreover,this proposed algorithm are as follows: s t = Ø. For this relationship, the support is It allows grouping of attributes to find defined as support = P s, t and the confidence is relationship among medical attributes. This defined as = P s, t /P t where P is the probability. provides control on the search process. The correlation coefficient (also known as the - Minimum confidence and support can vary coefficient) measures the degree of relationship from one group to another group. between two random variables by measuring the One item can belong to several groups degree of linear interdependency. It is defined by the Attributes are constrained to appear on either covariance between the two variables divided by antecedent or consequent or both side of the their standard deviations: rule. Cov(s, t) st = It does not generate subsets on full desired s t itemset, but generates subsets for items that Here Cov(s, t) represents the covariance of the can appear in both consequent and two variables and X and Y are stand for standard antecedent. 82
  3. 3. deviation .The covariance measures how two belong to zero or more groups. 1-itemset is selectedvariables change together: if it has support greater or equal to one of its Cov s, t = P s, t P s P t corresponding group support. As medical attribute As we know, standard deviation is the square root value contains patient information that isof its variance and variance is a special case of multidimensional, the algorithm performs the countcovariance when the two variables are identical. operation by comparing the value of attributes instead of determining presence or absence of values s = Var s = Cov s, s of attributes to calculate support. = P s, s P s P(s) = P s P(s)2 Similarly, t = P t P(t) 2 3.1. Candidate Generation and Selection P s, t P s P t st = P s P(s) 2 P t P(t)2 The intuition behind candidate generation of all Here P s, t is the support of itemset consists of level-wise algorithms like Apriori is based on theboth s and t. Let the support of the itemset be Sst . following simple fact: Every subset of a frequentHere p s and p t is the support of antecedent s and itemset is frequent so that they can reduce theantecedent t respectively. Let the support of number of itemsets that have to be checked.antecedent s and consequent t be Ss andSt . The value However, the idea behind candidate generation ofof Sst , Ss and St are computed during the desired proposed algorithm is every item in the itemset hasitemset generation of our proposed algorithm. Using to be in the same group. This idea makes the newthese values, we can calculate the correlation of candidates that consist of items in the same groupevery medical relationship rule between a group of and keeps itemsets consist of both rare items andmedical items to another group of medical items. The high frequent items. If all the items in a newcorrelation value will indicate medical researchers candidate set are in the same group, then it ishow strong a medical relationship is in perspective of selected as a valid candidate, otherwise the newhistorical data. candidate is not added to valid candidate itemsets. Sst Ss St Here for each group there are different support and st = confidence. Each candidate itemset belongs to a Ss Ss 2 St St 2 particular group. After finding group id of a candidate itemset, the algorithm uses corresponding So putting the value of , and in support for candidate selection where as Apriori usesassociation rule generation phase, we have found the a single support threshold for all the candidatesingle metric, correlation coefficient, to represent itemsets. By this way, itemsets are explored whichhow much antecedent and consequent are medically are desired to medical researchers.related with each other. For each medicalrelationship or rule, this metric has been used toindicate the degree of strong relationship between a 3.2. Generating association rulesgroup of items to another group of items to support Let AC(item) be the function which returns one outmedical qualitative research. The ranges of values of three values: 1 if item is constrained to be in thefor is between -1 and +1. If two variables are antecedent of a rule, 2 if it is constrained to be inindependent then equals 0. When equals +1 the consequent and 0 if it can be in either. Using thisthe variables are considered perfectly positively function, itemset is partitioned into antecedent set,correlated. A positive correlation is the evidence of a consequent set and both set. Moreover, it does notgeneral tendency that when a group of attribute use subset generation to itemsets to form rules likevalues s for a patient happens, another group of conventional association mining algorithm; it onlyattribute values y for the same patient happens. More uses subset generation to both set. Each subset ofpositive value means the relationship is more strong. both set is added in antecedent part in one rule and isWhen equals -1 the variables are considered added in consequent part in another rule. Eachperfectly negatively correlated. itemset belongs to a particular group. In addition to, Figure 2 shows the association-mining algorithm there is a different confidence for each groupto support medical research. Like Apriori, our whereas Apriori uses a single confidence for all thealgorithm is also based on level wise search. The itemsets. After finding group id of an itemset, themajor difference in our proposed algorithm is algorithm uses corresponding confidence to formcandidate generation process with Apriori. Each item rules. By this way, rules are explored which areconsists of attribute name and its value. Having desired of medical researchers.retrieved information of a 1-itemset, we make a new1-itemset if this 1-itemset is not created already,otherwise update its support. The 1-itemset can 83
  4. 4. Algorithm: Find itemsets which has high support procedure SelectDesiredItemSetFromCandidates and are in the same group. (CK, GroupSupports ) Input: Data and metadata files. k Output : Itemsets which are desired to Medical 1.1 j=FindGroupNoWhichHasMinimum Researchers. SupportIfMultipleGroupsExist (c) 1. K=1; 1.2 If >= GroupSupports[j] 2. Read the metadata about which attributes can only 1.3 Add it to I appear in the antecedent of a rule, can only appear 2. return I in the consequent and can appear in either Algorithm : Find assosiation rules for decision 3. Read Groups Information along with each group supportability of medical reasearcher. support and confidence from configuration file and Input: I : Itemsets , GroupConfidences make dictionary , here key is the attribute number Output: R: Set of rules and value is a list of group numbers on whcih the 1. R = Ø corresponding attribute belongs to. 2. For each X I 4. Ik = Select 1-itemsets that have support greater or 2.1 j =FindGroupNoWhichHasMinimum equal to one of its corresponding group support. ConfideceIfMultipleGroupsExist(X) 5. While(Ik 2.2 Both Set B = (b1, b2 n){ where bi 5.1 K++; X and AC(bi) = 0} 5.2 CK = Candidate_generation(Ik-1) 5.3 CalculateCandidatesSupport(Ck) where asi i)= 1} 5.4 Ik = SelectDesiredItemSetFromCandidates(CK, 2.4 Consequent set CS = (cs1, cs2 n){ GroupSupports) ; where csi X and AC(csi) = 2} 5.5 I = I U Ik 2.5 For each subset Y of B 6. return I 2.5.1 Y1 = B-Y; procedure Candidate_generation(Ik-1: frequent (k-1) 2.5.2 AS1 =AS U Y itemsets) 2.5.3 CS1 = CS U Y1 1. for each Itemset i1 k-1 2.5.4 if (support (AS1 CS1)/Support 1.1for each Itemset i2 k-1 (AS1)) >= GroupConfidences[j]; 1.1.1 newcandidate, NC = Union(i1,i2); AS1 CS1 is a valid rule. 1.1.2 if size of NC is k R = R U (AS1 CS1) isInSameGroup =TestWhetherAll- 2.5.5 AS2 =AS U Y1 TheItemsInSameGroup(NC) 2.5.6 CS2 = CS U Y if (isInSameGroup == true) 2.5.7 if (support (AS2 CS2)/Support add NC to Ck othewise (AS2)) >= GroupConfidences[j]; remove it. AS2 CS2 is a valid rule. 2. return Ck; R = R U (AS2 CS2) Figure 2: Association mining algorithm to support medical research determines number of items in a itemset. Number of3.2.1. Lemma 1. Number of rules is equal to k L(D 2i ) rules from D =2 ( 2 ) . So total number of rules = i=1 2 where k is the number of desired ( 2 )itemsets and L is function, which determines number =1 2 where k is the number of desired itemsets. Let m is the average number of distinctof items in an itemset. D2 is the both set. Number of k value, each multidimensional attribute holds. P is thediscarded rules = mp i=1 2 L(D 2i ) . number of attributes. Number of possible different Proof: Let I = {i1, i2 n} be the set of items. Let rules = . Number of discarded rules =G= {g1,g2,g3 q} be the set of groups. Let R= =1 2 ( 2 ).{r1,r2,r3 s} be the set of restrictions. GS is thefunction, which finds groups with the smallest 4. Results and discussionconfidence. If not all items are in the same group, theGS returns NULL. 1-itemset is selected if S( 1- The experiments were done using PC with coreitemset) >= S(GS(1-itemset)) where S is the function, 2 duo processor with a clock rate of 1.8 GHz andwhich returns support for an itemset. Let C= {c1, c2, 3GB of main memory. The operating system wasc3 x} be the set of candidate itemsets. A new Microsoft Vista and implementation language wascandidate NC is added to C c#. We used 1 dataset to verify our method. The dataci is selected for rule generation if S(C) >= S set of interest is patient dataset collected and(GS(C)). A desired itemset, D, is partitioned into preprocessed from Bangladeshi hospitals, which hasthree parts. D = {D0, D1, D2}. D0 is mapped to 50273 instances and 514 attributes (included 150anticipated items, D1 is mapped to consequent items, discrete and 364 numerical attributes). It contains allD2 is mapped to both. Each subset of D 2, d, is added categories of healthcare data: ratio, interval, decimal,to both antecedent and consequent. When d is added integer, percentage etc. All these data are convertedto antecedent then D2-d is added to consequent. On into mineable items (integer representation) usingthe other hand, when d is added to consequent then domain dictionary and rule base. We have taken anD2-d is added to antecedent. L is a function, which 84
  5. 5. average value from 10 trials for each of the test constrains on attributes constant. Time is not variedresult. significantly because the number of groups has no Table 1. Test result for patient dataset lead to reduce disk access. This is because number ofNumber of groups 4 8 groups has no lead to the number of candidateSupport for each group .55, .47,.84, .66, generations phases and to the number of support .64, .55,.85, .94, calculation phases. The number of groups has only .76,.45 .86,.35 lead to the number of valid candidate generationsCorrelation for each group .71, .63, .85,.82, and it can save some CPU time. .41, .76,.91, .73, 4 Groups 8Groups 12 Groups .51,.61 .82, .71Number of Items to be 4,4,4,4 5,4,5,6, 2000 Time(Seconds)constrained in antecedent for 4,5,5,7each groupNumber of Items to be 1,2,2,1 1,2,2,1constrained in consequent for 1,2,2,1 0each group 8 4 12Number of Items to be 0,0,0,0 1,1,1,1 Group Sizeconstrained in both for each 1,1,1,0group Figure 4: Time comparison of the proposedTotal number of desired itemsets 125 311 algorithms for the patient dataset based onTotal number of desired rules 21 28 Group Size Figure 4 shows how time is varied with differentTime(Seconds) 173.09 556.11 group size for medical research algorithm. Here we Table 1 shows test result for patient dataset, after measured the performance of Medical Researchrunning the program of the proposed algorithm with algorithm in terms of group size keeping number ofdifferent parameters. Second column of the table groups constant, support and confidence of eachpresents the test result, where we used 4 groups, group constant, antecedent and consequentminimum support of 45%-76% and correlation of constrains on attributes constant. Time is varied.41-.71 to mine symmetric association rules for significantly because group size has lead to reducemedical researcher. The maximum number of items disk access. This is because group size has lead toin a rule was 6. 125 desired itemsets were generated the number of candidate generations phases and toin total. 21 rules were discovered in total. It took the number of support calculation phases.about 3461 seconds to find these rules. Third column Group Size 4 Group size 10of the table presents the test result, where we used 8 1 Group Size 18groups, minimum support of 35%-94% and Accuracycorrelation of .63-.91 to mine symmetric associationrules for medical researcher. The maximum number 0.5of items in a rule was 8. 311 desired itemsets weregenerated in total. 28 rules were discovered in total. 0It took about 11122 seconds to find these rules. 0.5 0.7 0.85 Group Size 4 Group Size 10 Group Size 18 Correlation 2000 Time(Seconds) Figure 5: Accuracy of test result for the patient dataset based on correlation 1000 Figure 5 illustrates accuracy results for our proposed algorithm. The value of correlation for each presented result is also indicated. For accuracy 0 measurement, we intentionally discovered relationships among attributes for which trends are 4 Number of Groups12 8 known. Here we calculated accuracy as the ratioFigure 3: Time comparison of the proposed between the number of correct discoveredalgorithms for the patient dataset based on relationships and total number of discoverednumber of groups relationships. A discovered relationship is correct ifFigure 3 shows how time is varied with different it is one of the known trends of medical domain. Itnumber of groups for the medical research algorithm. shows that an average accuracy of 55% is achievedWe measured the performance of Medical Research with correlation 0.5. The proposed algorithm withalgorithm in terms of number of groups keeping correlation 0.7 achieves an average accuracy ofgroup size constant, support and confidence of each 85.66%. The proposed algorithm with correlation 0.7group constant, antecedent and consequent achieves an average accuracy of 94.66%. As 85
  6. 6. accuracy refers to the rate of correct values in the Large Databases," in Proceedings of the 1993 ACMdata, the figure represents the success of our SIGMOD international conference on Management ofproposed data mining algorithm. data, Washington, D.C., 1993, pp. 207-216. [5] H. Mannila, H. Toivonen, and A. I. Verkamo, "Efficient Algorithms for Discovering Association5. Conclusion Rules," in AAAI Workshop on Knowledge Discovery in Databases, 1994, pp. 181-192. Medical Researchers are interested to find [6] R. Srikant and R. Agrawal, "Mining Generalizedrelationship among various diseases, lab tests, Association Rules," in In Proc. of the 21st Intlsymptoms, etc. Due to high dimensionality of Conference on Very Large Databases, Zurich,medical data, conventional association mining Switzerland, 1995.algorithms discover a very high number of rules with [7] R. Srikant, Q. Vu, and R. Agrawal, "Miningmany attributes, which are tedious, redundant to association rules with item constraints," in In Proc.medical researchers and not among their desired set 3rd Int. Conf. Knowledge Discovery and Dataof attributes. In this paper, we have proposed an Mining, 1997, pp. 67--73.association rule mining algorithm for finding [8] A. Savasere, E. Omiecinski, and S. B. Navathe, "Ansymmetric association rules to support medical Efficient Algorithm for Mining Association Rules in Large Databases," in Proceedings of the 21thqualitative research. The main theme of this International Conference on Very Large Data Bases,algorithm is based on the following two statements: 1995, pp. 432 - 444.interesting relationships among various medical [9] H. Mannila, "Database methods for data mining," inattributes are concealed in subsets of the attributes, The Fourth International Conference on Knowledgebut do not come out on all attributes taken together Discovery and Data Mining, 1998.and all interesting relationships among various [10] B. Liu, W. Hsu, and Y. Ma, "Mining Associationmedical attributes have not same support and Rules with Multiple Minimum Supports.," incorrelation. The algorithm constructs a candidate SIGKDD Explorations, 1999, pp. 337--341.item sets based on groups constraint and use the [11] H. Yun, D. Ha, B. Hwang, and K. H. Ryu, "Miningcorresponding support of each group in candidate association rules on significant rare data using relativeselection process to discover all possible desired item support.," Journal of Systems and Software archive,sets of that group. We propose measuring vol. 67, no. 3, pp. 181 - 191, 2003.interestingness of known symmetric relationships [12] M. Hahsler, "A Model-Based Frequency Constraintand unknown symmetric relationships via the for Mining Associations from Transaction Data.," Data Mining and Knowledge Discovery, vol. 13, no.correlation measure of antecedent items and 2, pp. 137 - 166, 2006.consequent items. The proposed algorithm has been [13] L. Zhou and S. Yau, "Association rule andapplied to a real world medical data set. We have quantitative association rule mining among infrequentshown significant accuracy in the output of the items," in International Conference on Knowledgeproposed algorithm. Although we have used level- Discovery and Data Mining, San Jose, California,wise search for finding symmetric association rules, 2007, pp. 156-167.each step of our algorithm is different from any [14] C. Ordonez, C. Santana, and L. d. Braal, "Discoveringlevel-wise search algorithm. Rules generation from Interesting Association Rules in Medical Data," indesired item sets is also different from conventional Proccedings of ACM SIGMOD Workshop onassociation mining algorithms. Research Issues on Data Mining and Knowledge Discovery, 2000, pp. 78-85. [15] L. J. Sheela and V. Shanthi, "DIMAR - Discovering6. References interesting medical association rules form MRI scans," in 6th International Conference on Electrical [1] R. Agrawal and R. Srikant, "Fast Algorithms for Engineering/Electronics, Computer, Mining Association Rules in Large Databases," in Telecommunications and Information Technology, Proceedings of the 20th International Conference on 2009, pp. 654 - 658. Very Large Data Bases, San Francisco, CA, USA, [16] C. Ordonez, N. Ezquerra, and C. A. Santana, 1994, pp. 487 - 499. "Constraining and summarizing association rules in[2] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, medical data," Knowledge and Information Systems, "Dynamic Itemset Counting and Implication Rules for vol. 9, no. 3, pp. 259 - 283, September 2005. Market Basket Data," in Proceedings of the 1997 [17] H. Pan, J. Li, and Z. Wei, "Mining Interesting ACM SIGMOD international conference on Association Rules in Medical Images," Lecture Notes Management of data, Tucson, Arizona, United States, In Computer Science, vol. 3584, pp. 598-609, 2005. 1997, pp. 255-264. [18] S. Doddi, A. Marathe, S. S. Ravi, and D. C Torney,[3] J. S. Park, M. S. Chen, and P. S. Yu, "An Effctive "Discovery of association rules in medical data," Hash based Algorithm for mining association rules," Medical Informatics and the Internet in Medicine, vol. in Prof. ACM SIGMOD Conf Management of Data, 26, no. 1, pp. 25-33, January 2001. New York, NY, USA, 1995, pp. 175 - 186.[4] R. Agrawal, T. . Swami, "Mining Association Rules between Sets of Items in Very 86