3. Growingprivacyconcerns<br />Threat to individual privacy.<br />Inference of sensitive information including personal information or even patterns from non-sensitive information.<br /><ul><li>Individual item in database must not be disclosed
4. Not necessarily a person
5. Information about a corporation
6. Transaction record</li></li></ul><li>Why privacy preserving data mining?<br />Government / public agencies. Example:<br />The Centers for Disease Control want to identify disease outbreaks<br />Insurance companies have data on disease incidents, seriousness, patient background, etc.<br />But can/should they release this information?<br />Industry Collaborations / Trade Groups. Example:<br />An industry trade group may want to identify best practices to help members<br />But some practices are trade secrets<br />How do we provide “commodity” results to all (Manufacturing using chemical supplies from supplier X have high failure rates), while still preserving secrets (manufacturing process Y gives low failure rates)?<br />
7. Why privacy preserving data mining?<br />Multinational Corporations<br />A company would like to mine its data for globally valid results<br />But national laws may prevent transborder data sharing<br />Public use of private data<br />Data mining enables research studies of large populations<br />But these populations are reluctant to release personal information<br />
8. Example:Patient Records…<br />Patient health records split among providers<br />Insurance company<br />Pharmacy<br />Doctor<br />Hospital<br />Each agrees not to release the data without my consent<br />Medical study wants correlations across providers<br />Rules relating complaints/procedures to “unrelated” drugs<br />Does this need patient consent?<br />And that of every other patient!<br />It shouldn’t!<br />Rules shouldn’t disclose patient individual data<br />
9. Approaches:<br /> The first approach is to alter the data before delivery to the data miner so that real values are obscured.<br /> The second approach assumes the data is distributed between two or more sites, and these sites cooperate to leam the global data mining results without revealing the data at their individual sites.<br />
10. Introduction<br />Our technique of altering the data is to selectively modify individual values from a database to prevent discovery of set of rules.<br />Here we apply a group of heuristic solutions for reducing the number of occurrences of some frequent itemsets below a minimum user specified threshold.<br />The second approach is to allow users access to only a subset of data while global data mining results can still be discovered. <br />
11. Problem statement<br />Mining of association rules.<br /> Let I = { i,, i2,…., im } be a set of literals, called items. Given a set of transactions D, where each transaction T is a set of items such that T is subset or equal to I , an association rule is an expression X=>Y where X,Y are subset or equal to I and XП Y = ø .An example of such a rule is that 90% of customers buy hamburgers also buy coke. The 90% here is called the confidence of the rule which means that 90% of transaction that contain X also contain Y. The support of the rule is the percentage of transactions that contain both X and Y. The problem of mining association rules is to find all rules that are greater than the user-specified minimum support and minimum confidence.<br />
12. DataMiningCombiner<br />Combinedresults<br />LocalDataMining<br />LocalDataMining<br />LocalDataMining<br />Local <br />Data<br />Local <br />Data<br />Local <br />Data<br /> Mining of association rules<br />A&B C<br />A & B C<br />A&B C 4%<br />
13. Apriori algorithm:<br />Apriori is an influential algorithm for mining frequent itemsets from a givan database.It employs an iterative approach known as a level-wise search,where k-itemsets are used to explore (k+1)-itmsets.<br />Apriori property:<br /> All non-empty subsets of a frequent itemset must also be frequent.<br /> A two step process:<br /><ul><li>1.The join step:
14. To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself.This set of candidates is denoted Ck .The join is performed where members of Lk-1 are joinable if their first (k-2) items are in common.
15. 2.The prune step:
16. A scan of database to determine the count of each candidate in ck would result in the determination of Lk .Any (k-1) subset that is not frequent cannot be a subset of a frequent k-itemset. Hence if any (k-1) subset of a candidate k-itemset is not in Lk-1 ,then the candidate cannot be frequent either and so can be removed from Ck .</li></li></ul><li>Apriori Algorithm:<br />Input: Database, D, of transactions; minimum support threshold, min sup.<br />Output:L, frequent itemsets in D.<br />Method:<br />1. L1 = find frequent _1-itemsets(D);<br />2. for (k = 2;Lk-1 ≠ø;k++) {<br />3. Ck = apriori gen(Lk-1, min sup);<br />4. for each transaction t є D { // scan D for counts<br />5. Ct = subset(Ck ; t); // get the subsets of t that are candidates<br />6. for each candidate c є Ct<br />7. c.count++;<br />8. }<br />9. Lk = {c є Ck |c.count ≥ min sup}<br />10. }<br />11. return L = UkLk;<br />
17. Procedure Apriori_gen(Lk-1:frequent (k-1)-itemsets; min sup: minimum support)<br />1. for each itemset l1є Lk-1<br />2. for each itemset l2є Lk-1<br />3. if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^ …. ^ (l1[k - 2] = l2[k - 2]) ^ (l1[k - 1] < l2[k-1]) then {<br />4. c = l1join l2; // join step: generate candidates<br />5. if has infrequent subset(c;Lk-1) then<br />6. delete c; // prune step: remove unfruitful candidate<br />7. else add c to Ck;<br />8. }<br />9. return Ck;<br />Procedure has infrequent subset(c: candidate k-itemset; Lk-1: frequent (k-1)-itemsets); // use prior knowledge<br />1. for each (k - 1)-subset s of c<br />2. if s !є Lk-1 then<br />3. return TRUE;<br />4. return FALSE;<br />
18. Example:<br />Transaction Database D<br />C1:<br />L1:<br />Compare candidate<br /> support count with<br /> minimumSupport<br /> count 2<br />Scan D for<br />Count of each<br />Candidate<br />
19. C2:<br />L2:<br />Generate C2<br />Candidates<br /> from L1<br />Scan D for<br />Count of each<br /> candidate<br />C3:<br />Generate C3<br />Candidates<br /> from L2<br />Scan D for <br />Count of each<br /> candidate <br />L3:<br />Generation of candidate itemsets and frequent itemsets,<br /> where the minimum support count is 2.<br />
20. Generating Association Rules:<br />Consider the frequent itemset L2={AB,AC,BC}.The non-empty subsets of L2 are {A},{B},{A},{C},{B},{C}.The resulting association rules are:<br /><ul><li>B=>A Confidence=4/4=100%
21. A=>C Confidence=4/6=66%
22. C=>A Confidence=4/4=100%
23. B=>C Confidence=3/4=75%
24. C=>B Confidence= 3/4=75%</li></ul>If the minimum confidence thresold is 70%,except second rule all are strong.<br /><ul><li>Consider the frequent itemset L3={ABC}.The non-empty subsets of L3 are {A},{B},{C},{AB},{AC},{BC}.The resulting association rules are:
25. A=>B^C Confidence=3/6=50%
26. B=>A^C Confidence=3/4=75%
27. C=>A^B Confidence=3/4=75%
28. A^B=>C Confidence=3/4=75%
29. A^C=>B Confidence=3/4=75%
30. B^C=>A Confidence=3/3=100%
31. If the minimum confidence thresold is 70%,except first rule all are strong.</li></li></ul><li> Problem description<br />We propose algorithms to modify the data in database so that sensitive items cant be inferred through association rule mining.<br />More specifically,the objective is to modify the database D such that no association rules containing H,set of items to be hidden on the right hand side will be discovered.<br />
32. Proposed algorithms……<br />To hide an association rule,we can either decrease its support or its confidence to be smaller than pre-specified minimum support and minimum confidence.To decrease confidence of a rule, we propose two algorithms:<br />Increase Support of LHS First(ISLF).<br />Decrease Support of RHS First(DSRF).<br /> The first algorithm tries to increase the support of left hand side of rule.If it was not successful,it tries to decrease the support of the right hand side of the rule.<br />
33. Algorithm ISLF: <br />Input:<br />(1) A source database D,<br /> (2) A min-support,<br /> (3) A min-confidence,<br /> (4) A set of hidden items H<br />Output:<br />A transformed database D’, where rules containing H on RHS will be hidden<br />Algorithm:<br />1. Find large I-item sets from D ;<br />2. For each hidden item h є H<br />3. If h is not a large I-item set, then H := H-{h} ;<br />4. If H is empty, then EXIT;// no AR contains H in RHS<br />5. Find large 2-itemsets from D ;<br />6. For each hє H {<br />7. For each large 2-itemset containing h {<br />
34. 8. Compute confidence of rule U, where U is a rule of x-> h ;<br />9. If Confidence >min _ conf , then {//Increase Support of LHS<br />10. Find T1={t in D | t partially supports LHS(U);<br />11. Sort T1 in descending order by the number of supported items<br />12. Repeat {<br />13. choose the first transaction t from T1;<br />14. Modify t to support LHS(U);<br />15. Compute support and confidence of U };}<br />16. Until ( confidence (U) < min _ conf or T1 is empty );<br />17. } ; //end if confidence>min-conf<br />18. If confidence > min-conf, then {/Decrease Support of RHS<br />19. Find T2 = { t in D I t supports RHS (U)} ;<br />20. Sort T2 in descending order by the number of supported items ;<br />21. Repeat {<br />
35. 22. Choose the first transaction t from Tz;<br />23. Modify t to partially support RHS(U) ;<br />24. Compute support and confidence of U; }<br />25. Until ( confidence(U) <min-conf or T2 is empty ) ;<br />26. } ; //end if confidence>min-conf<br />27. If Confidence > min-conf, then<br />28. CAN NOT HIDE h ; <br />29. Else<br /> 30. Update D with new transaction t; <br />31. }//end of for each large 2-itemset<br />32. Remove h from H;<br />33. }//end of for each h є H<br />Output updated D,as the transformed D’;<br />
36. Example Running ISLF Algorithm<br />Example 1:<br /> To hide an item C,the rule B C (50%,75%) will be hidden if transaction T5 is modified from 100 to 110 using ISL .To hide item B,the rule A B(67%,83%) will be hidden if transaction T1 is modified from 111 to 101 using DSR.<br />Database before and after hiding item C,B using ISLF<br />
37. Example 2:<br />Here we reverse the order of hiding items.To hide the item B,the rule C B(50%,75%) will be hidden if transaction T5 is modified from 100 to 101 using ISL.To hide item C,the rule A C(83%,83%) will be hidden if transaction T1 is modified from 111 to 110 using DSR.<br />Database before and after hiding item B,C using ISLF<br />
38. Examples running DSRF Algorithm<br />Example 3:<br /> To hide an item C,the rule B C(50%,75%) will be hidden if transaction T1 is modified from 111 to 110 using DSR.To hide item B,the rule C B(50%,67%) will be hidden due to transaction T1 is modified.<br />Database before and after hiding item C,B using DSRF<br />
39. Example 4:<br />Here we reverse the order of hiding items.To hide item B,the rule C B(50%,75%) will be hidden if transaction T1 is modified from 111 to 101 using DSR.To hide item C,the rule B C will be hidden due to transaction T1 is modified.<br />Database before and after hiding item B,C using DSRF<br />
40. Analysis:<br />The first characteristic is that the transformed databases are different under different ordering of hiding items. From the above illustrated examples database D2,D4 are generated using ISLF and D5,D6 are generated using DSRF algorithm.<br />The second characteristic we analyze is the efficiency of the proposed algorithm compared with Dasseni’s algorithm.It can be seen that ISLF and DSRF algorithms require less database scanning and prune more number of association rules compared with Dasseni’s algorithm.<br />DB Scans and Rules pruned in hiding item C using ISLF<br />
41. One of the reasons that dasseni’s approach does not prune rules is that hidden rules are given in advance.<br />Our approach needs to hide all rules containing hidden items on the right hands side,but dasseni’s approach can hide some of the rules containing hidden item on the right hand side.<br />The third characteristic we analyze is efficiency comparison of the ISLF and DSRF algorithmsDSRF algorithm seems to be more effective when the support count of the hidden item is large. This is due to when support of right hand side of the rule is large; increase support of left hand side usually does not reduce the confidence of the rule but decrease support of right hand side usually decreases the confidence of the rule.<br />
42. Conclusions:<br />we have examined the database privacy problems caused by data mining technology and proposed two algorithms for hiding sensitive data in association rules mining. <br />The proposed algorithms are based on modifying the database transactions so that the confidence of the association rules can be reduced. Examples demonstrating the proposed algorithms are shown.<br /> The efficiency of the proposed approach are further compared with Dasseni’s approach.It was shown that our approach required less number of database scanning and prune more number of hidden rules. However, our approach must hide all rules containing the hidden items on the right hand side, where Dasseni’s approach can hide some of the specified rules. <br />
43. Software requirement specification:<br />The proposed algorithms can be implemented using JAVA as Front-end and Oracle-9i as Back-end under Windows environment.<br />Intel core 2 duo processor<br />RAM size<br />RAM speed<br />Hardware requirement specification:<br />
Be the first to comment