"Privacy in Data Mining"


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

"Privacy in Data Mining"

  1. 1. Csci 5708 Project Report Fall 2004 Privacy in Data Mining K. Vamshi Krishna Beemanapalli Kalyan
  2. 2. Privacy in DataMining Abstract We provide here the details of the survey done on privacy and its implications on data mining. We provide the perspective of the owners of the information, the collectors of information and the opinion of the researchers. Initially, the survey description begins with an introduction to the diverse but related fields of privacy in general and data mining. We explore the relationship and how the problem of privacy is aggravated by improper use of data mining techniques. We also provide detailed enumeration of the techniques which have been developed by researchers all over the world to combat this issue and give a brief overview of the techniques. A brief evaluation is performed and some conclusions are made. The paper ends by introducing the open issues in this field which are yet to be solved and potential research directions. Introduction Data mining and knowledge discovery in databases are two new research areas that investigate the automatic extraction of previously unknown patterns from large amounts of data. Recent developments in data collection, data dissemination, internet technologies have opened a new domain of issues relating to privacy of people in relation with data mining. The data mining algorithms are being revisited from a different point of view, this of privacy preservation. It is well understood among the community of researchers that the limitless expansion of data through internet or media, has reached to a point where threats against the privacy are very common on a daily basis and they require serious thinking. Privacy preserving data mining [9, 10], is a novel research direction in data mining and statistical databases, where data mining algorithms are analyzed for the side-effects they incur in data privacy. The problem is two fold. First, sensitive information like name, age etc should be trimmed from the original database so that the recipient of the information doesn’t compromise the privacy of the people. Secondly, techniques are required to be developed to minimize mining of sensitive patterns and rules using data mining algorithms, because such knowledge can equally well compromise data privacy, as we will indicate. The main objective in privacy preserving data mining is to develop algorithms for modifying the original data in some way, so that the private data and private knowledge remain private even after the mining process. The problem that arises when confidential information can be derived from released data by unauthorized users is also commonly called the ‘database inference problem’. This problem has plagued statistical databases, and now it’s the turn of the databases used for data mining purposes to seriously consider this problem.
  3. 3. Relationship: Hence here we have the relationship between data mining and privacy. As the primary goal of data mining is to extract hidden and unknown patterns and relationships among data items, there exists a side-effect of using the tool for extracting private and sensitive information. Therefore we have to concern about privacy whenever dealing with applying data mining techniques to databases holding private data. Therefore ‘database Inference Problem’ is made worse by data mining. Hence there is a need to revisit all the data mining algorithms from the angle of privacy, secrecy , civil liberties etc. Privacy violations in the recent past 1. Kaiser, a major US health provider, accidently sent out 858 email messages containing member IDs and responses to questions on various illnesses to the wrong members. (Washington Post, 10 August 2000). 2. GlobalHealthtrax, which sells health products online, inadvertently revealed customer names, home phone numbers, bank account, and credit card information of thousands of customers on their Web site. (MSNBC, 19 January 2000). 3. Medical Marketing Service advertises a database available to pharmaceutical marketers which includes the names of 4.3 million people with allergies, 923,000 with bladder control problems, and 380,000 who suffer from clinical depression. (www.mmslists.com) 4. Boston University has created a private company to sell the data collected for more than 50 years as part of the Framingham Heart Study. Data collected on more than 12,000 people, including medical records and genetic samples, will be sold. (New York Times, 17 June 2000) 5. The chain drug stores CVS and Giant Food admitted to making patient prescription records available for use by a direct mail and pharmaceutical company. (Washington Post, 15 February 1998). DataMining, a new monster? In the information age, information is captured, collated, bartered and sold. Everyone is in on the act. Companies such as New York-based DoubleClick faithfully capture each Web browser's mouse click and use the information to direct consumer ads. Others, from Redmond, Wash.-based Microsoft to Mountain View, Calif.-based Netscape to Cambridge, Mass.-based FireFly Networks, track individual interests in everything from music to Web pages. And Boulder, Colo.-based MessageMedia Inc. (a Softbank Holdings Inc. company) links traditional direct marketing databases to cyberspace pitches.Privacy
  4. 4. issues get all the more complex because of the Internet which allows unlimited access to data anytime, anywhere in the world [11,12,13,14,15]. After having surveyed the literature, we have come to realize that Privacy Preserving DataMining is an emerging field for research. There are several groups working on several different problems. For example, one group might be interested in devising new techniques for preserving data mining in the financial domain specific to association rule algorithm. It is clear that the research in this field is not mature enough to ensure total privacy protection for all those paranoid about privacy. But that is not going to stop all the credit companies, Telemarketers and E-commerce websites from mining personal data. So is a strong Federal Policy for preserving privacy going to bring any order out of the chaos? I really don’t know. I suspect the corporations are always two steps ahead of the law. Unfortunately, the US is lagging far behind the EU and Japan where they have very strong laws about preserving privacy and enforcing laws. Two laws that have received a lot of public attention in the recent past in the US are the 1996 Health Insurance Portability and Accountability Act (www.hhs.gov/ocr/hipaa) giving patients control over how their personal medical information is used and disclosed and the 1999 Gramm-Leach-Bliley Financial Services Modernization Act (www.banking.senate.gov/conf/) that requires financial institutions to disclose their privacy policies and allows consumers to opt-out of sharing personal information with nonaffiliated third parties. It will be interesting to know how any violations of this Act have been reported and how they have been dealt accordingly by the law. Statistics speak louder than words! Data mining has found extensive use in dealing with counter terrorism especially in the recent past. This has raised the hackles of several civil liberties unions because it blatantly violates an individual’s right to privacy. What is more important? Protecting nations from terrorist attacks or protecting the privacy of individuals? This is one of the major challenges faced by technologists, sociologists and lawyers. That is, how can we have privacy but at the same time ensure the safety of the nations? What should we be sacrificing and to what extent? Related Work in Statistical Databases Privacy preserving data mining heavily draws upon ideas from related research in Statistical Databases. The Inference Problem in Statistical Databases has been beaten to death in literature. The key difference between Data Mining and Statistical Database Queries is that in the latter case, we know what to preserve and protect. We are mostly interested in preserving mean, standard deviations and other statistics. But in Data Mining, we don’t even know what to expect. So there is no clear cut notion as to when one’s privacy is violated or how to detect it beforehand. Work in statistical databases has taken two approaches: Query Restriction  includes restricting the size of query result  controlling the overlap amongst successive queries
  5. 5.  keeping audit trail of all answered queries and constantly checking for possible compromise  suppression of data cells of small size  clustering entities into mutually exclusive atomic populations Data Perturbation  swapping values between records  replacing the original database by a sample from the same distribution  adding noise to the values in the database  adding noise to the results of a query  sampling the result of a query It is clear that statistical databases and data mining share the same goals of preventing disclosure of confidential information. However the goal in data mining is not to obtain high quality point estimates but to reconstruct with reasonable accuracy the original distribution of attribute values (given a distorted and perturbed dataset) so that the aggregate knowledge gained is close enough to what might be obtained from the original dataset. The general problem in Data Mining is the following. Say, X is a dataware house owner and Y is a Data Miner. X is interested in knowing more about hidden relationships in his dataware house but does not trust Y and does not want Y to learn of any confidential attributes. So he distorts and perturbs the data set and hands it to Y. Y’s job is to reconstruct the original distribution as faithfully as possible and build a model out of it. It is always a trade off between amount of privacy ensured and the accuracy of the knowledge model developed. For example, if X distorts the data too much to preserve privacy, Y will have a very poor accuracy with the developed knowledge models. Policy and Mechanism Just as the ACID properties have guided the architecture and design of all OTLP systems, several groups are interested in developing a set of guidelines for Privacy Preserving databases. These guidelines are like a protocol which all ideal databases are supposed to adhere to. Unfortunately, ensuring privacy is a complex and hard problem and it is not easy to satisfy everyone all the time. The research group of Rakesh Agrawal at IBM [17] is at the forefront in developing policies for such databases. They call such databases Hippocratic databases and the policies governing them are: 1. Purpose Specification For personal information stored in the database, the purposes for which the information has been collected shall be associated with that information. 2. Consent The purposes associated with personal information shall have consent of the donor of the personal information. 3. Limited Collection The personal information collected shall be limited to the minimum necessary for accomplishing the specified purposes. 4. Limited Use The database shall run only those queries that are consistent with the purposes for which the information has been collected.
  6. 6. 5. Limited Disclosure The personal information stored in the database shall not be communicated outside the database for purposes other than those for which there is consent from the donor of the information. 6. Limited Retention Personal information shall be retained only as long as necessary for the fulfillment of the purposes for which it has been collected. 7. Accuracy Personal information stored in the database shall be accurate and up-to-date. 8. Safety Personal information shall be protected by security safeguards against theft and other misappropriations. 9. Openness A donor shall be able to access all information about the donor stored in the database. 10. Compliance A donor shall be able to verify compliance with the above principles. Similarly, the database shall be able to address a challenge concerning compliance. Key Issues and Key Results This section gives a brief description of the techniques used by various researchers to ensure privacy preserving data mining. This has two sides to it. People have thought about it from the angle of modifying the databases so that sensitive information is hidden from the data miners. The other side is to modify the data mining algorithms so that they do not produce rules which reveal sensitive information in the data. Classification Taxonomy There are many approaches which have been adopted for privacy preserving data mining. We can classify them based on the following dimension. • Data distribution • Data modification • Data mining Algorithm • Data or rule hiding • Privacy preservation The data distribution refers to how the data is distributed in the database, whether the system is centralized or distributed. People have considered different ways for horizontally distributed databases and vertically distributed databases. Then we have the second dimension which is the data modification. Data modification is used in order to modify the original values of a database that needs to be released to the public and in this way to ensure high privacy protection. It is important that a data modification technique should be in concert with the privacy policy adopted by an organization. Methods of modification include data perturbation, data blocking, aggregation or merging , swapping or sampling which refers to releasing only part of the entire population The third dimension is the data mining algorithm. Though we do not know which data mining algorithm will be used for mining, but it facilitates the analysis and design of data hiding algorithm. Various data mining algorithms have been considered in isolation for research purposes. Ideas have been developed for classification rules, association rules, clustering ,
  7. 7. Bayesian networks and rough sets. Data or rule hiding refers to whether raw data or aggregated data to be released. Lessening the data, allows for producing weaker inference rulesthat will not allow inference of confidential values. This process is called “Rule Confusion”. The last dimension refers to selective modification of data. It is required in order to achieve higher utility for he modified data given that the privacy is not jeopardized. For this purpose heuristic-based techniques like adaptive modification, Cryptographic based techniques like secure multi-party computation and reconstruction-based techniques where the original distribution of data is reconstructed from randomized data have been proposed. This paper briefly talks about the above mentioned 3 techniques and discusses various tradeoff’s which exist among these techniques and how much performance and accuracy levels these techniques have been able to achieve. Techniques and Methods As mentioned in the last section the privacy preserving techniques are broadly divided into three types. • Heuristic-Based Techniques • Cryptography-based Techniques • Reconstruction-based Techniques Heuristic-Based Techniques A number of techniques have been developed for a number of data mining techniques like classification, association rule discovery and clustering, based on the premise that selective data modification or sanitization is an NP-Hard problem [6], and for this reason, heuristics can be used to address the complexity issues. Various combinations of algorithms exist of Association rule confusion for data centralized and data distributed databases. In these database various data modification techniques have been applied. The list of algorithms can be enumerated as shown below • Centralized Data Perturbation-Based Association Rule Confusion • Centralized Data Blocking-Based Association Rule Confusion Similarly for classification rules also many algorithms have been developed based on various combinations of data modification techniques and data distribution techniques. The list of algorithms which have been so far considered are enumerated below. • Centralized Data Blocking-Based Classification Rule Confusion • Horizontally partitioned data perturbation Classification Rule Confusion • Vertically partitioned data modification Classification Rule Confusion • Centralized data modification Classification Rule Confusion
  8. 8. Cryptographic-based Techniques Two research areas we have been able to identify in this include: 1. Order Preserving Encryption Scheme for Numeric Data Encryption is a well established technology for protecting sensitive data. However, once encrypted, data can no longer be easily queried aside from exact matches. Rakesh Agrawal’s group at IBM [17] has proposed an order preserving encryption scheme for numeric data that allows any comparison operator to be directly applied on encrypted data. Original and Encrypted data As can be seen from the above figure, the ordering of data tuples before and after encryption is the same. It is also to noted that the encrypted data items are nothing similar to the original data items. So such encrypted distributions can be used to answer queries that involve <=> operators and gives exact matches. For further details, see the reference paper [17] by Rakesh Agrawal 2. Secure Distributed Computation Research in Secure distributed computation has been studied extensively as part of a larger body of research in the theory of cryptography. Berry Pinkas [16] believes that this has direct relevance to privacy preserving computation of data mining algorithms. For example, consider separate medical institutions that wish to conduct a joint research while preserving the privacy of their patients. In this scenario it is required to protect privileged information, but it is also required to enable its use for research or for other purposes. In particular, although the parties realize that combining their data has some mutual benefit, none of them is willing to reveal its database to any other party.
  9. 9. It is obvious that if a data mining algorithm is run against the union of the databases, and its output becomes known to one or more of the parties, it reveals something about the contents of the other databases. (For example, if a researcher from a medical institution learns that the overall percentage of patients that have a certain symptom is 50%, while he knows that this percentage in his population of patients is 40%, then he also learns that more than 50% of the patients of the other institutions have this symptom.) The first step in gaining a foothold is to define privacy and use this definition to limit the information that is leaked by the distributed computation to the information that can be learned from the designated output of the computation. For more details, see the reference paper by Berry. Reconstruction-Based Techniques The work presented in [3] addresses the problem of building a decision tree classifier from training data in which the values of individual records have been perturbed. While it is not possible to accurately estimate original values in individual data records, the authors propose a reconstruction procedure to accurately estimate the distribution of original data values. By using the reconstructed distributions, they are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data. For the distortion of values, the authors have considered a discretization approach and a value distortion approach. For reconstruction the original distribution, hey have considered a Bayesian approach and they proposed three algorithms for building accurate decision trees that rely on reconstructed distributions. The following graph shows the similarity in reconstructed distributions from randomized data with the original distribution in data. Number 1200 Of people 1000 Number of People 800 Original 600 Randomized 400 Reconstructed 200 0 20 60 Ages Age
  10. 10. The above graph shows that the reconstruction techniques do a fairly good job in getting the original distribution from randomized data. Open Issues / Research Directions  Data Warehousing and the Inference problem  Preparing Perturbed Databases for a combination of Algorithms  Social Effects. [ Work with Social Scientists to preserve privacy over cultures]  Formulating legal rules and developing data mining algorithms accordingly  Privacy Inference Controller  Quantifying Privacy? The bulleted open issues are just a few to name. This field is so amateur that many issues are required to be solved before the algorithms can be considered for commercial puposes. Data Warehouse is the staging area for holding aggregate data. Data mining tools have their fullest power when they are working on aggregated data. So people should start considering the data mining algorithms from the data warehouse point of view. They should also consider the Inference problem and how much it can affect the techniques employed for data warehouses. It is very important to develop techniques which work on many types of data mining techniques. That is, we might modify the databases to suit a particular rule like the classification rule, but we have no idea how it would work for association rules. So the intruder might use the association rules to compromise the privacy of the data. Researchers have suggested working with various social scientists, because privacy issues vary between regions, cultures. So it is very important to consider what these people say while developing various data mining algorithms. Similarly the legal issues should also be considering by working closely with various national agencies to have a better understanding of the legal issues, rules and regulations Research can be done on construction of Privacy Inference Controller which will form a layer between the data mining algorithm and the Association rules. The job of the controller is to filter the sensitive association rules from the entire set of association rules produced by the data miner Finally, we need to quantify privacy. Little work has been done to formulate metrics which can quantitatively measure the privacy violation. A policy similar to adopted by insurance companies can be considered, which is what the researchers are heading towards
  11. 11. References [1] Security and Privacy Implications of Data Mining, 1996, Chris Clifton and Don Marks [2] Defining privacy for Data Mining , Chris Clifton, Murat Kantarcioglu and Jaideep Vaidya, Purdue University [3] Data Mining, National Security, Privacy and Civil Liberties, Bhavani Thuraisingham, The National Science Foundation [4] A Framework for Privacy Preserving Classification in Data Mining, 2004 Md.Zahidul Islam and Ljiljana Brankovic [5] Privacy Preserving Mining of Association Rules, 2002, Alexandre Evfimlevski, Ramakrishnan Srikant, Rakesh Agrawal and Johannes Gehrke, IBM Almaden Research Center [6] Privacy Preserving Data Mining, 2000, Rakesh Agrawal and Ramakrishnan Srikanth, IBM Research, Almaden [7] Privacy Preserving Data Mining, Advances in Crptology, 2000, Y.Lindell and Benny Pinkas. [8] Detecting Privacy and Ethical Sensitivity in Data Mining Results, 2004, Peter Fule and John Roddick [9] Limiting Privacy Breaches in Privacy Preserving Data Mining, 2003, Alexandre Evvfimievski, Johannes Gehrke and Ramakrishnan Srikant [10] State-of-the-art in Privacy Preserving Data Mining, Vassilios S.Verykios, Elisa Bertino, Igor Nai Fovino, Provenza, Yucel Saygin and Yannis Theodoridis [11] Business Week, Privacy on the Net, March 2000 [12] L. Cranor, J. Reagle, and M. Ackerman. Beyond concern: Understanding net users’ attitudes about online privcacy. Technical Report TR 99.4.3, AT&T Labs-Research, April 1999 [13] A. Westin. E-commerce and privacy: What net users want. Technical report, Louis Harris & Associates,June 1998. [14] A. Westin. Privacy concerns & consumer choice. Technical report, Louis Harris & Associates, Dec.1998. [15] A. Westin. Freebies and privacy: What net users think. Technical report, Opinion Research Corporation, July 1999. [16] Berry Pinkas, HP. SIGKDD Explorations, Volume 4, Issue 2 [17] http://www.almaden.ibm.com/software/quest/Publications/papers IBM Almaden Research Group.