Csci 5708 Project Report
Privacy in Data Mining
K. Vamshi Krishna
Privacy in DataMining
We provide here the details of the survey done on privacy and its implications on data
mining. We provide the perspective of the owners of the information, the collectors of
information and the opinion of the researchers. Initially, the survey description begins
with an introduction to the diverse but related fields of privacy in general and data
mining. We explore the relationship and how the problem of privacy is aggravated by
improper use of data mining techniques. We also provide detailed enumeration of the
techniques which have been developed by researchers all over the world to combat this
issue and give a brief overview of the techniques. A brief evaluation is performed and
some conclusions are made. The paper ends by introducing the open issues in this field
which are yet to be solved and potential research directions.
Data mining and knowledge discovery in databases are two new research areas that
investigate the automatic extraction of previously unknown patterns from large amounts
of data. Recent developments in data collection, data dissemination, internet technologies
have opened a new domain of issues relating to privacy of people in relation with data
mining. The data mining algorithms are being revisited from a different point of view,
this of privacy preservation. It is well understood among the community of researchers
that the limitless expansion of data through internet or media, has reached to a point
where threats against the privacy are very common on a daily basis and they require
Privacy preserving data mining [9, 10], is a novel research direction in data mining and
statistical databases, where data mining algorithms are analyzed for the side-effects they
incur in data privacy. The problem is two fold. First, sensitive information like name, age
etc should be trimmed from the original database so that the recipient of the information
doesn’t compromise the privacy of the people. Secondly, techniques are required to be
developed to minimize mining of sensitive patterns and rules using data mining
algorithms, because such knowledge can equally well compromise data privacy, as we
will indicate. The main objective in privacy preserving data mining is to develop
algorithms for modifying the original data in some way, so that the private data and
private knowledge remain private even after the mining process. The problem that arises
when confidential information can be derived from released data by unauthorized users is
also commonly called the ‘database inference problem’. This problem has plagued
statistical databases, and now it’s the turn of the databases used for data mining purposes
to seriously consider this problem.
Relationship: Hence here we have the relationship between data mining and privacy. As
the primary goal of data mining is to extract hidden and unknown patterns and
relationships among data items, there exists a side-effect of using the tool for extracting
private and sensitive information. Therefore we have to concern about privacy whenever
dealing with applying data mining techniques to databases holding private data.
Therefore ‘database Inference Problem’ is made worse by data mining. Hence there is a
need to revisit all the data mining algorithms from the angle of privacy, secrecy , civil
Privacy violations in the recent past
1. Kaiser, a major US health provider, accidently sent out 858 email messages containing
member IDs and responses to questions on various illnesses to the wrong members.
(Washington Post, 10 August 2000).
2. GlobalHealthtrax, which sells health products online, inadvertently revealed customer
names, home phone numbers, bank account, and credit card information of thousands of
customers on their Web site. (MSNBC, 19 January 2000).
3. Medical Marketing Service advertises a database available to pharmaceutical
marketers which includes the names of 4.3 million people with allergies,
923,000 with bladder control problems, and 380,000 who suffer from clinical depression.
4. Boston University has created a private company to sell the data collected for more
than 50 years as part of the Framingham Heart Study. Data collected on more than 12,000
people, including medical records and genetic samples, will be sold. (New York Times,
17 June 2000)
5. The chain drug stores CVS and Giant Food admitted to making patient prescription
records available for use by a direct mail and pharmaceutical company.
(Washington Post, 15 February 1998).
DataMining, a new monster?
In the information age, information is captured, collated, bartered and sold. Everyone is
in on the act. Companies such as New York-based DoubleClick faithfully capture each
Web browser's mouse click and use the information to direct consumer ads. Others, from
Redmond, Wash.-based Microsoft to Mountain View, Calif.-based Netscape to
Cambridge, Mass.-based FireFly Networks, track individual interests in everything from
music to Web pages. And Boulder, Colo.-based MessageMedia Inc. (a Softbank Holdings
Inc. company) links traditional direct marketing databases to cyberspace pitches.Privacy
issues get all the more complex because of the Internet which allows unlimited access to
data anytime, anywhere in the world [11,12,13,14,15].
After having surveyed the literature, we have come to realize that Privacy Preserving
DataMining is an emerging field for research. There are several groups working on
several different problems. For example, one group might be interested in devising new
techniques for preserving data mining in the financial domain specific to association rule
algorithm. It is clear that the research in this field is not mature enough to ensure total
privacy protection for all those paranoid about privacy. But that is not going to stop all
the credit companies, Telemarketers and E-commerce websites from mining personal
data. So is a strong Federal Policy for preserving privacy going to bring any order out of
the chaos? I really don’t know. I suspect the corporations are always two steps ahead of
the law. Unfortunately, the US is lagging far behind the EU and Japan where they have
very strong laws about preserving privacy and enforcing laws. Two laws that have
received a lot of public attention in the recent past in the US are the 1996 Health
Insurance Portability and Accountability Act (www.hhs.gov/ocr/hipaa) giving patients
control over how their personal medical information is used and disclosed and the
1999 Gramm-Leach-Bliley Financial Services Modernization Act
(www.banking.senate.gov/conf/) that requires financial institutions to disclose their
privacy policies and allows consumers to opt-out of sharing personal information with
nonaffiliated third parties. It will be interesting to know how any violations of this Act
have been reported and how they have been dealt accordingly by the law. Statistics speak
louder than words!
Data mining has found extensive use in dealing with counter terrorism especially in the
recent past. This has raised the hackles of several civil liberties unions because it
blatantly violates an individual’s right to privacy. What is more important? Protecting
nations from terrorist attacks or protecting the privacy of individuals? This is one of the
major challenges faced by technologists, sociologists and lawyers. That is, how can we
have privacy but at the same time ensure the safety of the nations? What should we be
sacrificing and to what extent?
Related Work in Statistical Databases
Privacy preserving data mining heavily draws upon ideas from related research in
Statistical Databases. The Inference Problem in Statistical Databases has been beaten to
death in literature. The key difference between Data Mining and Statistical Database
Queries is that in the latter case, we know what to preserve and protect. We are mostly
interested in preserving mean, standard deviations and other statistics. But in Data
Mining, we don’t even know what to expect. So there is no clear cut notion as to when
one’s privacy is violated or how to detect it beforehand.
Work in statistical databases has taken two approaches:
includes restricting the size of query result
controlling the overlap amongst successive queries
keeping audit trail of all answered queries and constantly checking for possible
suppression of data cells of small size
clustering entities into mutually exclusive atomic populations
swapping values between records
replacing the original database by a sample from the same distribution
adding noise to the values in the database
adding noise to the results of a query
sampling the result of a query
It is clear that statistical databases and data mining share the same goals of preventing
disclosure of confidential information. However the goal in data mining is not to obtain
high quality point estimates but to reconstruct with reasonable accuracy the original
distribution of attribute values (given a distorted and perturbed dataset) so that the
aggregate knowledge gained is close enough to what might be obtained from the original
The general problem in Data Mining is the following. Say, X is a dataware house owner
and Y is a Data Miner. X is interested in knowing more about hidden relationships in his
dataware house but does not trust Y and does not want Y to learn of any confidential
attributes. So he distorts and perturbs the data set and hands it to Y. Y’s job is to
reconstruct the original distribution as faithfully as possible and build a model out of it.
It is always a trade off between amount of privacy ensured and the accuracy of the
knowledge model developed. For example, if X distorts the data too much to preserve
privacy, Y will have a very poor accuracy with the developed knowledge models.
Policy and Mechanism
Just as the ACID properties have guided the architecture and design of all OTLP systems,
several groups are interested in developing a set of guidelines for Privacy Preserving
databases. These guidelines are like a protocol which all ideal databases are supposed to
adhere to. Unfortunately, ensuring privacy is a complex and hard problem and it is not
easy to satisfy everyone all the time. The research group of Rakesh Agrawal at IBM 
is at the forefront in developing policies for such databases. They call such databases
Hippocratic databases and the policies governing them are:
1. Purpose Specification For personal information stored in the database, the purposes
for which the information has been collected shall be associated with that information.
2. Consent The purposes associated with personal information shall have consent of the
donor of the personal information.
3. Limited Collection The personal information collected shall be limited to the
minimum necessary for accomplishing the specified purposes.
4. Limited Use The database shall run only those queries that are consistent with the
purposes for which the information has been collected.
5. Limited Disclosure The personal information stored in the database shall not be
communicated outside the database for purposes other than those for which there is
consent from the donor of the information.
6. Limited Retention Personal information shall be retained only as long as necessary
for the fulfillment of the purposes for which it has been collected.
7. Accuracy Personal information stored in the database shall be accurate and up-to-date.
8. Safety Personal information shall be protected by security safeguards against theft and
9. Openness A donor shall be able to access all information about the donor stored in the
10. Compliance A donor shall be able to verify compliance with the above principles.
Similarly, the database shall be able to address a challenge concerning compliance.
Key Issues and Key Results
This section gives a brief description of the techniques used by various researchers to ensure
privacy preserving data mining. This has two sides to it. People have thought about it from the
angle of modifying the databases so that sensitive information is hidden from the data miners.
The other side is to modify the data mining algorithms so that they do not produce rules which
reveal sensitive information in the data.
There are many approaches which have been adopted for privacy preserving data mining. We
can classify them based on the following dimension.
• Data distribution
• Data modification
• Data mining Algorithm
• Data or rule hiding
• Privacy preservation
The data distribution refers to how the data is distributed in the database, whether the system is
centralized or distributed. People have considered different ways for horizontally distributed
databases and vertically distributed databases. Then we have the second dimension which is the
data modification. Data modification is used in order to modify the original values of a database
that needs to be released to the public and in this way to ensure high privacy protection. It is
by an organization. Methods of modification include data perturbation, data blocking,
aggregation or merging , swapping or sampling which refers to releasing only part of the entire
The third dimension is the data mining algorithm. Though we do not know which data mining
algorithm will be used for mining, but it facilitates the analysis and design of data hiding
algorithm. Various data mining algorithms have been considered in isolation for research
purposes. Ideas have been developed for classification rules, association rules, clustering ,
Bayesian networks and rough sets. Data or rule hiding refers to whether raw data or aggregated
data to be released. Lessening the data, allows for producing weaker inference rulesthat will not
allow inference of confidential values. This process is called “Rule Confusion”.
The last dimension refers to selective modification of data. It is required in order to achieve
higher utility for he modified data given that the privacy is not jeopardized. For this purpose
heuristic-based techniques like adaptive modification, Cryptographic based techniques like
secure multi-party computation and reconstruction-based techniques where the original
distribution of data is reconstructed from randomized data have been proposed.
This paper briefly talks about the above mentioned 3 techniques and discusses various tradeoff’s
which exist among these techniques and how much performance and accuracy levels these
techniques have been able to achieve.
Techniques and Methods
As mentioned in the last section the privacy preserving techniques are broadly divided into three
• Heuristic-Based Techniques
• Cryptography-based Techniques
• Reconstruction-based Techniques
A number of techniques have been developed for a number of data mining techniques like
classification, association rule discovery and clustering, based on the premise that selective data
modification or sanitization is an NP-Hard problem , and for this reason, heuristics can be
used to address the complexity issues.
Various combinations of algorithms exist of Association rule confusion for data centralized and
data distributed databases. In these database various data modification techniques have been
applied. The list of algorithms can be enumerated as shown below
• Centralized Data Perturbation-Based Association Rule Confusion
• Centralized Data Blocking-Based Association Rule Confusion
Similarly for classification rules also many algorithms have been developed based on various
combinations of data modification techniques and data distribution techniques. The list of
algorithms which have been so far considered are enumerated below.
• Centralized Data Blocking-Based Classification Rule Confusion
• Horizontally partitioned data perturbation Classification Rule Confusion
• Vertically partitioned data modification Classification Rule Confusion
• Centralized data modification Classification Rule Confusion
Two research areas we have been able to identify in this include:
1. Order Preserving Encryption Scheme for Numeric Data
Encryption is a well established technology for protecting sensitive data. However, once
encrypted, data can no longer be easily queried aside from exact matches. Rakesh
Agrawal’s group at IBM  has proposed an order preserving encryption scheme for
numeric data that allows any comparison operator to be directly applied on encrypted
Original and Encrypted data
As can be seen from the above figure, the ordering of data tuples before and after
encryption is the same. It is also to noted that the encrypted data items are nothing similar
to the original data items. So such encrypted distributions can be used to answer queries
that involve <=> operators and gives exact matches. For further details, see the reference
paper  by Rakesh Agrawal
2. Secure Distributed Computation
Research in Secure distributed computation has been studied extensively as part of a
larger body of research in the theory of cryptography. Berry Pinkas  believes that
this has direct relevance to privacy preserving computation of data mining algorithms.
For example, consider separate medical institutions that wish to conduct a joint research
while preserving the privacy of their patients. In this scenario it is required to protect
privileged information, but it is also required to enable its use for research or for other
purposes. In particular, although the parties realize that combining their data has some
mutual benefit, none of them is willing to reveal its database to any other party.
It is obvious that if a data mining algorithm is run against the union of the databases, and
its output becomes known to one or more of the parties, it reveals something about the
contents of the other databases. (For example, if a researcher from a medical institution
learns that the overall percentage of patients that have a certain symptom is 50%, while
he knows that this percentage in his population of patients is 40%, then he also learns
that more than 50% of the patients of the other institutions have this symptom.)
The first step in gaining a foothold is to define privacy and use this definition to limit the
information that is leaked by the distributed computation to the information that can be
learned from the designated output of the computation. For more details, see the
reference paper by Berry.
The work presented in  addresses the problem of building a decision tree classifier from
training data in which the values of individual records have been perturbed. While it is not
possible to accurately estimate original values in individual data records, the authors propose a
reconstruction procedure to accurately estimate the distribution of original data values. By using
the reconstructed distributions, they are able to build classifiers whose accuracy is comparable to
the accuracy of classifiers built with the original data. For the distortion of values, the authors
have considered a discretization approach and a value distortion approach. For reconstruction the
original distribution, hey have considered a Bayesian approach and they proposed three
algorithms for building accurate decision trees that rely on reconstructed distributions.
The following graph shows the similarity in reconstructed distributions from randomized data
with the original distribution in data.
Number of People
The above graph shows that the reconstruction techniques do a fairly good job in getting the
original distribution from randomized data.
Open Issues / Research Directions
Data Warehousing and the Inference problem
Preparing Perturbed Databases for a combination of Algorithms
Social Effects. [ Work with Social Scientists to preserve privacy over cultures]
Formulating legal rules and developing data mining algorithms accordingly
Privacy Inference Controller
The bulleted open issues are just a few to name. This field is so amateur that many issues are
required to be solved before the algorithms can be considered for commercial puposes.
Data Warehouse is the staging area for holding aggregate data. Data mining tools have their
fullest power when they are working on aggregated data. So people should start considering the
data mining algorithms from the data warehouse point of view. They should also consider the
Inference problem and how much it can affect the techniques employed for data warehouses.
It is very important to develop techniques which work on many types of data mining techniques.
That is, we might modify the databases to suit a particular rule like the classification rule, but we
have no idea how it would work for association rules. So the intruder might use the association
rules to compromise the privacy of the data.
Researchers have suggested working with various social scientists, because privacy issues vary
between regions, cultures. So it is very important to consider what these people say while
developing various data mining algorithms. Similarly the legal issues should also be considering
by working closely with various national agencies to have a better understanding of the legal
issues, rules and regulations
Research can be done on construction of Privacy Inference Controller which will form a layer
between the data mining algorithm and the Association rules. The job of the controller is to filter
the sensitive association rules from the entire set of association rules produced by the data miner
Finally, we need to quantify privacy. Little work has been done to formulate metrics which can
quantitatively measure the privacy violation. A policy similar to adopted by insurance companies
can be considered, which is what the researchers are heading towards
 Security and Privacy Implications of Data Mining, 1996, Chris Clifton and Don
 Defining privacy for Data Mining , Chris Clifton, Murat Kantarcioglu and Jaideep
Vaidya, Purdue University
 Data Mining, National Security, Privacy and Civil Liberties, Bhavani Thuraisingham,
The National Science Foundation
 A Framework for Privacy Preserving Classification in Data Mining, 2004
Md.Zahidul Islam and Ljiljana Brankovic
 Privacy Preserving Mining of Association Rules, 2002, Alexandre Evfimlevski,
Ramakrishnan Srikant, Rakesh Agrawal and Johannes Gehrke, IBM Almaden Research
 Privacy Preserving Data Mining, 2000, Rakesh Agrawal and Ramakrishnan Srikanth,
IBM Research, Almaden
 Privacy Preserving Data Mining, Advances in Crptology, 2000, Y.Lindell and Benny
 Detecting Privacy and Ethical Sensitivity in Data Mining Results, 2004, Peter Fule
and John Roddick
 Limiting Privacy Breaches in Privacy Preserving Data Mining, 2003, Alexandre
Evvfimievski, Johannes Gehrke and Ramakrishnan Srikant
 State-of-the-art in Privacy Preserving Data Mining, Vassilios S.Verykios, Elisa
Bertino, Igor Nai Fovino, Provenza, Yucel Saygin and Yannis Theodoridis
 Business Week, Privacy on the Net, March 2000
 L. Cranor, J. Reagle, and M. Ackerman. Beyond concern: Understanding net users’
attitudes about online privcacy. Technical Report TR 99.4.3, AT&T Labs-Research,
 A. Westin. E-commerce and privacy: What net users want. Technical report, Louis
Harris & Associates,June 1998.
 A. Westin. Privacy concerns & consumer choice. Technical report, Louis Harris &
 A. Westin. Freebies and privacy: What net users think. Technical report, Opinion
Research Corporation, July 1999.
 Berry Pinkas, HP. SIGKDD Explorations, Volume 4, Issue 2
IBM Almaden Research Group.