SlideShare a Scribd company logo
1 of 79
Download to read offline
Data Perturbation for Privacy Preserving Data Mining
Saif Mohammad Asif Kabir
A Thesis
in
The Department
of
Electrical and Computer Engineering
Presented in Partial Fulfillment of the Requirements
for the Degree of Master of Applied Science (Electrical Engineering) at
Concordia University
Montreal, Quebec, Canada
June 2007
©Saif Mohammad Asif Kabir, 2007
iii
Abstract
Data Perturbation for Privacy Preserving Data Mining
Saif Mohammad Asif Kabir
Because of the increasing ability to trace and collect large amount of personal information,
privacy preserving in data mining applications has become an important concern. Data
perturbation is one of the well known techniques for privacy preserving data mining. The
objective of these data perturbation techniques is to distort the individual data values while
preserving the underlying statistical distribution properties. These data perturbation techniques
are usually assessed in terms of both their privacy parameters as well as its associated utility
measures. While the privacy parameters present the ability of these techniques to hide the original
data values, the data utility measures assess whether the dataset keeps the performance of data
mining techniques after the data distortion.
In this thesis, we investigate the use of truncated non-negative matrix factorization
(NMF) with sparseness constraints, and the Discrete Wavelet Transform (DWT) with
truncation as data distortion techniques for data mining privacy. Privacy parameters,
previously proposed in the literatures, are used to measure the privacy of the data after
perturbation and the accuracy of a simple K-nearest neighborhood (KNN) classifier is
used as the data utility measure. Our experimental results show that these two techniques
not only improve privacy, but also improve the classification accuracy.
iv
Acknowledgments
This is one of the best moments in my Masters’ program - to publicly acknowledge
those people who have contributed to make my success a part of their own in many
different ways. First of all, I would like to express my earnest gratitude to my supervisors,
Dr. Elhakeem and Dr. Youssef for their perpetual support and guidance with enduring
patience.
I'd like also to express my appreciation to all the faculty and people at Concordia
Institute for Information Systems Engineering (CIISE) who contributed to my success
one way or the other.
I would also like to acknowledge the innumerable contributions of Dr. Ben Hamza
and lab partner Ren wang. Special thanks go to all my research partners at the EV 2.224
lab, whose help and patience made my work a lot easier. Thank you all.
For my parents, no words can suffice. My deepest love, supreme appreciation and
gratitude to my parents, Enamul Kabir and Zulfia Yasmin for their unceasing love,
relentless patience and devotion raising. It would have been impossible for me to
accomplish this endeavor without their support.
Finally thanks to my beloved wife Shahrin Zaman for her patient, love, tender and
caring from far distance.
v
Table of Contents
LIST OF FIGURES ...........................................................................................................................VII
LIST OF TABLES .............................................................................................................................. IX
LIST OF SYMBOLS........................................................................................................................... XI
CHAPTER 1 ...........................................................................................................................................1
INTRODUCTION TO DATA MINING PRIVACY...........................................................................1
1.1 INTRODUCTION........................................................................................................................1
1.2 RELATED WORK......................................................................................................................3
1.2.1 P3P and Secure Database .....................................................................................................4
1.2.2 Secure Multi-Party Computation...........................................................................................5
1.2.3 Data Swapping.......................................................................................................................5
1.2.4 Security of Statistical Database.............................................................................................6
1.2.5 Privacy Preserving Distributed Data Mining...................................................................6
1.2.6 Rule Hiding.......................................................................................................................7
1.2.7 Data Perturbation ............................................................................................................7
1.3 CONTRIBUTION OF THIS WORK ................................................................................................9
CHAPTER 2 .........................................................................................................................................11
BACKGROUND CONCEPTS............................................................................................................11
2.1 INTRODUCTION......................................................................................................................11
2.2 NON-NEGATIVE MATRIX FACTORIZATION ............................................................................12
2.2.1 Cost Function .................................................................................................................13
2.2.2 Multiplicative update rule ..............................................................................................13
2.2.3 NMF with Sparseness Constraints..................................................................................15
2.3 WAVELET OVERVIEW ...........................................................................................................17
2.4 PRIVACY MEASURE ...............................................................................................................21
2.5 DATA UTILITY MEASURES ....................................................................................................22
2.6 PERFORMANCE MEASURE FOR CLASSIFIER ...........................................................................23
2.7 BAYESIAN ESTIMATION.........................................................................................................24
2.8 CONCLUSION.........................................................................................................................27
CHAPTER 3 .........................................................................................................................................28
NON-NEGATIVE MATRIX FACTORIZATION FOR DATA PERTURBATION .....................28
3.1 INTRODUCTION......................................................................................................................28
3.2 EXPERIMENTAL RESULTS......................................................................................................28
3.3 ADDING SPARSENESS CONSTRAINT.......................................................................................30
3.4 DEALING WITH NEGATIVE VALUES........................................................................................32
3.5 CONCLUSIONS .......................................................................................................................39
vi
CHAPTER 4 .........................................................................................................................................40
DATA DISTORTION USING DISCRETE WAVELET TRANSFORM .......................................40
4.1 INTRODUCTION......................................................................................................................40
4.2 EXPERIMENTAL RESULTS ......................................................................................................41
4.3 CONCLUSION.........................................................................................................................48
CHAPTER 5 .........................................................................................................................................49
BAYESIAN ESTIMATION OF ORIGINAL DATA ........................................................................49
5.1 INTRODUCTION......................................................................................................................49
5.2 EXPERIMENTAL RESULTS ......................................................................................................49
5.3 CONCLUSIONS .......................................................................................................................55
CHAPTER 6 .........................................................................................................................................56
CONCLUSIONS AND FUTURE WORK..........................................................................................56
6.1 CONCLUSIONS .......................................................................................................................56
6.2 FUTURE WORKS ....................................................................................................................57
REFERENCES.....................................................................................................................................59
vii
List of Figures
Figure 3.1 Effect of the reduced rank m on the privacy parameters 30
Figure 3.2 Effect of sparseness constraint on privacy parameters (Wisconsin Breast
Cancer dataset) 31
Figure 3.3 Effect of sparseness constraint on privacy parameter (Ionosphere
dataset using first approach) 34
Figure 3.4 Effect of truncation threshold on privacy parameters (Ionosphere
dataset using first approach) 36
Figure 3.5 Effect of sparseness constraint on privacy parameters (Ionosphere
dataset using second approach) 38
Figure 3.6 Effect of truncation threshold on privacy parameters (Ionosphere dataset
using second approach) 38
Figure 4.1 Proposed data distortion Technique 41
Figure 4.2 Influence of truncating the detail coefficients on the privacy and accuracy
parameters (Wisconsin Breast Cancer dataset) 43
Figure 4.3 Influence of truncating pairs of detail coefficients on the privacy and
accuracy parameters (breast cancer dataset) 44
Figure 4.4 Influence of truncating the detail coefficients on the privacy and accuracy
parameters (Ionosphere dataset) 47
Figure 4.4 Influence of truncating pairs of detail coefficients on the privacy and
viii
accuracy parameters (Ionosphere dataset) 47
Figure 5.1 Triangular signal distribution 50
Figure 5.2 Uniform signal distribution 50
Figure 5.3 Distribution of Gaussian noise in the Triangular signal 51
Figure 5.4 Distribution of equivalent noise in the NMF-perturbed Triangular signal 51
Figure 5.5 Distribution of equivalent noise in the DWT-perturbed Triangular signal 51
Figure 5.6 Distribution of Gaussian noise in the Uniform signal 52
Figure 5.7 Distribution of equivalent noise in the NMF-perturbed Uniform signal 52
Figure 5.8 Distribution of equivalent noise in the DWT-perturbed Uniform signal 52
ix
List of Tables
Table 2.1 Illustration of one-dimensional Haar wavelet transform 19
Table 2.2 Illustration of one-dimensional Haar wavelet transform to f(x)=[7 5 6 2] 19
Table 2.3 Confusion Matrix 23
Table 3.1 Experimental results for KNN (K=30) 29
Table 3.2 Effect of the sparseness constraint on the privacy parameters and accuracy 31
Table 3.3 The effect of threshold ∈ on the privacy parameters and accuracy 32
Table 3.4 Privacy parameters for the ionosphere dataset using the absolute value
approach with m=16 33
Table 3.5 Privacy parameters for the ionosphere dataset using the absolute value
approach with 16=m and 08.0=hS 34
Table 3.6 Privacy parameters for the ionosphere dataset using the absolute value
approach with truncation 35
Table 3.7 Trade off between the privacy parameters and accuracy for the ionosphere
dataset using biasing approach 37
Table 4.1: Influence of truncating the detail coefficients on the privacy and accuracy
parameters (Wisconsin Breast Cancer dataset) 42
Table 4.2: Influence of truncating pairs of detail coefficients on the privacy and accuracy
parameters (breast cancer dataset) 45
x
Table 4.3: Influence of truncating the detail coefficients on the privacy and accuracy
parameters (Ionosphere dataset) 46
Table 4.4: Influence of truncating pairs of detail coefficients on the privacy and
accuracy parameters (Ionosphere dataset) 46
Table 5.1: Estimation results for triangular signal distribution 54
Table 5.2: Estimation results for uniform signal distribution 54
xi
List of Symbols
KDD Knowledge Discovery in Database
P3P Platform for Privacy Preference
KNN K-Nearest Neighboring
SMC Secure Multi-Party Computation
PCA Principle Component Analysis
ACC Accuracy
NMF Non-negative Matrix Factorization
DWT Discrete Wavelet Transform
SVM Support Vector Machine
ICA Independent Component Analysis
PMF Positive matrix factorization
FFT Fast Fourier Transform
PDF Probability Density Function
MSE Mean Squared Error
XML Extensible Markup Language
W3C World Wide Web Consortium
1
Chapter 1
Introduction to Data mining Privacy
1.1 Introduction
Data mining or knowledge discovery in databases [1] is the process of searching
for useful and understandable patterns in large volumes of data using tools such as
classification and association rule mining.
A huge amount of data is practically useless when one is not able to extract the
valuable information hidden in it. Data mining is a promising approach to meet this
challenging requirement and has emerged as a momentous technology for gaining
knowledge from vast quantities of data. The fascination with the promise of analysis of
large amount of data has led to an increasing number of successful applications of data
mining which are very useful in marketing, business, medical analysis and other
applications in which pattern discovery is supreme for strategic decision making.
On the other hand, modern information technology also collect and analyze
millions of transactions containing personal information. With the new advent of
technology, there has been an explosive growth in huge amount of personal data
generated or collected and stored in electronic form. Several applications deal with
privacy sensitive data such as financial transactions, health care records, criminal records,
and credit records.
2
This leads to a growing concern that the use of this technology is violating
individual or group of individuals’ privacy which led to a drawback against the
technology. Privacy advocates have been demanding effort to stop bringing more
information into integrated collection. One example is the public protest in Japan against
the creation of a national registry containing information previously held by the
prefectures [2]. Another example is the U.S. senate siege of all data mining research and
development by the U.S. Department of defense [3] and introducing Data Mining
Moratorium Act. Thus, despite the fact that most data mining methods aim to develop
generalized knowledge, rather than identify information about specific individuals, and
despite of its benefits in various areas, one has to acknowledge that improper use of data
mining techniques can also result in new threats to privacy and information security.
On the other hand, one also has to acknowledge the fact that the main problem
is not with data mining itself, but the infrastructure used to support it. According to the
Data- Mining Moratorium Act, the siege of the Total/Terrorism Information awareness
program was not because preventing terrorism is a bad concept but because of the
possible exploitation of the data. As those data was distributed among multiple databases
under several authorities, it was difficult to collect the data for misuse. Building data
warehouse for data mining may change this assumption. Another problem is with the
results themselves. Publishing summaries of census data carries risk of violating privacy
which is recognized by census community. Summary tables may not be identifying an
individual but it may be possible to isolate an individual and determine private
information by combining results from different tables. Thus it is clear that, in order to be
fair to this technology, both the data mining and information security communities must
3
address these issues and different techniques have to be adapted to resolve those
problems.
Numerous techniques have been developed which allow mining when we are
not allowed to see the data to avoid the potential for misuse posed by an integrated data
warehouse. This work basically falls into two main categories: Data perturbation [4] and
secure multiparty computation [9]. In data perturbation techniques, the original dataset is
perturbed with the aim that the disclosed dataset may not reveal any private information.
The data mining challenge in this case is how to obtain useful (non-private) information
from such perturbed data. The second category depend on separation of authority: Data is
presumed to be controlled by different entities and the goal is for those entities to
cooperate to obtain valid data mining results without disclosing the real data to others.
The goal of this dissertation is to develop and evaluate new data perturbation techniques
to privacy preserving data mining.
Before we move into the details of our proposed work, in the next section, we
briefly overview some of the related literature on privacy preserving data mining.
1.2 Related Work
A growing body of literature exists on different approaches of privacy
preserving data mining. Some of theses approaches adopted for preserving privacy in
data mining are:
• Secure Database/ Platform for Privacy Preference(P3P)
• Secure Multi-Party Computation(SMC)
• Data Swapping
4
• Security of Statistical Database
• Privacy Preserving Distributed Data Mining
• Rule Hiding
• Data Perturbation
In next few sections, we provide a brief description of these approaches.
1.2.1 P3P and Secure Database
Nowadays a huge portion of the data related to individuals is composed by
different web sites providing different services. The World Wide Web Consortium
(W3C) Platform for Privacy Preferences (P3P) [5] is considered as one of the most well-
known infrastructures to enable web users to gain more control about the information that
web sites collect. P3P includes the system and architecture design perspectives of privacy
preserving data mining. While P3P is criticized by relying on each individual website to
be honest with its policy files, it is still considered a good effort to maintain privacy
standard for personal data collected over the Internet. P3P provides a way for web site
owners to encode their privacy policies in a standard XML format so that users can check
against their privacy preferences to decide whether or not to release their personal data to
the web site. There is a detailed survey of current P3P implementations in [6]. The basic
P3P architecture is client based, i.e. privacy of client is defined at the web-client end. In
contrast with these client-centric implementations, the author in [7] proposed a server-
centric architecture for P3P. However, the current P3P standard only provides a schema
for web users to check the consistency of their privacy preferences with the web sites’
5
privacy policies. There is no way specified so far to enforce web sites to act according to
their stated policies.
1.2.2 Secure Multi-Party Computation
Secure multi-party computation (SMC) is the problem of evaluating a function
of two or more parties’ secrete inputs. Each party finally gets a share of the function
output. No other information is revealed by the parties except what is implied by the
party’s own inputs and outputs. Yao [9] introduced the secure multi-party computation
concept which was later extended by Micali et. al. in [10]. The circuit evaluation
protocol [10,12], 1-out–of-k oblivious transfer [13], homomorphic encryption [14],
commutative encryption[15], Yao’s millionaire problem (secure comparison) and some
other cryptographic technique serve as the building blocks of SMC. Detailed discussion
of SMC framework can be found in [16], [17]. A variety of new SMC applications and
open problems for a spectrum of cooperative computation domain is presented in [18].
Related works include privacy preserving information retrieval [19], privacy preserving
statistical analysis [20], [21], privacy preserving geometric computation [20], privacy
preserving scientific computation [23]. The work presented in [24] discusses a wide array
of new secure multi-party computation applications. The SMC ideas have also been
applied for privacy preserving decision tree induction [25], naïve Bayesian classification
[26] of horizontally partitioned data, privacy preserving Bayesian network structure
computation for vertically partitioned data [27], K-Means clustering over vertically
partitioned data [28] and many others.
1.2.3 Data Swapping
Tore Dalenius and Steven Reiss [29] proposed the basic idea of data
swapping. This simple technique maintains the confidentiality of the attributes without
changing the aggregate statistical property of the data. The database is transformed by
switching a subset of attributes between selected pairs of records so that the lower order
6
frequency counts or marginal totals are preserved while data confidentially is
uncompromised. This technique can be considered as a data perturbation technique which
will be discussed in a later section. Since its initial appearance, a variety of refinements
of data swapping have been suggested. The reader is referred to [30] for a thorough
treatment.
1.2.4 Security of Statistical Database
The objective of this technique is to find mechanisms for the prevention of
the disclosure of individual value while maximizing the number of statistical queries that
can be answered about subset of records of a database at the same time. The security of
statistical databases against confidentiality disclosure can be found on [31]. Security
control methods can be classified into conceptual, output perturbation, query restriction
and data perturbation [32]. Statistical micro data protection using record linkage is
discussed in [33].
1.2.5 Privacy Preserving Distributed Data Mining
The distributed data mining approach supports computation of data mining
models and extraction of patterns at a certain node by exchanging only the minimal
necessary information among the participating nodes. Several distributed algorithms have
been proposed in the field of distributed data mining [34, 35]. The Fourier spectrum
based approach to represent and construct decision tree [36, 37], the collective
hierarchical clustering [38] are examples for distributed data mining algorithms which
can be used for privacy preserving data mining after minor modification. Several
7
distributed techniques to mine multiparty data have been reported. A privacy preserving
technique to construct decision tree [39], multi-party secure computation framework [40],
association rule mining from homogeneous [41] and heterogeneous [42] distributed
datasets are other examples.
1.2.6 Rule Hiding
The main idea of rule hiding is to transform the database such that the sensitive
rules are masked and all other underlying patterns can still be discovered. The optimal
sanitization is an NP hard problem for the hiding of sensitive large item sets in the
context of association rule mining is an example of rule hiding techniques [43]. This is
why some heuristic approaches have been applied to address the complexity issues. The
perturbation based rule hiding technique [44, 45] is implemented by toggling the 0 and 1
values so that frequent item sets that generate the rule are hidden or the support of
sensitive rules is lowered to a user specified threshold. The blocking based association
rule hiding technique [46] is also another example of this technique.
1.2.7 Data Perturbation
Data perturbation approach can be categorized into two main categories:
probability distribution approach and value distortion approach. The first approach
replaces the data with another estimated sample from the same distribution [47]. The
second approach perturbs the value of data elements directly by some additive or
multiplicative noise before its release to the miner. Some randomized methods [49] can
be categorized in this method. The authors in [50] proposed a value distortion technique
8
to protect the privacy by adding Gaussian random noise to the original data. They masked
the original data while estimating the original distribution and decision tree model with
good accuracy. This approach is further extended by applying expectation-maximization–
based (EM) [51] algorithm for better reconstruction of the distribution. Evfimievski [52]
and Rizvi [53] considered the same approach for association rule mining and suggested
technique for limiting privacy breaches. Chong [54] proposed probability distribution
data distortion where sensitive variables are replaced by a distorted set of values. To
avoid some of the drawbacks of additive noise, other approaches make use of
multiplicative noise [55], [56] for protecting the privacy of the data while maintaining
some of the original analytic properties. Two methods of multiplicative noise are used for
data perturbations. The first method is based on generating random numbers that have a
truncated Gaussian distribution with mean one and small variance, and multiplying each
element of the original data by the noise. The second method is to take a logarithmic
transformation of the data first (for positive data only), compute the covariance, and
generate random noise following a multivariate Gaussian distribution with mean zero and
variance equaling a constant times the covariance computed in the last step, then add this
noise to each element of the transformed data, and finally take the antilog of the noise-
added data. Multiplicative perturbation overcomes the scale problem, and it has been
proved that the mean and variance/covariance of the original data elements can be
estimated from the perturbed version. In practice, the first method is good if the data
disseminator only wants to make minor changes to the original data; however the second
method assures higher security than the first one and still maintains the data utility very
well. One of the main problems of the traditional additive perturbation and multiplicative
9
perturbation is that they perturb each data element independently, and therefore the
similarity between attributes or observations which are considered as vectors in the
original data space is not well preserved. Liu [57] proposed a random projection based
multiplicative data perturbation method. They use a combination of random projection
matrices for constructing a perturbed representation instead of multiplying noise with
each elements.
1.3 Contribution of this work
This work investigates two privacy preserving data perturbation techniques:
Truncated non-negative matrix factorization (NMF) with sparseness constraints, and the
Discrete Wavelet Transform (DWT). Similar to other techniques, our main objective is
to conceal the individual data items while preserving the underlying statistical
distribution of the perturbed data, i.e., the statistical properties of the perturbed data
should still match of the original one. Privacy parameters, previously proposed in the
literatures, are used to measure the privacy of the data after perturbation and the accuracy
of a simple K-nearest neighborhood (KNN) classifier is used as the data utility measure.
Our experimental results on the real world datasets [64] show that these two techniques
not only improve privacy, but also improve the classification accuracy.
The rest of the thesis is organized as follows. Chapter 2 briefly reviews the essential
concepts and definitions which we will refer to throughout the thesis. In chapter 3, we
investigate the use of non-negative matrix factorization (NMF) with sparseness
10
constraints for data perturbation. The results obtained when applying our proposed
technique to the original Wisconsin breast cancer and ionosphere databases downloaded
from UCI machine learning repository are also presented. Chapter 4 presents the
corresponding results when using the Discrete Wavelet Transform (DWT) with
truncation. Chapter 5 presents the results obtained when using the Bayesian estimation
technique to estimate the original signal distorted with the above two techniques. Finally,
conclusions and future works are given in chapter 6.
The results presented in chapters 3 and 4 have been partially published in [65] and [66]
respectively.
11
Chapter 2
Background Concepts
2.1 Introduction
The objective of the data perturbation techniques is to distort the individual data
values while preserving the underlying statistical distribution properties. These data
perturbation techniques are usually assessed in term of both their privacy parameters as
well as its utility measure. While the privacy parameters present the ability of these
techniques to hide the original data values, the data utility measures asses whether the
dataset keeps the performance of data mining techniques after the data distortion. Non-
negative Matrix Factorization (NMF) and Discrete Wavelet Transform (DWT) are two
vital techniques which are used in many signal processing applications. In this work, we
investigate the use of both NMF and DWT as data perturbation techniques for privacy
preserving data mining. In this chapter, we briefly review the formal theory of NMF and
DWT. Different privacy parameters and utility measure, previously introduced in the
literatures, are also reviewed. Finally, we describe the vulnerability of the privacy
methods in terms of the original signal estimation.
12
2.2 Non-negative Matrix Factorization
Non-negative matrix factorization (NMF) is a matrix factorization technique
which produces a useful decomposition for data analysis. NMF decomposes the data as a
product of two matrices having nonnegative elements. This is a reduced representation of
the original data that can be seen either as a feature extraction or a dimensionality
reduction technique. NMF can be interpreted as a parts-based representation of the data
due to the fact of its use of non-negativity constraints. The factorization process of NMF
is an active area of research in several fields and the subject is certainly a fertile area of
research. The notion of low rank approximations arises in a wide range of important
applications. Non-negative matrix factorization is such kind of approximation.
The NMF approach can be formulated as follows: Given a non-negative
RN × data matrix V, we can approximately factorize V into product of two non-negative
matrices W and H with sizes MN × and RM × respectively, that is WHV ≈ , where the
reduced rank M of the factorization is generally chosen so that ( ) ,NRMRN 〈+ and the
product WH can be regarded as a compressed form of the data V. In this approximation,
W contains the basis vectors as its columns where as H is a measurement vector where
each column contains the coefficient vectors. Note that each measurement vectors is
written in terms of the same basis vectors. The optimal choices of matrices W and H are
defined to be those non-negative matrices that minimize the reconstruction error between
V and WH. In next section, we present two of the most commonly used such error
functions.
13
2.2.1 Cost Function
We have to define a cost function to find the approximation matrices W and H.
Euclidean distance is one natural way to evaluate the approximation between the two
matrices V and WH. The Euclidean distance between any two matrices A and B is
defined:
( )∑ −=−
ji
ijij BABA
,
22
Here the lower bound is zero and equality is achieved when A=B.
Another useful measure of cost function is divergence of A and B which is
defined as follows:
( ) ijij
ij
ij
ji
ij BA
B
A
ABAD +−= ∑ log
,
Similar to the Euclidean distance, the lower bound of the divergence measure is zero and
it is achieved when A=B. This measure is called divergence from A to B as it is not
symmetric, i.e., in general, ( ) ( )ABDBAD ≠ . In the above measure, A and B can be
assumed to have a normalized probability distributions when .1,,
== ∑∑ ji ijji ij BA
Many NMF factorization algorithms have been proposed. Among these
techniques, multiplicative update [58] is the simplest to describe.
2.2.2 Multiplicative update rule
Assuming the use of Euclidean distance, then the objective of the
multiplicative update is to minimize
2
WHV − with respect to W and H, and subject to
14
the constraint .0, ≥HW The multiplicative update algorithm [58] can be summarized as
follows:
1. Initialize W and H as non-negative random matrix.
2. Update both W and H until convergence of
2
WHV − with the following
update rules
( )
( ) ,
μ
μ
μμ
a
T
a
T
aa
WHW
VW
HH ←
( )
( )
.
ia
T
ia
T
iaia
WHH
VH
WW ←
It can be shown that
2
WHV − is non increasing under the above update rule [58]. One
should also note that both W and H should be updated simultaneously during the above
updates. In other words, instead of updating the whole matrix W first and then update H,
we should update one row of W and the corresponding column of H. During the updating
of a row of W or a column of H, we need not need to calculate the whole
matrices VW T
, WHW T
, T
VH and T
WHH as we only need one row and column of these
matrices during one update.
If the divergence measure is used, then the update rules in step 2 above should be
replaced by
( )
,
/
∑
∑←
k ka
i iuiuia
aa
W
WHVW
HH μμ
( )
∑
∑←
v av
iuiua
iaia
H
WHVH
WW
μ μ /
Similar to the Euclidean update, both W and H should be updated simultaneously It can
also be shown that
2
WHV − is non increasing under the above update rule [58].
15
It should be noted that minimizing the cost function above can also been seen as a
traditional bound-constrained optimization problem which can be solved by some simple
and effective techniques such as the projected gradient method [58] [59].
2.2.3 NMF with Sparseness Constraints
Sparse coding is a coding of data where only few of the components of the
code are significantly active for any given input vector. While the non-negativity
constraints enforce some sparseness during the NMF, traditional NMF techniques do not
provide explicit control over the degree of sparseness. On the other hand, NMF with
sparseness constraints has already been used in applications that acquire additional
sparseness. The intent of adding sparseness constraint to NMF is to find decomposition
that will result in part based representation where only a few units are highlighted to
represent the typical data vector. As a result, most of the values are close to zero while
few components take significantly non-zero values.
One sparseness measure of a given vector X is based on the relationship
between its L1 norm and the L2 norm [1] and is given by:
( )
( )
,
1
2
−
−
=
∑
∑
n
x
xn
xsparseness
i
i
where n is the dimensionality of X. The sparseness constraint can be enforced either on
W, or H, or on both of them depending on the application.
Assuming that one wants to enforce a sparseness on H, the (Euclidean) cost
function changes to:
16
( ) ( )∑∑ ∑ +−=
ji
ij
i j
jiji HgWHVHWE
,
2
, λ
Here, the sparseness parameter 0≥λ and g is the sparseness function. To obtain the
same reconstruction cost, we have to scale up the basis vectors and scale down the
measurement vectors to get a lower cost function. When the sparsity term is zero and the
basis vectors grow without any bound, it leads to the optimal solution.
To solve the scaling problem, a normalization step such as
j
j
j
W
W
α
= is
incorporate when minimizing the above equation, where ( )jj Wαα = with some norm.
Reformulating the cost function to work with normalized vector, the cost function above
becomes:
( ) ( )∑∑ ∑ +−=
ji
ij
i j j
j
iji Hg
W
W
HVHWE
,
2
, λ
Thus, the above new cost function depends on the variable { }( )HWE jj , , where
jW being the normalized basis vector
j
j
j
W
W
W =: and . denotes any differentiable
norm.
The modified update rule for NMF with sparseness constraint using the gradient
descent is summarized by the following steps:
1. Calculate jw W∇ .
2. Normalize the basis vectors according to j
j
j
j W
W
W
W =← .
17
3. Calculate the approximation factorization according to ∑=
j
jiji WHVˆ
4. Update the measurement vector according to
( )ijj
T
i
j
T
i
ij
j
i
HgWV
WV
HH
'ˆ
⊗←
5. Calculate the reconstruction with new coefficient vectors according to
∑=
j
jiji WHVˆ .
6. Update the non-normalized basis vectors according to
( )[ ]
( )[ ]∑
∑
∇+
∇+
⊗←
i jwj
T
iiij
i jwj
T
iiij
jj
WWVVH
WWVVH
WW
ˆ
ˆ
,
In the above equation, ⊗ denotes elements-wise multiplication of the
corresponding matrices.
7. Repeat the above steps until convergence.
For a proof of convergence, the reader is referred to [58].
2.3 Wavelet Overview
Wavelet transforms are mathematical tools for hierarchically decomposing
functions. They allow a function to be described in terms of a coarse overall level, plus
details that range from broad to narrow. Wavelets offer an elegant technique for
representing the levels of detail present. A wavelet transformation converts data from an
original domain to a wavelet domain by expanding the raw data in an orthonormal basis
generated by dilation and translation of a father and mother wavelet. Wavelet
18
transformation preserves the structure of data. A contracted, high frequency wavelet
performs temporal analysis as compared to dilated, low frequency of the same wavelet
that performs frequency analysis.
Unlike the Fourier transform that uses sine and cosine as basis functions,
wavelet transform contains more complicated basis functions. While individual wavelet
functions are localizing in space, Fourier sine and cosine function are not. Wavelet
transform do not have a single set of basis function like Fourier transform. The
localization property of wavelet makes many functions sparse when transform into the
wavelet domain. This sparseness has been utilized in various applications such as data
compression, and removing noise from data.
In this work, we use the Haar wavelt transform [60] which is based on one of
the simplest possible wavelets, the Haar wavelet, which that be described by a step
function
( )
⎪
⎪
⎪
⎩
⎪⎪
⎪
⎨
⎧
≤≤
〈≤
−=
otherwise
x
x
x 1
2
1
2
1
0
0
1
1
φ
In order to avoid unnecessary mathematical notation, we will describe the Haar
wavelet transform by providing a simple example. The one dimensional Haar transform
of a function f can be viewed as a series of averaging and differencing operations on a
discrete function. We compute the averages and differences between every two adjacent
19
values of f(x). The procedure to find the Haar transform of a discrete function f(x) =[a0
a1 a2 a3] is shown in Table 2.1. In this example, resolution 4 is the full resolution.
Resolution Approximation Detail coefficients
4 [a0 a1 a2 a3]
2 [b0b1]=[(a0+a1)/2,(a2+ a3)/2] [(a0-a1)/2,(a2-a3)/2]
1 [(b0+b1)/2] [(b0-b1)/2]
Table 2.1 Illustration of one-dimensional Haar wavelet transform
For example, if f(x)=[7 5 6 2], then we have
Resolution Approximation Detail coefficients
4 [7 5 6 2]
2 [6 4] [1 2]
1 5 1
Table 2.2 Illustration of one-dimensional Haar wavelet transform to
f(x)=[7 5 6 2]
Note that other definition of Haar wavelet transform that is different from the
above definition by a factor of 2 also exists. However, this constant factor is irrelevant
to our work since the distorted dataset is obtained by applying the inverse transform again
20
It can also be shown that the above transform is equivalent to multiplying the vector
presentation of f by an integer (wavelet) matrix, which can be computed more efficiently
than the analogous Fourier matrix.
Multi-dimensional wavelets are usually defined via the tensor product. The two-
dimensional wavelet basis consists of all possible tensor products of one-dimensional
basis function. Applying the Haar wavelet transform to a 2 dimensional matrix will result
in four sets of coefficients: the approximation coefficients matrix, cA, and horizontal.
vertical and diagonal detail coefficients matrices (called cH, cV, and cD respectively).
For example, when applying single level Haar wavelet transform to the matrix
⎥
⎦
⎤
⎢
⎣
⎡
89
53
we obtain:
cA (the overall average)= (3 + 5 + 9 + 8)/4 = 6.25,
cH (the average of the difference of the summations of the rows) = (( (3 + 5)- (9 + 8)) /4=
-2.25,
cV (the difference of the summations of the columns) ((3 + 9)-(5+8))/4 = -0.25 and
cD (the average of the difference of the summations of the diagonal)= ((3+8)−(9+5))/4 =
−0.75.
We can think of the approximation coefficients as a coarser resolution version of the
original signal, and the detail (differences) coefficients as the higher resolution details.
21
2.4 Privacy measure
Throughout this work, we adopt the same set of privacy parameters proposed in
[62]. The value difference (VD) parameter is used as a measure for value difference after
the data distortion algorithm is applied to the original data matrix. Let V and V denote
the original and distorted data matrices respectively. Then, VD is given by
||||/|||| VVVVD −= ,
where |||| ⋅ denotes the Frobenius norm of the enclosed argument.
After a data distortion, the order of the value of the data elements also changes.
Several metrics are used to measure the position difference of the data elements. For a
dataset V with n data object and m attributes, let i
jRank denote the rank (in ascending
order) of the jth
element in attribute i. Similarly, let i
jRank denote the rank of the
corresponding distorted element. The RP parameter is used to measure the position
difference. It indicates the average change of rank for all attributes after distortion and
is given by
.
1
1 1
∑∑= =
−=
m
i
n
j
i
j
i
j RankRank
nm
RP
RK represents the percentage of elements that keeps their rank in each column
after distortion and is given by
∑∑= =
=
m
i
n
j
i
jRk
nm
RK
1 1
1
where 1=i
jRk If an element keeps its position in the order of values, otherwise
0=i
jRk .
22
Similarly, the CP parameter is used to measure how the rank of the average
value of each attributes varies after the data distortion. In particular, CP defines the
change of rank of the average value of the attributes and is given by
∑=
−=
m
i
ii RankVVRankVV
m
CP
1
1
,
where iRankVV , and iRankVV denote the rank of the average value of the ith
attribute before and after the data distortion, respectively.
Similar to RK, CK is used to measure the percentage of the attributes that keep
their ranks of average value after distortion.
From the data privacy perspective, a good data distortion algorithm should
result in a high values for the RP and CP parameters and low values for the RK and CK
parameters.
2.5 Data Utility Measures
The data utility measures assess whether the dataset keeps the performance of
data mining techniques after the data distortion. In other words, one measure of a good
privacy preserving data mining technique is to be able to compute relevant statistics and
construct prediction models without having access to the data.
The success of machine learning techniques in data mining has recently led
researchers to explore the applicability of learning algorithms in privacy preserving data
mining. A supervised learning algorithm is fed with a block of data that have been
23
classified manually as bi-classes or multi-classes, and builds a classifier, which is then
used to detect observations according to their class without accessing the attributes.
Throughout this work, we use the accuracy of a simple K-nearest
neighborhood (KNN) as our data utility measure. KNN classification is simple instance-
based learning algorithm that has shown to be effective in data classification. The success
of this algorithm is due to the availability of effective similarity measure among the K
nearest neighbor.
The algorithm starts by calculating the similarity between the test data and
all data in the training set. Then it picks the K closet instances and assigns the test data to
the most common class among these nearest neighbors. All observations of training data
and test data are considered as a vector. The classifier find out the K vectors from training
vectors which are most similar to the test vector. In our work, we used the Euclidian
distance is as a measure for the similarity.
2.6 Performance Measure for Classifier
There are many mechanisms used to measure the performance of the classifier. In
here, we introduce the performance measures used through the thesis. Let N=A+B+C+D
be the total number of observation in test data.
Class 1 Class 0
Classifier Decision: Class 1 A B
Classifier Decision: Class 0 C D
Table 2.3 Confusion Matrix.
24
If Table 2.3 denotes the confusion matrix of the data classifier, then we define the
accuracy, precision, recall, and F1 for class 1 classifier as follows:
N
DA
ACCURACY
+
= ,
,)(
BA
A
PPRECISION
+
=
,)(
CA
A
RRECALL
+
=
.
2
1
RP
PR
F
+
=
Similar measures can be defined for class 0:
N
DA
ACCURACY
+
= ,
,)(
DC
D
PPRECISION
+
=
,)(
DB
D
RRECALL
+
=
.
2
1
RP
PR
F
+
=
2.7 Bayesian Estimation
The Bayesian estimation approach is directly based on the Bayes theorem
[48]. Assuming that we have available some prior knowledge of the random variable to
be estimated, we can incorporate this knowledge into our estimator. For this mechanism,
we have to assume the prior probability density function (PDF) of the random variable s.
The resultant estimator is said to be optimal on the average, or with respect to the
assumed prior pdf of s. From the realizations of variable s without noise we may gather
25
the statistics needed to calculate the pdf of s. This can be done by histogram of sample or
by approximating the densities with parameterized models that are flexible enough to
account for the variety of probability densities we encountered.
Consider the denoising problem of a scalar variable nsx += where the original variable
s and the noise n are assumed to be statistically independent. To perform the denoising
problem, it is essential to find an estimate using the statistical properties of s and n so that
sˆ will be very close to s in some meaningful sense. While it is practically almost
impossible to extract exactly s from the noisy variable, it is possible to find estimates
which are better than the noisy sample x.
Let )(sPs and )(nPn denote the prior probability density function of s and n respectively.
We can calculate the posterior pdf for s given x, using basic axioms of probability theory.
In particular, using the Bayes rule we have
( ) ( ) ( )
( )xp
spsxp
xsp
|
| =
Since nsx += , then we have ( ) ( )sxpsxp n −=| . Thus, p(x) can be obtained as follows:
( ) ( ) ( )∫
∞
∞−
−= ,dssxpspxp ns
Hence, at least theoretically, we now have the complete knowledge of ( )xsp | . This
posterior pdf (i.e., the pdf of s after the data have been observed) can help to find out the
value of s from the noisy variable x.
Let ( ) ( ) )ˆ(ˆ 2
ssEsMSEB −= denote the Bayesian mean square error, where the expectation
operator is defined with respect to the joint pdf ( )sxp , . In what follows we show how to
obtain sˆ that minimizes the Bayesian MSE.
26
1. By noting that ( ) ( ) ( )xpxspsxp |, = ,Thus
( ) ( ) ( )( ) ( )∫ ∫ −= dxxpdsxspsssMSEB |ˆˆ 2
.
2. Since ( ) 0≥xp for all x, if the integral in brackets can be minimized for each x,
then MSEB will be minimized. This can be obtained by setting the derivative of
the integral in brackets (with respect to sˆ ) to zero. Hence
( ) ( ) ( ) ( )∫ ∫ =−
∂
∂
=−
∂
∂
dsxspss
s
dsxspss
s
|ˆ
ˆ
|ˆ
ˆ
22
( ) ( ) ( ) ( ) .0|ˆ2|2|ˆ2 =+−=−− ∫ ∫∫ dsxspsdsxsspdsxspss
Then we have ( )∫= dsxssps |ˆ .
3. Substituting the value of ( )xsp / we get
( ) ( )
( )
( ) ( )
( ) ( )
( ) ( )
( ) ( )∫
∫
∫
∫∫
−
−
===
dsspsxp
dsspsxsp
dsspsxp
dsspsxsp
xp
dsspsxsp
s
n
n
|
||
ˆ
Since the conditional pdf must integrate to 1, we get ( )xsEs |ˆ = . In other words, the
optimal estimator in terms of minimizing the Bayesian MSE is the mean of the posterior
pdf ( ).| xsp
The above estimation approach will be used in chapter 5 to estimate the original dataset
from its perturbed version. The associated mean square error serves as a measure for how
well the data perturbation techniques conceal the original data.
27
2.8 Conclusion
In this chapter, we presented some basic concepts related to privacy preserving data
mining. We also introduced some privacy parameters and performance measures used
throughout our thesis.
28
Chapter 3
Non-negative Matrix Factorization
for Data Perturbation
3.1 Introduction
In this chapter, we investigate the use of non-negative matrix factorization
(NMF) with sparseness constraints for data perturbation. In order to test the performance
of our proposed method, we conducted a series of experiments on some real world
datasets. In this chapter, we present the results obtained when applying our technique to
the original Wisconsin breast cancer and ionosphere databases downloaded from UCI
machine Learning Repository [64].
3.2 Experimental Results
As it was explained in chapter 2, given a non-negative RN × data matrix V, we
can approximately factorize V into product of two non-negative matrices W and H with
sizes MN × and RM × respectively, that is WHV ≈ , where the reduced rank M of the
factorization is generally chosen so that ( ) ,NRMRN 〈+ and the product WH can be
regarded as a compressed form of the data V. Throughout our experiments, the rows of
the (dataset) matrix V correspond to the dataset attributes and the columns correspond to
29
the specific observations. To select the reduced rank, M, we examine the accuracy of a
KNN classifier on this reduced rank dataset. We conduct our experiment on the real
world data downloaded from UCI machine Learning Repository [64]. The dataset is the
original Wisconsin Breast Cancer Database. For this breast cancer database, we used 569
observations and 30 attributes (with positive values) to perform our experiment. For the
classification task, 80% of the data was used for training and the other 20% was used for
testing. Throughout our experiments, we set K=30 for the KNN classifier. The
corresponding classification parameters on the original dataset are shown in Table 3.1.
Accuracy Class 1
Precision
Class 1
Recall
Class 1
F1
Class 0
Precision
Class 0
Recall
Class 0
F1
92.11% 89.9% 96.9% 93.2% 95.6% 86.0% 90.5%
Table 3.1 Experimental results for KNN=30
Figure 3.1 shows the effect of the reduced rank M on the privacy parameters. From
Figure 3.1, it is clear that M=2 provides the best choice with respect to the privacy
parameters. So, we fixed M=2 throughout the rest of our experiments with this dataset.
30
0 5 10 15 20 25 30
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
reduced rank
privacyparameters
acc
VD
RP
RK
CP
CK
Figure 3.1 Effect of the reduced rank M on the privacy parameters
3.3 Adding Sparseness Constraint
Figure 3.2 shows effect of sparseness constraint on privacy parameters for the Wisconsin
Breast Cancer dataset.
Table 3.2 shows how the privacy parameters and accuracy vary with the sparseness
constraint hS for some points which display a good trade off between the privacy
parameters and the utility measure.
31
hS RP RK CP CK VD ACC
0 128.2 0.036 0.133 0.866 0.0341 92.11
0.15 124.4 0.034 0.266 0.733 0.0452 92.10
0.3 125.0 0.114 0.266 0.733 0.0551 92.98
0.65 128.1 0.005 0.6 0.6 0.4696 93.86
Table 3.2 Effect of the sparseness constrain on the privacy parameters and accuracy
From the results in Table 3.2, it is clear that 65.0=hS not only improves the values of the
privacy parameters, but also improves the classification accuracy
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
sparseness
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 3.2 Effect of sparseness constraint on privacy parameters (Wisconsin Breast
Cancer dataset)
32
Table 3.3 shows the effect of threshold ε on the privacy parameters and accuracy
for 65.0=hS . From the table, it is clear that there is a trade-off between the privacy
parameters and the accuracy.
ε RP RK CP CK VD ACC
0.001 128.62 0.0058 0.6 0.6 0.46997 93.86
0.005 130.31 0.0057 0.6 0.6 0.47249 93.86
0.01 133 0.0055 0.6 0.6 0.48265 93.86
0.02 141.21 0.005 0.6 0.6 0.50483 44.74
Table 3.3 The effect of threshold ∈ on the privacy parameters and accuracy
3.4 Dealing with negative values
Throughout the rest of this section, we show how to use the above technique to perform
data perturbation for datasets with both positive and negative values. Two approaches
were used to deal with this situation. In the first approach, we take the absolute value of
all the attributes, perform the data perturbation using the NMF as described above, and
then restore the sign of the attributes from the original data set. In the second approach,
we bias the data with some constant so that all the attributes become positive. After
33
performing the data perturbation, the value of this constant is subtracted from the
perturbed data.
To test the above two approaches, we used the ionosphere database (351 observation and
35 attributes in the range of -1 to +1). The first 200 instances were used as training data
and other 151 were used as test data. We set K=13 for the KNN classifier. The
corresponding classification accuracy on the original dataset is 93.38%.
When using the first approach, the best classification result (93.37%) was obtained on the
NMF data with reduced rank 16=M . Table 3.4 shows the corresponding privacy
parameters.
M CK CP RK RP VD Acc
16 0.67 0.41 0.017 35.63 0.35 0.9337
Table 3.4 Privacy parameters for the ionosphere dataset using the absolute value
approach with M=16
Figure 3.3 shows effect of sparseness constraint on privacy parameters on Ionosphere
dataset.
34
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
10
-2
10
-1
10
0
10
1
10
2
sparseness
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 3.3 Effect of sparseness constraint on privacy parameters (Ionosphere dataset
using first approach)
When varying the sparseness constraint from 0 to 1, the best trade off between the
accuracy and the privacy parameters was obtained for 08.0=hS . Table 4 shows the
corresponding accuracy and privacy parameters.
Sh CK CP RK RP VD Acc
0.08 0.47 0.941 0.064 23.1 0.311 0.9337
Table 3.5 Privacy parameters for the ionosphere dataset using the absolute value
approach with m=16 and 08.0=hS
35
Table 3.6 shows the effect of the truncation threshold ∈ on the accuracy and privacy
parameters.
ε CK CP RK RP VD Acc
0.01 0.441 0.94 0.065 23.1 0.310 0.9338
0.027 0.470 1.00 0.062 23.71 0.305 0.9007
0.037 0.411 1.00 0.056 29.18 0.304 0.8543
0.05 0.205 1.82 0.049 35.59 0.421 0.7748
0.08 0.117 10.11 0.039 77.38 0.930 0.8741
0.09 0.0588 12.00 0.036 91.18 0.960 0.8344
Table 3.6 Privacy parameters for the ionosphere dataset using the absolute value
approach with truncation
Figure 3.4 shows effect of truncation threshold on privacy parameters (Ionosphere dataset
using first approach).
36
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
10
-2
10
-1
10
0
10
1
10
2
Threshold
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 3.4 Effect of truncation threshold on privacy parameters (Ionosphere dataset using
first approach)
Table 3.7, Figure 3.5 and figure 3.6 show the corresponding results when we used the
second approach to deal with the negative data values. In this case the optimum trade off
between the privacy parameters and the classification accuracy was obtained for
5.0=hS
37
M CK CP RK RP VD Acc
16 0.67 0.41 0.017 35.63 0.35 0.9337
hS CK CP RK RP VD Acc
0.5 0.647 0.470 0.012 38.85 0.376 0.9337
ε CK CP RK RP VD Acc
0.017 0.147 4.470 0.037 78.55 1.41 0.9470
0.022 0.147 4.588 0.037 78.494 1.42 0.9536
0.026 0.117 4.411 0.038 78.449 1.43 0.9602
0.031 0.147 4.411 0.038 78.583 1.48 0.9139
0.036 0.176 5.117 0.038 78.04 1.53 0.8675
0.04 0.058 5.470 0.038 77.426 1.56 0.8410
Table 3.7 Trade off between the privacy parameters and accuracy for the ionosphere
dataset using the biasing approach
38
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
10
-3
10
-2
10
-1
10
0
10
1
10
2
sparseness
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 3.5 Effect of sparseness constraint on privacy parameters (Ionosphere dataset
using second approach)
0.015 0.02 0.025 0.03 0.035 0.04
10
-2
10
-1
10
0
10
1
10
2
Threshold
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 3.6 Effect of truncation threshold on privacy parameters (Ionosphere dataset using
second approach)
39
3.5 Conclusions
Non-negative matrix factorization with sparseness constraints provides an
effective data perturbation tool for privacy preserving data mining.
It should be noted that more work needs to be done in order to systematically determine
the range of NMF parameters that optimize both the utility function and privacy
parameters.
40
Chapter 4
Data Distortion using Discrete
Wavelet Transform
4.1 Introduction
The primary focus of this chapter is to explore the use of the discrete wavelet transform
(DWT) with truncation for data perturbation. Our primary experimental results show that
the proposed method is effective in concealing the sensitive information while preserving
the performance of data mining techniques after the data distortion. The proposed data
distortion approach (see Figure 4.1) can be summarized as follows:
1. Perform the 2D Haar wavelet transform on the original dataset.
2. Truncate the detail coefficients that are below a pre-specified threshold value.
3. Perform the inverse transform to obtain the perturbed dataset.
4. Iterate the above steps until satisfactory privacy parameters and utility measures are
obtained.
41
Figure 4.1 Proposed data distortion Technique
4.2 Experimental results
In order to test the performance of our proposed method, we conducted a series of
experiments on some real world datasets. In this section, we present a sample of the
results obtained when applying our technique to the original Wisconsin Breast Cancer
and Ionosphere databases downloaded from UCI machine Learning Repository [64].
For the Breast Cancer database, we used 569 observations and 30 attributes to perform
our experiment. For the classification task, 80% of the data was used for training and the
other 20% was used for testing. Throughout our experiments, we set K=30 for the KNN
classifier. The corresponding classification accuracy on the original dataset is 92.11%.
Original
Dataset Haar
DWT
Detail
Coefficient
Truncation
Haar IDWT
Perturbed
Dataset
42
Table 4.1 shows how the accuracy and privacy parameters vary when the truncation
process is applied to one of the detail coefficients (cV, cH or cD) in the Haar wavelet
transform and then the dataset is reconstructed by applying the inverse transform using
this truncated coefficients.
Truncated
coefficients
RP RK CP CK VD ACC%
cV 60.4 0.013 1.73 0.2 0.60 96.49
cH 91.9 0.009 0 1 0.16 98.25
cD 60.6 0.014 0 1 0.14 98.25
Table 4.1 Influence of truncating the detail coefficients on the privacy and accuracy
parameters (Wisconsin Breast Cancer dataset)
43
Figure 4.2 shows the effect of truncating detail coefficient on the privacy and accuracy
parameters.
cV cH cD
10
-3
10
-2
10
-1
10
0
10
1
10
2
Detail coefficient
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 4.2 Influence of truncating the detail coefficients on the privacy and accuracy
parameters (Wisconsin Breast Cancer dataset)
It should be noted that the classification accuracy in all these three cases is higher than
the corresponding accuracy of the original dataset. On the other hand, for this particular
dataset, truncating cH and cD resulted in CP=0, i.e., the rank of the average value of each
attribute did not change in the distorted dataset which is some what undesirable feature.
Table 4 shows the corresponding results when a pair of the detail coefficients (i.e., cH
and cV; cH and cD; or cV and CD ) in the Haar wavelet transform are truncated to zero
44
and then the dataset is reconstructed by applying the inverse transform to this truncated
coefficients.
Again, high level of classification accuracy is maintained in all three cases. On the other
hand, it is clear that truncated both cH and cV (denoted as cHV in the table) results in a
better privacy parameter. This conclusion should be interpreted with care since the above
results may change when the same technique is applied to a different dataset. Figure 4.3
shows the effect of truncating pairs of detail coefficients on the privacy and accuracy
parameters (breast cancer dataset).
cHV cHD cVD
10
-3
10
-2
10
-1
10
0
10
1
10
2
Combine Detail coefficient
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 4.3 Influence of truncating pairs of detail coefficients on the privacy and accuracy
parameters (breast cancer dataset)
45
Truncated
coefficients
RP RK CP CK VD ACC
cHV 77.51 0.017 1.8 0.26 0.63 98.25%
cHD 74.7 0.007 0 1 0.23 98.25%
cVD 50.22 0.06 1.7 0.26 0.63 92.11%
Table 4.2 Influence of truncating pairs of detail coefficients on the privacy and accuracy
parameters (breast cancer dataset)
Tables 4.3, 4.4 and Figures 4.4, 4.5 show the corresponding results when we run our
algorithm on the Ionosphere database from UCI repository [64]. This dataset has 351
instance and 35 attributes including class attribute. We used 200 instances as training data
and the other 151 are as test data. We also set K=13 (for which the classification accuracy
of the original dataset was 93.38%). The results obtained clearly show some tradeoff
between the accuracy and the privacy parameters.
46
Detail
coefficient
RP RK CP CK VD ACC%
cV 50.3 0.019 8.76 .03 0.56 90.07
cH 33.8 0.016 0 1 0.33 92.72
cD 34.7 0.014 0 1 0.35 91.39
Table 4.3 Influence of truncating the detail coefficients on the privacy and accuracy
parameters (Ionosphere dataset)
coefficient RP RK CP CK VD ACC
cHV 61.33 0.039 8.9 0.05 0.64 87.42%
cHD 47.26 0.039 0 1 0.242 93.38%
cVD 61.5 0.03 9.1 0.03 0.65 91.39%
Table 4.4 Influence of truncating pairs of detail coefficients on the privacy and accuracy
parameters (Ionosphere dataset)
47
cV cH cD
10
-2
10
-1
10
0
10
1
10
2
Threshold
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 4.4 Influence of truncating the detail coefficients on the privacy and accuracy
parameters (Ionosphere dataset)
cHV cHD cVD
10
-2
10
-1
10
0
10
1
10
2
Threshold
Privacyparameters
accuracy
RP
VD
RK
CP
CK
Figure 4.5 Influence of truncating pairs of detail coefficients on the privacy and accuracy
parameters (Ionosphere dataset)
48
4.3 Conclusion
In this chapter, we have presented a new algorithm for privacy preserving data mining
based on DWT with truncated coefficients. Based on the presented excremental results,
the proposed method is effective in concealing the sensitive information while preserving
the performance of data mining techniques after the data distortion.
49
Chapter 5
Bayesian Estimation of Original
Data
5.1 Introduction
The main objective of data perturbation techniques is to distort the individual
data values while preserving the properties of the underlying statistical distribution. In
this chapter we use the Bayesian estimation algorithm that was explained in chapter 2 to
estimate the original dataset from its perturbed version. The mean square error between
the original data and the estimated dataset can serve as an added measure for the
effectiveness of the data perturbation technique in hiding the original data.
5.2 Experimental results
In this section, we present the results of our experiments when applying the
Bayesian estimation algorithm on datasets that were artificially generated in such a way
that the values of their attributes follow the triangular distribution and the uniform
50
distribution shown in Figure 5.1 and Figure 5.2 respectively. In these figures (as well as
the rest of the figures in this chapter), the x-axis shows the signal value and the y-axis
shows its corresponding frequency.
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
5
10
15
20
25
30
35
40
45
50
Figure 5.1 Triangular signal distribution
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
5
10
15
20
25
30
35
40
Figure 5.2 Uniform signal distribution
51
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
10
20
30
40
50
60
70
Figure 5.3 Distribution of Gaussian noise in the Triangular signal
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
Figure 5.4 Distribution of equivalent noise in the NMF-perturbed Triangular signal
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
10
20
30
40
50
60
70
Figure 5.5 Distribution of equivalent noise in the DWT-perturbed Triangular signal
52
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
Figure 5.6 Distribution of Gaussian noise in the Uniform signal
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
10
20
30
40
50
60
70
Figure 5.7 Distribution of equivalent noise in the NMF-perturbed Uniform signal
-100 -80 -60 -40 -20 0 20 40 60 80 100
0
10
20
30
40
50
60
Figure 5.8 Distribution of equivalent noise in the DWT-perturbed Uniform signal
53
The data perturbation was performed in three ways: (i) by adding a zero mean
Gaussian noise, (ii) Using the NMF (reduced rank M=35, and sparseness
constraint 9.0=hS ) approach described in chapter 3, and (iii) by using the DWT
(horizontal detailed coefficient matrix cH=0, and diagonal detailed coefficient matrix
cD=0) approach described in chapter 4.
When applying the above data perturbation methods, the parameters of the data
perturbation algorithms were adjusted in such a way that the added noise variance is fixed
in all the three methods. Let O denote the original RN × data matrix and D denote the
corresponding RN × distorted data matrix. Then the mean square error between the
original signal and the perturbed one is given by:
( ) ( )( )∑ −
×
=
ji
in jiDjiO
RN
MSE
,
2
,,
1
Similarly, let E denote the estimated signal from the distorted signal. Then the
mean square error between the original data and the estimated one is given by
( ) ( )( )∑ −
×
=
ji
out jiEjiO
RN
MSE
,
2
,,
1
Table 5.1 shows the privacy parameters and MSEin,, MSEout obtained for the a triangular
data set.
54
Distortion
Method
VD RP RK CP CK MSEin MSEout
Gaussian
Noise
0.699 26.81 0.005 27.94 0.005 834.6 715.4
NMF 0.55 25.73 0.55 24.26 0.015 832.37 982.45
DWT 0.70 35.5 0.012 24.26 30.45 832.53 949.53
Table 5.1 Estimation results for Triangular signal distribution
Table 5.2 shows the corresponding results when the original dataset have a
uniform distribution.
Distortion
Method
VD RP RK CP CK MSEin MSEout
Gaussian
Noise
0.49 20.3 0.018 22.22 0.015 826.2 606.02
NMF 0.5 26.27 0.014 23.57 0.025 827.2 946.4
DWT 0.5 23.1 0.015 36.25 0.025 826.98 1013
Table 5.2 Estimation results for Uniform signal distribution
55
5.3 Conclusions
In this chapter, we applied the Bayesian estimation to estimate the original dataset
from the perturbed one. When using the mean square error between the estimated dataset
and the original dataset as a measure for the effectives of the data perturbation technique,
our experimental results show that perturbing the data using the NMF and the DWT
approaches described earlier outperform the traditional method of data perturbation by
AWGN.
56
Chapter 6
Conclusions and Future Work
6.1 Conclusions
Privacy issues have created innovative challenges in the area of data mining
technology. These technical issues should not simply be addressed by restricting data
collection or even by restricting the secondary use of information technology. In this
thesis, we addressed two data perturbation approaches that can be used to conceal
sensitive information while preserving the general patterns and trends from the original
database. Using a set of privacy parameters previously proposed in the literatures, our
experimental results show that, by the proper choice of the algorithm parameters, both the
sparsified NMF and DWT with truncation techniques are effective data perturbation tools
that preserve the privacy as well as maintain the utility measure.
The mean square error of the Bayesian estimation technique, which we used to
estimate the original signal from the distorted one, also adds to the list of privacy
parameters that can be used to test the effectiveness of any newly proposed data
perturbation tools.
To sum up, we believe that different tools from well established fields such as
signal processing and communications theory can be used to reveal a new era in the
57
privacy preserving data mining and hence present a flourishing trend in the privacy and
security research.
6.2 Future Works
Throughout our analysis, we used a simple KNN classifier. We believe that a
better accuracy can be obtained by using better classifiers such as support vector
machines (SVMs). The use of a more accurate classifier may allow a better trade of
between the accuracy and the privacy parameters.
Proposing a good set of privacy parameters for privacy preserving data
mining is still an area in its infancy stage. The privacy concept is too complex to capture
with the current set of parameters used throughout this thesis. In fact, one drawback of
the parameters used in this thesis is that they do not reflect the effort required to break the
privacy preserving algorithm. There is almost no literature related to this topic. Although
our work described one additional privacy parameters (MSE of the estimation algorithm),
we believe that more work needs to be done in this area, especially on how to
quantitatively relate these parameters to the actual work required to break these data
perturbation techniques.
Other promising data perturbation techniques such as adding data dependent
noise needs to be explored. Using a combination of rule hiding and randomization also
presents an interesting direction of research. These techniques can play a vital role for
privacy preserving data mining. Sanitization is a challenging problem and it is sometimes
restrictive. One can combine sanitization and randomization under the same framework
to reduce the side effect of the sanitization process. On the one hand, randomization does
58
not remove items from a dataset which in general introduce false drops to the data, i.e.,
some patterns that are not supposed to exist in the original database.
59
References
[1] M. Chen, J. Han, and P. Yu, "Data Mining: An Overview from a Database
Prospective", IEEE Trans. Knowledge and Data Engineering, 8, 1996.
[2] Doug Struck. Don't store my data, Japanese tell government. International Herald
Tribune, page 1, August 24-25 2002.
[3] M. Feingold, M. Corzine, M. Wyden, and M. Nelson, “Data-mining moratorium act
of 2003.” U.S. Senate Bill (proposed), January 16 2003.
[4] Krishnamurty Muralidhar, “Security of Random Data Perturbation Methods,” ACM
Transactions on Database Systems, Vol.24, No.4, December 1999.
[5] L. Cranor, M. Langheinrich, M. Marchiori, M. Presler-Marshall, and J. Reagle. “The
platform for privacy preferences 1.0 specification,” In W3C Recommendation, April
2002.
[6] Yuichi Koike. “References for p3p implementations,” available through
http://www.w3.org/P3P/implementations, 2004.
[7] Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu.
“Implementing p3p using database technology.” In 19th International Conference on
Data Engineering, Bangalore, India, March 2003.
[8] Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu. “Hippocratic
databases,” In 28th International Conference on Very Large Data Bases,Hong Kong,
China, August 2002.
.
60
[9] A. C. Yao. “How to generate and exchange secrets.” In Proceedings 27th IEEE
Symposium on Foundations of Computer Science, pages 162–167, 1986.
[10] O. Goldreich, S. Micali, and A. Wigderson. “How to play any mental game”. In
Proceedings of the 19th annual ACM symposium on Theory of Computing, pages
218–229, 1987.
[11] Patrik O. Hoyer. “Non-negative Matrix Factorization with Sparseness Constraints,”
Journal of Machine Learning Research 5 (2004) 1457–1469
[12] Oded Goldreich. “Secure Multi-Party Computation (Working Draft).” Department of
Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot,
Israel, June 1998.
[13] G. Brassard, C. Crepeau and J. Robert. “All-or-nothing disclosure of secrets.” In
Advances in Cryptology - Crypto86, LNCS, volume :263, Springer verlag, 1987 pp. 234-
238
[14] K. Sako and M. Hirt, “Efficient receipt-free voting based on homomorphic
encryption,” in Proceedings of Advances in Cryptology (EUROCRYPT2000), Bruges,
Belgium, May 2000, pp. 539–556.
[15] J. C. Benaloh and M. De Mare. “One-way accumulators: A decentralized alter-
native to digital signatures.” Advances in Cryptology – EUROCRYPT’93. Workshop on
the Theory and Application of Cryptographic Techniques. Lecture Notes in Computer
Science., 765:274–285, May 1993.
[16] O. Goldreich, “Secure Multi-Party Computation (Working Draft),” Department of
Computer Science and Applied Mathematics, Weizmann Institute of
Science, Rehovot, Israel, June 1998.
61
[17] B. Pinkas, “Cryptographic techniques for privacy preserving data mining,” SIGKDD
Explorations, vol. 4, no. 2, pp. 12–19, 2002.
[Online]. Available:http://portal.acm.org/citation.cfm?id=772865
[18] W. Du and M. J. Atallah, “Secure multi-party computation problems and their
applications: A review and open problems,” in Proceedings of the 2001 Workshop on
New Security Paradigms. Cloudcroft, NM: ACM Press, September 2001, pp. 13–22.
[19] M. J. Atallah, W. Du and F. Kerschbaum, “Protocols for secure remote database
access with approximate matching,” in 7th ACM Conference on Computer and
Communications Security(ACMCCS 2000). The first workshop on Security of Privacy in
E-Commerce, Athens, Greece, November 2000.
[20] M. J. Atallah and W. Du, “Secure multi-party computational geometry,” in
WADS2001: Seventh International Workshop on Algorithms and Data Structures,
Providence, Rhode Island, August 2001, pp. 165–179.
[21] W. Du, Y. S. Han, and S. Chen, “Privacy-preserving multivariate statistical analysis:
Linear regression and classification,” in Proceedingsof 2004 SIAM International
Conference on Data Mining (SDM04), Lake Buena Vista, FL, April 2004.
[Online]. Available: http://www.cis.syr.edu/wedu/Research/paper/sdm2004 privacy.pdf
.
[22] Pentti Paatero and Unto Tapper. “Positive matrix factorization: A non-negative
factor model with optimal utilization of error.” Environmetrics, 5:111-126, 1994.
[23] W. Du and M. J. Atallah, “Privacy preserving cooperative scientific computations,”
in Proceedings of the 14th IEEE Computer Security Foundations Workshop, Nova
Scotia, Canada, June 2001, pp. 273–282
[24] W. Du and M. J. Atallah. “Secure multi-party computation problems and their
applications: A review and open problems.” In New Security Paradigms Workshop,
pages 11–20, Cloudcroft, New Mexico, USA, September 11-13 2001.
62
[25] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Advances in
Cryptology (CRYPTO’00), ser. Lecture Notes in Computer Science, vol.1880. Springer-
Verlag, 2000, pp. 36–53.
[26] M. Kantarcoglu and J. Vaidya, “Privacy preserving naive bayes classifier for
horizontally partitioned data,” in IEEE ICDM Workshop on Privacy Preserving
Data Mining, Melbourne, FL, November 2003, pp. 3–9.
[27] R. Wright and Z. Yang, “Privacy-preserving bayesian network structure computation
on distributed heterogeneous data,” in Proceedings of the Tenth
ACM SIGKDD Conference (SIGKDD’04), Seattle, WA, August 2004.
[Online] Available: http://www.cs.stevens.edu/rwright/Publications/
.
[28] J. Vaidya and C. Clifton, “Privacy-preserving k-means clustering over vertically
partitioned data,” in The Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Washington, D.C., August 2003.
[29] Tore Dalenius and Steven P. Reiss. “Data-swapping: A technique for disclosure
control.” Journal of Statistical Planning and Inference, 6:73–85, 1982.
[30] Stephen E. Fienberg and Julie McIntyre. “Data swapping: Variations on a theme by
dalenius and reiss.” Technical report, National Institute of Statistical Sciences, Research
Triangle Park, NC, 2003.
[31] Nabil R. Adam and John C.Worthmann. “Security-control methods for statistical
databases: a comparative study.” ACM Comput. Surv., 21(4):515–556, 1989.
[32] N. S. Matloff. “Inference control via query restriction vs. data modification: a
perspective.” In on Database Security: Status and Prospects, pages 159–166. North-
Holland Publishing Co., 1988.
63
[33] J. Domingo-Ferrer and Torra. “Statistical data protection in statistical micro data
protection via advanced record linkage.” Statistics and Computing, 13(4):343–354, 2003.
[34] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. “Collective Data Mining:
A New Perspective Towards Distributed Data Mining.” In Hillol Kargupta and Philip
Chan, editors, Advances in Distributed and Parallel Knowledge Discovery, pages 133–
184. MIT/AAAI Press, 2000.
.
[35] B. Park, H. Kargupta, E. Johnson, E. Sanseverino, D. Hershberger, and L. Silvestre.
“Distributed, Collaborative Data Analysis from Heterogeneous Sites Using a Scalable
Evolutionary Technique.” Journal of Applied Intelligence, 16(1):19–42,
2002.
[36] H. Kargupta, H. Park, S. Pittie, L. Liu, D. Kushraj, and K. Sarkar. “MobiMine:
Monitoring the stock market from a PDA.” ACM SIGKDD Explorations, 3:37–47, 2001.
[37] B. Park, R. Ayyagari, and H. Kargupta. “A fourier analysis-based approach to learn
classifier from distributed heterogeneous data.” In Proceedings of the First SIAM
International Conference on Data Mining, Chicago, US, 2001.
[38] E. Johnson and H. Kargupta. “Collective, Hierarchical Clustering From Distributed,
Heterogeneous Data.” In M. Zaki and C. Ho, editors, Large-Scale Parallel KDD Systems.
Lecture Notes in Computer Science, volume 1759, pages 221–244. Springer-Verlag,
1999.
[39] J. Ross Quinlan. “Induction of decision trees.” In Machine Learning, pages 1(1): 81–
106, 1986.
64
[40] W. Du and M. J. Atallah. “Secure multi-party computation problems and their
applications: A review and open problems.” In New Security Paradigms Workshop,
pages 11 – 20, 2001.
[41] M. Kantarcioglu and C. Clifton. “Privacy-preserving distributed mining of
association rules on horizontally partitioned data.” In SIGMOD Workshop on DMKD,
Madison, WI, June 2002.
[42] J. Vaidya and C. Clifton. “Privacy Preserving Association Rule Mining in Vertically
Partitioned Data.” In The Eighth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Edmonton, Canada, July 2002.
[43] M. J. Atallah, E. Bertino, A. K. Elmagarmid, M. Ibrahim, and V. S. Verykios,
“Disclosure limitation of sensitive rules,” in Proceedings of the IEEE Knowledge and
Data Engineering Workshop, 1999, pp. 45–52.
[44] S. Oliveira and O. R. Zaiane, “Privacy preserving frequent itemset mining,” in
Proceedings of the IEEE International Conference on Privacy, Security and Data
Mining. Maebashi City, Japan: Australian Computer Society, Inc., 2002, pp. 43–54.
[Online]. Available:http://portal.acm.org/citation.cfm?id=850789
[45] V. S. Verykios, A. K. Elmagarmid, B. Elisa, Y. Saygin, and D. Elena, “Association
rule hiding,” in IEEE Transactions on Knowledge and Data Engineering, 2003.
[46] Y. Saygin, V. S. Verykios, and C. Clifton, “Using unknowns to prevent discovery of
association rules,” SIGMOD Record, vol. 30, no. 4, pp. 45–54, December 2001.
[Online]. Available: http://portal.acm.org/citation.cfm?id=604271
[47] C. K. Liew, U. J. Choi, and C. J. Liew, “A data distortion by probability
distribution,” ACM Transactions on Database Systems (TODS), vol. 10, no. 3, pp. 395–
411, 1985.
65
[Online]. Available: http://portal.acm.org/citation.cfm?id=4017
[48] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd ed.), John Wiley
and Sons, 2001
[49] S. Warner, “Randomized response: A survey technique for eliminating evasive
answer bias,” Journal of the American Statistical Association, vol. 60, pp. 63–69, 1965.
[50] Rakesh Agrawal and Ramakrishnan Srikant. “Privacy-preserving data mining,” In
Proceeding of the ACM SIGMOD Conference on Management of Data, pp. 439–450,
Dallas, Texas, May 2000. ACM Press.
[51] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy
preserving data mining algorithms,” in Proceedings of the twentieth ACM SIGMOD-
SIGACT-SIGART symposium on Principles of Database Systems, Santa Barbara, CA,
2001, pp. 247–255.
[Online]. Available: http://portal.acm.org/citation.cfm?id=375602
[52] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacy preserving mining
of association rules,” in Proceedings of 8th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD’02), July 2002.
[53] A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy
preserving data mining,” in Proceedings of the ACM SIGMOD/PODS Conference, San
Diego, CA, June 2003.
[54] Chong K. Liew, Uinam J. Choi, and Chung J. Liew. “A data distortion by
probability distribution.” ACM Trans. Database Syst., 10(3):395–411, 1985.
66
[55] J. J. Kim and W. E. Winkler, “Multiplicative noise for masking continuous data,”
Statistical Research Division, U.S. Bureau of the Census, Washington D.C., Tech. Rep.
Statistics #2003-01, April 2003.
[56] K. Muralidhar, D. Batrah, and P. J. Kirs, “Accessibility, security, and accuracy in
statistical databases: The case for the multiplicative fixed data perturbation approach,”
Management Science, vol. 41, no. 9, pp. 1549–1584, 1995.
[57] K. Liu, H. Kargupta, and J. Ryan. “Random projection and privacy preserving
correlation computation from distributed data.” Technical report, University of Maryland
Baltimore County, Computer Science and Electrical Engineering Department, Technical
Report TR-CS-03-24, 2003.
[58] Daniel D. Lee and H. Sebastian Seung. “Algorithms for non-negative matrix
factorization.” In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors,
Advances in Neural Information Processing Systems 13, pages 556–562. MIT
Press, 2001.
[59] C.-J. Lin. Projected gradient methods for non-negative matrix factorization.
Technical report, Department of Computer Science, National Taiwan University,
www.csie.ntu.edu.tw/~cjlin (2005)
[60] Amara Graps, “An introduction to Wavelets”, IEEE Computational Science and
Engineering, Summer 1995, vol 2, num 2. IEEE Computer society.
[61] T.Li, Q. Li, S. Zhu, M Ogihara. “A Survey on wavelet Application in Data Mining.”
SIGKDD Explorations, Vol 4,issue 2,pages-49-68, 2002.
[62] Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang, “Data distortion for privacy
protection in a terrorist Analysis system.” P. Kantor et al(Eds.):ISI 2005,LNCS 3495,
pp.459-464,2005
67
[63] J. F. Pang, D. B. Bo and S.Bai, “Research and Implementation of Text
Categorization System Based on VSM,” International Conf. on Multilingual Information
Processing [C], pp. 31-36, .2000
[64] UCI Machine Learning Repository.
http://www.ics.uci.edu/mlearn/mlsummary.html.
[65] Saif M. A. Kabir1, Amr M. Youssef and Ahmed K. Elhakeem, “On data distortion
for privacy preserving data mining,” In proc. of 20th Canadian Conference on Electrical
and Computer Engineering (CCECE 2007), Vancouver, BC, April 24-26 April, 2007.
[66] Saif M. A. Kabir1, Amr M. Youssef and Ahmed K. Elhakeem, “Data distortion by
Non-negative Matrix Factorization for preserving privacy,” In proc. of the 3rd
International Conference on Computational Intelligence, CI 2007, Banff, Alberta,
Canada, July 2-4, 2007.

More Related Content

Viewers also liked

Hikmah Perseteruan BAZNAZ dengan LAZ
Hikmah Perseteruan BAZNAZ dengan LAZHikmah Perseteruan BAZNAZ dengan LAZ
Hikmah Perseteruan BAZNAZ dengan LAZAnas Ferdian
 
Fall 2015-Sunset Cove Update
Fall 2015-Sunset Cove UpdateFall 2015-Sunset Cove Update
Fall 2015-Sunset Cove Updateecowatchers
 
IPFW Department of Nursing
IPFW Department of NursingIPFW Department of Nursing
IPFW Department of NursingMelinda Conley
 
Weppler jamaica bay task force 29 oct15_sand
Weppler jamaica bay task force 29 oct15_sandWeppler jamaica bay task force 29 oct15_sand
Weppler jamaica bay task force 29 oct15_sandecowatchers
 
Поиск списков в неструктурированных данных
Поиск списков в неструктурированных данныхПоиск списков в неструктурированных данных
Поиск списков в неструктурированных данныхOleksii Holubovych
 
O lado humano da saúde
O lado humano da saúdeO lado humano da saúde
O lado humano da saúdeHelena Martins
 
Realization of ofdm based underwater acoustic communication
Realization of ofdm based underwater acoustic communicationRealization of ofdm based underwater acoustic communication
Realization of ofdm based underwater acoustic communicationeSAT Journals
 
Film poster research
Film poster researchFilm poster research
Film poster researchnieevequinn
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studiesdnac
 
Michael Levy. Unarmed - 5
Michael Levy. Unarmed - 5Michael Levy. Unarmed - 5
Michael Levy. Unarmed - 5michael levy
 
Translation , Transcription and Transduction
Translation , Transcription and TransductionTranslation , Transcription and Transduction
Translation , Transcription and TransductionMicrobiology
 

Viewers also liked (15)

Hikmah Perseteruan BAZNAZ dengan LAZ
Hikmah Perseteruan BAZNAZ dengan LAZHikmah Perseteruan BAZNAZ dengan LAZ
Hikmah Perseteruan BAZNAZ dengan LAZ
 
O mestre da ciencia
O mestre da cienciaO mestre da ciencia
O mestre da ciencia
 
Fall 2015-Sunset Cove Update
Fall 2015-Sunset Cove UpdateFall 2015-Sunset Cove Update
Fall 2015-Sunset Cove Update
 
IPFW Department of Nursing
IPFW Department of NursingIPFW Department of Nursing
IPFW Department of Nursing
 
Weppler jamaica bay task force 29 oct15_sand
Weppler jamaica bay task force 29 oct15_sandWeppler jamaica bay task force 29 oct15_sand
Weppler jamaica bay task force 29 oct15_sand
 
Unidad 27
Unidad   27Unidad   27
Unidad 27
 
Поиск списков в неструктурированных данных
Поиск списков в неструктурированных данныхПоиск списков в неструктурированных данных
Поиск списков в неструктурированных данных
 
O lado humano da saúde
O lado humano da saúdeO lado humano da saúde
O lado humano da saúde
 
Realization of ofdm based underwater acoustic communication
Realization of ofdm based underwater acoustic communicationRealization of ofdm based underwater acoustic communication
Realization of ofdm based underwater acoustic communication
 
Film poster research
Film poster researchFilm poster research
Film poster research
 
10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies10 More than a Pretty Picture: Visual Thinking in Network Studies
10 More than a Pretty Picture: Visual Thinking in Network Studies
 
Michael Levy. Unarmed - 5
Michael Levy. Unarmed - 5Michael Levy. Unarmed - 5
Michael Levy. Unarmed - 5
 
Pms
PmsPms
Pms
 
Translation , Transcription and Transduction
Translation , Transcription and TransductionTranslation , Transcription and Transduction
Translation , Transcription and Transduction
 
Personification
PersonificationPersonification
Personification
 

Similar to Data Perturbation Techniques for Privacy Preserving Data Mining

Agile Business Continuity Planning Using Business Process Modeling Notation
Agile Business Continuity Planning Using Business Process Modeling NotationAgile Business Continuity Planning Using Business Process Modeling Notation
Agile Business Continuity Planning Using Business Process Modeling NotationBrandi Gonzales
 
Deep Learning for Health Informatics
Deep Learning for Health InformaticsDeep Learning for Health Informatics
Deep Learning for Health InformaticsJason J Pulikkottil
 
Iris based Human Identification
Iris based Human IdentificationIris based Human Identification
Iris based Human Identificationdswazalwar
 
RELIABLE AND SECURE SCADA FRAMEWORK FOR RESIDENTIAL MICROG.docx
RELIABLE AND SECURE SCADA FRAMEWORK FOR RESIDENTIAL MICROG.docxRELIABLE AND SECURE SCADA FRAMEWORK FOR RESIDENTIAL MICROG.docx
RELIABLE AND SECURE SCADA FRAMEWORK FOR RESIDENTIAL MICROG.docxaudeleypearl
 
Final submission (2).pdf
Final submission (2).pdfFinal submission (2).pdf
Final submission (2).pdfhabtamu292245
 
barış_geçer_tez
barış_geçer_tezbarış_geçer_tez
barış_geçer_tezBaris Geçer
 
IT-Service-Catalog.pdf
IT-Service-Catalog.pdfIT-Service-Catalog.pdf
IT-Service-Catalog.pdfssuser53d67b
 
Pharma statistic 2018
Pharma statistic 2018Pharma statistic 2018
Pharma statistic 2018Majdi Ayoub
 
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACYTHE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACYIRJET Journal
 
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...divya_prabha
 
Comprehensive STATCOM Control For Distribution And Transmission System Applic...
Comprehensive STATCOM Control For Distribution And Transmission System Applic...Comprehensive STATCOM Control For Distribution And Transmission System Applic...
Comprehensive STATCOM Control For Distribution And Transmission System Applic...alimeziane3
 
BLOCKCHAIN HYPERLEDGER IN MEDICAL FIELD.pdf
BLOCKCHAIN HYPERLEDGER IN MEDICAL FIELD.pdfBLOCKCHAIN HYPERLEDGER IN MEDICAL FIELD.pdf
BLOCKCHAIN HYPERLEDGER IN MEDICAL FIELD.pdfASHMILA K P
 
Asistencia | Live Face Recognition | Python
Asistencia | Live Face Recognition | Python Asistencia | Live Face Recognition | Python
Asistencia | Live Face Recognition | Python Naomi Kulkarni
 
Application of linear transformation in computer
Application of linear transformation in computerApplication of linear transformation in computer
Application of linear transformation in computerFavour Chukwuedo
 
Singh_uta_2502M_10903.pdf
Singh_uta_2502M_10903.pdfSingh_uta_2502M_10903.pdf
Singh_uta_2502M_10903.pdfsehat maruli
 
Missing Data Problems in Machine Learning
Missing Data Problems in Machine LearningMissing Data Problems in Machine Learning
Missing Data Problems in Machine Learningbutest
 
Report on I-LEACH: An Energy Efficient Routing Protocol in Wireless Sensor Ne...
Report on I-LEACH: An Energy Efficient Routing Protocol in Wireless Sensor Ne...Report on I-LEACH: An Energy Efficient Routing Protocol in Wireless Sensor Ne...
Report on I-LEACH: An Energy Efficient Routing Protocol in Wireless Sensor Ne...divya_prabha
 

Similar to Data Perturbation Techniques for Privacy Preserving Data Mining (20)

Agile Business Continuity Planning Using Business Process Modeling Notation
Agile Business Continuity Planning Using Business Process Modeling NotationAgile Business Continuity Planning Using Business Process Modeling Notation
Agile Business Continuity Planning Using Business Process Modeling Notation
 
Deep Learning for Health Informatics
Deep Learning for Health InformaticsDeep Learning for Health Informatics
Deep Learning for Health Informatics
 
Iris based Human Identification
Iris based Human IdentificationIris based Human Identification
Iris based Human Identification
 
RELIABLE AND SECURE SCADA FRAMEWORK FOR RESIDENTIAL MICROG.docx
RELIABLE AND SECURE SCADA FRAMEWORK FOR RESIDENTIAL MICROG.docxRELIABLE AND SECURE SCADA FRAMEWORK FOR RESIDENTIAL MICROG.docx
RELIABLE AND SECURE SCADA FRAMEWORK FOR RESIDENTIAL MICROG.docx
 
COMPLETE 2
COMPLETE 2COMPLETE 2
COMPLETE 2
 
Final submission (2).pdf
Final submission (2).pdfFinal submission (2).pdf
Final submission (2).pdf
 
SI Thesis
SI ThesisSI Thesis
SI Thesis
 
barış_geçer_tez
barış_geçer_tezbarış_geçer_tez
barış_geçer_tez
 
IT-Service-Catalog.pdf
IT-Service-Catalog.pdfIT-Service-Catalog.pdf
IT-Service-Catalog.pdf
 
Pharma statistic 2018
Pharma statistic 2018Pharma statistic 2018
Pharma statistic 2018
 
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACYTHE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
THE CRYPTO CLUSTERING FOR ENHANCEMENT OF DATA PRIVACY
 
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
Project report on An Energy Efficient Routing Protocol in Wireless Sensor Net...
 
Comprehensive STATCOM Control For Distribution And Transmission System Applic...
Comprehensive STATCOM Control For Distribution And Transmission System Applic...Comprehensive STATCOM Control For Distribution And Transmission System Applic...
Comprehensive STATCOM Control For Distribution And Transmission System Applic...
 
BLOCKCHAIN HYPERLEDGER IN MEDICAL FIELD.pdf
BLOCKCHAIN HYPERLEDGER IN MEDICAL FIELD.pdfBLOCKCHAIN HYPERLEDGER IN MEDICAL FIELD.pdf
BLOCKCHAIN HYPERLEDGER IN MEDICAL FIELD.pdf
 
Novak_Final_Thesis
Novak_Final_ThesisNovak_Final_Thesis
Novak_Final_Thesis
 
Asistencia | Live Face Recognition | Python
Asistencia | Live Face Recognition | Python Asistencia | Live Face Recognition | Python
Asistencia | Live Face Recognition | Python
 
Application of linear transformation in computer
Application of linear transformation in computerApplication of linear transformation in computer
Application of linear transformation in computer
 
Singh_uta_2502M_10903.pdf
Singh_uta_2502M_10903.pdfSingh_uta_2502M_10903.pdf
Singh_uta_2502M_10903.pdf
 
Missing Data Problems in Machine Learning
Missing Data Problems in Machine LearningMissing Data Problems in Machine Learning
Missing Data Problems in Machine Learning
 
Report on I-LEACH: An Energy Efficient Routing Protocol in Wireless Sensor Ne...
Report on I-LEACH: An Energy Efficient Routing Protocol in Wireless Sensor Ne...Report on I-LEACH: An Energy Efficient Routing Protocol in Wireless Sensor Ne...
Report on I-LEACH: An Energy Efficient Routing Protocol in Wireless Sensor Ne...
 

Data Perturbation Techniques for Privacy Preserving Data Mining

  • 1. Data Perturbation for Privacy Preserving Data Mining Saif Mohammad Asif Kabir A Thesis in The Department of Electrical and Computer Engineering Presented in Partial Fulfillment of the Requirements for the Degree of Master of Applied Science (Electrical Engineering) at Concordia University Montreal, Quebec, Canada June 2007 ©Saif Mohammad Asif Kabir, 2007
  • 2.
  • 3. iii Abstract Data Perturbation for Privacy Preserving Data Mining Saif Mohammad Asif Kabir Because of the increasing ability to trace and collect large amount of personal information, privacy preserving in data mining applications has become an important concern. Data perturbation is one of the well known techniques for privacy preserving data mining. The objective of these data perturbation techniques is to distort the individual data values while preserving the underlying statistical distribution properties. These data perturbation techniques are usually assessed in terms of both their privacy parameters as well as its associated utility measures. While the privacy parameters present the ability of these techniques to hide the original data values, the data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. In this thesis, we investigate the use of truncated non-negative matrix factorization (NMF) with sparseness constraints, and the Discrete Wavelet Transform (DWT) with truncation as data distortion techniques for data mining privacy. Privacy parameters, previously proposed in the literatures, are used to measure the privacy of the data after perturbation and the accuracy of a simple K-nearest neighborhood (KNN) classifier is used as the data utility measure. Our experimental results show that these two techniques not only improve privacy, but also improve the classification accuracy.
  • 4. iv Acknowledgments This is one of the best moments in my Masters’ program - to publicly acknowledge those people who have contributed to make my success a part of their own in many different ways. First of all, I would like to express my earnest gratitude to my supervisors, Dr. Elhakeem and Dr. Youssef for their perpetual support and guidance with enduring patience. I'd like also to express my appreciation to all the faculty and people at Concordia Institute for Information Systems Engineering (CIISE) who contributed to my success one way or the other. I would also like to acknowledge the innumerable contributions of Dr. Ben Hamza and lab partner Ren wang. Special thanks go to all my research partners at the EV 2.224 lab, whose help and patience made my work a lot easier. Thank you all. For my parents, no words can suffice. My deepest love, supreme appreciation and gratitude to my parents, Enamul Kabir and Zulfia Yasmin for their unceasing love, relentless patience and devotion raising. It would have been impossible for me to accomplish this endeavor without their support. Finally thanks to my beloved wife Shahrin Zaman for her patient, love, tender and caring from far distance.
  • 5. v Table of Contents LIST OF FIGURES ...........................................................................................................................VII LIST OF TABLES .............................................................................................................................. IX LIST OF SYMBOLS........................................................................................................................... XI CHAPTER 1 ...........................................................................................................................................1 INTRODUCTION TO DATA MINING PRIVACY...........................................................................1 1.1 INTRODUCTION........................................................................................................................1 1.2 RELATED WORK......................................................................................................................3 1.2.1 P3P and Secure Database .....................................................................................................4 1.2.2 Secure Multi-Party Computation...........................................................................................5 1.2.3 Data Swapping.......................................................................................................................5 1.2.4 Security of Statistical Database.............................................................................................6 1.2.5 Privacy Preserving Distributed Data Mining...................................................................6 1.2.6 Rule Hiding.......................................................................................................................7 1.2.7 Data Perturbation ............................................................................................................7 1.3 CONTRIBUTION OF THIS WORK ................................................................................................9 CHAPTER 2 .........................................................................................................................................11 BACKGROUND CONCEPTS............................................................................................................11 2.1 INTRODUCTION......................................................................................................................11 2.2 NON-NEGATIVE MATRIX FACTORIZATION ............................................................................12 2.2.1 Cost Function .................................................................................................................13 2.2.2 Multiplicative update rule ..............................................................................................13 2.2.3 NMF with Sparseness Constraints..................................................................................15 2.3 WAVELET OVERVIEW ...........................................................................................................17 2.4 PRIVACY MEASURE ...............................................................................................................21 2.5 DATA UTILITY MEASURES ....................................................................................................22 2.6 PERFORMANCE MEASURE FOR CLASSIFIER ...........................................................................23 2.7 BAYESIAN ESTIMATION.........................................................................................................24 2.8 CONCLUSION.........................................................................................................................27 CHAPTER 3 .........................................................................................................................................28 NON-NEGATIVE MATRIX FACTORIZATION FOR DATA PERTURBATION .....................28 3.1 INTRODUCTION......................................................................................................................28 3.2 EXPERIMENTAL RESULTS......................................................................................................28 3.3 ADDING SPARSENESS CONSTRAINT.......................................................................................30 3.4 DEALING WITH NEGATIVE VALUES........................................................................................32 3.5 CONCLUSIONS .......................................................................................................................39
  • 6. vi CHAPTER 4 .........................................................................................................................................40 DATA DISTORTION USING DISCRETE WAVELET TRANSFORM .......................................40 4.1 INTRODUCTION......................................................................................................................40 4.2 EXPERIMENTAL RESULTS ......................................................................................................41 4.3 CONCLUSION.........................................................................................................................48 CHAPTER 5 .........................................................................................................................................49 BAYESIAN ESTIMATION OF ORIGINAL DATA ........................................................................49 5.1 INTRODUCTION......................................................................................................................49 5.2 EXPERIMENTAL RESULTS ......................................................................................................49 5.3 CONCLUSIONS .......................................................................................................................55 CHAPTER 6 .........................................................................................................................................56 CONCLUSIONS AND FUTURE WORK..........................................................................................56 6.1 CONCLUSIONS .......................................................................................................................56 6.2 FUTURE WORKS ....................................................................................................................57 REFERENCES.....................................................................................................................................59
  • 7. vii List of Figures Figure 3.1 Effect of the reduced rank m on the privacy parameters 30 Figure 3.2 Effect of sparseness constraint on privacy parameters (Wisconsin Breast Cancer dataset) 31 Figure 3.3 Effect of sparseness constraint on privacy parameter (Ionosphere dataset using first approach) 34 Figure 3.4 Effect of truncation threshold on privacy parameters (Ionosphere dataset using first approach) 36 Figure 3.5 Effect of sparseness constraint on privacy parameters (Ionosphere dataset using second approach) 38 Figure 3.6 Effect of truncation threshold on privacy parameters (Ionosphere dataset using second approach) 38 Figure 4.1 Proposed data distortion Technique 41 Figure 4.2 Influence of truncating the detail coefficients on the privacy and accuracy parameters (Wisconsin Breast Cancer dataset) 43 Figure 4.3 Influence of truncating pairs of detail coefficients on the privacy and accuracy parameters (breast cancer dataset) 44 Figure 4.4 Influence of truncating the detail coefficients on the privacy and accuracy parameters (Ionosphere dataset) 47 Figure 4.4 Influence of truncating pairs of detail coefficients on the privacy and
  • 8. viii accuracy parameters (Ionosphere dataset) 47 Figure 5.1 Triangular signal distribution 50 Figure 5.2 Uniform signal distribution 50 Figure 5.3 Distribution of Gaussian noise in the Triangular signal 51 Figure 5.4 Distribution of equivalent noise in the NMF-perturbed Triangular signal 51 Figure 5.5 Distribution of equivalent noise in the DWT-perturbed Triangular signal 51 Figure 5.6 Distribution of Gaussian noise in the Uniform signal 52 Figure 5.7 Distribution of equivalent noise in the NMF-perturbed Uniform signal 52 Figure 5.8 Distribution of equivalent noise in the DWT-perturbed Uniform signal 52
  • 9. ix List of Tables Table 2.1 Illustration of one-dimensional Haar wavelet transform 19 Table 2.2 Illustration of one-dimensional Haar wavelet transform to f(x)=[7 5 6 2] 19 Table 2.3 Confusion Matrix 23 Table 3.1 Experimental results for KNN (K=30) 29 Table 3.2 Effect of the sparseness constraint on the privacy parameters and accuracy 31 Table 3.3 The effect of threshold ∈ on the privacy parameters and accuracy 32 Table 3.4 Privacy parameters for the ionosphere dataset using the absolute value approach with m=16 33 Table 3.5 Privacy parameters for the ionosphere dataset using the absolute value approach with 16=m and 08.0=hS 34 Table 3.6 Privacy parameters for the ionosphere dataset using the absolute value approach with truncation 35 Table 3.7 Trade off between the privacy parameters and accuracy for the ionosphere dataset using biasing approach 37 Table 4.1: Influence of truncating the detail coefficients on the privacy and accuracy parameters (Wisconsin Breast Cancer dataset) 42 Table 4.2: Influence of truncating pairs of detail coefficients on the privacy and accuracy parameters (breast cancer dataset) 45
  • 10. x Table 4.3: Influence of truncating the detail coefficients on the privacy and accuracy parameters (Ionosphere dataset) 46 Table 4.4: Influence of truncating pairs of detail coefficients on the privacy and accuracy parameters (Ionosphere dataset) 46 Table 5.1: Estimation results for triangular signal distribution 54 Table 5.2: Estimation results for uniform signal distribution 54
  • 11. xi List of Symbols KDD Knowledge Discovery in Database P3P Platform for Privacy Preference KNN K-Nearest Neighboring SMC Secure Multi-Party Computation PCA Principle Component Analysis ACC Accuracy NMF Non-negative Matrix Factorization DWT Discrete Wavelet Transform SVM Support Vector Machine ICA Independent Component Analysis PMF Positive matrix factorization FFT Fast Fourier Transform PDF Probability Density Function MSE Mean Squared Error XML Extensible Markup Language W3C World Wide Web Consortium
  • 12.
  • 13. 1 Chapter 1 Introduction to Data mining Privacy 1.1 Introduction Data mining or knowledge discovery in databases [1] is the process of searching for useful and understandable patterns in large volumes of data using tools such as classification and association rule mining. A huge amount of data is practically useless when one is not able to extract the valuable information hidden in it. Data mining is a promising approach to meet this challenging requirement and has emerged as a momentous technology for gaining knowledge from vast quantities of data. The fascination with the promise of analysis of large amount of data has led to an increasing number of successful applications of data mining which are very useful in marketing, business, medical analysis and other applications in which pattern discovery is supreme for strategic decision making. On the other hand, modern information technology also collect and analyze millions of transactions containing personal information. With the new advent of technology, there has been an explosive growth in huge amount of personal data generated or collected and stored in electronic form. Several applications deal with privacy sensitive data such as financial transactions, health care records, criminal records, and credit records.
  • 14. 2 This leads to a growing concern that the use of this technology is violating individual or group of individuals’ privacy which led to a drawback against the technology. Privacy advocates have been demanding effort to stop bringing more information into integrated collection. One example is the public protest in Japan against the creation of a national registry containing information previously held by the prefectures [2]. Another example is the U.S. senate siege of all data mining research and development by the U.S. Department of defense [3] and introducing Data Mining Moratorium Act. Thus, despite the fact that most data mining methods aim to develop generalized knowledge, rather than identify information about specific individuals, and despite of its benefits in various areas, one has to acknowledge that improper use of data mining techniques can also result in new threats to privacy and information security. On the other hand, one also has to acknowledge the fact that the main problem is not with data mining itself, but the infrastructure used to support it. According to the Data- Mining Moratorium Act, the siege of the Total/Terrorism Information awareness program was not because preventing terrorism is a bad concept but because of the possible exploitation of the data. As those data was distributed among multiple databases under several authorities, it was difficult to collect the data for misuse. Building data warehouse for data mining may change this assumption. Another problem is with the results themselves. Publishing summaries of census data carries risk of violating privacy which is recognized by census community. Summary tables may not be identifying an individual but it may be possible to isolate an individual and determine private information by combining results from different tables. Thus it is clear that, in order to be fair to this technology, both the data mining and information security communities must
  • 15. 3 address these issues and different techniques have to be adapted to resolve those problems. Numerous techniques have been developed which allow mining when we are not allowed to see the data to avoid the potential for misuse posed by an integrated data warehouse. This work basically falls into two main categories: Data perturbation [4] and secure multiparty computation [9]. In data perturbation techniques, the original dataset is perturbed with the aim that the disclosed dataset may not reveal any private information. The data mining challenge in this case is how to obtain useful (non-private) information from such perturbed data. The second category depend on separation of authority: Data is presumed to be controlled by different entities and the goal is for those entities to cooperate to obtain valid data mining results without disclosing the real data to others. The goal of this dissertation is to develop and evaluate new data perturbation techniques to privacy preserving data mining. Before we move into the details of our proposed work, in the next section, we briefly overview some of the related literature on privacy preserving data mining. 1.2 Related Work A growing body of literature exists on different approaches of privacy preserving data mining. Some of theses approaches adopted for preserving privacy in data mining are: • Secure Database/ Platform for Privacy Preference(P3P) • Secure Multi-Party Computation(SMC) • Data Swapping
  • 16. 4 • Security of Statistical Database • Privacy Preserving Distributed Data Mining • Rule Hiding • Data Perturbation In next few sections, we provide a brief description of these approaches. 1.2.1 P3P and Secure Database Nowadays a huge portion of the data related to individuals is composed by different web sites providing different services. The World Wide Web Consortium (W3C) Platform for Privacy Preferences (P3P) [5] is considered as one of the most well- known infrastructures to enable web users to gain more control about the information that web sites collect. P3P includes the system and architecture design perspectives of privacy preserving data mining. While P3P is criticized by relying on each individual website to be honest with its policy files, it is still considered a good effort to maintain privacy standard for personal data collected over the Internet. P3P provides a way for web site owners to encode their privacy policies in a standard XML format so that users can check against their privacy preferences to decide whether or not to release their personal data to the web site. There is a detailed survey of current P3P implementations in [6]. The basic P3P architecture is client based, i.e. privacy of client is defined at the web-client end. In contrast with these client-centric implementations, the author in [7] proposed a server- centric architecture for P3P. However, the current P3P standard only provides a schema for web users to check the consistency of their privacy preferences with the web sites’
  • 17. 5 privacy policies. There is no way specified so far to enforce web sites to act according to their stated policies. 1.2.2 Secure Multi-Party Computation Secure multi-party computation (SMC) is the problem of evaluating a function of two or more parties’ secrete inputs. Each party finally gets a share of the function output. No other information is revealed by the parties except what is implied by the party’s own inputs and outputs. Yao [9] introduced the secure multi-party computation concept which was later extended by Micali et. al. in [10]. The circuit evaluation protocol [10,12], 1-out–of-k oblivious transfer [13], homomorphic encryption [14], commutative encryption[15], Yao’s millionaire problem (secure comparison) and some other cryptographic technique serve as the building blocks of SMC. Detailed discussion of SMC framework can be found in [16], [17]. A variety of new SMC applications and open problems for a spectrum of cooperative computation domain is presented in [18]. Related works include privacy preserving information retrieval [19], privacy preserving statistical analysis [20], [21], privacy preserving geometric computation [20], privacy preserving scientific computation [23]. The work presented in [24] discusses a wide array of new secure multi-party computation applications. The SMC ideas have also been applied for privacy preserving decision tree induction [25], naïve Bayesian classification [26] of horizontally partitioned data, privacy preserving Bayesian network structure computation for vertically partitioned data [27], K-Means clustering over vertically partitioned data [28] and many others. 1.2.3 Data Swapping Tore Dalenius and Steven Reiss [29] proposed the basic idea of data swapping. This simple technique maintains the confidentiality of the attributes without changing the aggregate statistical property of the data. The database is transformed by switching a subset of attributes between selected pairs of records so that the lower order
  • 18. 6 frequency counts or marginal totals are preserved while data confidentially is uncompromised. This technique can be considered as a data perturbation technique which will be discussed in a later section. Since its initial appearance, a variety of refinements of data swapping have been suggested. The reader is referred to [30] for a thorough treatment. 1.2.4 Security of Statistical Database The objective of this technique is to find mechanisms for the prevention of the disclosure of individual value while maximizing the number of statistical queries that can be answered about subset of records of a database at the same time. The security of statistical databases against confidentiality disclosure can be found on [31]. Security control methods can be classified into conceptual, output perturbation, query restriction and data perturbation [32]. Statistical micro data protection using record linkage is discussed in [33]. 1.2.5 Privacy Preserving Distributed Data Mining The distributed data mining approach supports computation of data mining models and extraction of patterns at a certain node by exchanging only the minimal necessary information among the participating nodes. Several distributed algorithms have been proposed in the field of distributed data mining [34, 35]. The Fourier spectrum based approach to represent and construct decision tree [36, 37], the collective hierarchical clustering [38] are examples for distributed data mining algorithms which can be used for privacy preserving data mining after minor modification. Several
  • 19. 7 distributed techniques to mine multiparty data have been reported. A privacy preserving technique to construct decision tree [39], multi-party secure computation framework [40], association rule mining from homogeneous [41] and heterogeneous [42] distributed datasets are other examples. 1.2.6 Rule Hiding The main idea of rule hiding is to transform the database such that the sensitive rules are masked and all other underlying patterns can still be discovered. The optimal sanitization is an NP hard problem for the hiding of sensitive large item sets in the context of association rule mining is an example of rule hiding techniques [43]. This is why some heuristic approaches have been applied to address the complexity issues. The perturbation based rule hiding technique [44, 45] is implemented by toggling the 0 and 1 values so that frequent item sets that generate the rule are hidden or the support of sensitive rules is lowered to a user specified threshold. The blocking based association rule hiding technique [46] is also another example of this technique. 1.2.7 Data Perturbation Data perturbation approach can be categorized into two main categories: probability distribution approach and value distortion approach. The first approach replaces the data with another estimated sample from the same distribution [47]. The second approach perturbs the value of data elements directly by some additive or multiplicative noise before its release to the miner. Some randomized methods [49] can be categorized in this method. The authors in [50] proposed a value distortion technique
  • 20. 8 to protect the privacy by adding Gaussian random noise to the original data. They masked the original data while estimating the original distribution and decision tree model with good accuracy. This approach is further extended by applying expectation-maximization– based (EM) [51] algorithm for better reconstruction of the distribution. Evfimievski [52] and Rizvi [53] considered the same approach for association rule mining and suggested technique for limiting privacy breaches. Chong [54] proposed probability distribution data distortion where sensitive variables are replaced by a distorted set of values. To avoid some of the drawbacks of additive noise, other approaches make use of multiplicative noise [55], [56] for protecting the privacy of the data while maintaining some of the original analytic properties. Two methods of multiplicative noise are used for data perturbations. The first method is based on generating random numbers that have a truncated Gaussian distribution with mean one and small variance, and multiplying each element of the original data by the noise. The second method is to take a logarithmic transformation of the data first (for positive data only), compute the covariance, and generate random noise following a multivariate Gaussian distribution with mean zero and variance equaling a constant times the covariance computed in the last step, then add this noise to each element of the transformed data, and finally take the antilog of the noise- added data. Multiplicative perturbation overcomes the scale problem, and it has been proved that the mean and variance/covariance of the original data elements can be estimated from the perturbed version. In practice, the first method is good if the data disseminator only wants to make minor changes to the original data; however the second method assures higher security than the first one and still maintains the data utility very well. One of the main problems of the traditional additive perturbation and multiplicative
  • 21. 9 perturbation is that they perturb each data element independently, and therefore the similarity between attributes or observations which are considered as vectors in the original data space is not well preserved. Liu [57] proposed a random projection based multiplicative data perturbation method. They use a combination of random projection matrices for constructing a perturbed representation instead of multiplying noise with each elements. 1.3 Contribution of this work This work investigates two privacy preserving data perturbation techniques: Truncated non-negative matrix factorization (NMF) with sparseness constraints, and the Discrete Wavelet Transform (DWT). Similar to other techniques, our main objective is to conceal the individual data items while preserving the underlying statistical distribution of the perturbed data, i.e., the statistical properties of the perturbed data should still match of the original one. Privacy parameters, previously proposed in the literatures, are used to measure the privacy of the data after perturbation and the accuracy of a simple K-nearest neighborhood (KNN) classifier is used as the data utility measure. Our experimental results on the real world datasets [64] show that these two techniques not only improve privacy, but also improve the classification accuracy. The rest of the thesis is organized as follows. Chapter 2 briefly reviews the essential concepts and definitions which we will refer to throughout the thesis. In chapter 3, we investigate the use of non-negative matrix factorization (NMF) with sparseness
  • 22. 10 constraints for data perturbation. The results obtained when applying our proposed technique to the original Wisconsin breast cancer and ionosphere databases downloaded from UCI machine learning repository are also presented. Chapter 4 presents the corresponding results when using the Discrete Wavelet Transform (DWT) with truncation. Chapter 5 presents the results obtained when using the Bayesian estimation technique to estimate the original signal distorted with the above two techniques. Finally, conclusions and future works are given in chapter 6. The results presented in chapters 3 and 4 have been partially published in [65] and [66] respectively.
  • 23. 11 Chapter 2 Background Concepts 2.1 Introduction The objective of the data perturbation techniques is to distort the individual data values while preserving the underlying statistical distribution properties. These data perturbation techniques are usually assessed in term of both their privacy parameters as well as its utility measure. While the privacy parameters present the ability of these techniques to hide the original data values, the data utility measures asses whether the dataset keeps the performance of data mining techniques after the data distortion. Non- negative Matrix Factorization (NMF) and Discrete Wavelet Transform (DWT) are two vital techniques which are used in many signal processing applications. In this work, we investigate the use of both NMF and DWT as data perturbation techniques for privacy preserving data mining. In this chapter, we briefly review the formal theory of NMF and DWT. Different privacy parameters and utility measure, previously introduced in the literatures, are also reviewed. Finally, we describe the vulnerability of the privacy methods in terms of the original signal estimation.
  • 24. 12 2.2 Non-negative Matrix Factorization Non-negative matrix factorization (NMF) is a matrix factorization technique which produces a useful decomposition for data analysis. NMF decomposes the data as a product of two matrices having nonnegative elements. This is a reduced representation of the original data that can be seen either as a feature extraction or a dimensionality reduction technique. NMF can be interpreted as a parts-based representation of the data due to the fact of its use of non-negativity constraints. The factorization process of NMF is an active area of research in several fields and the subject is certainly a fertile area of research. The notion of low rank approximations arises in a wide range of important applications. Non-negative matrix factorization is such kind of approximation. The NMF approach can be formulated as follows: Given a non-negative RN × data matrix V, we can approximately factorize V into product of two non-negative matrices W and H with sizes MN × and RM × respectively, that is WHV ≈ , where the reduced rank M of the factorization is generally chosen so that ( ) ,NRMRN 〈+ and the product WH can be regarded as a compressed form of the data V. In this approximation, W contains the basis vectors as its columns where as H is a measurement vector where each column contains the coefficient vectors. Note that each measurement vectors is written in terms of the same basis vectors. The optimal choices of matrices W and H are defined to be those non-negative matrices that minimize the reconstruction error between V and WH. In next section, we present two of the most commonly used such error functions.
  • 25. 13 2.2.1 Cost Function We have to define a cost function to find the approximation matrices W and H. Euclidean distance is one natural way to evaluate the approximation between the two matrices V and WH. The Euclidean distance between any two matrices A and B is defined: ( )∑ −=− ji ijij BABA , 22 Here the lower bound is zero and equality is achieved when A=B. Another useful measure of cost function is divergence of A and B which is defined as follows: ( ) ijij ij ij ji ij BA B A ABAD +−= ∑ log , Similar to the Euclidean distance, the lower bound of the divergence measure is zero and it is achieved when A=B. This measure is called divergence from A to B as it is not symmetric, i.e., in general, ( ) ( )ABDBAD ≠ . In the above measure, A and B can be assumed to have a normalized probability distributions when .1,, == ∑∑ ji ijji ij BA Many NMF factorization algorithms have been proposed. Among these techniques, multiplicative update [58] is the simplest to describe. 2.2.2 Multiplicative update rule Assuming the use of Euclidean distance, then the objective of the multiplicative update is to minimize 2 WHV − with respect to W and H, and subject to
  • 26. 14 the constraint .0, ≥HW The multiplicative update algorithm [58] can be summarized as follows: 1. Initialize W and H as non-negative random matrix. 2. Update both W and H until convergence of 2 WHV − with the following update rules ( ) ( ) , μ μ μμ a T a T aa WHW VW HH ← ( ) ( ) . ia T ia T iaia WHH VH WW ← It can be shown that 2 WHV − is non increasing under the above update rule [58]. One should also note that both W and H should be updated simultaneously during the above updates. In other words, instead of updating the whole matrix W first and then update H, we should update one row of W and the corresponding column of H. During the updating of a row of W or a column of H, we need not need to calculate the whole matrices VW T , WHW T , T VH and T WHH as we only need one row and column of these matrices during one update. If the divergence measure is used, then the update rules in step 2 above should be replaced by ( ) , / ∑ ∑← k ka i iuiuia aa W WHVW HH μμ ( ) ∑ ∑← v av iuiua iaia H WHVH WW μ μ / Similar to the Euclidean update, both W and H should be updated simultaneously It can also be shown that 2 WHV − is non increasing under the above update rule [58].
  • 27. 15 It should be noted that minimizing the cost function above can also been seen as a traditional bound-constrained optimization problem which can be solved by some simple and effective techniques such as the projected gradient method [58] [59]. 2.2.3 NMF with Sparseness Constraints Sparse coding is a coding of data where only few of the components of the code are significantly active for any given input vector. While the non-negativity constraints enforce some sparseness during the NMF, traditional NMF techniques do not provide explicit control over the degree of sparseness. On the other hand, NMF with sparseness constraints has already been used in applications that acquire additional sparseness. The intent of adding sparseness constraint to NMF is to find decomposition that will result in part based representation where only a few units are highlighted to represent the typical data vector. As a result, most of the values are close to zero while few components take significantly non-zero values. One sparseness measure of a given vector X is based on the relationship between its L1 norm and the L2 norm [1] and is given by: ( ) ( ) , 1 2 − − = ∑ ∑ n x xn xsparseness i i where n is the dimensionality of X. The sparseness constraint can be enforced either on W, or H, or on both of them depending on the application. Assuming that one wants to enforce a sparseness on H, the (Euclidean) cost function changes to:
  • 28. 16 ( ) ( )∑∑ ∑ +−= ji ij i j jiji HgWHVHWE , 2 , λ Here, the sparseness parameter 0≥λ and g is the sparseness function. To obtain the same reconstruction cost, we have to scale up the basis vectors and scale down the measurement vectors to get a lower cost function. When the sparsity term is zero and the basis vectors grow without any bound, it leads to the optimal solution. To solve the scaling problem, a normalization step such as j j j W W α = is incorporate when minimizing the above equation, where ( )jj Wαα = with some norm. Reformulating the cost function to work with normalized vector, the cost function above becomes: ( ) ( )∑∑ ∑ +−= ji ij i j j j iji Hg W W HVHWE , 2 , λ Thus, the above new cost function depends on the variable { }( )HWE jj , , where jW being the normalized basis vector j j j W W W =: and . denotes any differentiable norm. The modified update rule for NMF with sparseness constraint using the gradient descent is summarized by the following steps: 1. Calculate jw W∇ . 2. Normalize the basis vectors according to j j j j W W W W =← .
  • 29. 17 3. Calculate the approximation factorization according to ∑= j jiji WHVˆ 4. Update the measurement vector according to ( )ijj T i j T i ij j i HgWV WV HH 'ˆ ⊗← 5. Calculate the reconstruction with new coefficient vectors according to ∑= j jiji WHVˆ . 6. Update the non-normalized basis vectors according to ( )[ ] ( )[ ]∑ ∑ ∇+ ∇+ ⊗← i jwj T iiij i jwj T iiij jj WWVVH WWVVH WW ˆ ˆ , In the above equation, ⊗ denotes elements-wise multiplication of the corresponding matrices. 7. Repeat the above steps until convergence. For a proof of convergence, the reader is referred to [58]. 2.3 Wavelet Overview Wavelet transforms are mathematical tools for hierarchically decomposing functions. They allow a function to be described in terms of a coarse overall level, plus details that range from broad to narrow. Wavelets offer an elegant technique for representing the levels of detail present. A wavelet transformation converts data from an original domain to a wavelet domain by expanding the raw data in an orthonormal basis generated by dilation and translation of a father and mother wavelet. Wavelet
  • 30. 18 transformation preserves the structure of data. A contracted, high frequency wavelet performs temporal analysis as compared to dilated, low frequency of the same wavelet that performs frequency analysis. Unlike the Fourier transform that uses sine and cosine as basis functions, wavelet transform contains more complicated basis functions. While individual wavelet functions are localizing in space, Fourier sine and cosine function are not. Wavelet transform do not have a single set of basis function like Fourier transform. The localization property of wavelet makes many functions sparse when transform into the wavelet domain. This sparseness has been utilized in various applications such as data compression, and removing noise from data. In this work, we use the Haar wavelt transform [60] which is based on one of the simplest possible wavelets, the Haar wavelet, which that be described by a step function ( ) ⎪ ⎪ ⎪ ⎩ ⎪⎪ ⎪ ⎨ ⎧ ≤≤ 〈≤ −= otherwise x x x 1 2 1 2 1 0 0 1 1 φ In order to avoid unnecessary mathematical notation, we will describe the Haar wavelet transform by providing a simple example. The one dimensional Haar transform of a function f can be viewed as a series of averaging and differencing operations on a discrete function. We compute the averages and differences between every two adjacent
  • 31. 19 values of f(x). The procedure to find the Haar transform of a discrete function f(x) =[a0 a1 a2 a3] is shown in Table 2.1. In this example, resolution 4 is the full resolution. Resolution Approximation Detail coefficients 4 [a0 a1 a2 a3] 2 [b0b1]=[(a0+a1)/2,(a2+ a3)/2] [(a0-a1)/2,(a2-a3)/2] 1 [(b0+b1)/2] [(b0-b1)/2] Table 2.1 Illustration of one-dimensional Haar wavelet transform For example, if f(x)=[7 5 6 2], then we have Resolution Approximation Detail coefficients 4 [7 5 6 2] 2 [6 4] [1 2] 1 5 1 Table 2.2 Illustration of one-dimensional Haar wavelet transform to f(x)=[7 5 6 2] Note that other definition of Haar wavelet transform that is different from the above definition by a factor of 2 also exists. However, this constant factor is irrelevant to our work since the distorted dataset is obtained by applying the inverse transform again
  • 32. 20 It can also be shown that the above transform is equivalent to multiplying the vector presentation of f by an integer (wavelet) matrix, which can be computed more efficiently than the analogous Fourier matrix. Multi-dimensional wavelets are usually defined via the tensor product. The two- dimensional wavelet basis consists of all possible tensor products of one-dimensional basis function. Applying the Haar wavelet transform to a 2 dimensional matrix will result in four sets of coefficients: the approximation coefficients matrix, cA, and horizontal. vertical and diagonal detail coefficients matrices (called cH, cV, and cD respectively). For example, when applying single level Haar wavelet transform to the matrix ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ 89 53 we obtain: cA (the overall average)= (3 + 5 + 9 + 8)/4 = 6.25, cH (the average of the difference of the summations of the rows) = (( (3 + 5)- (9 + 8)) /4= -2.25, cV (the difference of the summations of the columns) ((3 + 9)-(5+8))/4 = -0.25 and cD (the average of the difference of the summations of the diagonal)= ((3+8)−(9+5))/4 = −0.75. We can think of the approximation coefficients as a coarser resolution version of the original signal, and the detail (differences) coefficients as the higher resolution details.
  • 33. 21 2.4 Privacy measure Throughout this work, we adopt the same set of privacy parameters proposed in [62]. The value difference (VD) parameter is used as a measure for value difference after the data distortion algorithm is applied to the original data matrix. Let V and V denote the original and distorted data matrices respectively. Then, VD is given by ||||/|||| VVVVD −= , where |||| ⋅ denotes the Frobenius norm of the enclosed argument. After a data distortion, the order of the value of the data elements also changes. Several metrics are used to measure the position difference of the data elements. For a dataset V with n data object and m attributes, let i jRank denote the rank (in ascending order) of the jth element in attribute i. Similarly, let i jRank denote the rank of the corresponding distorted element. The RP parameter is used to measure the position difference. It indicates the average change of rank for all attributes after distortion and is given by . 1 1 1 ∑∑= = −= m i n j i j i j RankRank nm RP RK represents the percentage of elements that keeps their rank in each column after distortion and is given by ∑∑= = = m i n j i jRk nm RK 1 1 1 where 1=i jRk If an element keeps its position in the order of values, otherwise 0=i jRk .
  • 34. 22 Similarly, the CP parameter is used to measure how the rank of the average value of each attributes varies after the data distortion. In particular, CP defines the change of rank of the average value of the attributes and is given by ∑= −= m i ii RankVVRankVV m CP 1 1 , where iRankVV , and iRankVV denote the rank of the average value of the ith attribute before and after the data distortion, respectively. Similar to RK, CK is used to measure the percentage of the attributes that keep their ranks of average value after distortion. From the data privacy perspective, a good data distortion algorithm should result in a high values for the RP and CP parameters and low values for the RK and CK parameters. 2.5 Data Utility Measures The data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. In other words, one measure of a good privacy preserving data mining technique is to be able to compute relevant statistics and construct prediction models without having access to the data. The success of machine learning techniques in data mining has recently led researchers to explore the applicability of learning algorithms in privacy preserving data mining. A supervised learning algorithm is fed with a block of data that have been
  • 35. 23 classified manually as bi-classes or multi-classes, and builds a classifier, which is then used to detect observations according to their class without accessing the attributes. Throughout this work, we use the accuracy of a simple K-nearest neighborhood (KNN) as our data utility measure. KNN classification is simple instance- based learning algorithm that has shown to be effective in data classification. The success of this algorithm is due to the availability of effective similarity measure among the K nearest neighbor. The algorithm starts by calculating the similarity between the test data and all data in the training set. Then it picks the K closet instances and assigns the test data to the most common class among these nearest neighbors. All observations of training data and test data are considered as a vector. The classifier find out the K vectors from training vectors which are most similar to the test vector. In our work, we used the Euclidian distance is as a measure for the similarity. 2.6 Performance Measure for Classifier There are many mechanisms used to measure the performance of the classifier. In here, we introduce the performance measures used through the thesis. Let N=A+B+C+D be the total number of observation in test data. Class 1 Class 0 Classifier Decision: Class 1 A B Classifier Decision: Class 0 C D Table 2.3 Confusion Matrix.
  • 36. 24 If Table 2.3 denotes the confusion matrix of the data classifier, then we define the accuracy, precision, recall, and F1 for class 1 classifier as follows: N DA ACCURACY + = , ,)( BA A PPRECISION + = ,)( CA A RRECALL + = . 2 1 RP PR F + = Similar measures can be defined for class 0: N DA ACCURACY + = , ,)( DC D PPRECISION + = ,)( DB D RRECALL + = . 2 1 RP PR F + = 2.7 Bayesian Estimation The Bayesian estimation approach is directly based on the Bayes theorem [48]. Assuming that we have available some prior knowledge of the random variable to be estimated, we can incorporate this knowledge into our estimator. For this mechanism, we have to assume the prior probability density function (PDF) of the random variable s. The resultant estimator is said to be optimal on the average, or with respect to the assumed prior pdf of s. From the realizations of variable s without noise we may gather
  • 37. 25 the statistics needed to calculate the pdf of s. This can be done by histogram of sample or by approximating the densities with parameterized models that are flexible enough to account for the variety of probability densities we encountered. Consider the denoising problem of a scalar variable nsx += where the original variable s and the noise n are assumed to be statistically independent. To perform the denoising problem, it is essential to find an estimate using the statistical properties of s and n so that sˆ will be very close to s in some meaningful sense. While it is practically almost impossible to extract exactly s from the noisy variable, it is possible to find estimates which are better than the noisy sample x. Let )(sPs and )(nPn denote the prior probability density function of s and n respectively. We can calculate the posterior pdf for s given x, using basic axioms of probability theory. In particular, using the Bayes rule we have ( ) ( ) ( ) ( )xp spsxp xsp | | = Since nsx += , then we have ( ) ( )sxpsxp n −=| . Thus, p(x) can be obtained as follows: ( ) ( ) ( )∫ ∞ ∞− −= ,dssxpspxp ns Hence, at least theoretically, we now have the complete knowledge of ( )xsp | . This posterior pdf (i.e., the pdf of s after the data have been observed) can help to find out the value of s from the noisy variable x. Let ( ) ( ) )ˆ(ˆ 2 ssEsMSEB −= denote the Bayesian mean square error, where the expectation operator is defined with respect to the joint pdf ( )sxp , . In what follows we show how to obtain sˆ that minimizes the Bayesian MSE.
  • 38. 26 1. By noting that ( ) ( ) ( )xpxspsxp |, = ,Thus ( ) ( ) ( )( ) ( )∫ ∫ −= dxxpdsxspsssMSEB |ˆˆ 2 . 2. Since ( ) 0≥xp for all x, if the integral in brackets can be minimized for each x, then MSEB will be minimized. This can be obtained by setting the derivative of the integral in brackets (with respect to sˆ ) to zero. Hence ( ) ( ) ( ) ( )∫ ∫ =− ∂ ∂ =− ∂ ∂ dsxspss s dsxspss s |ˆ ˆ |ˆ ˆ 22 ( ) ( ) ( ) ( ) .0|ˆ2|2|ˆ2 =+−=−− ∫ ∫∫ dsxspsdsxsspdsxspss Then we have ( )∫= dsxssps |ˆ . 3. Substituting the value of ( )xsp / we get ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )∫ ∫ ∫ ∫∫ − − === dsspsxp dsspsxsp dsspsxp dsspsxsp xp dsspsxsp s n n | || ˆ Since the conditional pdf must integrate to 1, we get ( )xsEs |ˆ = . In other words, the optimal estimator in terms of minimizing the Bayesian MSE is the mean of the posterior pdf ( ).| xsp The above estimation approach will be used in chapter 5 to estimate the original dataset from its perturbed version. The associated mean square error serves as a measure for how well the data perturbation techniques conceal the original data.
  • 39. 27 2.8 Conclusion In this chapter, we presented some basic concepts related to privacy preserving data mining. We also introduced some privacy parameters and performance measures used throughout our thesis.
  • 40. 28 Chapter 3 Non-negative Matrix Factorization for Data Perturbation 3.1 Introduction In this chapter, we investigate the use of non-negative matrix factorization (NMF) with sparseness constraints for data perturbation. In order to test the performance of our proposed method, we conducted a series of experiments on some real world datasets. In this chapter, we present the results obtained when applying our technique to the original Wisconsin breast cancer and ionosphere databases downloaded from UCI machine Learning Repository [64]. 3.2 Experimental Results As it was explained in chapter 2, given a non-negative RN × data matrix V, we can approximately factorize V into product of two non-negative matrices W and H with sizes MN × and RM × respectively, that is WHV ≈ , where the reduced rank M of the factorization is generally chosen so that ( ) ,NRMRN 〈+ and the product WH can be regarded as a compressed form of the data V. Throughout our experiments, the rows of the (dataset) matrix V correspond to the dataset attributes and the columns correspond to
  • 41. 29 the specific observations. To select the reduced rank, M, we examine the accuracy of a KNN classifier on this reduced rank dataset. We conduct our experiment on the real world data downloaded from UCI machine Learning Repository [64]. The dataset is the original Wisconsin Breast Cancer Database. For this breast cancer database, we used 569 observations and 30 attributes (with positive values) to perform our experiment. For the classification task, 80% of the data was used for training and the other 20% was used for testing. Throughout our experiments, we set K=30 for the KNN classifier. The corresponding classification parameters on the original dataset are shown in Table 3.1. Accuracy Class 1 Precision Class 1 Recall Class 1 F1 Class 0 Precision Class 0 Recall Class 0 F1 92.11% 89.9% 96.9% 93.2% 95.6% 86.0% 90.5% Table 3.1 Experimental results for KNN=30 Figure 3.1 shows the effect of the reduced rank M on the privacy parameters. From Figure 3.1, it is clear that M=2 provides the best choice with respect to the privacy parameters. So, we fixed M=2 throughout the rest of our experiments with this dataset.
  • 42. 30 0 5 10 15 20 25 30 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 reduced rank privacyparameters acc VD RP RK CP CK Figure 3.1 Effect of the reduced rank M on the privacy parameters 3.3 Adding Sparseness Constraint Figure 3.2 shows effect of sparseness constraint on privacy parameters for the Wisconsin Breast Cancer dataset. Table 3.2 shows how the privacy parameters and accuracy vary with the sparseness constraint hS for some points which display a good trade off between the privacy parameters and the utility measure.
  • 43. 31 hS RP RK CP CK VD ACC 0 128.2 0.036 0.133 0.866 0.0341 92.11 0.15 124.4 0.034 0.266 0.733 0.0452 92.10 0.3 125.0 0.114 0.266 0.733 0.0551 92.98 0.65 128.1 0.005 0.6 0.6 0.4696 93.86 Table 3.2 Effect of the sparseness constrain on the privacy parameters and accuracy From the results in Table 3.2, it is clear that 65.0=hS not only improves the values of the privacy parameters, but also improves the classification accuracy 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 sparseness Privacyparameters accuracy RP VD RK CP CK Figure 3.2 Effect of sparseness constraint on privacy parameters (Wisconsin Breast Cancer dataset)
  • 44. 32 Table 3.3 shows the effect of threshold ε on the privacy parameters and accuracy for 65.0=hS . From the table, it is clear that there is a trade-off between the privacy parameters and the accuracy. ε RP RK CP CK VD ACC 0.001 128.62 0.0058 0.6 0.6 0.46997 93.86 0.005 130.31 0.0057 0.6 0.6 0.47249 93.86 0.01 133 0.0055 0.6 0.6 0.48265 93.86 0.02 141.21 0.005 0.6 0.6 0.50483 44.74 Table 3.3 The effect of threshold ∈ on the privacy parameters and accuracy 3.4 Dealing with negative values Throughout the rest of this section, we show how to use the above technique to perform data perturbation for datasets with both positive and negative values. Two approaches were used to deal with this situation. In the first approach, we take the absolute value of all the attributes, perform the data perturbation using the NMF as described above, and then restore the sign of the attributes from the original data set. In the second approach, we bias the data with some constant so that all the attributes become positive. After
  • 45. 33 performing the data perturbation, the value of this constant is subtracted from the perturbed data. To test the above two approaches, we used the ionosphere database (351 observation and 35 attributes in the range of -1 to +1). The first 200 instances were used as training data and other 151 were used as test data. We set K=13 for the KNN classifier. The corresponding classification accuracy on the original dataset is 93.38%. When using the first approach, the best classification result (93.37%) was obtained on the NMF data with reduced rank 16=M . Table 3.4 shows the corresponding privacy parameters. M CK CP RK RP VD Acc 16 0.67 0.41 0.017 35.63 0.35 0.9337 Table 3.4 Privacy parameters for the ionosphere dataset using the absolute value approach with M=16 Figure 3.3 shows effect of sparseness constraint on privacy parameters on Ionosphere dataset.
  • 46. 34 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 10 -2 10 -1 10 0 10 1 10 2 sparseness Privacyparameters accuracy RP VD RK CP CK Figure 3.3 Effect of sparseness constraint on privacy parameters (Ionosphere dataset using first approach) When varying the sparseness constraint from 0 to 1, the best trade off between the accuracy and the privacy parameters was obtained for 08.0=hS . Table 4 shows the corresponding accuracy and privacy parameters. Sh CK CP RK RP VD Acc 0.08 0.47 0.941 0.064 23.1 0.311 0.9337 Table 3.5 Privacy parameters for the ionosphere dataset using the absolute value approach with m=16 and 08.0=hS
  • 47. 35 Table 3.6 shows the effect of the truncation threshold ∈ on the accuracy and privacy parameters. ε CK CP RK RP VD Acc 0.01 0.441 0.94 0.065 23.1 0.310 0.9338 0.027 0.470 1.00 0.062 23.71 0.305 0.9007 0.037 0.411 1.00 0.056 29.18 0.304 0.8543 0.05 0.205 1.82 0.049 35.59 0.421 0.7748 0.08 0.117 10.11 0.039 77.38 0.930 0.8741 0.09 0.0588 12.00 0.036 91.18 0.960 0.8344 Table 3.6 Privacy parameters for the ionosphere dataset using the absolute value approach with truncation Figure 3.4 shows effect of truncation threshold on privacy parameters (Ionosphere dataset using first approach).
  • 48. 36 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 10 -2 10 -1 10 0 10 1 10 2 Threshold Privacyparameters accuracy RP VD RK CP CK Figure 3.4 Effect of truncation threshold on privacy parameters (Ionosphere dataset using first approach) Table 3.7, Figure 3.5 and figure 3.6 show the corresponding results when we used the second approach to deal with the negative data values. In this case the optimum trade off between the privacy parameters and the classification accuracy was obtained for 5.0=hS
  • 49. 37 M CK CP RK RP VD Acc 16 0.67 0.41 0.017 35.63 0.35 0.9337 hS CK CP RK RP VD Acc 0.5 0.647 0.470 0.012 38.85 0.376 0.9337 ε CK CP RK RP VD Acc 0.017 0.147 4.470 0.037 78.55 1.41 0.9470 0.022 0.147 4.588 0.037 78.494 1.42 0.9536 0.026 0.117 4.411 0.038 78.449 1.43 0.9602 0.031 0.147 4.411 0.038 78.583 1.48 0.9139 0.036 0.176 5.117 0.038 78.04 1.53 0.8675 0.04 0.058 5.470 0.038 77.426 1.56 0.8410 Table 3.7 Trade off between the privacy parameters and accuracy for the ionosphere dataset using the biasing approach
  • 50. 38 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 10 -3 10 -2 10 -1 10 0 10 1 10 2 sparseness Privacyparameters accuracy RP VD RK CP CK Figure 3.5 Effect of sparseness constraint on privacy parameters (Ionosphere dataset using second approach) 0.015 0.02 0.025 0.03 0.035 0.04 10 -2 10 -1 10 0 10 1 10 2 Threshold Privacyparameters accuracy RP VD RK CP CK Figure 3.6 Effect of truncation threshold on privacy parameters (Ionosphere dataset using second approach)
  • 51. 39 3.5 Conclusions Non-negative matrix factorization with sparseness constraints provides an effective data perturbation tool for privacy preserving data mining. It should be noted that more work needs to be done in order to systematically determine the range of NMF parameters that optimize both the utility function and privacy parameters.
  • 52. 40 Chapter 4 Data Distortion using Discrete Wavelet Transform 4.1 Introduction The primary focus of this chapter is to explore the use of the discrete wavelet transform (DWT) with truncation for data perturbation. Our primary experimental results show that the proposed method is effective in concealing the sensitive information while preserving the performance of data mining techniques after the data distortion. The proposed data distortion approach (see Figure 4.1) can be summarized as follows: 1. Perform the 2D Haar wavelet transform on the original dataset. 2. Truncate the detail coefficients that are below a pre-specified threshold value. 3. Perform the inverse transform to obtain the perturbed dataset. 4. Iterate the above steps until satisfactory privacy parameters and utility measures are obtained.
  • 53. 41 Figure 4.1 Proposed data distortion Technique 4.2 Experimental results In order to test the performance of our proposed method, we conducted a series of experiments on some real world datasets. In this section, we present a sample of the results obtained when applying our technique to the original Wisconsin Breast Cancer and Ionosphere databases downloaded from UCI machine Learning Repository [64]. For the Breast Cancer database, we used 569 observations and 30 attributes to perform our experiment. For the classification task, 80% of the data was used for training and the other 20% was used for testing. Throughout our experiments, we set K=30 for the KNN classifier. The corresponding classification accuracy on the original dataset is 92.11%. Original Dataset Haar DWT Detail Coefficient Truncation Haar IDWT Perturbed Dataset
  • 54. 42 Table 4.1 shows how the accuracy and privacy parameters vary when the truncation process is applied to one of the detail coefficients (cV, cH or cD) in the Haar wavelet transform and then the dataset is reconstructed by applying the inverse transform using this truncated coefficients. Truncated coefficients RP RK CP CK VD ACC% cV 60.4 0.013 1.73 0.2 0.60 96.49 cH 91.9 0.009 0 1 0.16 98.25 cD 60.6 0.014 0 1 0.14 98.25 Table 4.1 Influence of truncating the detail coefficients on the privacy and accuracy parameters (Wisconsin Breast Cancer dataset)
  • 55. 43 Figure 4.2 shows the effect of truncating detail coefficient on the privacy and accuracy parameters. cV cH cD 10 -3 10 -2 10 -1 10 0 10 1 10 2 Detail coefficient Privacyparameters accuracy RP VD RK CP CK Figure 4.2 Influence of truncating the detail coefficients on the privacy and accuracy parameters (Wisconsin Breast Cancer dataset) It should be noted that the classification accuracy in all these three cases is higher than the corresponding accuracy of the original dataset. On the other hand, for this particular dataset, truncating cH and cD resulted in CP=0, i.e., the rank of the average value of each attribute did not change in the distorted dataset which is some what undesirable feature. Table 4 shows the corresponding results when a pair of the detail coefficients (i.e., cH and cV; cH and cD; or cV and CD ) in the Haar wavelet transform are truncated to zero
  • 56. 44 and then the dataset is reconstructed by applying the inverse transform to this truncated coefficients. Again, high level of classification accuracy is maintained in all three cases. On the other hand, it is clear that truncated both cH and cV (denoted as cHV in the table) results in a better privacy parameter. This conclusion should be interpreted with care since the above results may change when the same technique is applied to a different dataset. Figure 4.3 shows the effect of truncating pairs of detail coefficients on the privacy and accuracy parameters (breast cancer dataset). cHV cHD cVD 10 -3 10 -2 10 -1 10 0 10 1 10 2 Combine Detail coefficient Privacyparameters accuracy RP VD RK CP CK Figure 4.3 Influence of truncating pairs of detail coefficients on the privacy and accuracy parameters (breast cancer dataset)
  • 57. 45 Truncated coefficients RP RK CP CK VD ACC cHV 77.51 0.017 1.8 0.26 0.63 98.25% cHD 74.7 0.007 0 1 0.23 98.25% cVD 50.22 0.06 1.7 0.26 0.63 92.11% Table 4.2 Influence of truncating pairs of detail coefficients on the privacy and accuracy parameters (breast cancer dataset) Tables 4.3, 4.4 and Figures 4.4, 4.5 show the corresponding results when we run our algorithm on the Ionosphere database from UCI repository [64]. This dataset has 351 instance and 35 attributes including class attribute. We used 200 instances as training data and the other 151 are as test data. We also set K=13 (for which the classification accuracy of the original dataset was 93.38%). The results obtained clearly show some tradeoff between the accuracy and the privacy parameters.
  • 58. 46 Detail coefficient RP RK CP CK VD ACC% cV 50.3 0.019 8.76 .03 0.56 90.07 cH 33.8 0.016 0 1 0.33 92.72 cD 34.7 0.014 0 1 0.35 91.39 Table 4.3 Influence of truncating the detail coefficients on the privacy and accuracy parameters (Ionosphere dataset) coefficient RP RK CP CK VD ACC cHV 61.33 0.039 8.9 0.05 0.64 87.42% cHD 47.26 0.039 0 1 0.242 93.38% cVD 61.5 0.03 9.1 0.03 0.65 91.39% Table 4.4 Influence of truncating pairs of detail coefficients on the privacy and accuracy parameters (Ionosphere dataset)
  • 59. 47 cV cH cD 10 -2 10 -1 10 0 10 1 10 2 Threshold Privacyparameters accuracy RP VD RK CP CK Figure 4.4 Influence of truncating the detail coefficients on the privacy and accuracy parameters (Ionosphere dataset) cHV cHD cVD 10 -2 10 -1 10 0 10 1 10 2 Threshold Privacyparameters accuracy RP VD RK CP CK Figure 4.5 Influence of truncating pairs of detail coefficients on the privacy and accuracy parameters (Ionosphere dataset)
  • 60. 48 4.3 Conclusion In this chapter, we have presented a new algorithm for privacy preserving data mining based on DWT with truncated coefficients. Based on the presented excremental results, the proposed method is effective in concealing the sensitive information while preserving the performance of data mining techniques after the data distortion.
  • 61. 49 Chapter 5 Bayesian Estimation of Original Data 5.1 Introduction The main objective of data perturbation techniques is to distort the individual data values while preserving the properties of the underlying statistical distribution. In this chapter we use the Bayesian estimation algorithm that was explained in chapter 2 to estimate the original dataset from its perturbed version. The mean square error between the original data and the estimated dataset can serve as an added measure for the effectiveness of the data perturbation technique in hiding the original data. 5.2 Experimental results In this section, we present the results of our experiments when applying the Bayesian estimation algorithm on datasets that were artificially generated in such a way that the values of their attributes follow the triangular distribution and the uniform
  • 62. 50 distribution shown in Figure 5.1 and Figure 5.2 respectively. In these figures (as well as the rest of the figures in this chapter), the x-axis shows the signal value and the y-axis shows its corresponding frequency. -100 -80 -60 -40 -20 0 20 40 60 80 100 0 5 10 15 20 25 30 35 40 45 50 Figure 5.1 Triangular signal distribution -100 -80 -60 -40 -20 0 20 40 60 80 100 0 5 10 15 20 25 30 35 40 Figure 5.2 Uniform signal distribution
  • 63. 51 -100 -80 -60 -40 -20 0 20 40 60 80 100 0 10 20 30 40 50 60 70 Figure 5.3 Distribution of Gaussian noise in the Triangular signal -100 -80 -60 -40 -20 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 Figure 5.4 Distribution of equivalent noise in the NMF-perturbed Triangular signal -100 -80 -60 -40 -20 0 20 40 60 80 100 0 10 20 30 40 50 60 70 Figure 5.5 Distribution of equivalent noise in the DWT-perturbed Triangular signal
  • 64. 52 -100 -80 -60 -40 -20 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 Figure 5.6 Distribution of Gaussian noise in the Uniform signal -100 -80 -60 -40 -20 0 20 40 60 80 100 0 10 20 30 40 50 60 70 Figure 5.7 Distribution of equivalent noise in the NMF-perturbed Uniform signal -100 -80 -60 -40 -20 0 20 40 60 80 100 0 10 20 30 40 50 60 Figure 5.8 Distribution of equivalent noise in the DWT-perturbed Uniform signal
  • 65. 53 The data perturbation was performed in three ways: (i) by adding a zero mean Gaussian noise, (ii) Using the NMF (reduced rank M=35, and sparseness constraint 9.0=hS ) approach described in chapter 3, and (iii) by using the DWT (horizontal detailed coefficient matrix cH=0, and diagonal detailed coefficient matrix cD=0) approach described in chapter 4. When applying the above data perturbation methods, the parameters of the data perturbation algorithms were adjusted in such a way that the added noise variance is fixed in all the three methods. Let O denote the original RN × data matrix and D denote the corresponding RN × distorted data matrix. Then the mean square error between the original signal and the perturbed one is given by: ( ) ( )( )∑ − × = ji in jiDjiO RN MSE , 2 ,, 1 Similarly, let E denote the estimated signal from the distorted signal. Then the mean square error between the original data and the estimated one is given by ( ) ( )( )∑ − × = ji out jiEjiO RN MSE , 2 ,, 1 Table 5.1 shows the privacy parameters and MSEin,, MSEout obtained for the a triangular data set.
  • 66. 54 Distortion Method VD RP RK CP CK MSEin MSEout Gaussian Noise 0.699 26.81 0.005 27.94 0.005 834.6 715.4 NMF 0.55 25.73 0.55 24.26 0.015 832.37 982.45 DWT 0.70 35.5 0.012 24.26 30.45 832.53 949.53 Table 5.1 Estimation results for Triangular signal distribution Table 5.2 shows the corresponding results when the original dataset have a uniform distribution. Distortion Method VD RP RK CP CK MSEin MSEout Gaussian Noise 0.49 20.3 0.018 22.22 0.015 826.2 606.02 NMF 0.5 26.27 0.014 23.57 0.025 827.2 946.4 DWT 0.5 23.1 0.015 36.25 0.025 826.98 1013 Table 5.2 Estimation results for Uniform signal distribution
  • 67. 55 5.3 Conclusions In this chapter, we applied the Bayesian estimation to estimate the original dataset from the perturbed one. When using the mean square error between the estimated dataset and the original dataset as a measure for the effectives of the data perturbation technique, our experimental results show that perturbing the data using the NMF and the DWT approaches described earlier outperform the traditional method of data perturbation by AWGN.
  • 68. 56 Chapter 6 Conclusions and Future Work 6.1 Conclusions Privacy issues have created innovative challenges in the area of data mining technology. These technical issues should not simply be addressed by restricting data collection or even by restricting the secondary use of information technology. In this thesis, we addressed two data perturbation approaches that can be used to conceal sensitive information while preserving the general patterns and trends from the original database. Using a set of privacy parameters previously proposed in the literatures, our experimental results show that, by the proper choice of the algorithm parameters, both the sparsified NMF and DWT with truncation techniques are effective data perturbation tools that preserve the privacy as well as maintain the utility measure. The mean square error of the Bayesian estimation technique, which we used to estimate the original signal from the distorted one, also adds to the list of privacy parameters that can be used to test the effectiveness of any newly proposed data perturbation tools. To sum up, we believe that different tools from well established fields such as signal processing and communications theory can be used to reveal a new era in the
  • 69. 57 privacy preserving data mining and hence present a flourishing trend in the privacy and security research. 6.2 Future Works Throughout our analysis, we used a simple KNN classifier. We believe that a better accuracy can be obtained by using better classifiers such as support vector machines (SVMs). The use of a more accurate classifier may allow a better trade of between the accuracy and the privacy parameters. Proposing a good set of privacy parameters for privacy preserving data mining is still an area in its infancy stage. The privacy concept is too complex to capture with the current set of parameters used throughout this thesis. In fact, one drawback of the parameters used in this thesis is that they do not reflect the effort required to break the privacy preserving algorithm. There is almost no literature related to this topic. Although our work described one additional privacy parameters (MSE of the estimation algorithm), we believe that more work needs to be done in this area, especially on how to quantitatively relate these parameters to the actual work required to break these data perturbation techniques. Other promising data perturbation techniques such as adding data dependent noise needs to be explored. Using a combination of rule hiding and randomization also presents an interesting direction of research. These techniques can play a vital role for privacy preserving data mining. Sanitization is a challenging problem and it is sometimes restrictive. One can combine sanitization and randomization under the same framework to reduce the side effect of the sanitization process. On the one hand, randomization does
  • 70. 58 not remove items from a dataset which in general introduce false drops to the data, i.e., some patterns that are not supposed to exist in the original database.
  • 71. 59 References [1] M. Chen, J. Han, and P. Yu, "Data Mining: An Overview from a Database Prospective", IEEE Trans. Knowledge and Data Engineering, 8, 1996. [2] Doug Struck. Don't store my data, Japanese tell government. International Herald Tribune, page 1, August 24-25 2002. [3] M. Feingold, M. Corzine, M. Wyden, and M. Nelson, “Data-mining moratorium act of 2003.” U.S. Senate Bill (proposed), January 16 2003. [4] Krishnamurty Muralidhar, “Security of Random Data Perturbation Methods,” ACM Transactions on Database Systems, Vol.24, No.4, December 1999. [5] L. Cranor, M. Langheinrich, M. Marchiori, M. Presler-Marshall, and J. Reagle. “The platform for privacy preferences 1.0 specification,” In W3C Recommendation, April 2002. [6] Yuichi Koike. “References for p3p implementations,” available through http://www.w3.org/P3P/implementations, 2004. [7] Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu. “Implementing p3p using database technology.” In 19th International Conference on Data Engineering, Bangalore, India, March 2003. [8] Rakesh Agrawal, Jerry Kiernan, Ramakrishnan Srikant, and Yirong Xu. “Hippocratic databases,” In 28th International Conference on Very Large Data Bases,Hong Kong, China, August 2002. .
  • 72. 60 [9] A. C. Yao. “How to generate and exchange secrets.” In Proceedings 27th IEEE Symposium on Foundations of Computer Science, pages 162–167, 1986. [10] O. Goldreich, S. Micali, and A. Wigderson. “How to play any mental game”. In Proceedings of the 19th annual ACM symposium on Theory of Computing, pages 218–229, 1987. [11] Patrik O. Hoyer. “Non-negative Matrix Factorization with Sparseness Constraints,” Journal of Machine Learning Research 5 (2004) 1457–1469 [12] Oded Goldreich. “Secure Multi-Party Computation (Working Draft).” Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel, June 1998. [13] G. Brassard, C. Crepeau and J. Robert. “All-or-nothing disclosure of secrets.” In Advances in Cryptology - Crypto86, LNCS, volume :263, Springer verlag, 1987 pp. 234- 238 [14] K. Sako and M. Hirt, “Efficient receipt-free voting based on homomorphic encryption,” in Proceedings of Advances in Cryptology (EUROCRYPT2000), Bruges, Belgium, May 2000, pp. 539–556. [15] J. C. Benaloh and M. De Mare. “One-way accumulators: A decentralized alter- native to digital signatures.” Advances in Cryptology – EUROCRYPT’93. Workshop on the Theory and Application of Cryptographic Techniques. Lecture Notes in Computer Science., 765:274–285, May 1993. [16] O. Goldreich, “Secure Multi-Party Computation (Working Draft),” Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel, June 1998.
  • 73. 61 [17] B. Pinkas, “Cryptographic techniques for privacy preserving data mining,” SIGKDD Explorations, vol. 4, no. 2, pp. 12–19, 2002. [Online]. Available:http://portal.acm.org/citation.cfm?id=772865 [18] W. Du and M. J. Atallah, “Secure multi-party computation problems and their applications: A review and open problems,” in Proceedings of the 2001 Workshop on New Security Paradigms. Cloudcroft, NM: ACM Press, September 2001, pp. 13–22. [19] M. J. Atallah, W. Du and F. Kerschbaum, “Protocols for secure remote database access with approximate matching,” in 7th ACM Conference on Computer and Communications Security(ACMCCS 2000). The first workshop on Security of Privacy in E-Commerce, Athens, Greece, November 2000. [20] M. J. Atallah and W. Du, “Secure multi-party computational geometry,” in WADS2001: Seventh International Workshop on Algorithms and Data Structures, Providence, Rhode Island, August 2001, pp. 165–179. [21] W. Du, Y. S. Han, and S. Chen, “Privacy-preserving multivariate statistical analysis: Linear regression and classification,” in Proceedingsof 2004 SIAM International Conference on Data Mining (SDM04), Lake Buena Vista, FL, April 2004. [Online]. Available: http://www.cis.syr.edu/wedu/Research/paper/sdm2004 privacy.pdf . [22] Pentti Paatero and Unto Tapper. “Positive matrix factorization: A non-negative factor model with optimal utilization of error.” Environmetrics, 5:111-126, 1994. [23] W. Du and M. J. Atallah, “Privacy preserving cooperative scientific computations,” in Proceedings of the 14th IEEE Computer Security Foundations Workshop, Nova Scotia, Canada, June 2001, pp. 273–282 [24] W. Du and M. J. Atallah. “Secure multi-party computation problems and their applications: A review and open problems.” In New Security Paradigms Workshop, pages 11–20, Cloudcroft, New Mexico, USA, September 11-13 2001.
  • 74. 62 [25] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Advances in Cryptology (CRYPTO’00), ser. Lecture Notes in Computer Science, vol.1880. Springer- Verlag, 2000, pp. 36–53. [26] M. Kantarcoglu and J. Vaidya, “Privacy preserving naive bayes classifier for horizontally partitioned data,” in IEEE ICDM Workshop on Privacy Preserving Data Mining, Melbourne, FL, November 2003, pp. 3–9. [27] R. Wright and Z. Yang, “Privacy-preserving bayesian network structure computation on distributed heterogeneous data,” in Proceedings of the Tenth ACM SIGKDD Conference (SIGKDD’04), Seattle, WA, August 2004. [Online] Available: http://www.cs.stevens.edu/rwright/Publications/ . [28] J. Vaidya and C. Clifton, “Privacy-preserving k-means clustering over vertically partitioned data,” in The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., August 2003. [29] Tore Dalenius and Steven P. Reiss. “Data-swapping: A technique for disclosure control.” Journal of Statistical Planning and Inference, 6:73–85, 1982. [30] Stephen E. Fienberg and Julie McIntyre. “Data swapping: Variations on a theme by dalenius and reiss.” Technical report, National Institute of Statistical Sciences, Research Triangle Park, NC, 2003. [31] Nabil R. Adam and John C.Worthmann. “Security-control methods for statistical databases: a comparative study.” ACM Comput. Surv., 21(4):515–556, 1989. [32] N. S. Matloff. “Inference control via query restriction vs. data modification: a perspective.” In on Database Security: Status and Prospects, pages 159–166. North- Holland Publishing Co., 1988.
  • 75. 63 [33] J. Domingo-Ferrer and Torra. “Statistical data protection in statistical micro data protection via advanced record linkage.” Statistics and Computing, 13(4):343–354, 2003. [34] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. “Collective Data Mining: A New Perspective Towards Distributed Data Mining.” In Hillol Kargupta and Philip Chan, editors, Advances in Distributed and Parallel Knowledge Discovery, pages 133– 184. MIT/AAAI Press, 2000. . [35] B. Park, H. Kargupta, E. Johnson, E. Sanseverino, D. Hershberger, and L. Silvestre. “Distributed, Collaborative Data Analysis from Heterogeneous Sites Using a Scalable Evolutionary Technique.” Journal of Applied Intelligence, 16(1):19–42, 2002. [36] H. Kargupta, H. Park, S. Pittie, L. Liu, D. Kushraj, and K. Sarkar. “MobiMine: Monitoring the stock market from a PDA.” ACM SIGKDD Explorations, 3:37–47, 2001. [37] B. Park, R. Ayyagari, and H. Kargupta. “A fourier analysis-based approach to learn classifier from distributed heterogeneous data.” In Proceedings of the First SIAM International Conference on Data Mining, Chicago, US, 2001. [38] E. Johnson and H. Kargupta. “Collective, Hierarchical Clustering From Distributed, Heterogeneous Data.” In M. Zaki and C. Ho, editors, Large-Scale Parallel KDD Systems. Lecture Notes in Computer Science, volume 1759, pages 221–244. Springer-Verlag, 1999. [39] J. Ross Quinlan. “Induction of decision trees.” In Machine Learning, pages 1(1): 81– 106, 1986.
  • 76. 64 [40] W. Du and M. J. Atallah. “Secure multi-party computation problems and their applications: A review and open problems.” In New Security Paradigms Workshop, pages 11 – 20, 2001. [41] M. Kantarcioglu and C. Clifton. “Privacy-preserving distributed mining of association rules on horizontally partitioned data.” In SIGMOD Workshop on DMKD, Madison, WI, June 2002. [42] J. Vaidya and C. Clifton. “Privacy Preserving Association Rule Mining in Vertically Partitioned Data.” In The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, July 2002. [43] M. J. Atallah, E. Bertino, A. K. Elmagarmid, M. Ibrahim, and V. S. Verykios, “Disclosure limitation of sensitive rules,” in Proceedings of the IEEE Knowledge and Data Engineering Workshop, 1999, pp. 45–52. [44] S. Oliveira and O. R. Zaiane, “Privacy preserving frequent itemset mining,” in Proceedings of the IEEE International Conference on Privacy, Security and Data Mining. Maebashi City, Japan: Australian Computer Society, Inc., 2002, pp. 43–54. [Online]. Available:http://portal.acm.org/citation.cfm?id=850789 [45] V. S. Verykios, A. K. Elmagarmid, B. Elisa, Y. Saygin, and D. Elena, “Association rule hiding,” in IEEE Transactions on Knowledge and Data Engineering, 2003. [46] Y. Saygin, V. S. Verykios, and C. Clifton, “Using unknowns to prevent discovery of association rules,” SIGMOD Record, vol. 30, no. 4, pp. 45–54, December 2001. [Online]. Available: http://portal.acm.org/citation.cfm?id=604271 [47] C. K. Liew, U. J. Choi, and C. J. Liew, “A data distortion by probability distribution,” ACM Transactions on Database Systems (TODS), vol. 10, no. 3, pp. 395– 411, 1985.
  • 77. 65 [Online]. Available: http://portal.acm.org/citation.cfm?id=4017 [48] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd ed.), John Wiley and Sons, 2001 [49] S. Warner, “Randomized response: A survey technique for eliminating evasive answer bias,” Journal of the American Statistical Association, vol. 60, pp. 63–69, 1965. [50] Rakesh Agrawal and Ramakrishnan Srikant. “Privacy-preserving data mining,” In Proceeding of the ACM SIGMOD Conference on Management of Data, pp. 439–450, Dallas, Texas, May 2000. ACM Press. [51] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proceedings of the twentieth ACM SIGMOD- SIGACT-SIGART symposium on Principles of Database Systems, Santa Barbara, CA, 2001, pp. 247–255. [Online]. Available: http://portal.acm.org/citation.cfm?id=375602 [52] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacy preserving mining of association rules,” in Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), July 2002. [53] A. Evfimevski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in Proceedings of the ACM SIGMOD/PODS Conference, San Diego, CA, June 2003. [54] Chong K. Liew, Uinam J. Choi, and Chung J. Liew. “A data distortion by probability distribution.” ACM Trans. Database Syst., 10(3):395–411, 1985.
  • 78. 66 [55] J. J. Kim and W. E. Winkler, “Multiplicative noise for masking continuous data,” Statistical Research Division, U.S. Bureau of the Census, Washington D.C., Tech. Rep. Statistics #2003-01, April 2003. [56] K. Muralidhar, D. Batrah, and P. J. Kirs, “Accessibility, security, and accuracy in statistical databases: The case for the multiplicative fixed data perturbation approach,” Management Science, vol. 41, no. 9, pp. 1549–1584, 1995. [57] K. Liu, H. Kargupta, and J. Ryan. “Random projection and privacy preserving correlation computation from distributed data.” Technical report, University of Maryland Baltimore County, Computer Science and Electrical Engineering Department, Technical Report TR-CS-03-24, 2003. [58] Daniel D. Lee and H. Sebastian Seung. “Algorithms for non-negative matrix factorization.” In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 556–562. MIT Press, 2001. [59] C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Technical report, Department of Computer Science, National Taiwan University, www.csie.ntu.edu.tw/~cjlin (2005) [60] Amara Graps, “An introduction to Wavelets”, IEEE Computational Science and Engineering, Summer 1995, vol 2, num 2. IEEE Computer society. [61] T.Li, Q. Li, S. Zhu, M Ogihara. “A Survey on wavelet Application in Data Mining.” SIGKDD Explorations, Vol 4,issue 2,pages-49-68, 2002. [62] Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang, “Data distortion for privacy protection in a terrorist Analysis system.” P. Kantor et al(Eds.):ISI 2005,LNCS 3495, pp.459-464,2005
  • 79. 67 [63] J. F. Pang, D. B. Bo and S.Bai, “Research and Implementation of Text Categorization System Based on VSM,” International Conf. on Multilingual Information Processing [C], pp. 31-36, .2000 [64] UCI Machine Learning Repository. http://www.ics.uci.edu/mlearn/mlsummary.html. [65] Saif M. A. Kabir1, Amr M. Youssef and Ahmed K. Elhakeem, “On data distortion for privacy preserving data mining,” In proc. of 20th Canadian Conference on Electrical and Computer Engineering (CCECE 2007), Vancouver, BC, April 24-26 April, 2007. [66] Saif M. A. Kabir1, Amr M. Youssef and Ahmed K. Elhakeem, “Data distortion by Non-negative Matrix Factorization for preserving privacy,” In proc. of the 3rd International Conference on Computational Intelligence, CI 2007, Banff, Alberta, Canada, July 2-4, 2007.