Saif_CCECE2007_full_paper_submitted

CCECE 2007- CCGEI 2007, Vancouver, April 2007
0-7803-8253-6/04/$17.00 ©2007 IEEE
ON DATA DISTORTION FOR PRIVACY PRESERVING DATA MINING
Saif M. A. Kabir1
, Amr M. Youssef2
and Ahmed K. Elhakeem1
1
Department of Electrical and Computer Engineering
2
Concordia Institute for Information System Engineering
Concordia University, Montreal, Quebec, Canada
{sm_asif,youssef,ahmed}@ece.concordia.ca
Abstract
Because of the increasing ability to trace and collect large
amount of personal information, privacy preserving in data
mining applications has become an important concern. Data
perturbation is one of the well known techniques for privacy
preserving data mining. The objective of these data
perturbation techniques is to distort the individual data values
while preserving the underlying statistical distribution
properties. Theses data perturbation techniques are usually
assessed in terms of both their privacy parameters as well as its
associated utility measure. While the privacy parameters
present the ability of these techniques to hide the original data
values, the data utility measures assess whether the dataset
keeps the performance of data mining techniques after the data
distortion. In this paper, we investigate the use of truncated
non-negative matrix factorization (NMF) with sparseness
constraints for data perturbation.
Keywords: Privacy preserving data mining, Non-negative
matrix factorization.
1. INTRODUCTION
Data mining [1] is the process of searching for patterns in
large volumes of data using tools such as classification and
association rule mining. Several data mining applications deal
with privacy-sensitive data such as financial transactions, and
health care records. Because of the increasing ability to trace
and collect large amount of personal data, privacy preserving in
data mining applications has become an important concern.
Data can be either collected in a centralized location or
collected and stored at distributed or scattered locations.
According to the collection procedure, there exist different
privacy concerns. For example, for the centralized storage of
data, the major privacy issue is to defend the exact value of the
attributes from the data analysts. On the contrary, in a
distributed database situation, the preeminent purpose is to
maintain the independence of the distributed data ownership
which is related to the issue of data mining in a distributed
environment. Among the techniques that are used for privacy
preserving data mining are: query restriction, secure multi-party
computation, data swapping, distributed data mining, and data
perturbation. In this work, we focus on the latter approach, i.e.,
data perturbation. The objective of data perturbation is to distort
the individual data values while preserving the underlying
statistical distribution properties. Theses data perturbation
techniques are usually assessed in terms of both their privacy
parameters as well as its associated utility measure. While the
privacy parameters present the ability of these techniques to
hide the original data values, the data utility measures assess
whether the dataset keeps the performance of data mining
techniques after the data distortion.
The primary focus of this work is to explore a new data
perturbation approach for privacy preserving data mining. In
particular, we investigate the use of truncated non-negative
matrix factorization (NMF) with sparseness constraints for data
perturbation. Our primary experimental results show that the
proposed method is effective in concealing the sensitive
information while preserving the performance of data mining
techniques after the data distortion.
The rest of the paper is organized as follows. In section 2, we
briefly review the non-negative matrix factorization technique.
The data distortion and the utility measures used in this work
are reviewed in section 3 and section 4 respectively. The
experimental results on some real world datasets are presented
in section 5. Finally, the conclusions and future works are given
in section 6.
2. NON-NEGATIVE MATRIX FACTORIZATION
Non negative matrix factorization (NMF) [9] refers to a class of
algorithms that can be formulated as follows: Given a non-
negative rn× data matrix, V , NMF finds an approximate
factorization WHV ≈ where W and H are both non negative
matrices of size mn× and rm× respectively. The reduced
rank m of the factorization is generally chosen so that
( ) nrmrn <+ and hence the product WH can be regarded as
a compressed form of the data matrixV . The optimal choices
of matrices W and H are defined to be those non-negative
matrices that minimize the reconstruction error between V and
WH . Various error functions have been proposed. The most
widely used is the squared error (Euclidean distance) function
( ) ( )( ) .,
2
,
∑ −=
ji
ijij WHVHWE
Unlike other matrix factorization methods (such as principle
component analysis and independent component analysis), non-

negative matrix factorization requires all entries of both
matrices to be non negative, i.e., the data is described by using
additive components only. In section 5 we show how to deal
with datasets with both positive and negative attributes.
NMF with Sparseness Constraint
Several measures for sparseness have been proposed. In this
work, the sparseness of a vector X of dimension n is given by
[8]:
( )
1
/ 2
−
− ∑∑=
n
xxn
X
ii
S
Usually, most of NMF algorithms produce a sparse
representation of the data. Such a representation encodes much
of the data using few active components. However, the
sparseness given by these techniques can be considered as a
side-effect rather than a controlled parameter, i.e., one cannot in
any way control the degree to which the representation is
sparse.
Our aim is to constrain NMF to find solutions with desired
degrees of sparseness. The sparseness constraint can be
imposed on either W or H or on both of them. For example, a
doctor analyzing a dataset that describes disease patterns, might
assume that most diseases are rare (hence sparse) but that each
disease can cause a large number of symptoms. Assuming that
symptoms make up the rows of her matrix and the columns
denote different individuals, in this case it is the coefficients
which should be sparse and the basis vectors unconstrained.
Throughout our work, we used the projected gradient descent
algorithm for NMF with sparseness constraints proposed in [8]
where we added the sparse constraint only on the H matrix.
Truncation on NMF with Sparseness Constraint
In order to control the degree of achievable data distortion, the
elements in the sparsified H matrix with values less than a
specified truncation threshold ε are truncated to zero.
Thus the overall data distortion can be summarized as follows:
(i) Perform sparsified NMF with sparse constraint hs on H to
obtain hSH (ii) Truncate the elements in hSH that are less
than ε to obtain ε,hSH . The perturbed dataset is given by
ε,hSWH .
Thus the new dataset is basically distorted twice by our
proposed algorithm that has three parameters: the reduced rank
m , the sparseness parameter hs and the truncation threshold
ε .
3. DATA DISTORTION MEASURES
Throughout this work, we adopt the same set of privacy
parameters proposed in [6]. The value difference (VD)
parameter is used as a measure for value difference after the
data distortion algorithm is applied to the original data
matrix. Let V and V denote the original and distorted data
matrices respectively. Then, VD is given by
||||/|||| VVVVD −= ,
where ||||⋅ denotes the Frobenius norm of the enclosed
argument.
After a data distortion, the order of the value of the data
elements also changes. Several metrics are used to measure
the position difference of the data elements. For a dataset V
with n data object and m attributes, let i
jRank denote the
rank (in ascending order) of the jth
element in attribute i.
Similarly, let i
jRank denote the rank of the
corresponding distorted element. The RP parameter is used
to measure the position difference. It indicates the average
change of rank for all attributes after distortion and is given
by
.
1
1 1
∑∑= =
−=
m
i
n
j
i
j
i
j RankRank
nm
RP
RK represents the percentage of elements that keeps their
rank in each column after distortion and is given by
∑∑= =
=
m
i
n
j
i
jRk
nm
RK
1 1
1
where 1=i
jRk If an element keeps its position in the order
of values, otherwise 0=i
jRk .
Similarly, the CP parameter is used to measure how the rank
of the average value of each attributes varies after the data
distortion. In particular, CP defines the change of rank of the
average value of the attributes and is given by
∑=
−=
m
i
ii RankVVRankVV
m
CP
1
1
,
where iRankVV , and iRankVV denote the rank of the
average value of the ith
attribute before and after the data
distortion, respectively.
Similar to RK, CK is used to measure the percentage of the
attributes that keep their ranks of average value after
distortion.
From the data privacy perspective, a good data distortion
algorithm should result in a high values for the RP and CP
parameters and low values for the RK and CK parameters.
4. UTILITY MEASURE
The data utility measures assess whether the dataset keeps
the performance of data mining techniques after the data
distortion. Throughout this work, we use the accuracy of a
simple K-nearest neighborhood (KNN) [11] as our data utility
measure.
5. EXPERIMENTAL RESULTS
In order to test the performance of our proposed
method, we conducted a series of experiments on some real

world datasets. In this section, we present a sample of the
results obtained when applying our technique to the original
Wisconsin breast cancer and ionosphere databases
downloaded from UCI machine Learning Repository [10].
For the breast cancer database, we used 569 observations and
30 attributes (with positive values) to perform our
experiment. For the classification task, 80% of the data was
used for training and the other 20% was used for testing.
Throughout our experiments, we set K=30 for the KNN
classifier. The corresponding classification accuracy on the
original dataset is 92.11%. Figure 1 shows the effect of the
reduced rank m on the privacy parameters.
From the Figure 1, it is clear that 2=m provides the best
choice with respect to the privacy parameters. So, we fixed
2=m throughout the rest of our experiments with this
dataset.
0 5 10 15 20 25 30
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
Reduce rank m
privacyparameters
acc
RP
VD
RK
CP
CK
Figure 1 Effect of the reduced rank m on the privacy
parameters.
Table 1 shows the how the privacy parameters and accuracy
vary with the sparseness constraint hS .
hS RP RK CP CK VD ACC
0 128.2 0.036 0.133 0.866 0.0341 92.11
0.15 124.4 0.034 0.266 0.733 0.0452 92.10
0.3 125.0 0.114 0.266 0.733 0.0551 92.98
0.65 128.1 0.005 0.6 0.6 0.4696 93.86
Table 1 Effect of the sparseness constrain on the privacy
parameters and accuracy
From the results in Table 1, it is clear that 65.0=hS not only
improves the values of the privacy parameters, but also
improves the classification accuracy.
Table 2 shows the effect of threshold ε on the privacy
parameters and accuracy. From the table, it is clear that there is
a trade-off between the privacy parameters and the accuracy.
ε RP RK CP CK VD ACC
0.001 128.62 0.0058 0.6 0.6 0.46997 93.86
0.005 130.31 0.0057 0.6 0.6 0.47249 93.86
0.01 133 0.0055 0.6 0.6 0.48265 93.86
0.02 141.21 0.005 0.6 0.6 0.50483 44.74
Table 2 The effect of threshold ε on the privacy
parameters and accuracy
Dealing with negative values:
Throughout the rest of this section we show how to use the
above technique to perform data perturbation for datasets with
both positive and negative values. Two approaches were used
to deal with this situation. In the first approach, we take the
absolute value of all the attributes, perform the data
perturbation using the NMF as described above, and then
restore the sign of the attributes from the original data set. In
the second approach, we bias the data with some constant so
that all the attributes become positive. After performing the
data perturbation, the value of this constant is subtracted from
the perturbed data.
To test the above two approaches, we used the ionosphere
database (351 observation and 35 attributes in the range of -1 to
+1). The first 200 instances were used as training data and other
151 were used as test data. We set K=13 for the KNN classifier.
The corresponding classification accuracy on the original
dataset is 93.38%.
When using the first approach, the best classification result
(93.37%) was obtained on the NMF data with reduced rank
16=m . Table 3 shows the corresponding privacy parameters.
m CK CP RK RP VD Acc
16 0.67 0.41 0.017 35.63 0.35 0.9337
Table 3 Privacy parameters for the ionosphere dataset using the
absolute value approach with 16=m
When varying the sparseness constraint from 0 to 1, the best
trade off between the accuracy and the privacy parameters was
obtained for 08.0=hS . Table 4 shows the corresponding
accuracy and privacy parameters.
Sh CK CP RK RP VD Acc
0.08 0.47 0.941 0.064 23.1 0.311 0.9337
absolute value approach with 16=m and 08.0=hS .

Table 5 shows the effect of the truncation threshold ε on the
accuracy and privacy parameters.
ε CK CP RK RP VD Acc
0.01 0.441 0.94 0.065 23.1 0.310 0.9338
0.027 0.470 1 0.062 23.71 0.305 0.9007
0.037 0.411 1 0.056 29.18 0.304 0.8543
0.05 0.205 1.82 0.049 35.59 0.421 0.7748
0.08 0.117 10.11 0.039 77.38 0.930 0.8741
0.09 0.0588 12 0.036 91.18 0.960 0.8344
absolute value approach with 16=m and 08.0=hS .
Tables 6 show the corresponding results when we used the
second approach to deal with the negative data values. In this
case the optimum trade off between the privacy parameters and
the classification accuracy was obtained for 5.0=hS
m CK CP RK RP VD Acc
16 0.67 0.41 0.017 35.63 0.35 0.9337
hS CK CP RK RP VD Acc
0.5 0.647 0.470 0.012 38.85 0.376 0.9337
ε CK CP RK RP VD Acc
0.017 0.147 4.470 0.037 78.55 1.41 0.9470
0.022 0.147 4.588 0.037 78.494 1.42 0.9536
0.026 0.117 4.411 0.038 78.449 1.43 0.9602
0.031 0.147 4.411 0.038 78.583 1.48 0.9139
0.036 0.176 5.117 0.038 78.04 1.53 0.8675
0.04 0.058 5.470 0.038 77.426 1.56 0.8410
Table 6 Trade off between the privacy parameters and accuracy
for the ionosphere dataset using the biasing approach
6. CONCLUSIONS
Non-negative matrix factorization with sparseness
constraints provides an effective data perturbation tool for
privacy preserving data mining.
On the other hand, while the privacy parameters used in this
work provide some indication on the ability of these techniques
to hide the original data values, it is interesting to quantitatively
relate these parameters to the actual work required break these
data perturbation techniques.
References
[1] M. Chen, J. Han, and P. Yu, "Data Mining: An
Overview from a Database Prospective", IEEE Trans.
Knowledge and Data Engineering, 8, 1996.
[2] Z. Yang, S. Zhong, R. N. Wright, “Privacy-
preserving classification of customer data without
loss of accuracy,” In proceedings of the 5th SIAM
International Conference on Data Mining, Newport
Beach, CA, April 21-23, 2005.
[3] R. Agrawal, and A. Evfimievski, “Information
sharing across private database,” Proceedings of the
2003 ACM SIGMOD international conference on
management of data, San Diego, CA, pp. 86-97.
[4] D. Agrawal and C. C. Aggawal, “On the design and
quantification of privacy preserving data mining
algorothms,” In Proceedings of the 20th ACM
SIMOD Symposium on Principles of Database
Systems, pages 247–255, Santa Barbara, May 2001.
[5] Rakesh Agrawal and Ramakrishnan Srikant,
“Privacy-preserving data mining,” In Proceeding of
the ACM SIGMOD Conference on Management of
Data, pages 439–450, Dallas, Texas, May 2000.
ACM Press.
[6] Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang,
Data distortion for privacy protection in a terrorist
Analysis system. P. Kantor et al (Eds.):ISI 2005,
LNCS 3495, pp.459-464, 2005
[7] V. P. Pauca, F. Shahnaz, M. Berry and R. Plemmons.
Text Mining using non-negative Matrix
Factorizations, Proc. SIAM Inter. Conf. on Data
Mining, Orlando, April, 2004.
[8] Patrik O. Hoyer. Non-negative Matrix Factorization
with Sparseness Constraints. Journal of Machine
Learning Research 5 (2004) 1457–1469
[9] D. D. Lee and H. S. Seung. Algorithms for non-
negative matrix factorization. In Advances in Neural
Information Processing 13 (Proc. NIPS 2000). MIT
Press, 2001.
[10] UCI Machine Learning Repository.
http://www.ics.uci.edu/mlearn/mlsummary.html.
[11] R. Duda, P. Hart, and D. Stork, “Pattern
Classification,” John Wiley and Sons, 2001.

Saif_CCECE2007_full_paper_submitted

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Saif_CCECE2007_full_paper_submitted

Similar to Saif_CCECE2007_full_paper_submitted (20)

Saif_CCECE2007_full_paper_submitted