SlideShare a Scribd company logo
CCECE 2007- CCGEI 2007, Vancouver, April 2007
0-7803-8253-6/04/$17.00 ©2007 IEEE
ON DATA DISTORTION FOR PRIVACY PRESERVING DATA MINING
Saif M. A. Kabir1
, Amr M. Youssef2
and Ahmed K. Elhakeem1
1
Department of Electrical and Computer Engineering
2
Concordia Institute for Information System Engineering
Concordia University, Montreal, Quebec, Canada
{sm_asif,youssef,ahmed}@ece.concordia.ca
Abstract
Because of the increasing ability to trace and collect large
amount of personal information, privacy preserving in data
mining applications has become an important concern. Data
perturbation is one of the well known techniques for privacy
preserving data mining. The objective of these data
perturbation techniques is to distort the individual data values
while preserving the underlying statistical distribution
properties. Theses data perturbation techniques are usually
assessed in terms of both their privacy parameters as well as its
associated utility measure. While the privacy parameters
present the ability of these techniques to hide the original data
values, the data utility measures assess whether the dataset
keeps the performance of data mining techniques after the data
distortion. In this paper, we investigate the use of truncated
non-negative matrix factorization (NMF) with sparseness
constraints for data perturbation.
Keywords: Privacy preserving data mining, Non-negative
matrix factorization.
1. INTRODUCTION
Data mining [1] is the process of searching for patterns in
large volumes of data using tools such as classification and
association rule mining. Several data mining applications deal
with privacy-sensitive data such as financial transactions, and
health care records. Because of the increasing ability to trace
and collect large amount of personal data, privacy preserving in
data mining applications has become an important concern.
Data can be either collected in a centralized location or
collected and stored at distributed or scattered locations.
According to the collection procedure, there exist different
privacy concerns. For example, for the centralized storage of
data, the major privacy issue is to defend the exact value of the
attributes from the data analysts. On the contrary, in a
distributed database situation, the preeminent purpose is to
maintain the independence of the distributed data ownership
which is related to the issue of data mining in a distributed
environment. Among the techniques that are used for privacy
preserving data mining are: query restriction, secure multi-party
computation, data swapping, distributed data mining, and data
perturbation. In this work, we focus on the latter approach, i.e.,
data perturbation. The objective of data perturbation is to distort
the individual data values while preserving the underlying
statistical distribution properties. Theses data perturbation
techniques are usually assessed in terms of both their privacy
parameters as well as its associated utility measure. While the
privacy parameters present the ability of these techniques to
hide the original data values, the data utility measures assess
whether the dataset keeps the performance of data mining
techniques after the data distortion.
The primary focus of this work is to explore a new data
perturbation approach for privacy preserving data mining. In
particular, we investigate the use of truncated non-negative
matrix factorization (NMF) with sparseness constraints for data
perturbation. Our primary experimental results show that the
proposed method is effective in concealing the sensitive
information while preserving the performance of data mining
techniques after the data distortion.
The rest of the paper is organized as follows. In section 2, we
briefly review the non-negative matrix factorization technique.
The data distortion and the utility measures used in this work
are reviewed in section 3 and section 4 respectively. The
experimental results on some real world datasets are presented
in section 5. Finally, the conclusions and future works are given
in section 6.
2. NON-NEGATIVE MATRIX FACTORIZATION
Non negative matrix factorization (NMF) [9] refers to a class of
algorithms that can be formulated as follows: Given a non-
negative rn× data matrix, V , NMF finds an approximate
factorization WHV ≈ where W and H are both non negative
matrices of size mn× and rm× respectively. The reduced
rank m of the factorization is generally chosen so that
( ) nrmrn <+ and hence the product WH can be regarded as
a compressed form of the data matrixV . The optimal choices
of matrices W and H are defined to be those non-negative
matrices that minimize the reconstruction error between V and
WH . Various error functions have been proposed. The most
widely used is the squared error (Euclidean distance) function
( ) ( )( ) .,
2
,
∑ −=
ji
ijij WHVHWE
Unlike other matrix factorization methods (such as principle
component analysis and independent component analysis), non-
negative matrix factorization requires all entries of both
matrices to be non negative, i.e., the data is described by using
additive components only. In section 5 we show how to deal
with datasets with both positive and negative attributes.
NMF with Sparseness Constraint
Several measures for sparseness have been proposed. In this
work, the sparseness of a vector X of dimension n is given by
[8]:
( )
1
/ 2
−
− ∑∑=
n
xxn
X
ii
S
Usually, most of NMF algorithms produce a sparse
representation of the data. Such a representation encodes much
of the data using few active components. However, the
sparseness given by these techniques can be considered as a
side-effect rather than a controlled parameter, i.e., one cannot in
any way control the degree to which the representation is
sparse.
Our aim is to constrain NMF to find solutions with desired
degrees of sparseness. The sparseness constraint can be
imposed on either W or H or on both of them. For example, a
doctor analyzing a dataset that describes disease patterns, might
assume that most diseases are rare (hence sparse) but that each
disease can cause a large number of symptoms. Assuming that
symptoms make up the rows of her matrix and the columns
denote different individuals, in this case it is the coefficients
which should be sparse and the basis vectors unconstrained.
Throughout our work, we used the projected gradient descent
algorithm for NMF with sparseness constraints proposed in [8]
where we added the sparse constraint only on the H matrix.
Truncation on NMF with Sparseness Constraint
In order to control the degree of achievable data distortion, the
elements in the sparsified H matrix with values less than a
specified truncation threshold ε are truncated to zero.
Thus the overall data distortion can be summarized as follows:
(i) Perform sparsified NMF with sparse constraint hs on H to
obtain hSH (ii) Truncate the elements in hSH that are less
than ε to obtain ε,hSH . The perturbed dataset is given by
ε,hSWH .
Thus the new dataset is basically distorted twice by our
proposed algorithm that has three parameters: the reduced rank
m , the sparseness parameter hs and the truncation threshold
ε .
3. DATA DISTORTION MEASURES
Throughout this work, we adopt the same set of privacy
parameters proposed in [6]. The value difference (VD)
parameter is used as a measure for value difference after the
data distortion algorithm is applied to the original data
matrix. Let V and V denote the original and distorted data
matrices respectively. Then, VD is given by
||||/|||| VVVVD −= ,
where ||||⋅ denotes the Frobenius norm of the enclosed
argument.
After a data distortion, the order of the value of the data
elements also changes. Several metrics are used to measure
the position difference of the data elements. For a dataset V
with n data object and m attributes, let i
jRank denote the
rank (in ascending order) of the jth
element in attribute i.
Similarly, let i
jRank denote the rank of the
corresponding distorted element. The RP parameter is used
to measure the position difference. It indicates the average
change of rank for all attributes after distortion and is given
by
.
1
1 1
∑∑= =
−=
m
i
n
j
i
j
i
j RankRank
nm
RP
RK represents the percentage of elements that keeps their
rank in each column after distortion and is given by
∑∑= =
=
m
i
n
j
i
jRk
nm
RK
1 1
1
where 1=i
jRk If an element keeps its position in the order
of values, otherwise 0=i
jRk .
Similarly, the CP parameter is used to measure how the rank
of the average value of each attributes varies after the data
distortion. In particular, CP defines the change of rank of the
average value of the attributes and is given by
∑=
−=
m
i
ii RankVVRankVV
m
CP
1
1
,
where iRankVV , and iRankVV denote the rank of the
average value of the ith
attribute before and after the data
distortion, respectively.
Similar to RK, CK is used to measure the percentage of the
attributes that keep their ranks of average value after
distortion.
From the data privacy perspective, a good data distortion
algorithm should result in a high values for the RP and CP
parameters and low values for the RK and CK parameters.
4. UTILITY MEASURE
The data utility measures assess whether the dataset keeps
the performance of data mining techniques after the data
distortion. Throughout this work, we use the accuracy of a
simple K-nearest neighborhood (KNN) [11] as our data utility
measure.
5. EXPERIMENTAL RESULTS
In order to test the performance of our proposed
method, we conducted a series of experiments on some real
world datasets. In this section, we present a sample of the
results obtained when applying our technique to the original
Wisconsin breast cancer and ionosphere databases
downloaded from UCI machine Learning Repository [10].
For the breast cancer database, we used 569 observations and
30 attributes (with positive values) to perform our
experiment. For the classification task, 80% of the data was
used for training and the other 20% was used for testing.
Throughout our experiments, we set K=30 for the KNN
classifier. The corresponding classification accuracy on the
original dataset is 92.11%. Figure 1 shows the effect of the
reduced rank m on the privacy parameters.
From the Figure 1, it is clear that 2=m provides the best
choice with respect to the privacy parameters. So, we fixed
2=m throughout the rest of our experiments with this
dataset.
0 5 10 15 20 25 30
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
Reduce rank m
privacyparameters
acc
RP
VD
RK
CP
CK
Figure 1 Effect of the reduced rank m on the privacy
parameters.
Table 1 shows the how the privacy parameters and accuracy
vary with the sparseness constraint hS .
hS RP RK CP CK VD ACC
0 128.2 0.036 0.133 0.866 0.0341 92.11
0.15 124.4 0.034 0.266 0.733 0.0452 92.10
0.3 125.0 0.114 0.266 0.733 0.0551 92.98
0.65 128.1 0.005 0.6 0.6 0.4696 93.86
Table 1 Effect of the sparseness constrain on the privacy
parameters and accuracy
From the results in Table 1, it is clear that 65.0=hS not only
improves the values of the privacy parameters, but also
improves the classification accuracy.
Table 2 shows the effect of threshold ε on the privacy
parameters and accuracy. From the table, it is clear that there is
a trade-off between the privacy parameters and the accuracy.
ε RP RK CP CK VD ACC
0.001 128.62 0.0058 0.6 0.6 0.46997 93.86
0.005 130.31 0.0057 0.6 0.6 0.47249 93.86
0.01 133 0.0055 0.6 0.6 0.48265 93.86
0.02 141.21 0.005 0.6 0.6 0.50483 44.74
Table 2 The effect of threshold ε on the privacy
parameters and accuracy
Dealing with negative values:
Throughout the rest of this section we show how to use the
above technique to perform data perturbation for datasets with
both positive and negative values. Two approaches were used
to deal with this situation. In the first approach, we take the
absolute value of all the attributes, perform the data
perturbation using the NMF as described above, and then
restore the sign of the attributes from the original data set. In
the second approach, we bias the data with some constant so
that all the attributes become positive. After performing the
data perturbation, the value of this constant is subtracted from
the perturbed data.
To test the above two approaches, we used the ionosphere
database (351 observation and 35 attributes in the range of -1 to
+1). The first 200 instances were used as training data and other
151 were used as test data. We set K=13 for the KNN classifier.
The corresponding classification accuracy on the original
dataset is 93.38%.
When using the first approach, the best classification result
(93.37%) was obtained on the NMF data with reduced rank
16=m . Table 3 shows the corresponding privacy parameters.
m CK CP RK RP VD Acc
16 0.67 0.41 0.017 35.63 0.35 0.9337
Table 3 Privacy parameters for the ionosphere dataset using the
absolute value approach with 16=m
When varying the sparseness constraint from 0 to 1, the best
trade off between the accuracy and the privacy parameters was
obtained for 08.0=hS . Table 4 shows the corresponding
accuracy and privacy parameters.
Sh CK CP RK RP VD Acc
0.08 0.47 0.941 0.064 23.1 0.311 0.9337
Table 4 Privacy parameters for the ionosphere dataset using the
absolute value approach with 16=m and 08.0=hS .
Table 5 shows the effect of the truncation threshold ε on the
accuracy and privacy parameters.
ε CK CP RK RP VD Acc
0.01 0.441 0.94 0.065 23.1 0.310 0.9338
0.027 0.470 1 0.062 23.71 0.305 0.9007
0.037 0.411 1 0.056 29.18 0.304 0.8543
0.05 0.205 1.82 0.049 35.59 0.421 0.7748
0.08 0.117 10.11 0.039 77.38 0.930 0.8741
0.09 0.0588 12 0.036 91.18 0.960 0.8344
Table 5 Privacy parameters for the ionosphere dataset using the
absolute value approach with 16=m and 08.0=hS .
Tables 6 show the corresponding results when we used the
second approach to deal with the negative data values. In this
case the optimum trade off between the privacy parameters and
the classification accuracy was obtained for 5.0=hS
m CK CP RK RP VD Acc
16 0.67 0.41 0.017 35.63 0.35 0.9337
hS CK CP RK RP VD Acc
0.5 0.647 0.470 0.012 38.85 0.376 0.9337
ε CK CP RK RP VD Acc
0.017 0.147 4.470 0.037 78.55 1.41 0.9470
0.022 0.147 4.588 0.037 78.494 1.42 0.9536
0.026 0.117 4.411 0.038 78.449 1.43 0.9602
0.031 0.147 4.411 0.038 78.583 1.48 0.9139
0.036 0.176 5.117 0.038 78.04 1.53 0.8675
0.04 0.058 5.470 0.038 77.426 1.56 0.8410
Table 6 Trade off between the privacy parameters and accuracy
for the ionosphere dataset using the biasing approach
6. CONCLUSIONS
Non-negative matrix factorization with sparseness
constraints provides an effective data perturbation tool for
privacy preserving data mining.
On the other hand, while the privacy parameters used in this
work provide some indication on the ability of these techniques
to hide the original data values, it is interesting to quantitatively
relate these parameters to the actual work required break these
data perturbation techniques.
References
[1] M. Chen, J. Han, and P. Yu, "Data Mining: An
Overview from a Database Prospective", IEEE Trans.
Knowledge and Data Engineering, 8, 1996.
[2] Z. Yang, S. Zhong, R. N. Wright, “Privacy-
preserving classification of customer data without
loss of accuracy,” In proceedings of the 5th SIAM
International Conference on Data Mining, Newport
Beach, CA, April 21-23, 2005.
[3] R. Agrawal, and A. Evfimievski, “Information
sharing across private database,” Proceedings of the
2003 ACM SIGMOD international conference on
management of data, San Diego, CA, pp. 86-97.
[4] D. Agrawal and C. C. Aggawal, “On the design and
quantification of privacy preserving data mining
algorothms,” In Proceedings of the 20th ACM
SIMOD Symposium on Principles of Database
Systems, pages 247–255, Santa Barbara, May 2001.
[5] Rakesh Agrawal and Ramakrishnan Srikant,
“Privacy-preserving data mining,” In Proceeding of
the ACM SIGMOD Conference on Management of
Data, pages 439–450, Dallas, Texas, May 2000.
ACM Press.
[6] Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang,
Data distortion for privacy protection in a terrorist
Analysis system. P. Kantor et al (Eds.):ISI 2005,
LNCS 3495, pp.459-464, 2005
[7] V. P. Pauca, F. Shahnaz, M. Berry and R. Plemmons.
Text Mining using non-negative Matrix
Factorizations, Proc. SIAM Inter. Conf. on Data
Mining, Orlando, April, 2004.
[8] Patrik O. Hoyer. Non-negative Matrix Factorization
with Sparseness Constraints. Journal of Machine
Learning Research 5 (2004) 1457–1469
[9] D. D. Lee and H. S. Seung. Algorithms for non-
negative matrix factorization. In Advances in Neural
Information Processing 13 (Proc. NIPS 2000). MIT
Press, 2001.
[10] UCI Machine Learning Repository.
http://www.ics.uci.edu/mlearn/mlsummary.html.
[11] R. Duda, P. Hart, and D. Stork, “Pattern
Classification,” John Wiley and Sons, 2001.

More Related Content

What's hot

POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...
POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...
POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...
IJCI JOURNAL
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
ijcsa
 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
HARDIK SINGH
 
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETTWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
IJDKP
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
Suryakumar Thangarasu
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
Editor Jacotech
 
IRJET- Evidence Chain for Missing Data Imputation: Survey
IRJET- Evidence Chain for Missing Data Imputation: SurveyIRJET- Evidence Chain for Missing Data Imputation: Survey
IRJET- Evidence Chain for Missing Data Imputation: Survey
IRJET Journal
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
Ahmad Amri
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
ijcsity
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Seval Çapraz
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
IRJET Journal
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
ijistjournal
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
IOSR Journals
 
Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study of the Class and Structural Changes Caused By Incorporating the Target ...Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study of the Class and Structural Changes Caused By Incorporating the Target ...
ijceronline
 
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTION
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTIONINFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTION
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTION
IJDKP
 
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
IOSR Journals
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
IRJET Journal
 

What's hot (19)

POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...
POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...
POSTERIOR RESOLUTION AND STRUCTURAL MODIFICATION FOR PARAMETER DETERMINATION ...
 
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
A ROBUST MISSING VALUE IMPUTATION METHOD MIFOIMPUTE FOR INCOMPLETE MOLECULAR ...
 
IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292IJCSI-10-6-1-288-292
IJCSI-10-6-1-288-292
 
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETTWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SET
 
Marketing analytics - clustering Types
Marketing analytics - clustering TypesMarketing analytics - clustering Types
Marketing analytics - clustering Types
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
IRJET- Evidence Chain for Missing Data Imputation: Survey
IRJET- Evidence Chain for Missing Data Imputation: SurveyIRJET- Evidence Chain for Missing Data Imputation: Survey
IRJET- Evidence Chain for Missing Data Imputation: Survey
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Mine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means ClusteringMine Blood Donors Information through Improved K-Means Clustering
Mine Blood Donors Information through Improved K-Means Clustering
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
 
Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study of the Class and Structural Changes Caused By Incorporating the Target ...Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study of the Class and Structural Changes Caused By Incorporating the Target ...
 
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTION
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTIONINFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTION
INFLUENCE OF DATA GEOMETRY IN RANDOM SUBSET FEATURE SELECTION
 
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
Efficient Disease Classifier Using Data Mining Techniques: Refinement of Rand...
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
 

Viewers also liked

Moments of life
Moments of lifeMoments of life
Moments of life
Zeliya Dsouza
 
Surrogate marketing(1)
Surrogate marketing(1)Surrogate marketing(1)
Surrogate marketing(1)
Anagha Deshpande
 
Product_relaunching_Heinz case study zeliya dsouza
Product_relaunching_Heinz case study zeliya dsouza Product_relaunching_Heinz case study zeliya dsouza
Product_relaunching_Heinz case study zeliya dsouza
Zeliya Dsouza
 
Furniture 3d virtual reality (Vietnamese)
Furniture 3d virtual reality (Vietnamese)Furniture 3d virtual reality (Vietnamese)
Furniture 3d virtual reality (Vietnamese)
Quang-Vĩnh Hà
 
WAVELET TRANSFORM
WAVELET TRANSFORMWAVELET TRANSFORM
Fdi c4
Fdi c4Fdi c4
Thesis_final_subm
Thesis_final_submThesis_final_subm
In bev inc and anheuser-busch companies inc
In bev inc and  anheuser-busch companies incIn bev inc and  anheuser-busch companies inc
In bev inc and anheuser-busch companies inc
Anagha Deshpande
 
Surrogate marketing(1)
Surrogate marketing(1)Surrogate marketing(1)
Surrogate marketing(1)
Anagha Deshpande
 
Auction website design phase_prs
Auction website design phase_prsAuction website design phase_prs
Auction website design phase_prs
Zeliya Dsouza
 
Ithaca prelaunch
Ithaca prelaunchIthaca prelaunch
Ithaca prelaunch
Properji
 
Climate Adaptation: A brief Australian perspective on new technologies for mi...
Climate Adaptation: A brief Australian perspective on new technologies for mi...Climate Adaptation: A brief Australian perspective on new technologies for mi...
Climate Adaptation: A brief Australian perspective on new technologies for mi...
Oficina Regional de la FAO para América Latina y el Caribe
 
Información acerca de Slideshare y Flickr
Información acerca de Slideshare y FlickrInformación acerca de Slideshare y Flickr
Información acerca de Slideshare y Flickr
Vielka Poveda
 
Question three
Question threeQuestion three
Question three
hayleywells
 
P pt ti
P pt tiP pt ti
P pt ti
SiHa NZ
 
University Talks #1 | Виктор Васильев - Как стать счастливым, сохранить счаст...
University Talks #1 | Виктор Васильев - Как стать счастливым, сохранить счаст...University Talks #1 | Виктор Васильев - Как стать счастливым, сохранить счаст...
University Talks #1 | Виктор Васильев - Как стать счастливым, сохранить счаст...
Amir Abdullaev
 
Documento cristobal saunero
Documento cristobal sauneroDocumento cristobal saunero
Documento cristobal saunero
Carolina Yañez
 
LEÇON 358 – Nul appel à Dieu ne peut être inentendu ni laissé sans réponse.
LEÇON 358 – Nul appel à Dieu ne peut être inentendu ni laissé sans réponse.LEÇON 358 – Nul appel à Dieu ne peut être inentendu ni laissé sans réponse.
LEÇON 358 – Nul appel à Dieu ne peut être inentendu ni laissé sans réponse.
Pierrot Caron
 
Assistant photographer performance appraisal
Assistant photographer performance appraisalAssistant photographer performance appraisal
Assistant photographer performance appraisal
lindameygi
 

Viewers also liked (20)

Moments of life
Moments of lifeMoments of life
Moments of life
 
Surrogate marketing(1)
Surrogate marketing(1)Surrogate marketing(1)
Surrogate marketing(1)
 
Product_relaunching_Heinz case study zeliya dsouza
Product_relaunching_Heinz case study zeliya dsouza Product_relaunching_Heinz case study zeliya dsouza
Product_relaunching_Heinz case study zeliya dsouza
 
Furniture 3d virtual reality (Vietnamese)
Furniture 3d virtual reality (Vietnamese)Furniture 3d virtual reality (Vietnamese)
Furniture 3d virtual reality (Vietnamese)
 
WAVELET TRANSFORM
WAVELET TRANSFORMWAVELET TRANSFORM
WAVELET TRANSFORM
 
Fdi c4
Fdi c4Fdi c4
Fdi c4
 
Thesis_final_subm
Thesis_final_submThesis_final_subm
Thesis_final_subm
 
In bev inc and anheuser-busch companies inc
In bev inc and  anheuser-busch companies incIn bev inc and  anheuser-busch companies inc
In bev inc and anheuser-busch companies inc
 
Surrogate marketing(1)
Surrogate marketing(1)Surrogate marketing(1)
Surrogate marketing(1)
 
Auction website design phase_prs
Auction website design phase_prsAuction website design phase_prs
Auction website design phase_prs
 
Ithaca prelaunch
Ithaca prelaunchIthaca prelaunch
Ithaca prelaunch
 
Climate Adaptation: A brief Australian perspective on new technologies for mi...
Climate Adaptation: A brief Australian perspective on new technologies for mi...Climate Adaptation: A brief Australian perspective on new technologies for mi...
Climate Adaptation: A brief Australian perspective on new technologies for mi...
 
Información acerca de Slideshare y Flickr
Información acerca de Slideshare y FlickrInformación acerca de Slideshare y Flickr
Información acerca de Slideshare y Flickr
 
Question three
Question threeQuestion three
Question three
 
P pt ti
P pt tiP pt ti
P pt ti
 
University Talks #1 | Виктор Васильев - Как стать счастливым, сохранить счаст...
University Talks #1 | Виктор Васильев - Как стать счастливым, сохранить счаст...University Talks #1 | Виктор Васильев - Как стать счастливым, сохранить счаст...
University Talks #1 | Виктор Васильев - Как стать счастливым, сохранить счаст...
 
Documento cristobal saunero
Documento cristobal sauneroDocumento cristobal saunero
Documento cristobal saunero
 
LEÇON 358 – Nul appel à Dieu ne peut être inentendu ni laissé sans réponse.
LEÇON 358 – Nul appel à Dieu ne peut être inentendu ni laissé sans réponse.LEÇON 358 – Nul appel à Dieu ne peut être inentendu ni laissé sans réponse.
LEÇON 358 – Nul appel à Dieu ne peut être inentendu ni laissé sans réponse.
 
ImmoStepLogo ok
ImmoStepLogo okImmoStepLogo ok
ImmoStepLogo ok
 
Assistant photographer performance appraisal
Assistant photographer performance appraisalAssistant photographer performance appraisal
Assistant photographer performance appraisal
 

Similar to Saif_CCECE2007_full_paper_submitted

Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
ijcsit
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy Algorithms
IRJET Journal
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
IRJET Journal
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
Editor Jacotech
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
IJARIIE JOURNAL
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
IJERA Editor
 
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
IJECEIAES
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
IJECEIAES
 
F017533540
F017533540F017533540
F017533540
IOSR Journals
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
IJDKP
 
61_Empirical
61_Empirical61_Empirical
61_Empirical
Boshra Albayaty
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
IJAEMSJORNAL
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
Sitamarhi Institute of Technology
 
Predicting electricity consumption using hidden parameters
Predicting electricity consumption using hidden parametersPredicting electricity consumption using hidden parameters
Predicting electricity consumption using hidden parameters
IJLT EMAS
 
C054
C054C054
C054
Weili Xu
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
IJRES Journal
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
inventionjournals
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 

Similar to Saif_CCECE2007_full_paper_submitted (20)

Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...Q UANTUM  C LUSTERING -B ASED  F EATURE SUBSET  S ELECTION FOR MAMMOGRAPHIC I...
Q UANTUM C LUSTERING -B ASED F EATURE SUBSET S ELECTION FOR MAMMOGRAPHIC I...
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy Algorithms
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
An Efficient Frame Embedding Using Haar Wavelet Coefficients And Orthogonal C...
 
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
 
F017533540
F017533540F017533540
F017533540
 
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...
 
61_Empirical
61_Empirical61_Empirical
61_Empirical
 
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Unit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdfUnit-3 Data Analytics.pdf
Unit-3 Data Analytics.pdf
 
Predicting electricity consumption using hidden parameters
Predicting electricity consumption using hidden parametersPredicting electricity consumption using hidden parameters
Predicting electricity consumption using hidden parameters
 
C054
C054C054
C054
 
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 

Saif_CCECE2007_full_paper_submitted

  • 1. CCECE 2007- CCGEI 2007, Vancouver, April 2007 0-7803-8253-6/04/$17.00 ©2007 IEEE ON DATA DISTORTION FOR PRIVACY PRESERVING DATA MINING Saif M. A. Kabir1 , Amr M. Youssef2 and Ahmed K. Elhakeem1 1 Department of Electrical and Computer Engineering 2 Concordia Institute for Information System Engineering Concordia University, Montreal, Quebec, Canada {sm_asif,youssef,ahmed}@ece.concordia.ca Abstract Because of the increasing ability to trace and collect large amount of personal information, privacy preserving in data mining applications has become an important concern. Data perturbation is one of the well known techniques for privacy preserving data mining. The objective of these data perturbation techniques is to distort the individual data values while preserving the underlying statistical distribution properties. Theses data perturbation techniques are usually assessed in terms of both their privacy parameters as well as its associated utility measure. While the privacy parameters present the ability of these techniques to hide the original data values, the data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. In this paper, we investigate the use of truncated non-negative matrix factorization (NMF) with sparseness constraints for data perturbation. Keywords: Privacy preserving data mining, Non-negative matrix factorization. 1. INTRODUCTION Data mining [1] is the process of searching for patterns in large volumes of data using tools such as classification and association rule mining. Several data mining applications deal with privacy-sensitive data such as financial transactions, and health care records. Because of the increasing ability to trace and collect large amount of personal data, privacy preserving in data mining applications has become an important concern. Data can be either collected in a centralized location or collected and stored at distributed or scattered locations. According to the collection procedure, there exist different privacy concerns. For example, for the centralized storage of data, the major privacy issue is to defend the exact value of the attributes from the data analysts. On the contrary, in a distributed database situation, the preeminent purpose is to maintain the independence of the distributed data ownership which is related to the issue of data mining in a distributed environment. Among the techniques that are used for privacy preserving data mining are: query restriction, secure multi-party computation, data swapping, distributed data mining, and data perturbation. In this work, we focus on the latter approach, i.e., data perturbation. The objective of data perturbation is to distort the individual data values while preserving the underlying statistical distribution properties. Theses data perturbation techniques are usually assessed in terms of both their privacy parameters as well as its associated utility measure. While the privacy parameters present the ability of these techniques to hide the original data values, the data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. The primary focus of this work is to explore a new data perturbation approach for privacy preserving data mining. In particular, we investigate the use of truncated non-negative matrix factorization (NMF) with sparseness constraints for data perturbation. Our primary experimental results show that the proposed method is effective in concealing the sensitive information while preserving the performance of data mining techniques after the data distortion. The rest of the paper is organized as follows. In section 2, we briefly review the non-negative matrix factorization technique. The data distortion and the utility measures used in this work are reviewed in section 3 and section 4 respectively. The experimental results on some real world datasets are presented in section 5. Finally, the conclusions and future works are given in section 6. 2. NON-NEGATIVE MATRIX FACTORIZATION Non negative matrix factorization (NMF) [9] refers to a class of algorithms that can be formulated as follows: Given a non- negative rn× data matrix, V , NMF finds an approximate factorization WHV ≈ where W and H are both non negative matrices of size mn× and rm× respectively. The reduced rank m of the factorization is generally chosen so that ( ) nrmrn <+ and hence the product WH can be regarded as a compressed form of the data matrixV . The optimal choices of matrices W and H are defined to be those non-negative matrices that minimize the reconstruction error between V and WH . Various error functions have been proposed. The most widely used is the squared error (Euclidean distance) function ( ) ( )( ) ., 2 , ∑ −= ji ijij WHVHWE Unlike other matrix factorization methods (such as principle component analysis and independent component analysis), non-
  • 2. negative matrix factorization requires all entries of both matrices to be non negative, i.e., the data is described by using additive components only. In section 5 we show how to deal with datasets with both positive and negative attributes. NMF with Sparseness Constraint Several measures for sparseness have been proposed. In this work, the sparseness of a vector X of dimension n is given by [8]: ( ) 1 / 2 − − ∑∑= n xxn X ii S Usually, most of NMF algorithms produce a sparse representation of the data. Such a representation encodes much of the data using few active components. However, the sparseness given by these techniques can be considered as a side-effect rather than a controlled parameter, i.e., one cannot in any way control the degree to which the representation is sparse. Our aim is to constrain NMF to find solutions with desired degrees of sparseness. The sparseness constraint can be imposed on either W or H or on both of them. For example, a doctor analyzing a dataset that describes disease patterns, might assume that most diseases are rare (hence sparse) but that each disease can cause a large number of symptoms. Assuming that symptoms make up the rows of her matrix and the columns denote different individuals, in this case it is the coefficients which should be sparse and the basis vectors unconstrained. Throughout our work, we used the projected gradient descent algorithm for NMF with sparseness constraints proposed in [8] where we added the sparse constraint only on the H matrix. Truncation on NMF with Sparseness Constraint In order to control the degree of achievable data distortion, the elements in the sparsified H matrix with values less than a specified truncation threshold ε are truncated to zero. Thus the overall data distortion can be summarized as follows: (i) Perform sparsified NMF with sparse constraint hs on H to obtain hSH (ii) Truncate the elements in hSH that are less than ε to obtain ε,hSH . The perturbed dataset is given by ε,hSWH . Thus the new dataset is basically distorted twice by our proposed algorithm that has three parameters: the reduced rank m , the sparseness parameter hs and the truncation threshold ε . 3. DATA DISTORTION MEASURES Throughout this work, we adopt the same set of privacy parameters proposed in [6]. The value difference (VD) parameter is used as a measure for value difference after the data distortion algorithm is applied to the original data matrix. Let V and V denote the original and distorted data matrices respectively. Then, VD is given by ||||/|||| VVVVD −= , where ||||⋅ denotes the Frobenius norm of the enclosed argument. After a data distortion, the order of the value of the data elements also changes. Several metrics are used to measure the position difference of the data elements. For a dataset V with n data object and m attributes, let i jRank denote the rank (in ascending order) of the jth element in attribute i. Similarly, let i jRank denote the rank of the corresponding distorted element. The RP parameter is used to measure the position difference. It indicates the average change of rank for all attributes after distortion and is given by . 1 1 1 ∑∑= = −= m i n j i j i j RankRank nm RP RK represents the percentage of elements that keeps their rank in each column after distortion and is given by ∑∑= = = m i n j i jRk nm RK 1 1 1 where 1=i jRk If an element keeps its position in the order of values, otherwise 0=i jRk . Similarly, the CP parameter is used to measure how the rank of the average value of each attributes varies after the data distortion. In particular, CP defines the change of rank of the average value of the attributes and is given by ∑= −= m i ii RankVVRankVV m CP 1 1 , where iRankVV , and iRankVV denote the rank of the average value of the ith attribute before and after the data distortion, respectively. Similar to RK, CK is used to measure the percentage of the attributes that keep their ranks of average value after distortion. From the data privacy perspective, a good data distortion algorithm should result in a high values for the RP and CP parameters and low values for the RK and CK parameters. 4. UTILITY MEASURE The data utility measures assess whether the dataset keeps the performance of data mining techniques after the data distortion. Throughout this work, we use the accuracy of a simple K-nearest neighborhood (KNN) [11] as our data utility measure. 5. EXPERIMENTAL RESULTS In order to test the performance of our proposed method, we conducted a series of experiments on some real
  • 3. world datasets. In this section, we present a sample of the results obtained when applying our technique to the original Wisconsin breast cancer and ionosphere databases downloaded from UCI machine Learning Repository [10]. For the breast cancer database, we used 569 observations and 30 attributes (with positive values) to perform our experiment. For the classification task, 80% of the data was used for training and the other 20% was used for testing. Throughout our experiments, we set K=30 for the KNN classifier. The corresponding classification accuracy on the original dataset is 92.11%. Figure 1 shows the effect of the reduced rank m on the privacy parameters. From the Figure 1, it is clear that 2=m provides the best choice with respect to the privacy parameters. So, we fixed 2=m throughout the rest of our experiments with this dataset. 0 5 10 15 20 25 30 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 Reduce rank m privacyparameters acc RP VD RK CP CK Figure 1 Effect of the reduced rank m on the privacy parameters. Table 1 shows the how the privacy parameters and accuracy vary with the sparseness constraint hS . hS RP RK CP CK VD ACC 0 128.2 0.036 0.133 0.866 0.0341 92.11 0.15 124.4 0.034 0.266 0.733 0.0452 92.10 0.3 125.0 0.114 0.266 0.733 0.0551 92.98 0.65 128.1 0.005 0.6 0.6 0.4696 93.86 Table 1 Effect of the sparseness constrain on the privacy parameters and accuracy From the results in Table 1, it is clear that 65.0=hS not only improves the values of the privacy parameters, but also improves the classification accuracy. Table 2 shows the effect of threshold ε on the privacy parameters and accuracy. From the table, it is clear that there is a trade-off between the privacy parameters and the accuracy. ε RP RK CP CK VD ACC 0.001 128.62 0.0058 0.6 0.6 0.46997 93.86 0.005 130.31 0.0057 0.6 0.6 0.47249 93.86 0.01 133 0.0055 0.6 0.6 0.48265 93.86 0.02 141.21 0.005 0.6 0.6 0.50483 44.74 Table 2 The effect of threshold ε on the privacy parameters and accuracy Dealing with negative values: Throughout the rest of this section we show how to use the above technique to perform data perturbation for datasets with both positive and negative values. Two approaches were used to deal with this situation. In the first approach, we take the absolute value of all the attributes, perform the data perturbation using the NMF as described above, and then restore the sign of the attributes from the original data set. In the second approach, we bias the data with some constant so that all the attributes become positive. After performing the data perturbation, the value of this constant is subtracted from the perturbed data. To test the above two approaches, we used the ionosphere database (351 observation and 35 attributes in the range of -1 to +1). The first 200 instances were used as training data and other 151 were used as test data. We set K=13 for the KNN classifier. The corresponding classification accuracy on the original dataset is 93.38%. When using the first approach, the best classification result (93.37%) was obtained on the NMF data with reduced rank 16=m . Table 3 shows the corresponding privacy parameters. m CK CP RK RP VD Acc 16 0.67 0.41 0.017 35.63 0.35 0.9337 Table 3 Privacy parameters for the ionosphere dataset using the absolute value approach with 16=m When varying the sparseness constraint from 0 to 1, the best trade off between the accuracy and the privacy parameters was obtained for 08.0=hS . Table 4 shows the corresponding accuracy and privacy parameters. Sh CK CP RK RP VD Acc 0.08 0.47 0.941 0.064 23.1 0.311 0.9337 Table 4 Privacy parameters for the ionosphere dataset using the absolute value approach with 16=m and 08.0=hS .
  • 4. Table 5 shows the effect of the truncation threshold ε on the accuracy and privacy parameters. ε CK CP RK RP VD Acc 0.01 0.441 0.94 0.065 23.1 0.310 0.9338 0.027 0.470 1 0.062 23.71 0.305 0.9007 0.037 0.411 1 0.056 29.18 0.304 0.8543 0.05 0.205 1.82 0.049 35.59 0.421 0.7748 0.08 0.117 10.11 0.039 77.38 0.930 0.8741 0.09 0.0588 12 0.036 91.18 0.960 0.8344 Table 5 Privacy parameters for the ionosphere dataset using the absolute value approach with 16=m and 08.0=hS . Tables 6 show the corresponding results when we used the second approach to deal with the negative data values. In this case the optimum trade off between the privacy parameters and the classification accuracy was obtained for 5.0=hS m CK CP RK RP VD Acc 16 0.67 0.41 0.017 35.63 0.35 0.9337 hS CK CP RK RP VD Acc 0.5 0.647 0.470 0.012 38.85 0.376 0.9337 ε CK CP RK RP VD Acc 0.017 0.147 4.470 0.037 78.55 1.41 0.9470 0.022 0.147 4.588 0.037 78.494 1.42 0.9536 0.026 0.117 4.411 0.038 78.449 1.43 0.9602 0.031 0.147 4.411 0.038 78.583 1.48 0.9139 0.036 0.176 5.117 0.038 78.04 1.53 0.8675 0.04 0.058 5.470 0.038 77.426 1.56 0.8410 Table 6 Trade off between the privacy parameters and accuracy for the ionosphere dataset using the biasing approach 6. CONCLUSIONS Non-negative matrix factorization with sparseness constraints provides an effective data perturbation tool for privacy preserving data mining. On the other hand, while the privacy parameters used in this work provide some indication on the ability of these techniques to hide the original data values, it is interesting to quantitatively relate these parameters to the actual work required break these data perturbation techniques. References [1] M. Chen, J. Han, and P. Yu, "Data Mining: An Overview from a Database Prospective", IEEE Trans. Knowledge and Data Engineering, 8, 1996. [2] Z. Yang, S. Zhong, R. N. Wright, “Privacy- preserving classification of customer data without loss of accuracy,” In proceedings of the 5th SIAM International Conference on Data Mining, Newport Beach, CA, April 21-23, 2005. [3] R. Agrawal, and A. Evfimievski, “Information sharing across private database,” Proceedings of the 2003 ACM SIGMOD international conference on management of data, San Diego, CA, pp. 86-97. [4] D. Agrawal and C. C. Aggawal, “On the design and quantification of privacy preserving data mining algorothms,” In Proceedings of the 20th ACM SIMOD Symposium on Principles of Database Systems, pages 247–255, Santa Barbara, May 2001. [5] Rakesh Agrawal and Ramakrishnan Srikant, “Privacy-preserving data mining,” In Proceeding of the ACM SIGMOD Conference on Management of Data, pages 439–450, Dallas, Texas, May 2000. ACM Press. [6] Shuting Xu, Jun Zhang, Dianwei Han, and Jie Wang, Data distortion for privacy protection in a terrorist Analysis system. P. Kantor et al (Eds.):ISI 2005, LNCS 3495, pp.459-464, 2005 [7] V. P. Pauca, F. Shahnaz, M. Berry and R. Plemmons. Text Mining using non-negative Matrix Factorizations, Proc. SIAM Inter. Conf. on Data Mining, Orlando, April, 2004. [8] Patrik O. Hoyer. Non-negative Matrix Factorization with Sparseness Constraints. Journal of Machine Learning Research 5 (2004) 1457–1469 [9] D. D. Lee and H. S. Seung. Algorithms for non- negative matrix factorization. In Advances in Neural Information Processing 13 (Proc. NIPS 2000). MIT Press, 2001. [10] UCI Machine Learning Repository. http://www.ics.uci.edu/mlearn/mlsummary.html. [11] R. Duda, P. Hart, and D. Stork, “Pattern Classification,” John Wiley and Sons, 2001.