Privacy preserving clustering on centralized data through scaling transf


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Privacy preserving clustering on centralized data through scaling transf

  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 449 PRIVACY PRESERVING CLUSTERING ON CENTRALIZED DATA THROUGH SCALING TRANSFORMATION Khatri Nishant P. Ms. Preeti Gupta M. Tech. (CSE) CSE Dept. Amity School of Engg. & Tech. Amity School of Engg & Tech Amity University Rajasthan, Amity University Rajasthan, Jaipur, India Jaipur, India Tusal Patel M. Tech. (CSE) Amity School of Engg. & Tech. Amity University Rajasthan, Jaipur, India ABSTRACT Data sharing among organizations is considered to be useful as it offers mutual benefits for effective decision making and business growth. Data mining techniques can be applied on this shared data which can help in extracting meaningful, useful, previously unknown and ultimately comprehensible information from large databases. This ultimately leads to knowledge discovery and the mined knowledge can be used for irrefutable profits by both the parties. However information which is an important asset to business organizations, when shared raises an issue of privacy breach. Though this paper, privacy preserving clustering for centralized data through scaling based transformation is being introduced. Keywords: Data mining, Clustering, Privacy Preservation, Scaling I INTRODUCTION The information age has enabled many organizations to gather large volume of data. However, the usefulness of this data is negligible if “meaningful information” or “knowledge” cannot be extracted from it and is not put to best use in future to increase effectiveness. Data mining otherwise known as knowledge discovery is the technique used by INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 3, May-June (2013), pp. 449-454 © IAEME: Journal Impact Factor (2013): 6.1302 (Calculated by GISI) IJCET © I A E M E
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 450 analysts to find out the hidden and unknown pattern from the collection of data which can be put to great use for deducing convincing opportunities. In contrast to standard statistical methods, data mining techniques search for interesting information. Many techniques like classification, clustering, association rule mining, etc. can be applied for mining knowledge from large databases. Confidentiality Issues in Data Mining: It can be seen that there are situations where sharing of data among organizations can lead to mutual gain. But a key issue that arises in any kind of sharing of data is that of confidentiality. The need for privacy is sometimes due to law (e.g., for medical databases) or can be motivated by business interests. Therefore the issue raises a challenge for researchers for finding techniques to preserve the privacy of data among the communicating parties. Most privacy preserving data mining methods use some form of transformation on data to perform privacy preservation. Typically, such methods reduce the granularity of representation to preserve privacy. This paper presents a technique of privacy preserving clustering where irreversible scaling transformation applied on centralized data stored in a data matrix can lead to preserving of confidentiality yet not changing the nature of the data and the relationship existing between the data objects. II. RELATED WORK [1] suggests the method of privacy preserving computation of cluster means. It is done using two protocols ( one based on oblivious polynomial evaluation and second on homomorphic encryption). In [2], the k-means technique is used to preserve privacy of vertically partitioned data. Vertically partitioned data means the complete attribute set of database is divided into two or more sets and each set serves as individual database. [3] suggests the decision tree technique for privacy preserving over vertically partitioned data. [4] suggests the method for privacy preserving clustering by Rotation Based Technique(RBT) which is very effective method concentrated mainly on isometric transformation. [5] presents an algorithm for privacy preservation for Support Vector Machine(SVM) based classification using local and global models. Local models are local to each party which are not disclosed while generating global model jointly. The global model remains the same for every party which is then used for classifying new data objects. [6] represents the modified k-means algorithm for privacy preserving. A privacy preserving protocol for k-clustering is used on horizontally partitioned databases. Many more privacy preservation techniques has been presented in [6] for Naive Bayes and Decision Tree classification. [7] presented various techniques for privacy preservation for different procedures of data mining. An algorithm is suggested for preserving privacy in association rule mining. A subroutine has also been presented for securely finding the closest cluster in k-means clustering for privacy preservation. [8] represents various cryptographic techniques for privacy preserving. [9] presents the theoretical and experimental results to demonstrate that most probably the random data distortion preserves little data privacy.
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 451 III. PRIVACY PRESERVING CLUSTERING BY DATA MATRIX TRANSFORMATION A. Terms Used a. Data Matrix Objects (e.g. individuals, patterns, events) are usually represented as points (vectors) in a multidimensional space. Each dimension represents a distinct attribute describing the object. Thus, an object is represented as an m x n matrix D, where there are m rows, one for each object, and n columns, one for each attribute. This matrix is referred to as a data matrix, represented as follows:             mnmkm nk nk aaa aaa aaa .. ..... .. .. 1 2221 1111 B. Assumption 1) In the paper an effort to secure attributes with numeric values is made, with an assumption that numeric data (e.g. salary, age, phone number, etc.). is definitely the most sensitive data that needs to be secured. C. General Approach Scaling Based Transformation (SBT) method: Let Dmxn be a data matrix, where each row represents an object, and each object contains values for each of n numerical attributes. The SBT method of dimension n is an ordered pair, defined as SBT = (D, fs), where: 1. D R mxn is a normalized data matrix of objects to be clustered 2. fs is scaling based transformation function In this procedure as the scaling operation of data matrix is used , which is taken as 2D transformation. So it is mandatory to decide the scaling factor. Here it is supposed to be kept same in both the x and y direction. Doing so will lead to shifting point on a higher scale. This is the key factor of maintaining the cluster distribution before and after the transformation. Even the points will be distorted as compared to original data points, the cluster distribution remains the same. Thus this procedure preserves the privacy without distorting the data mining results before transformation.
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 452 D. Proposed Algorithm SBT_Algorithm Input : Dmxn // Dmxn is normalized data matrix Output: D'mxn 1. k ← n/2 2. Pk ← k pairs(Ai,Aj) in D such that 1 ≤ i,j ≤ n and i ≠ j 3. Decide scaling factor s. 4. For each selected pair Pk in pairs(d) do a. V(A'i,A'j) ← S X V(Ai,Aj) // S is scaling matrix with s as scaling factor End for End E. Results For performing the proposed procedure iris2D dataset is used which contains 150 records. We have performed the clustering operation using Weka 3.6. We have used simple k-Means clustering algorithm for our dataset. 1) Cluster distribution before transformation. Figure 1- Cluster Distribution before transformation This output shows that 100 records belong to first cluster (cluster 0) and rest of 50 records belong to second cluster (cluster 1). After this the transformed data set is supplied to Weka for k-Means clustering and the visualized output is as shown below. 2) Cluster distribution after transformation.
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 453 Figure 2- Cluster Distribution after transformation Comparing Figure 1 and Figure 2 it is clear that the cluster distribution before and after transformation remains the same. Hence our procedure works effectively to maintain privacy for the confidential numeric data. F. Security The above stated procedure provides security to the numeric data. It means even if the standard deviation and mean of the numeric dataset is published then also the original numeric data of dataset before transformation cannot be interpreted correctly. This is accomplished mainly in two steps: 1) Data Camouflage: First we try to conceal raw data by normalization. Obviously it is not secure but it is beneficial in two ways a) It gives an equal weight to all attributes and b) It makes difficult the re-identification of objects with other datasets. 2) Attribute Distortion: By scaling two attribute values at a time attribute distortion is achieved IV. CONCLUSION In this paper, a scaling based transformation method has been introduced for Privacy Preserving Clustering on Centralized Data. The proposed method is designed to preserve privacy only for numeric confidential data. This procedure also ensures the similar cluster distributions before and after transformation. This method is clustering algorithm independent. Moreover unsuccessful attempt is also made to recover original data from normalized data which ensures the security of data after transformation without changes in cluster distribution. Nowadays whatever data is required at particular site only that data is stored locally. So the complete dataset is stored in distributed manner. Doing so maintains the availability of data and also reduces the load of data server. Hence as a part of future work this procedure can be applied to the distributed data by making some changes for preserving privacy. This would lead to better method for maintaining confidentiality of distributed (Horizontally/Vertically) data.
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 3, May – June (2013), © IAEME 454 V. REFERENCES [1] "Privacy Preserving Clustering" by S.Jha, L. Kruger, P. McDaniel [2] "Privacy Preserving KMeans Clustering over Vertically Partitioned Data" by Jaideep Vaidya,Chris Clifton in SIGKDD 2003. [3] "Privacy Preserving Decision Trees over Vertically Partitioned Data" by Jaideep Vaidya,Chris Clifton, Murat Kantarcioglu, A. Scott Patterson at ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 3, Article 14, Publication date: October 2008. [4] "Privacy Preserving Spatio-Temporal Clustering on Horizontally Partitioned Data" Ali Inan, Yucel Saygin [5] "Privacy Preserving SVM Classification on Vertically Partitioned Data" Hwanjo Yu, Jaideep Vaidya, Xiaoqian Jiang. [6] "Communication Efficient Privacy-Preserving Clustering" Geetha Jagannathan, Krishnan Pillaipakkamnatt, Rebecca N. Wright, Daryl, Umano [7] A thesis on "Privacy Preserving Data Mining Over Vertically Partitioned Data" by Jaideep Shrikant Vaidya. [8] "Cryptographic techniques for privacy-preserving data mining" by Benny Pinkas. [9] "Random Data Perturbation Techniques and Privacy Preserving Data Mining " by Hillol Kargupta, Souptik Dutta, Qi Wang, Krishnamoorthy Sivakumar. [10] Deepika Khurana and Dr. M.P.S Bhatia, “Dynamic Approach To K-Means Clustering Algorithm”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, 2013, pp. 204 - 219, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.