- The document discusses privacy-preserving clustering on distorted data using singular value decomposition (SVD) and sparsified singular value decomposition (SSVD).
- It applies SVD and SSVD to distort a real-world dataset of 100 terrorists with 42 attributes, generating distorted datasets.
- K-means clustering is then performed on the original and distorted datasets for different numbers of clusters (k). The results show that SSVD more effectively groups the data objects into clusters compared to the original and SVD-distorted datasets, while preserving data privacy as measured by various metrics.
Data mining techniques are used to retrieve the knowledge from large databases that helps the
organizations to establish the business effectively in the competitive world. Sometimes, it violates privacy
issues of individual customers. In this paper we proposes An algorithm ,addresses the problem of privacy
issues related to the individual customers and also propose a transformation technique based on a Walsh-
Hadamard transformation (WHT) and Rotation. The WHT generates an orthogonal matrix, it transfers
entire data into new domain but maintain the distance between the data records these records can be
reconstructed by applying statistical based techniques i.e. inverse matrix, so this problem is resolved by
applying Rotation transformation. In this work, we increase the complexity to unauthorized persons for
accessing original data of other organizations by applying Rotation transformation. The experimental
results show that, the proposed transformation gives same classification accuracy like original data set. In
this paper we compare the results with existing techniques such as Data perturbation like Simple Additive
Noise (SAN) and Multiplicative Noise (MN), Discrete Cosine Transformation (DCT), Wavelet and First
and Second order sum and Inner product Preservation (FISIP) transformation techniques. Based on
privacy measures the paper concludes that proposed transformation technique is better to maintain the
privacy of individual customers.
Privacy Preservation and Restoration of Data Using Unrealized Data SetsIJERA Editor
In today’s world, there is an improved advance in hardware technology which increases the capability to store and record personal data about consumers and individuals. Data mining extracts knowledge to support a variety of areas as marketing, medical diagnosis, weather forecasting, national security etc successfully. Still there is a challenge to extract certain kinds of data without violating the data owners’ privacy. As data mining becomes more enveloping, such privacy concerns are increasing. This gives birth to a new category of data mining method called privacy preserving data mining algorithm (PPDM). The aim of this algorithm is to protect the easily affected information in data from the large amount of data set. The privacy preservation of data set can be expressed in the form of decision tree. This paper proposes a privacy preservation based on data set complement algorithms which store the information of the real dataset. So that the private data can be safe from the unauthorized party, if some portion of the data can be lost, then we can recreate the original data set from the unrealized dataset and the perturbed data set.
Distance based transformation for privacy preserving data mining using hybrid...csandit
Data mining techniques are used to retrieve the knowledge from large databases that helps the
organizations to establish the business effectively in the competitive world. Sometimes, it
violates privacy issues of individual customers. This paper addresses the problem of privacy
issues related to the individual customers and also propose a transformation technique based on
a Walsh-Hadamard transformation (WHT) and Rotation. The WHT generates an orthogonal
matrix, it transfers entire data into new domain but maintain the distance between the data
records these records can be reconstructed by applying statistical based techniques i.e. inverse
matrix, so this problem is resolved by applying Rotation transformation. In this work, we
increase the complexity to unauthorized persons for accessing original data of other
organizations by applying Rotation transformation. The experimental results show that, the
proposed transformation gives same classification accuracy like original data set. In this paper
we compare the results with existing techniques such as Data perturbation like Simple Additive
Noise (SAN) and Multiplicative Noise (MN), Discrete Cosine Transformation (DCT), Wavelet
and First and Second order sum and Inner product Preservation (FISIP) transformation
techniques. Based on privacy measures the paper concludes that proposed transformation
technique is better to maintain the privacy of individual customers.
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
Now-a day’s data sharing between two organizations
is common in many application areas like business planning
or marketing. When data are to be shared between parties,
there could be some sensitive data which should not be
disclosed to the other parties. Also medical records are more
sensitive so, privacy protection is taken more seriously. As
required by the Health Insurance Portability and
Accountability Act (HIPAA), it is necessary to protect the
privacy of patients and ensure the security of the medical
data. To address this problem, released datasets must be
modified unavoidably. We propose a method called Hybrid
approach for privacy preserving and implemented it. First we
randomized the original data. Then we have applied
generalization on randomized or modified data. This
technique protect private data with better accuracy, also it can
reconstruct original data and provide data with no information
loss, makes usability of data.
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
Privacy is an important issue in data mining and knowledge
discovery. In this paper, we propose to use the randomized
response techniques to conduct the data mining computation.
Specially, we present a method to build decision tree
classifiers from the disguised data. We conduct experiments
to compare the accuracy ofou r decision tree with the one
built from the original undisguised data. Our results show
that although the data are disguised, our method can still
achieve fairly high accuracy. We also show how the parameter
used in the randomized response techniques affects the
accuracy ofth e results
Keywords
Privacy, security, decision tree, data mining
On distributed fuzzy decision trees for big datanexgentechnology
GET IEEE BIG DATA, JAVA ,DOTNET,ANDROID ,NS2,MATLAB,EMBEDED AT LOW COST WITH BEST QUALITY PLEASE CONTACT BELOW NUMBER
FOR MORE INFORMATION PLEASE FIND THE BELOW DETAILS:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159
www.nexgenproject.com
Data mining techniques are used to retrieve the knowledge from large databases that helps the
organizations to establish the business effectively in the competitive world. Sometimes, it violates privacy
issues of individual customers. In this paper we proposes An algorithm ,addresses the problem of privacy
issues related to the individual customers and also propose a transformation technique based on a Walsh-
Hadamard transformation (WHT) and Rotation. The WHT generates an orthogonal matrix, it transfers
entire data into new domain but maintain the distance between the data records these records can be
reconstructed by applying statistical based techniques i.e. inverse matrix, so this problem is resolved by
applying Rotation transformation. In this work, we increase the complexity to unauthorized persons for
accessing original data of other organizations by applying Rotation transformation. The experimental
results show that, the proposed transformation gives same classification accuracy like original data set. In
this paper we compare the results with existing techniques such as Data perturbation like Simple Additive
Noise (SAN) and Multiplicative Noise (MN), Discrete Cosine Transformation (DCT), Wavelet and First
and Second order sum and Inner product Preservation (FISIP) transformation techniques. Based on
privacy measures the paper concludes that proposed transformation technique is better to maintain the
privacy of individual customers.
Privacy Preservation and Restoration of Data Using Unrealized Data SetsIJERA Editor
In today’s world, there is an improved advance in hardware technology which increases the capability to store and record personal data about consumers and individuals. Data mining extracts knowledge to support a variety of areas as marketing, medical diagnosis, weather forecasting, national security etc successfully. Still there is a challenge to extract certain kinds of data without violating the data owners’ privacy. As data mining becomes more enveloping, such privacy concerns are increasing. This gives birth to a new category of data mining method called privacy preserving data mining algorithm (PPDM). The aim of this algorithm is to protect the easily affected information in data from the large amount of data set. The privacy preservation of data set can be expressed in the form of decision tree. This paper proposes a privacy preservation based on data set complement algorithms which store the information of the real dataset. So that the private data can be safe from the unauthorized party, if some portion of the data can be lost, then we can recreate the original data set from the unrealized dataset and the perturbed data set.
Distance based transformation for privacy preserving data mining using hybrid...csandit
Data mining techniques are used to retrieve the knowledge from large databases that helps the
organizations to establish the business effectively in the competitive world. Sometimes, it
violates privacy issues of individual customers. This paper addresses the problem of privacy
issues related to the individual customers and also propose a transformation technique based on
a Walsh-Hadamard transformation (WHT) and Rotation. The WHT generates an orthogonal
matrix, it transfers entire data into new domain but maintain the distance between the data
records these records can be reconstructed by applying statistical based techniques i.e. inverse
matrix, so this problem is resolved by applying Rotation transformation. In this work, we
increase the complexity to unauthorized persons for accessing original data of other
organizations by applying Rotation transformation. The experimental results show that, the
proposed transformation gives same classification accuracy like original data set. In this paper
we compare the results with existing techniques such as Data perturbation like Simple Additive
Noise (SAN) and Multiplicative Noise (MN), Discrete Cosine Transformation (DCT), Wavelet
and First and Second order sum and Inner product Preservation (FISIP) transformation
techniques. Based on privacy measures the paper concludes that proposed transformation
technique is better to maintain the privacy of individual customers.
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
Now-a day’s data sharing between two organizations
is common in many application areas like business planning
or marketing. When data are to be shared between parties,
there could be some sensitive data which should not be
disclosed to the other parties. Also medical records are more
sensitive so, privacy protection is taken more seriously. As
required by the Health Insurance Portability and
Accountability Act (HIPAA), it is necessary to protect the
privacy of patients and ensure the security of the medical
data. To address this problem, released datasets must be
modified unavoidably. We propose a method called Hybrid
approach for privacy preserving and implemented it. First we
randomized the original data. Then we have applied
generalization on randomized or modified data. This
technique protect private data with better accuracy, also it can
reconstruct original data and provide data with no information
loss, makes usability of data.
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
Privacy is an important issue in data mining and knowledge
discovery. In this paper, we propose to use the randomized
response techniques to conduct the data mining computation.
Specially, we present a method to build decision tree
classifiers from the disguised data. We conduct experiments
to compare the accuracy ofou r decision tree with the one
built from the original undisguised data. Our results show
that although the data are disguised, our method can still
achieve fairly high accuracy. We also show how the parameter
used in the randomized response techniques affects the
accuracy ofth e results
Keywords
Privacy, security, decision tree, data mining
On distributed fuzzy decision trees for big datanexgentechnology
GET IEEE BIG DATA, JAVA ,DOTNET,ANDROID ,NS2,MATLAB,EMBEDED AT LOW COST WITH BEST QUALITY PLEASE CONTACT BELOW NUMBER
FOR MORE INFORMATION PLEASE FIND THE BELOW DETAILS:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159
www.nexgenproject.com
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONcscpconf
Huge Volumes of detailed personal data is regularly collected and analyzed by applications
using data mining, sharing of these data is beneficial to the application users. On one hand it is
an important asset to business organizations and governments for decision making at the same
time analysing such data opens treats to privacy if not done properly. This paper aims to reveal
the information by protecting sensitive data. We are using Vector quantization technique for
preserving privacy. Quantization will be performed on training data samples it will produce
transformed data set. This transformed data set does not reveal the original data. Hence privacy
is preserved
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Additive gaussian noise based data perturbation in multi level trust privacy ...IJDKP
Data perturbation is one of the most popular models used in pr
ivacy preserving data mining. It is specially
convenient for applications where the data owners need to export/publi
sh the privacy-sensitive data. This
work proposes that an Additive Perturbation based Privacy Pre
serving Data Mining (PPDM) to deal with
the problem of increasing accurate models about all data without
knowing exact details of individual
values. To Preserve Privacy, the approach establishes R
andom Perturbation to individual values before
data are published. In Proposed system the PPDM approach introd
uces Multilevel Trust (MLT) on data
miners. Here different perturbed copies of the similar data a
re available to the data miner at different trust
levels and may mingle these copies to jointly gather extra infor
mation about original data and release the
data is called diversity attack. To prevent this attack ML
T-PPDM approach is used along with the addition
of random Gaussian noise and the noise is properly correlated to
the original data, so the data miners
cannot get diversity gain in their combined reconstruction.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
MDAV2K: A VARIABLE-SIZE MICROAGGREGATION TECHNIQUE FOR PRIVACY PRESERVATIONcscpconf
Public and private organizations are collecting personal data regarding day to day life of
individuals and accumulating them in large databases. Data mining techniques may be applied
to such databases to extract useful hidden knowledge. Releasing the databases for data mining
purpose may lead to breach of individual privacy. Therefore the databases must be protected
through means of privacy preservation techniques before releasing them for data mining
purpose. Microaggregation is a privacy preservation technique used by statistical disclosure
control community as well as data mining community for microdata protection. The Maximum
distance to Average Vector (MDAV) is a very popular multivariate fixed-size microaggregation
technique studied by many researchers. The principal goal of such techniques is to preserve
privacy without much information loss. In this paper we propose a variable-size, improved MDAV technique having low information loss.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
Abstract Text mining refers to the process of deriving high quality information from text. It is also known as knowledge discovery from text (KDT), deals with the machine supported analysis of text. It is used in various areas such as information retrieval, marketing, information extraction, natural language processing, document similarity, and so on. Document Similarity is one of the important techniques in text mining. In document similarity, the first and foremost step is to classify the files based on their category. In this research work, various classification rule techniques are used to classify the computer files based on their extensions. For example, the extension of computer files may be pdf, doc, ppt, xls, and so on. There are several algorithms for rule classifier such as decision table, JRip, Ridor, DTNB, NNge, PART, OneR and ZeroR. In this research work, three classification algorithms namely decision table, DTNB and OneR classifiers are used for performing classification of computer files based on their extension. The results produced by these algorithms are analyzed by using the performance factors classification accuracy and error rate. From the experimental results, DTNB proves to be more efficient than other two techniques. Index Terms: Data mining, Text mining, Classification, Decision table, DTNB, OneR
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETIJDKP
Data mining is a task in which data is extracted from the large database to make itin an understandable
form or structure so that it can be used for further use. In this paper we present an approach by which the
concept of hierarchal clustering applied over the horizontally partitioned data set. We also explain the
desired algorithm like hierarichal clustering, algorithms for finding the minimum closest cluster. In this
paper wealso explain the two party computations. Privacy of any data is the most important thing in these
days hence we present an approach by which we can apply privacy preservation over the two party which
are distributing their data horizontally. We also explain about the hierarichal clustering which we are
going to apply in our present method.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONcscpconf
Huge Volumes of detailed personal data is regularly collected and analyzed by applications
using data mining, sharing of these data is beneficial to the application users. On one hand it is
an important asset to business organizations and governments for decision making at the same
time analysing such data opens treats to privacy if not done properly. This paper aims to reveal
the information by protecting sensitive data. We are using Vector quantization technique for
preserving privacy. Quantization will be performed on training data samples it will produce
transformed data set. This transformed data set does not reveal the original data. Hence privacy
is preserved
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Additive gaussian noise based data perturbation in multi level trust privacy ...IJDKP
Data perturbation is one of the most popular models used in pr
ivacy preserving data mining. It is specially
convenient for applications where the data owners need to export/publi
sh the privacy-sensitive data. This
work proposes that an Additive Perturbation based Privacy Pre
serving Data Mining (PPDM) to deal with
the problem of increasing accurate models about all data without
knowing exact details of individual
values. To Preserve Privacy, the approach establishes R
andom Perturbation to individual values before
data are published. In Proposed system the PPDM approach introd
uces Multilevel Trust (MLT) on data
miners. Here different perturbed copies of the similar data a
re available to the data miner at different trust
levels and may mingle these copies to jointly gather extra infor
mation about original data and release the
data is called diversity attack. To prevent this attack ML
T-PPDM approach is used along with the addition
of random Gaussian noise and the noise is properly correlated to
the original data, so the data miners
cannot get diversity gain in their combined reconstruction.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
Distributed digital artifacts incorporate cryptographic hash values to URI called trusty URIs in a distributed environment
building good in quality, verifiable and unchangeable web resources to prevent the rising man in the middle attack. The greatest
challenge of a centralized system is that it gives users no possibility to check whether data have been modified and the communication
is limited to a single server. As a solution for this, is the distributed digital artifact system, where resources are distributed among
different domains to enable inter-domain communication. Due to the emerging developments in web, attacks have increased rapidly,
among which man in the middle attack (MIMA) is a serious issue, where user security is at its threat. This work tries to prevent MIMA
to an extent, by providing self reference and trusty URIs even when presented in a distributed environment. Any manipulation to the
data is efficiently identified and any further access to that data is blocked by informing user that the uniform location has been
changed. System uses self-reference to contain trusty URI for each resource, lineage algorithm for generating seed and SHA-512 hash
generation algorithm to ensure security. It is implemented on the semantic web, which is an extension to the world wide web, using
RDF (Resource Description Framework) to identify the resource. Hence the framework was developed to overcome existing
challenges by making the digital artifacts on the semantic web distributed to enable communication between different domains across
the network securely and thereby preventing MIMA.
MDAV2K: A VARIABLE-SIZE MICROAGGREGATION TECHNIQUE FOR PRIVACY PRESERVATIONcscpconf
Public and private organizations are collecting personal data regarding day to day life of
individuals and accumulating them in large databases. Data mining techniques may be applied
to such databases to extract useful hidden knowledge. Releasing the databases for data mining
purpose may lead to breach of individual privacy. Therefore the databases must be protected
through means of privacy preservation techniques before releasing them for data mining
purpose. Microaggregation is a privacy preservation technique used by statistical disclosure
control community as well as data mining community for microdata protection. The Maximum
distance to Average Vector (MDAV) is a very popular multivariate fixed-size microaggregation
technique studied by many researchers. The principal goal of such techniques is to preserve
privacy without much information loss. In this paper we propose a variable-size, improved MDAV technique having low information loss.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Evaluating the efficiency of rule techniques for file classificationeSAT Journals
Abstract Text mining refers to the process of deriving high quality information from text. It is also known as knowledge discovery from text (KDT), deals with the machine supported analysis of text. It is used in various areas such as information retrieval, marketing, information extraction, natural language processing, document similarity, and so on. Document Similarity is one of the important techniques in text mining. In document similarity, the first and foremost step is to classify the files based on their category. In this research work, various classification rule techniques are used to classify the computer files based on their extensions. For example, the extension of computer files may be pdf, doc, ppt, xls, and so on. There are several algorithms for rule classifier such as decision table, JRip, Ridor, DTNB, NNge, PART, OneR and ZeroR. In this research work, three classification algorithms namely decision table, DTNB and OneR classifiers are used for performing classification of computer files based on their extension. The results produced by these algorithms are analyzed by using the performance factors classification accuracy and error rate. From the experimental results, DTNB proves to be more efficient than other two techniques. Index Terms: Data mining, Text mining, Classification, Decision table, DTNB, OneR
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETIJDKP
Data mining is a task in which data is extracted from the large database to make itin an understandable
form or structure so that it can be used for further use. In this paper we present an approach by which the
concept of hierarchal clustering applied over the horizontally partitioned data set. We also explain the
desired algorithm like hierarichal clustering, algorithms for finding the minimum closest cluster. In this
paper wealso explain the two party computations. Privacy of any data is the most important thing in these
days hence we present an approach by which we can apply privacy preservation over the two party which
are distributing their data horizontally. We also explain about the hierarichal clustering which we are
going to apply in our present method.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
A Survey Paper on an Integrated Approach for Privacy Preserving In High Dimen...IJSRD
Data mining is a technique which is used for extraction of knowledge and information from large amount of data collected by hospitals, government and individuals. The term data mining is also referred as knowledge mining from databases. The major challenge in data mining is ensuring security and privacy of data in databases, because data sharing is common at organizational level. The data in databases comes from a number of sources like – medical, financial, library, marketing, shopping record etc so it is foremost task for anyone to keep secure that data. The objective is to achieve fully privacy preserved data without affecting the data utility in databases. i.e. how data is used or transferred between organizations so that data integrity remains in database but sensitive and confidential data is preserved. This paper presents a brief study about different PPDM techniques like- Randomization, perturbation, Slicing, summarization etc. by use of which the data privacy can be preserved. The technique for which the best computational and theoretical outcome is achieved is chosen for privacy preserving in high dimensional data.
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...IJECEIAES
Leakage and misuse of sensitive data is a challenging problem to enterprises. It has become more serious problem with the advent of cloud and big data. The rationale behind this is the increase in outsourcing of data to public cloud and publishing data for wider visibility. Therefore Privacy Preserving Data Publishing (PPDP), Privacy Preserving Data Mining (PPDM) and Privacy Preserving Distributed Data Mining (PPDM) are crucial in the contemporary era. PPDP and PPDM can protect privacy at data and process levels respectively. Therefore, with big data privacy to data became indispensable due to the fact that data is stored and processed in semi-trusted environment. In this paper we proposed a comprehensive methodology for effective sanitization of data based on misusability measure for preserving privacy to get rid of data leakage and misuse. We followed a hybrid approach that caters to the needs of privacy preserving MapReduce programming. We proposed an algorithm known as Misusability Measure-Based Privacy Preserving Algorithm (MMPP) which considers level of misusability prior to choosing and application of appropriate sanitization on big data. Our empirical study with Amazon EC2 and EMR revealed that the proposed methodology is useful in realizing privacy preserving Map Reduce programming.
Performance analysis of perturbation-based privacy preserving techniques: an ...IJECEIAES
Nowadays, enormous amounts of data are produced every second. These data also contain private information from sources including media platforms, the banking sector, finance, healthcare, and criminal histories. Data mining is a method for looking through and analyzing massive volumes of data to find usable information. Preserving personal data during data mining has become difficult, thus privacy-preserving data mining (PPDM) is used to do so. Data perturbation is one of the several tactics used by the PPDM data privacy protection mechanism. In perturbation, datasets are perturbed in order to preserve personal information. Both data accuracy and data privacy are addressed by it. This paper will explore and compare several hybrid perturbation strategies that may be used to protect data privacy. For this, two perturbation-based techniques named improved random projection perturbation (IRPP) and enhanced principal component analysis-based technique (EPCAT) were used. These methods are employed to assess the precision, run time, and accuracy of the experimental results. This paper provides the impacts of perturbation-based privacy preserving techniques. It is observed that hybrid approaches are more efficient than the traditional approach.
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
Privacy Preserving Approaches for High Dimensional Dataijtsrd
This paper proposes a model for hiding sensitive association rules for Privacy preserving in high dimensional data. Privacy preservation is a big challenge in data mining. The protection of sensitive information becomes a critical issue when releasing data to outside parties. Association rule mining could be very useful in such situations. It could be used to identify all the possible ways by which ˜non-confidential data can reveal ˜confidential data, which is commonly known as ˜inference problem. This issue is solved using Association Rule Hiding (ARH) techniques in Privacy Preserving Data Mining (PPDM). Association rule hiding aims to conceal these association rules so that no sensitive information can be mined from the database. Tata Gayathri | N Durga"Privacy Preserving Approaches for High Dimensional Data" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-5 , August 2017, URL: http://www.ijtsrd.com/papers/ijtsrd2430.pdf http://www.ijtsrd.com/engineering/computer-engineering/2430/privacy-preserving-approaches-for-high-dimensional-data/tata-gayathri
A Review on Privacy Preservation in Data Miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques
for masking sensitive information through data modification. The major issues were how to modify the data
and how to recover the data mining result from the altered data. The reports were often tightly coupled
with the data mining algorithms under consideration. Privacy preserving data publishing focuses on
techniques for publishing data, not techniques for data mining. In case, it is expected that standard data
mining techniques are applied on the published data. Anonymization of the data is done by hiding the
identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data.
This survey carries out the various privacy preservation techniques and algorithms.
A Review on Privacy Preservation in Data Miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques
for masking sensitive information through data modification. The major issues were how to modify the data
and how to recover the data mining result from the altered data. The reports were often tightly coupled
with the data mining algorithms under consideration. Privacy preserving data publishing focuses on
techniques for publishing data, not techniques for data mining. In case, it is expected that standard data
mining techniques are applied on the published data. Anonymization of the data is done by hiding the
identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data.
This survey carries out the various privacy preservation techniques and algorithms.
A review on privacy preservation in data miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques for masking sensitive information through data modification. The major issues were how to modify the data and how to recover the data mining result from the altered data. The reports were often tightly coupled with the data mining algorithms under consideration. Privacy preserving data publishing focuses on techniques for publishing data, not techniques for data mining. In case, it is expected that standard data mining techniques are applied on the published data. Anonymization of the data is done by hiding the identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data. This survey carries out the various privacy preservation techniques and algorithms.
A Review on Privacy Preservation in Data Miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques
for masking sensitive information through data modification. The major issues were how to modify the data
and how to recover the data mining result from the altered data. The reports were often tightly coupled
with the data mining algorithms under consideration. Privacy preserving data publishing focuses on
techniques for publishing data, not techniques for data mining. In case, it is expected that standard data
mining techniques are applied on the published data. Anonymization of the data is done by hiding the
identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data.
This survey carries out the various privacy preservation techniques and algorithms.
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREMIJNSA Journal
Data mining has made broad significant multidisciplinary field used in vast application domains and extracts knowledge by identifying structural relationship among the objects in large data bases. Privacy preserving data mining is a new area of data mining research for providing privacy of sensitive knowledge of information extracted from data mining system to be shared by the intended persons not to everyone to access. In this paper , we proposed a new approach of privacy preserving data mining by using implicit function theorem for secure transformation of sensitive data obtained from data mining system. we proposed two way enhanced security approach. First transforming original values of sensitive data into different partial derivatives of functional values for perturbation of data. secondly generating symmetric key value by Eigen values of jacobian matrix for secure computation. we given an example of academic sensitive data converting into vector valued functions to explain about our proposed concept and presented implementation based results of new proposed of approach.
VOLUME-7 ISSUE-8, AUGUST 2019 , International Journal of Research in Advent Technology (IJRAT) , ISSN: 2321-9637 (Online) Published By: MG Aricent Pvt Ltd
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Knowledge engineering: from people to machines and back
Privacy Preserving Clustering on Distorted data
1. IOSR Journal of Computer Engineering (IOSRJCE)
ISSN: 2278-0661, ISBN: 2278-8727 Volume 5, Issue 2 (Sep-Oct. 2012), PP 25-29
www.iosrjournals.org
www.iosrjournals.org 25 | P a g e
Privacy Preserving Clustering on Distorted data
Thanveer Jahan1,
Dr.G.Narasimha2
, Dr.C.V.Guru Rao3
1
(M.Tech, (Ph d(CSE)), JNTU, Kukatpally ,Hyderabad, A.P,India)
2
(Assistant Professor (CSE), JNTU, Kukatpally, Hyderabad, A.P, India)
3
(HOD (CSE), SR Engineering College, Warangal, A.P, India)
Abstract: In designing various security and privacy related data mining applications, privacy preserving has
become a major concern. Protecting sensitive or confidential information in data mining is an important long
term goal. An increased data disclosure risks may encounter when it is released. Various data distortion
techniques are widely used to protect sensitive data; these approaches protect data by adding noise or by
different matrix decomposition methods. In this paper we primarily focus, data distortion methods such as
singular value decomposition (SVD) and sparsified singular value decomposition (SSVD). Various privacy
metrics have been proved to measure the difference between original dataset and distorted dataset and degree
of privacy protection. The data mining utility k-means clustering is used on these distorted datasets. Our
experimental results use a real world dataset. An efficient solution is achieved using sparsified singular value
decomposition and singular value decomposition, meeting privacy requirements. The accuracy while using the
distorted data is almost equal to that of the original dataset.
Keywords- Privacy Preserving, Data Distortion, Singular Value Decomposition (SVD), Sparsified Singular
Value Decomposition (SSVD), k--means clustering.
I. Introduction
The rapid increase of applications in data mining has raised a major concern in corporations. Private
information of innocent people is collected and used for data mining purpose. In this modern technology, data of
various kinds are collected and exchanged at an unprecedented speed and scale. A latest and fruitful direction
for future research is to efficiently discover valuable information from large datasets and to develop techniques
that incorporate privacy concerns. Now a day’s data is an important asset of companies, governments, and
research institutions [10] and is used for various public and private interests. Data is sensitive to privacy issues.
Defense applications, financial transactions, healthcare records and network communication traffic are few of
examples. Preserving Privacy in sensitive domains has become a major concern in data mining applications.
Many data mining applications would not be acceptable, without an acceptable level of privacy of sensitive
information. Data can be collected at centralized or distributed location. In centralized location, major concern
is to shield the exact values of the attributes from the data analysts, where as in distributed locations data storage
patterns are different i.e. they are horizontally distributed or vertically distributed [1, 8].There has been much
research on privacy preserving data mining (PPDM) based on data perturbation or data distortion, randomization
and secure multi party computations. The goal of privacy-preserving data mining techniques is used to hide
sensitive or confidential data values from an unauthorized user and preserve data patterns. These patterns and
semantics are used to build a valid decision model on distorted data sets. Different data mining techniques such
as classification, clustering etc are proposed for privacy protection in data processing. The best scenario is to
construct a data pattern model on distorted data equivalent to or better than an original data.
There are two approaches in this case i.e. to distort the data so that the analysts are unaware of original
data and the second approach is to modify the data mining algorithms. In this paper we propose the first
approach the analysts uses distorted dataset transformed into data matrix D, not the original dataset D. The
matrix D cannot be used to reconstruct the original matrix D, without knowing the error part E = D – D. The
analysts are unable to know attributes (columns) of original attributes and apply data mining algorithms on it. In
this way data privacy preservation is premised on the maintenance of data analytical values. We transformed
original dataset into distorted dataset to protect privacy. Among the widely used approach is Singular value
decomposition (SVD) and its derivative Sparsified Singular value Decomposition (SSVD) are the one most
popular techniques to address issues. SSVD was first introduced by Gao and Zhang in [4] to reduce cost and
enhance performance of SVD in text retrieval applications. Xu et al. applied SVD and SSVD methods in
terrorist analysis system [15,16]. SSVD was further studied in [5] in which structural partition strategies
proposed to partition data into submatrices. In Ref. [11] privacy preserving clustering in singular value
decomposition (SVD) was proposed and the results proved that accuracy of original and distorted dataset are
equivalent. In our work, we take a closer look to perform data distortion by singular value decomposition and
2. Privacy Preserving Clustering on Distorted data
www.iosrjournals.org 26 | Page
sparsified singular value decomposition. Thus, data mining techniques k-means clustering is applied on the
distorted dataset to attain inherent property of privacy protection.
The remaining part of the paper is organized as follows; Section 2 briefly introduces related work on
data analysis system and data distortion methods: SVD and sparsified SVD and k-means clustering. Section 3
discusses the various data perturbation metrics. The experiments are carried out and the results are presented and
discussed in Section 4. We finally sum up this paper and bring our future plans in Section 5.
II. Related Work
2.1 Privacy preserving data mining
There has been a raising concern for disclosure of security and privacy, as the data mining techniques
gain popularity and widely used in business and research. Two parties having private data wish to work in
collaboration by to other party. Indeed, neither party shares their private data. In such cases privacy preserving
data mining (PPDM) have major significance. PPDM develops algorithms for modifying the original data in a
way that data and knowledge remain private even after mining process [12]. Common techniques include data
perturbation, blocking feature values, swapping tuples etc. PPDM scheme should able to maximize the degree of
data modification to retain the maximum data utility level.
2.2 Analysis system and data distortion
A simplified model of data analysis system consists of two parts, the data manipulation and the data
analysis as illustrated in Fig 1.The original data is completely manipulated by the authorized user’s or data
owner using data distortion process i.e. matrix decomposition method .Data distortion is one of the important
parts in many privacy preserving data mining tasks. The distorted methods must preserve data privacy and at
the same time must keep the utility of the data after distortion. The data distorted or perturbated data is collected
by analysts to perform all actions such as clustering etc. The protected data maintains privacy as analysts is
unknown with actual data values. The classical data distortion methods are based on random value perturbation
and are applied [8]. Singular value decomposition (SVD) is a popular method in data mining and information
retrieval
[9].SVD has numerous applications in data mining, information retrieval and image compression in which it is
often used to approximate a given a matrix by a lower rank matrix with minimum distance between them. SVD
is used to reduce dimensionality of the original dataset D. A sparse matrix D of dimension p×q represents the
original dataset. The rows and columns are the data objects and attributes. The singular value decomposition of
the matrix D is [3].
D = U S VT
Where U is an p×p orthonormal matrix, S = diag [σ1 ,σ2 ,……, σs ] (s = min{p,q})is an p×q diagonal
matrix whose nonnegative diagonal entries are in a descending order, and VT
is an q×q orthonormal matrix. The
number of nonzero diagonals of S is equal to the rank of the matrix D. The singular values in the matrix S are
arranged in a descending order .The SVD transformation has property that the maximal variation among objects
is captured in the first dimension, as σ1 ≥ σi for i ≥ 2.The remaining variations are captured similarly in the
second dimension and so on. Thus, a transformed matrix with a lower dimension can be constructed to represent
the original matrix i.e.
Dr = Ur Sr Vr
T
Where Ur contains the first r columns of U, Sr contains the first r nonzero diagonals of S and Vr
T
contains the
first r rows of VT
.The rank of the matrix Dr is r and with r being small, the dimensionality of the dataset has
been gradually reduced from min {p, q} to r (assuming all attributes are linearly dependent). Dr is proved to be
best r dimensional approximation of D in the sense of Frobenius norm. In data mining applications the use of Dr
to represent D has important function. The removed error part Er = D Dr can be considered as the noise in the
original dataset D [8]. Mining on reduced dataset Dr yield better results than on original dataset D. Thus, the
distorted data Dr can provide effective protection for data privacy. Sparsified SVD is a data distortion method
Distorted
Data
Original
Data
Data
Manipulation
Authorized
User
Data
Analyst
K-means
Clustering
Figure 1. A Data Analysis system for Clustering
3. Privacy Preserving Clustering on Distorted data
www.iosrjournals.org 27 | Page
1, if Ordi
j = Ordi
j ,
0, otherwise.
better than a SVD in preserving privacy. After reducing rank of the SVD matrices, we set small size entries
which are smaller than a certain threshold є in Ur and Vr
T
to zero. This operation is referred as a dropping
operation [4]. Thus, drop uij in Ur , if |uij| < є and vij in Vr
T
, if |vij| < є. Let Ur denote Ur with dropped elements
and Vr
T
denote Vr
T
with dropped elements, the distorted data matrix Dr is represented as
Dr = Ur S Vr
T
,
The sparsified SVD method is equivalent to further distorting the dataset Dr. Denote
Eє = Dr - Dr ,
D =Dr+ Er + Eє ,
The data provided to analysts is Dr which is twice distorted in the sparsified SVD method. The sparsified SVD
was proposed by Gao and Zhang in [4] for reducing storage cost and enhancing performance of SVD in text
retrieval applications.
2.3 K-means Clustering
Clustering is a well-known problem in statistics and engineering, namely, how to arrange a set of
vectors (measurements) into a number of groups (clusters). Clustering is an important area of application for a
variety of fields including data mining, statistical data analysis and vector quantization [6]. The problem has
been formulated in various ways in the machine learning, pattern recognition optimization and statistics
literature. The fundamental clustering problem is that of grouping together (clustering) data items that are
similar to each other. Given a set of data items, clustering algorithms group similar items together. Clustering
has many applications, such as customer behavior analysis, targeted marketing, forensics, and bioinformatics.
III. Data Perturbation Metrics
In literature privacy metrics have been proposed in [2, 10] .In Ref. [2] the metrics are incomplete and is
proved in Ref. [8].It is important to know the density function of each attribute a priori, which may be difficult
to obtain for the real world datasets. We propose some privacy measures which depend on the original matrix D
and its distorted matrix D.
3.1 Value difference (VD)
The elements of data matrix change after distortion. The value difference (VD) of the datasets is
represented by relative value difference in the Forbenius norm. VD is the ratio of the Forbenius norm of the
difference of D and |D| to the Forbenius form of D.
VD = || D |D| ||F ∕ || D ||F .
3.2 Position difference
several metrics are used to measure position difference of the data elements. RP is used to denote
average change of order of all attributes. The order of the element changes after distortion. Dataset D has n data
objects and m attributes. Ordi
j denotes the ascending order of the jth element in attributes i, and Ordi
j denotes the
ascending order of the distorted element DijThen, RP = ( ij Ordij│) ∕ (m * n)
RK represents the percentage of elements that keep their orders of value in each column after the distortion.
RK = ( i
j ) / ( m * n),
RKi
j = {
The metric CP is used to define the change of order of the average value of attributes:
CP = ( (OrdDVi OrdDVi )│m,
where OrdDVi is the ascending order the average value of attribute i while OrdDVi denotes ascending
order after distortion. Like RK, we define CK to measure the percentage of the attribute that keeps their order of
average value after distortion.
CK = ( i
) / m,
where CKi
= {
The higher the value of RP and CP and the lower the value of RK and CK, the more privacy can be preserved.
In order to be fair for a dataset, privacy metrics are calculated as shown in the Table 1.The value of VD,RP,CP
is more in SSVD than in SVD .Among the four distortion methods SSVD is better than SVD to preserve privacy
as shown in Fig 4.
1, if OrdDVi = OrdDVi ,
0, otherwise
4. Privacy Preserving Clustering on Distorted data
www.iosrjournals.org 28 | Page
Table 1.Comparision of Privacy Metrics for distortion methods
Figure 2. Performance of privacy metrics
IV. Experiments And Results
We conduct experiments on real data set having 100 data points. For a real world dataset, we
downloaded information about 100 terrorists (q), 42 attributes (p) such as age, place, relationship etc. The
original matrix is of dimension 100×42.
4.1 Proposed Algorithm
Input: Data matrix D, No of clusters k,
Output: Distorted Data matrix Dr , Clusters
Step 1: Finding sensitive or confidential attributes (pi) i= 0,1,…….41 in D.
Step 2: Form the matrix C.C = [p0,p1,p2,…p41]
Step 3: Apply SVD to the matrix C.
SVD(C) = U S VT
,
Then distorted matrix Cr=Ur Sr Vr
T
Step 4: Then, apply SSVD to matrix Cr
Choosing the rank r and dropping
Threshold є as 10-3
Then Distorted matrix SSVD ( Cr) = Ur Sr Vr
T
Step 4: Update Cr in D , gives Dr
Step 5: Generate Clusters for sensitive attribute in Dr .
Table 2: Data objects and Clusters
The illustration of the above method is represented for 10 data objects is shown in the Table 2.we
analyzed a specific number of clusters ranging from 2 to 4 clusters. The effectiveness is measured in terms of
the proportion of the points that are grouped in the same clusters after we apply a transformation on data, such
points as legitimate points. Considering the transformation attributes as relationship with terrorist group. k
denotes the number of clusters to group the data objects. For the clusters k=2and k=3 the data objects grouped in
original dataset and in SVD dataset are exactly same. In SSVD data objects are effectively grouped when,
compared to original and SVD data set for k=2,3,4 .
Privacy
Metrics
DATA VD RP RK CP CK
Org __ __ __ __ __
SVD 0.0525 31.2 0.0251 12.2 0.12
SSVD 1.0422 37.5 0.0066 13.1 0.05
Clusters(k)
Original
data
SVD SSVD
2 3 4 2 3 4 2 3 4
Data objects(points)
2 1 2 2 1 2 2 2 1
1 3 3 1 3 1 2 2 2
1 3 3 1 3 1 2 3 2
1 3 1 1 3 1 2 3 2
1 3 1 1 3 1 2 3 2
2 1 2 2 1 2 2 2 2
1 3 1 1 3 1 2 3 2
1 3 1 1 3 1 2 3 2
1 3 1 1 3 1 2 3 2
1 3 1 1 3 1 2 3 2
5. Privacy Preserving Clustering on Distorted data
www.iosrjournals.org 29 | Page
4.2 Measuring Accuracy
The efficiency is measured on the number of data points those are legitimate and are grouped in the
original and distorted datasets. k- means clustering do not consider noise.
A Misclassification Error is used to concentrate on a potential problem where the data point from a cluster
migrates to a different cluster.
ME = 1/ N * r)|)
Table 3. Results of Misclassification Error
Misclassification error must be 0% where N represents the number of point in the original dataset, k is
the number of clusters under analysis and | Clusteri(D)| represents the number of data points legitimate in the ith
cluster in dataset D. The results are tabulated in the Table 3.The cluster analysis yields good results for the
original and distorted datasets using SVD and SSVD distortion techniques. The results suggest that our
techniques perform well to achieve feasible solution. The accuracy of distorted data is same as original data.
Thus, a complete privacy can be obtained in k-means cluster analysis and is also proved in privacy metrics.
V. Conclusion
We propose a better approach for a data analysis system to use data distortion techniques: singular
value decomposition (SVD) and Sparsified SVD to preserve privacy. We have presented privacy preserving data
mining application which distorts original dataset to meet privacy requirements. Experimental results show the
effectiveness by measuring accuracy of original data and distorted data. It has proved that high degree of data
distortion can maintain high level of data utility using k-means clustering. Future work may address other
scenarios to protect data along with different data mining algorithms.
References
[1] B.Gillburd, A. Schuster and R.Wolff. “K- TTP: A newPrivacymodelforlargescaledistributedenvironments”. In Proceedings of the
10th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (kdd’04), Seattle, WA, USA, 2004.
[2] D.Agrawal and C .C.Aggarwal “On the and quantification of privacy –Preserving data minng algorithms”. In Proceedings of the
20th ACM SIGACT-SIGMOD.
[3] G.H.Golub andC.F.van Loan. Matrix Computation, John Hopkins Univ., 3rd ED., 1996.
[4] J.Gao, J. Zhang. “Sparsification strategies in latent semantic indexing”, In Proceedings of the 2003 Text Mining Workshop, M.W.
Berry and W.M. Pottenger, (ed.), pp. 93-103, San Francisco, CA, May 3, 2003.
[5] J.Wang, W.Zhong,S. Xu,J.Zhang. “Selective Data Distortion via Structural Partition and SSVD for privacy Preseervation”,In
Proceedings of the 2006 International Conference on Information & knowledge Engineering,pp:114-120,CSREA Press,Las Vegas.
[6] J. Han, M. Kamber. Concepts and Techniques. Morgan Kaufmann Publishers, 2001
[7] N. Maheswari & K. Duraiswamy CLUST- SVD: Privacy preservinglusteringinsingularvaluedecomposition In World journal of
modeling and simulation vol 4(2008) No4,pp250-256.
[8] R.Agrawal, A.Evfimieski and R.srikanth.”Information sharing across private databases”. In proceedinds of the 2003 ACM SIGMOD
International Conference on management of data, pp.86-97,san Diego,CA,2003
[9] R.Agrawal and R.srikant”Privacy–Preserving data mining”. In proceedings of the 2000 ACM SIGMOD International Conference
on management of data, pp86-97, San Diego, CA, 2003.
[10] S.Deewester, S.Dumais, et al.”Indexing by latent semantic analysis”, J Amer. Soc.Info.Sci, 41:391-407, 1990.
[11] V. S. Verykios, E. Bertino, I. Fovino, L. Provenza, Y. Saygin, and Y. Theodoridis. “State-of-the-art I privacy preserving data
mining”, SIGMOD Record, 33(1):50-57, 2004
[12] V. S. Verykios, E. Bertino, I. Fovino, L. Provenza, Y. Saygin, andY. Theodoridis. “State-of-the-art in privacy preserving data
mining”,SIGMOD Record, 33(1):50-57, 2004
[13] W.Frankes and R.Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs,NJ,19192
[14] V.Estvill–Castro, L Brankovic and D.L.Dowe. “Privacy in data mining”.Australian Computer Society NSWBranch, Australia.
Available at www.acs.org.au/nsw/articles/199082.html
[15] S. Xu, J. Zhang, D. Han, J.Wang. Data distortion for privacy protection in a terrorist analysis system. in: Proceedings of the 2005
IEEE International Conference on Intelligence and Security Informatics, 2005, 459–464.
[16] S. Xu, J. Zhang, D. Han, J. Wang. Singular value decomposition based data distortion strategy for privacy protection.Knowledge
and Information Systems, 2006, 10(3): 383–397.
Data
objects(points)
Original data set Distorted dataset-
SVD
Distorted dataset-
SSVD
K=2 K=3 K=4 K=2 K=3 K=4 K=2 K=3 K=4
10 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00
100 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00