Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering (IKE 2012), Pages 65-71, Las Vegas, NV, USA.
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...Kato Mivule
Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science (Complex Adaptive Systems), 2012, Pages 176-181, Washington DC, USA.
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...Kato Mivule
Kato Mivule and Claude Turner, An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge, International Conference on Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA
Applying Data Privacy Techniques on Published Data in UgandaKato Mivule
Kato Mivule, Claude Turner, "Applying Data Privacy Techniques on Published Data in Uganda", Proceedings of the 2012 International Conference on e-Learning, e-Business, Enterprise Information Systems, and e-Government (EEE 2012), Pages 110-115, Las Vegas, NV, USA.
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...Kato Mivule
Kato Mivule, Claude Turner, "A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Using Machine Learning Classification as a Gauge", Procedia Computer Science, Volume 20, 2013, Pages 414-419, Baltimore MD, USA
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
Implementation of Data Privacy and Security in an Online Student Health Recor...Kato Mivule
Kato Mivule, Stephen Otunba, Tattwamasi Tripathy, Sharad and Sharma, "Implementation of Data Privacy and Security in an Online Student Health Records System", Proceedings at the ISCA 21th Int Conf on Software Engineering and Data Engineering (SEDE-2012), Pages 143-148, Los Angeles, CA, USA
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...Kato Mivule
Literature Review – Talk, By Kato Mivule, COSC891 Fall 2013, Computer Science Department, Bowie State University
"Signal Processing and Machine Learning with Differential Privacy Algorithms and challenges for continuous data" Sarwate and Chaudhuri (2013)
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
Privacy is an important issue in data mining and knowledge
discovery. In this paper, we propose to use the randomized
response techniques to conduct the data mining computation.
Specially, we present a method to build decision tree
classifiers from the disguised data. We conduct experiments
to compare the accuracy ofou r decision tree with the one
built from the original undisguised data. Our results show
that although the data are disguised, our method can still
achieve fairly high accuracy. We also show how the parameter
used in the randomized response techniques affects the
accuracy ofth e results
Keywords
Privacy, security, decision tree, data mining
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...Kato Mivule
Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science (Complex Adaptive Systems), 2012, Pages 176-181, Washington DC, USA.
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...Kato Mivule
Kato Mivule and Claude Turner, An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge, International Conference on Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA
Applying Data Privacy Techniques on Published Data in UgandaKato Mivule
Kato Mivule, Claude Turner, "Applying Data Privacy Techniques on Published Data in Uganda", Proceedings of the 2012 International Conference on e-Learning, e-Business, Enterprise Information Systems, and e-Government (EEE 2012), Pages 110-115, Las Vegas, NV, USA.
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...Kato Mivule
Kato Mivule, Claude Turner, "A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Using Machine Learning Classification as a Gauge", Procedia Computer Science, Volume 20, 2013, Pages 414-419, Baltimore MD, USA
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
Implementation of Data Privacy and Security in an Online Student Health Recor...Kato Mivule
Kato Mivule, Stephen Otunba, Tattwamasi Tripathy, Sharad and Sharma, "Implementation of Data Privacy and Security in an Online Student Health Records System", Proceedings at the ISCA 21th Int Conf on Software Engineering and Data Engineering (SEDE-2012), Pages 143-148, Los Angeles, CA, USA
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...Kato Mivule
Literature Review – Talk, By Kato Mivule, COSC891 Fall 2013, Computer Science Department, Bowie State University
"Signal Processing and Machine Learning with Differential Privacy Algorithms and challenges for continuous data" Sarwate and Chaudhuri (2013)
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
Privacy is an important issue in data mining and knowledge
discovery. In this paper, we propose to use the randomized
response techniques to conduct the data mining computation.
Specially, we present a method to build decision tree
classifiers from the disguised data. We conduct experiments
to compare the accuracy ofou r decision tree with the one
built from the original undisguised data. Our results show
that although the data are disguised, our method can still
achieve fairly high accuracy. We also show how the parameter
used in the randomized response techniques affects the
accuracy ofth e results
Keywords
Privacy, security, decision tree, data mining
An Investigation of Data Privacy and Utility Using Machine Learning as a GaugeKato Mivule
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge By Kato Mivule for the Degree of D.Sc. in Computer Science - Bowie State University
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
Now-a day’s data sharing between two organizations
is common in many application areas like business planning
or marketing. When data are to be shared between parties,
there could be some sensitive data which should not be
disclosed to the other parties. Also medical records are more
sensitive so, privacy protection is taken more seriously. As
required by the Health Insurance Portability and
Accountability Act (HIPAA), it is necessary to protect the
privacy of patients and ensure the security of the medical
data. To address this problem, released datasets must be
modified unavoidably. We propose a method called Hybrid
approach for privacy preserving and implemented it. First we
randomized the original data. Then we have applied
generalization on randomized or modified data. This
technique protect private data with better accuracy, also it can
reconstruct original data and provide data with no information
loss, makes usability of data.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREMIJNSA Journal
Data mining has made broad significant multidisciplinary field used in vast application domains and extracts knowledge by identifying structural relationship among the objects in large data bases. Privacy preserving data mining is a new area of data mining research for providing privacy of sensitive knowledge of information extracted from data mining system to be shared by the intended persons not to everyone to access. In this paper , we proposed a new approach of privacy preserving data mining by using implicit function theorem for secure transformation of sensitive data obtained from data mining system. we proposed two way enhanced security approach. First transforming original values of sensitive data into different partial derivatives of functional values for perturbation of data. secondly generating symmetric key value by Eigen values of jacobian matrix for secure computation. we given an example of academic sensitive data converting into vector valued functions to explain about our proposed concept and presented implementation based results of new proposed of approach.
VOLUME-7 ISSUE-8, AUGUST 2019 , International Journal of Research in Advent Technology (IJRAT) , ISSN: 2321-9637 (Online) Published By: MG Aricent Pvt Ltd
A Comparative Study on Privacy Preserving Datamining TechniquesIJMER
Privacy protection is very important in the recent years for the reason of increasing in the
ability to store data. In particular, recent advances in the data mining field have lead to increased
concerns about privacy. Data in its original form, however, typically contains sensitive information about
individuals, and publishing such data will violate individual privacy. The current practice in data
publishing based on that what type of data can be released and use of that data. Recently, PPDM has
received immersed attention in research communities, and many approaches have been proposed for
different data publishing scenarios. In this comparative study we will systematically summarize and
evaluate different approaches for PPDM, study the challenges ,differences and requirements that
distinguish PPDM from other related problems, and propose future research directions
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches14894
In this paper we review on the
various privacy preserving data mining techniques like data
modification and secure multiparty computation based on the
different aspects.
Index Terms– Privacy and Security, Data Mining, Privacy
Preserving, Secure Multiparty Computation (SMC) and Data
Modification
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
Privacy Preservation and Restoration of Data Using Unrealized Data SetsIJERA Editor
In today’s world, there is an improved advance in hardware technology which increases the capability to store and record personal data about consumers and individuals. Data mining extracts knowledge to support a variety of areas as marketing, medical diagnosis, weather forecasting, national security etc successfully. Still there is a challenge to extract certain kinds of data without violating the data owners’ privacy. As data mining becomes more enveloping, such privacy concerns are increasing. This gives birth to a new category of data mining method called privacy preserving data mining algorithm (PPDM). The aim of this algorithm is to protect the easily affected information in data from the large amount of data set. The privacy preservation of data set can be expressed in the form of decision tree. This paper proposes a privacy preservation based on data set complement algorithms which store the information of the real dataset. So that the private data can be safe from the unauthorized party, if some portion of the data can be lost, then we can recreate the original data set from the unrealized dataset and the perturbed data set.
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETIJDKP
Data mining is a task in which data is extracted from the large database to make itin an understandable
form or structure so that it can be used for further use. In this paper we present an approach by which the
concept of hierarchal clustering applied over the horizontally partitioned data set. We also explain the
desired algorithm like hierarichal clustering, algorithms for finding the minimum closest cluster. In this
paper wealso explain the two party computations. Privacy of any data is the most important thing in these
days hence we present an approach by which we can apply privacy preservation over the two party which
are distributing their data horizontally. We also explain about the hierarichal clustering which we are
going to apply in our present method.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...cscpconf
Sharing data that contains personally identifiable or sensitive information, such as medical
records, always has privacy and security implications. The issues can become rather complex
when the methods of access can vary, and accurate individual data needs to be provided whilst
mass data release for specific purposes (for example for medical research) also has to be
catered for. Although various solutions have been proposed to address the different aspects
individually, a comprehensive approach is highly desirable. This paper presents a solution for
maintaining the privacy of data released en masse in a controlled manner, and for providing
secure access to the original data for authorized users. The results show that the solution is provably secure and maintains privacy in a more efficient manner than previous solutions
An Investigation of Data Privacy and Utility Using Machine Learning as a GaugeKato Mivule
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge By Kato Mivule for the Degree of D.Sc. in Computer Science - Bowie State University
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
Now-a day’s data sharing between two organizations
is common in many application areas like business planning
or marketing. When data are to be shared between parties,
there could be some sensitive data which should not be
disclosed to the other parties. Also medical records are more
sensitive so, privacy protection is taken more seriously. As
required by the Health Insurance Portability and
Accountability Act (HIPAA), it is necessary to protect the
privacy of patients and ensure the security of the medical
data. To address this problem, released datasets must be
modified unavoidably. We propose a method called Hybrid
approach for privacy preserving and implemented it. First we
randomized the original data. Then we have applied
generalization on randomized or modified data. This
technique protect private data with better accuracy, also it can
reconstruct original data and provide data with no information
loss, makes usability of data.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREMIJNSA Journal
Data mining has made broad significant multidisciplinary field used in vast application domains and extracts knowledge by identifying structural relationship among the objects in large data bases. Privacy preserving data mining is a new area of data mining research for providing privacy of sensitive knowledge of information extracted from data mining system to be shared by the intended persons not to everyone to access. In this paper , we proposed a new approach of privacy preserving data mining by using implicit function theorem for secure transformation of sensitive data obtained from data mining system. we proposed two way enhanced security approach. First transforming original values of sensitive data into different partial derivatives of functional values for perturbation of data. secondly generating symmetric key value by Eigen values of jacobian matrix for secure computation. we given an example of academic sensitive data converting into vector valued functions to explain about our proposed concept and presented implementation based results of new proposed of approach.
VOLUME-7 ISSUE-8, AUGUST 2019 , International Journal of Research in Advent Technology (IJRAT) , ISSN: 2321-9637 (Online) Published By: MG Aricent Pvt Ltd
A Comparative Study on Privacy Preserving Datamining TechniquesIJMER
Privacy protection is very important in the recent years for the reason of increasing in the
ability to store data. In particular, recent advances in the data mining field have lead to increased
concerns about privacy. Data in its original form, however, typically contains sensitive information about
individuals, and publishing such data will violate individual privacy. The current practice in data
publishing based on that what type of data can be released and use of that data. Recently, PPDM has
received immersed attention in research communities, and many approaches have been proposed for
different data publishing scenarios. In this comparative study we will systematically summarize and
evaluate different approaches for PPDM, study the challenges ,differences and requirements that
distinguish PPDM from other related problems, and propose future research directions
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches14894
In this paper we review on the
various privacy preserving data mining techniques like data
modification and secure multiparty computation based on the
different aspects.
Index Terms– Privacy and Security, Data Mining, Privacy
Preserving, Secure Multiparty Computation (SMC) and Data
Modification
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
Privacy Preservation and Restoration of Data Using Unrealized Data SetsIJERA Editor
In today’s world, there is an improved advance in hardware technology which increases the capability to store and record personal data about consumers and individuals. Data mining extracts knowledge to support a variety of areas as marketing, medical diagnosis, weather forecasting, national security etc successfully. Still there is a challenge to extract certain kinds of data without violating the data owners’ privacy. As data mining becomes more enveloping, such privacy concerns are increasing. This gives birth to a new category of data mining method called privacy preserving data mining algorithm (PPDM). The aim of this algorithm is to protect the easily affected information in data from the large amount of data set. The privacy preservation of data set can be expressed in the form of decision tree. This paper proposes a privacy preservation based on data set complement algorithms which store the information of the real dataset. So that the private data can be safe from the unauthorized party, if some portion of the data can be lost, then we can recreate the original data set from the unrealized dataset and the perturbed data set.
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETIJDKP
Data mining is a task in which data is extracted from the large database to make itin an understandable
form or structure so that it can be used for further use. In this paper we present an approach by which the
concept of hierarchal clustering applied over the horizontally partitioned data set. We also explain the
desired algorithm like hierarichal clustering, algorithms for finding the minimum closest cluster. In this
paper wealso explain the two party computations. Privacy of any data is the most important thing in these
days hence we present an approach by which we can apply privacy preservation over the two party which
are distributing their data horizontally. We also explain about the hierarichal clustering which we are
going to apply in our present method.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...cscpconf
Sharing data that contains personally identifiable or sensitive information, such as medical
records, always has privacy and security implications. The issues can become rather complex
when the methods of access can vary, and accurate individual data needs to be provided whilst
mass data release for specific purposes (for example for medical research) also has to be
catered for. Although various solutions have been proposed to address the different aspects
individually, a comprehensive approach is highly desirable. This paper presents a solution for
maintaining the privacy of data released en masse in a controlled manner, and for providing
secure access to the original data for authorized users. The results show that the solution is provably secure and maintains privacy in a more efficient manner than previous solutions
Performance analysis of perturbation-based privacy preserving techniques: an ...IJECEIAES
Nowadays, enormous amounts of data are produced every second. These data also contain private information from sources including media platforms, the banking sector, finance, healthcare, and criminal histories. Data mining is a method for looking through and analyzing massive volumes of data to find usable information. Preserving personal data during data mining has become difficult, thus privacy-preserving data mining (PPDM) is used to do so. Data perturbation is one of the several tactics used by the PPDM data privacy protection mechanism. In perturbation, datasets are perturbed in order to preserve personal information. Both data accuracy and data privacy are addressed by it. This paper will explore and compare several hybrid perturbation strategies that may be used to protect data privacy. For this, two perturbation-based techniques named improved random projection perturbation (IRPP) and enhanced principal component analysis-based technique (EPCAT) were used. These methods are employed to assess the precision, run time, and accuracy of the experimental results. This paper provides the impacts of perturbation-based privacy preserving techniques. It is observed that hybrid approaches are more efficient than the traditional approach.
In this era, there are need to secure data in distributed database system. For collaborative data
publishing some anonymization techniques are available such as generalization and bucketization. We consider
the attack can call as “insider attack” by colluding data providers who may use their own records to infer
others records. To protect our database from these types of attacks we used slicing technique for anonymization,
as above techniques are not suitable for high dimensional data. It cause loss of data and also they need clear
separation of quasi identifier and sensitive database. We consider this threat and make several contributions.
First, we introduce a notion of data privacy and used slicing technique which shows that anonymized data
satisfies privacy and security of data which classifies data vertically and horizontally. Second, we present
verification algorithms which prove the security against number of providers of data and insure high utility and
data privacy of anonymized data with efficiency. For experimental result we use the hospital patient datasets
and suggest that our slicing approach achieves better or comparable utility and efficiency than baseline
algorithms while satisfying data security. Our experiment successfully demonstrates the difference between
computation time of encryption algorithm which is used to secure data and our system.
A Survey Paper on an Integrated Approach for Privacy Preserving In High Dimen...IJSRD
Data mining is a technique which is used for extraction of knowledge and information from large amount of data collected by hospitals, government and individuals. The term data mining is also referred as knowledge mining from databases. The major challenge in data mining is ensuring security and privacy of data in databases, because data sharing is common at organizational level. The data in databases comes from a number of sources like – medical, financial, library, marketing, shopping record etc so it is foremost task for anyone to keep secure that data. The objective is to achieve fully privacy preserved data without affecting the data utility in databases. i.e. how data is used or transferred between organizations so that data integrity remains in database but sensitive and confidential data is preserved. This paper presents a brief study about different PPDM techniques like- Randomization, perturbation, Slicing, summarization etc. by use of which the data privacy can be preserved. The technique for which the best computational and theoretical outcome is achieved is chosen for privacy preserving in high dimensional data.
Data Anonymization Process Challenges and Context Missionsijdms
Data anonymization is one of the solutions allowing companies to comply with the GDPR directive in terms of data protection. In this context, developers must follow several steps in the process of data anonymization in development and testing environments. Indeed, real personal and sensitive data must not leave the production environment which is very secure. Often, anonymization experts are faced with difficulties including the lack of data flows and mapping between data sources, the non-cooperation of the database project teams (refusal to change) or even the lack of skills of these teams present due to the age of the systems developed by experienced teams who unfortunately left the project. Other problems are lack of data models. The aim of this paper is to discuss an anonymization process of databases of banking applications and present our context-based recommendations to overcome the different issues met and the solutions to improve methodologies of data anonymization process.
Data Transformation Technique for Protecting Private Information in Privacy P...acijjournal
Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. Data
Mining can be utilized in any organization that needs to find patterns or relationships in their data. A group of techniques that find relationships that have not previously been discovered. In many situations, the extracted patterns are highly private and it should not be disclosed. In order to maintain the secrecy of data,
there is in need of several techniques and algorithms for modifying the original data in order to limit the extraction of confidential patterns. There have been two types of privacy in data mining. The first type of privacy is that the data is altered so that the mining result will preserve certain privacy. The second type of privacy is that the data is manipulated so that the mining result is not affected or minimally affected. The aim of privacy preserving data mining researchers is to develop data mining techniques that could be
applied on data bases without violating the privacy of individuals. Many techniques for privacy preserving data mining have come up over the last decade. Some of them are statistical, cryptographic, randomization methods, k-anonymity model, l-diversity and etc. In this work, we propose a new perturbative masking technique known as data transformation technique can be used for protecting the sensitive information. An
experimental result shows that the proposed technique gives the better result compared with the existing technique.
Towards A More Secure Web Based Tele Radiology System: A Steganographic ApproachCSCJournals
While it is possible to make a patient's medical images available to a practicing radiologist online e.g. through open network systems inter connectivity and email attachments, these methods don't guarantee the security, confidentiality and tamper free reliability required in a medical information system infrastructure. The possibility of securely and covertly transmitting such medical images remotely for clinical interpretation and diagnosis through a secure steganographic technique was the focus of this study.
We propose a method that uses an Enhanced Least Significant Bit (ELSB) steganographic insertion method to embed a patient's Medical Image (MI) in the spatial domain of a cover digital image and his/her health records in the frequency domain of the same cover image as a watermark to ensure tamper detection and nonrepudiation. The ELSB method uses the Marsenne Twister (MT) Pseudo Random Number Generator (PRNG) to randomly embed and conceal the patient's data in the cover image. This technique significantly increases the imperceptibility of the hidden information to steganalysis thereby enhancing the security of the embedded patient's data.
In measuring the effectiveness of the proposed method, the study adopted the Design Science Research (DSR) methodology, a paradigm for problem solving in computing and Information Systems (IS) that involves design and implementation of artefacts and methods considered novel and the analytical testing of the performance of such artefacts in pursuit of understanding and enhancing an existing method, artefact or practice.
The fidelity measures of the stego images from the proposed method were compared with those from the traditional Least Significant Bit (LSB) method in order to establish the imperceptibility of the embedded information. The results demonstrated improvements of between 1 to 2.6 decibels (dB) in the Peak Signal to Noise Ratio (PSNR), and up to 0.4 MSE ratios for the proposed method.
DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS cscpconf
Cybersecurity solutions are traditionally static and signature-based. The traditional solutions
along with the use of analytic models, machine learning and big data could be improved by
automatically trigger mitigation or provide relevant awareness to control or limit consequences
of threats. This kind of intelligent solutions is covered in the context of Data Science for
Cybersecurity. Data Science provides a significant role in cybersecurity by utilising the power
of data (and big data), high-performance computing and data mining (and machine learning) to
protect users against cybercrimes. For this purpose, a successful data science project requires
an effective methodology to cover all issues and provide adequate resources. In this paper, we
are introducing popular data science methodologies and will compare them in accordance with
cybersecurity challenges. A comparison discussion has also delivered to explain methodologies’
strengths and weaknesses in case of cybersecurity projects.
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONcscpconf
Huge Volumes of detailed personal data is regularly collected and analyzed by applications
using data mining, sharing of these data is beneficial to the application users. On one hand it is
an important asset to business organizations and governments for decision making at the same
time analysing such data opens treats to privacy if not done properly. This paper aims to reveal
the information by protecting sensitive data. We are using Vector quantization technique for
preserving privacy. Quantization will be performed on training data samples it will produce
transformed data set. This transformed data set does not reveal the original data. Hence privacy
is preserved
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09666155510, 09849539085 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
Similar to Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview (20)
A Study of Usability-aware Network Trace Anonymization Kato Mivule
The publication and sharing of network trace data is a critical to the advancement of collaborative research among various entities, both in government, private sector, and academia. However, due to the sensitive and confidential nature of the data involved, entities have to employ various anonymization techniques to meet legal requirements in compliance with confidentiality policies. Nevertheless, the very composition of network trace data makes it a challenge when applying anonymization techniques. On the other hand, basic application of microdata anonymization techniques on network traces is problematic and does not deliver the necessary data usability. Therefore, as a contribution, we point out some of the ongoing challenges in the network trace anonymization. We then suggest usability-aware anonymization heuristics by employing microdata privacy techniques while giving consideration to usability of the anonymized data. Our preliminary results show that with trade-offs, it might be possible to generate anonymized network traces with enhanced usability, on a case-by-case basis using micro-data anonymization techniques.
Kato Mivule - Towards Agent-based Data Privacy EngineeringKato Mivule
Towards Agent-based Data Privacy Engineering - Given any original data set X, a set of data privacy engineering phases should be followed from start to completion in the generation of a privatized data set Y. Could we have agents that autonomously implement privacy?
Lit Review Talk by Kato Mivule: A Review of Genetic AlgorithmsKato Mivule
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms and Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.
An Investigation of Data Privacy and Utility Using Machine Learning as a GaugeKato Mivule
Dissertation Defense: "An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge" by Kato Mivule, Bowie State University, April 17, 2014.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
1. Utilizing Noise Addition for Data Privacy, an Overview
Kato Mivule
Computer Science Department
Bowie State University
14000 Jericho Park Road Bowie, MD 20715
Mivulek0220@students.bowiestate.edu
Abstract – The internet is increasingly becoming a
standard for both the production and consumption of
data while at the same time cyber-crime involving the
theft of private data is growing. Therefore in efforts to
securely transact in data, privacy and security concerns
must be taken into account to ensure that the
confidentiality of individuals and entities involved is not
compromised, and that the data published is compliant to
privacy laws. In this paper, we take a look at noise
addition as one of the data privacy providing techniques.
Our endeavor in this overview is to give a foundational
perspective on noise addition data privacy techniques,
provide statistical consideration for noise addition
techniques and look at the current state of the art in the
field, while outlining future areas of research.
Keywords: Data Privacy, Security, Noise Addition,
Data Perturbation
1. Introduction
Large data collection organizations such as the
Census Bureau often release statistics to the public in the
form of statistical databases, often transformed to some
extent, omitting sensitive information such as personal
identifying information (PII). Researchers have shown
that with such publicly released statistical databases in
conjunction with supplemental data, adversaries are able
to launch inference attacks and reconstruct identities of
individuals or an entity's sensitive information [1].
Therefore while data de-identification is essential, it
should be taken as an initial step in the process of privacy
preserving data publishing but other methods such as
noise addition should strongly be considered after PII has
been removed from data sets to ensure greater levels of
confidentiality [1] [2]. A generalized data privacy
procedure would involve both data de-identification and
perturbation as shown in Figure 1.
Figure 1: Generalized Data Privacy with Noise Addition
2. Background
In this section we take a look at some of the
terms used in the noise addition procedure. Data Privacy
and Confidentiality is the protection of an entity or an
individual against illegitimate information revelation. [1].
Data Security is concerned with legitimate accessibility
of data [2]. Data de-identification is the removal of
personally identifiable information (PII) from a data set
[3] [4]. Data de-identification process also referred to as
data anonymization, data sanitization, and statistical
disclosure control (SDC), is a process in which PII
attributes are excluded or denatured to such an extent that
when the data is made public, a person's identity, or an
entity's sensitive data, cannot be reconstructed [5] [6].
Statistical disclosure control methods are classified as
non-perturbative and perturbative, with the former being
a procedure in which original data is not denatured, while
with the latter, original data is denatured before
publication to provide confidentiality [1]. Therefore de-
identification of data ensures to some extent that
sensitive and personal data does not suffer from inference
and reconstruction attacks, which are methods of attack
in which isolated pieces of data are used to infer a
supposition about a person or an entity [7].
Data utility verses privacy is how useful a published
dataset is to the consumer of that publicized dataset. In
most instances, when publishers of large data sets do so,
they ensure that PII is removed and data is distorted by
noise addition techniques. However, in doing so, the
original data suffers loss of some of its statistical
properties even while confidentiality is granted, thus
making the dataset almost meaningless to the user of the
published dataset. Therefore a balance between privacy
and utility needs is always sought [24] [25] [26]. Data
privacy scholars have noted that achieving optimal data
privacy while not shrinking data utility is an ongoing NP-
hard task [27]. Statistical databases are non-changing
data sets often published in aggregated format [28].
While data de-identification will ensure the removal of
PII attributes, it has been deemed a novice method by
researchers; the remaining sanitized data set could still be
compromised and used to reconstruct an individual's
identity or an entity's sensitive data [1] [2]. Therefore the
remaining confidential attributes that contain sensitive
information for example salary, student's GPA, need to be
transformed to such an extent that they cannot be linked
2. with outside information in an inference attack. It is in
this context that we focus on noise addition as a
perturbation methodology that seeks to transform
numerical attributes to grant confidentiality.
3. Related work
With an increasing interest in data privacy and
security research, a number of surveys have been done
articulating the progress and state of the art in the data
privacy and security research field. In their survey on
data privacy and security, Santos et al., present an
overview on state of the art in data security techniques,
placing emphasis on data security solutions for data
warehousing [40]. Furthermore, in their overview,
Matthews and Harel, offer a more comprehensive
summary of current statistical disclosure limitation
techniques, noting that that the balance between privacy
and utility is still being sought with data privacy
enhancing techniques [41]. Additionally Joshi and Kuo,
offer an outline of state of the art data privacy techniques
in Online Social Networks, in which they note how a
balance is always pursued between privacy requirements
for users and using private data for advertisements [42].
Yet still, in their review, Ying-hua et al., take a closer
look at the current data privacy preserving techniques in
data mining, providing advantages and disadvantages of
various data privacy procedures [43]. While a number of
current overviews on data privacy focus on the general
data privacy enhancing techniques, in this paper, we
focus on noise addition methods while providing
statistical considerations for data perturbation.
4. Noise Addition
In this section, we take a look at noise addition
perturbation methods that transform confidential
attributes by adding noise to provide confidentiality.
Noise addition works by adding or multiplying a
stochastic or randomized number to confidential
quantitative attributes. The stochastic value is chosen
from a normal distribution with zero mean and a
diminutive standard deviation [10] [11].
4.1. Additive Noise
Work on additive noise was first publicized by
Kim [12] with the general expression that
= + � (1)
Where Z is the transformed data point, X is the original
data point and ɛ is the random variable (noise) with a
distribution ~ , � . This is then added to X. The X
is then replaced with the Z for the data set to be
published.[13] With stochastic noise, random data is
added to confidential attributes to conceal the
distinguishing values, an example includes increasing a
student's GPA by a diminutive percentage, say from 3.45
to 3.65 GPA [14]. In their work on additive noise,
Domingo-Ferrer et al., outline that in additive noise, also
referred to as white noise, concealment by additive noise
anticipates that the variable of measurements of the
original data set is continuously replaced by the
variable,
= + (2)
Where j is the variable of normally distributed noise
acquired from a random variable: � ~ ,� , such
that
� ��, � , for all t! = l thus the method preserves the
mean and covariance. [20] Therefore additive noise can
be expressed in a simple format as follows [21]:
= + � (3)
Z is masked data value to be published, after the
transformation X + ɛ. X is the original unmasked data
value in the raw data set. ɛ (epsilon) is the random
variable (noise) added to X, whose distribution is
�~ , � . Ciriani et al., note that additive noise also
known as uncorrelated noise, preserves the mean and
covariance of the original data but the correlation
coefficients and variances are not sustained. Another
variation of additive noise is correlated additive noise
that keeps the mean and allows the sustenance of
correlation coefficients in the original data [22].
4.2. Multiplicative Noise
Multiplicative noise is another type of stochastic
noise outlined by Kim and Winkler [23] in which they
describe that multiplicative noise is rendered by
generating random numbers with a mean = 1, which then
are used as noise and multiplied to the original data set.
Each data element is multiplied by a random number
with a short Gaussian distribution, with mean = 1 and a
small variance:
= � (4)
Where Yis the perturbed data; X is the original data; E is
the generated random variable (noise) with a normal
distribution with mean µ and variance σ [23].
4.3 Logarithmic multiplicative noise
Kim and Winkler [23] describe another variation
of multiplicative noise, in which a logarithmic alteration
is taken on the original data:
= � (5)
The random number (noise) is then generated and then
added to the altered data [23]:
= + � (6)
Where X is the original data; Yis the logarithmic altered
data; Z is the logarithmic altered data with noise added to
it; is the exponential function used to calculate the
antilog.
4.4. Differential Privacy
In this section, we take a look at Differential
privacy, a current state of the art data perturbation
method that utilizes Laplace noise addition techniques
and was proposed by Dwork (2006). Differential privacy
is the latest state-of-the-art methodology in data privacy
that enforces confidentiality by returning perturbed
aggregated query results from databases, such that users
of the databases cannot discern if particular data item has
3. been altered or not. This means that with the perturbed
results of the query, an attacker cannot derive information
about any data item in the database [33]. The database in
this case is a collection of rows that represent each
individual entity we seek to provide concealment. [34]
According to Dwork (2008), two databases D1 and D2 are
considered identical or similar, if they differ or disagree
in only one element or row that is ∆ = .
Therefore, a procedure that grants confidentiality,
satisfies -differential privacy if the result to any same
query run on database D1 and again run on database D2
should probabilistically be similar, and as long as those
results satisfy the following requirement: [36]
�[�� ∈�]
�[�� ∈�]
≤ �
(7)
Where D1 and D2 are the two databases
P is the probability of the perturbed query
results D1 and D2 respectively.
qn() is the privacy granting procedure
(perturbation).
qn(D1) is the privacy granting procedure on
query results from database D1.
qn(D2) is the privacy granting procedure on
query results from database D2.
R is the perturbed query results from the
databases D1 and D2 respectively.
�
is the exponential epsilon value.
Therefore to satisfy differential privacy, the probability
of the perturbed query results D1 divided by the
probability of the perturbed query results D2 should be
less or equal to an exponential epsilon value. That is to
say, if we run the same query on database D1, and then
run the same query again on database D2, our query
results should probabilistically be similar. If the condition
can be mitigated in the presence or absence of the most
influential observation for a particular query, then this
condition will also be mitigated for any other
observation. The consequence of the most dominant
observation for a given query is given by ∆ and
assessed in the following way:
∆ = | − | (8)
For all possible realizations of D1 and D2, Where f(D1)
and f(D2) represent the true responses to the query from
D1 and D2 [33] [34] [35] [36]. According to Dwork
(2006), the results to a query are presented as noise in the
following way:
+ � , (9)
Where b is defined as follows for Laplace noise:
=
∆�
�
(10)
Xrepresents a particular realization of the database, while
f(x) represents the true response to the query, the
response would satisfy -differential privacy. The Δf
must look at all possible realizations of D1 and D2 [33]
[34] [35] [36] [37]. We could take an example in which
we query the GPA of students at Bowie State University.
If our Min GPA in the database is 2.0, for smallest
possible GPA, and our Max GPA is 4.0 for largest
possible GPA, we then calculate Δf as 2.0. We choose a
small value of 0.01. The parameter b of the Laplace
noise is set to Δf/ = 2.0/0.01 = 200. Thus we have
Laplace (0, 200) noise distribution. Therefore the
unperturbed results of the query + Noise from Laplace (0,
200) = Perturbed query results satisfying -differential
privacy. [34] It has been noted by researchers that a
smaller epsilon value creates greater privacy by the
procedure. However, utility risks degeneration with a
much smaller epsilon value [38]. For example, at
0.0001, will give b as 20000, Laplace (0, 20000) noise
distribution.
Figure 2: A general Differential Privacy satisfying
procedure.
General steps for differential privacy shown in Figure 2:
Run query on database
Calculate the most influential observation
Calculate the Laplace noise distribution
Add Laplace noise distribution to the query
results
Publish perturbed query results.
4.5. Differential privacy pros and cons
Differential privacy grants across-the-board
privacy, and easy to implement with SQL for aggregated
data publication [39]. However, utility is a challenge as
statistical properties change with a much smaller as
Laplace noise addition takes into account the outliers and
most influential observation. [38] More noise to the data
at the level of the most influential observation only
renders the data useless thus balance between privacy
and utility still a challenge [34] [37].
4.6. Statistical background for Noise addition
In this section, we take a look at statistical
considerations for data perturbation utilizing noise
addition. With noise addition, transformed data has to
keep the same statistical properties as the original data.
Therefore consideration has to be made for statistical
characteristics such as normal distribution, mean,
variance, standard deviation, covariance, and correlation
4. for both original and perturbed data sets.
The Mean μ, is the average of values after their total sum
has been taken. In this case we would look at the
summation of values then we divide them by the n, the
quantity of values; the mathematical statement then for
the Mean μ, is straight forward: [16]
� = ∑ = (11)
The Normal Distribution, also known as the Gaussian
distribution, used in calculating the noise addition, is a
bell shaped continuous probability distribution used as an
estimation to depict real-valued stochastic variables that
agglomerate around a single mean. The formula for
normal distribution is as follows:[15]
=
√ � �
× − − � / � (12)
The parameter μ represents the mean, the point of the
peak in the bell curve, while the parameter σ2
representing the variance, the width of the distribution.
The annotation N (μ, σ2
) represents a normal distribution
with mean μ and variance σ2
. Therefore X~ �, � is
representative of X distributed N (μ, σ2
). The distribution
with μ = 0 and σ 2 = 1 is cited to as the standard normal.
The Variance σ2
, in noise addition, is a measure of how
data distributes itself in approximation to the mean value.
The expression for variance is given by: [17]
� =
∑ �− �
�
(12)
Where σ2
is the variance, µ is the mean, X being the
single data values, N as the number of values, and ∑ (X –
µ)2
as the summing up of all data values X minus the
mean µ squared.
The Standard Deviation, σ, is a measure of how
distributed data is from the normal, thus we would say
standard deviation is how data points are deviated from
the mean. The mathematical expression is simply the
square root of the variance σ2
: [18]
� = √� (13)
Covariance: With noise addition, the measurement of
how affiliated original data and perturbed data are, is
crucial. Covariance, Cov(X, Y), is a calculation of how
affiliated the deviations between the data points X and Y
are. If the covariance is positive, then the X and Y data
points' inclination is to increase together, else if the
covariance is negative, then the tendency is that for the
two data points X and Y, one lessens while the other
gains. However, if the covariance is zero, then this would
signal that the data points are each autonomous. The
expression for covariance is given as follows: [19]
� =
�
∑ − (14)
Correlation also known as the Pearson product,
calculates the capability and inclination of an additive or
linear relation between two data points. The correlation
is dimensionless, autonomous of the parts in which
the data points x and y are calculated [19]. If is = -1,
then indicates a negative linear relation between the
data points x and y. If = 0, then the linear relation
between the two data points x and y does not exist,
however, a regular nonlinear relation might exist. If
= +1, then there is a strong linear relation between x and
y. The expression used for correlation is: [19]
� � = =
�
� �
(15)
4.7. Signal Noise Ratio (SNR)
In this section, we take a look at SNR in relation to
data perturbation using noise addition, with the aim that
SNR could be employed to achieve optimal data utility
while preserving privacy, by measuring how much noise
we need to optimally obfuscate data. In electronic
signals, SNR is used to calculate a signal tainted by noise
by approximating the signal power to noise power ratio,
basically the ratio of the power of the signal without
noise over the power of the noise.
=
�
(16)
With data perturbation, we could further borrow from the
definition of SNR employed in Image Processing were
the ratio of mean to standard deviation of a signal is used,
and typically SNR is computed as the ratio of the mean
pixel value to the standard deviation of the pixel values
in a certain vicinity [29] [30].
=
�
�
(17)
The parameter μ in this case represents the mean of the
signal and the parameter σ as the standard deviation of
the noise. A presumed threshold for SNR in image
processing is based on the Rose Criterion which
stipulates that an SNR of 5 is desirable in order to
differentiate image details with 100 per cent confidence.
Therefore, an SNR of less than 5 per cent will result in
less than 100 per cent confidence in recognizing
particulars of an image [31].
5. Illustration
In this section, we provide an example of data
perturbation with noise addition for illustrative purposes.
We follow a simple algorithm in implementing noise
addition perturbation methodology to provide
confidentiality in a published data set. The first step is the
5. de-identification of the data set by the removal
of PII, after which we apply noise addition. In our
implementation, we created a data set of 10 records for
illustrative purposes and then applied the algorithm
below. The original data set contained PII, we de-
identified the original data set, after which we applied
additive noise to the numerical attributes, and we then
plotted the results in a graph, comparing the statistical
properties of the original and perturbed data.
Steps for De-identification and Noise Addition
1. For all values of the data set to be published,
2. Do data de-identification
2.1. Find PII
2.2 Remove PII
3. For remaining data void of PII to be published,
3.1. Find quantitative attributes in the data set
3.2. Apply additive noise to the quantitative data
values
4. Publish data set
5.1. Results of Illustration
Table 1: Original Data Set (All data for illustrative
purposes).
Table 2: Result after de-identification on original data.
Table 4: Results of the Normal Distribution of Original Perturbed Scholarship Amount.
Table 3: Random noise between 1000 and 9000 added to Scholarship attribute.
6. Figure 3: Results of the normal distribution of original
and perturbed scholarship amount
Covariance between Original Scholarship Data set and
Perturbed Scholarship Data set = 1055854875.465. Since
Covariance is positive, it shows that the two data sets
move together in the same direction. Correlation between
Original Scholarship Data set and Perturbed Scholarship
Data set = 0.999. Since Correlation is a strong positive, it
shows a relationship between the two data sets,
increasing and decreasing together.
6. Conclusion
We have taken a look at data perturbation
utilizing noise addition as a methodology used to provide
privacy for published data sets. We also took a look the
statistical considerations when utilizing noise addition.
We provided an illustrative example showing that de-
identification of data when done in concert with noise
addition would add more to the privacy of published data
sets while maintaining the statistical properties of the
original data set. However, generating perturbed data sets
that are statistically close to the original data sets is still a
challenge as consideration has to be made for the tradeoff
between utility and privacy; the more close the perturbed
data is to the original, the less confidential that data set
becomes, and the more distant the perturbed data set is
from the original, the more secure but then, utility of the
data set might be lost when the statistical characteristics
of the origin data set are lost. Noise generation certainly
affects the level of perturbation on the original data set.
Yet still, striking the right balance between privacy and
utility remains a factor. While state of the art data
perturbation techniques such as differential privacy
provide hope for achieving greater confidentiality,
achieving optimal data privacy while not shrinking data
utility is an ongoing NP-hard task. Therefore more
research needs to be done on how optimal privacy could
be achieved without degrading data utility. Another area
of research is how noise addition techniques could be
optimally applied in the cloud and mobile computing
arena, given the ubiquitous computing era.
7. References
[1] V. Ciriani, et al, 2007. Secure Data Management in
Decentralized System, Springer, ISBN 0387276947,
2007, pp 291-321.
[2] D.E Denning and P.J Denning, 1979. Data Security,
ACM Computing Surveys, Vpl. II, No. 3, September
1, 1979.
[3] US Department of Homeland Security, 2008.
Handbook for Safeguarding Sensitive Personally
Identifiable Information at The Department of
Homeland Security, October 2008. [Online].
Available at:
http://www.dhs.gov/xlibrary/assets/privacy/privacy_
guide_spii_handbook.pdf
[4] E. Mccallister and K. Scarfone, 2010. Guide to
Protecting the Confidentiality of Personally
Identifiable Information ( PII ) Recommendations of
the National Institute of Standards and Technology,
NIST Special Publication 800-122, 2010.
[5] S.R. Ganta, et al, 2008. Composition attacks and
auxiliary information in data privacy, Proceeding of
the 14th ACM SIGKDD international conference on
Knowledge discovery and data mining -
SIGKDD ’08, 2008, p. 265.
[6] A. Oganian, and J. Domingo-Ferrer, 2001. On the
complexity of optimal microaggregation for
statistical disclosure control, Statistical Journal of
the United Nations Economic Commission for
Europe, Vol. 18, No. 4. (2001), pp. 345-353.
[7] K.F. Brewster, 1996. The National Computer
Security Center (NCSC) Technical Report - 005V
olume 1/5 Library No. S-243,039, 1996.
[8] P. Samarati, 2001. Protecting Respondent’s Privacy
in Microdata Release. IEEE Transactions on
Knowledge and Data Engineering 13, 6 (Nov./Dec.
2001): pp. 1010-1027.
[9] L. Sweeney, 2002. k-anonymity: A Model for
Protecting Privacy. International Journal on
Uncertainty, Fuzziness and Knowledge-based
Systems 10, 5 (Oct. 2002): pp. 557-570.
[10]Md Zahidul Islam, Privacy Preservation in Data
Mining Through Noise Addition, PhD Thesis,
School of Electrical Engineering and Computer
Science, University of Newcastle, Callaghan, New
South Wales 2308, Australia, November 2007
[11]Mohammad Ali Kadampur, Somayajulu D.V.L.N., A
Noise Addition Scheme in Decision Tree for, Privacy
Preserving Data Mining, JOURNAL OF
COMPUTING, VOLUME 2, ISSUE 1, JANUARY
2010, ISSN 2151-9617
[12]Jay Kim, A Method For Limiting Disclosure in
Microdata Based Random Noise and
Transformation, Proceedings of the Survey Research
Methods, American Statistical Association, Pages
370-374, 1986.
[13]J. Domingo-Ferrer, F. Sebé, and J. Castellà-Roca,
“On the Security of Noise Addition for Privacy in
Statistical Databases,” in Privacy in Statistical
7. Databases, vol. 3050, Springer Berlin / Heidelberg,
2004, p. 519.
[14]Huang et al, Deriving Private Information from
Randomized Data, Special Interest Group on
Management of Data - SIGMOD 2005 June 2005.
[15]Lyman Ott and Michael Longnecker, An introduction
to statistical methods and data analysis, Cengage
Learning, 2010, ISBN 0495017582,
9780495017585, Pages 171-173
[16]Martin Sternstein, Barron's AP Statistics, Barron's
Educational Series, 2010, ISBN 0764140892,
Pages 49-51.
[17]Chris Spatz, Basic Statistics: Tales of Distributions,
Cengage Learning, 2010, ISBN 0495808911, Page
68.
[18]David Ray Anderson, Dennis J. Sweeney, Thomas
Arthur Williams, Statistics for Business and
Economics, Cengage Learning, 2008, ISBN
0324365055, Pages 95.
[19]Michael J. Crawley, Statistics: an introduction using
R, John Wiley and Sons, 2005, ISBN 0470022973,
Pages 93-95.
[20]J. Domingo-Ferrer and V. Torra (Eds.), On the
Security of Noise Addition for Privacy in Statistical
Databases, LNCS 3050, pp. 149–161, 2004.#
Springer-Verlag Berlin Heidelberg 2004.
[21]Ruth Brand, Microdata Protection Through Noise
Addition, LNCS 2316, pp. 97–116, 2002. Springer-
Verlag Berlin Heidelberg 2002.
[22][22] Ciriani et al, Microdata Protection,Secure Data
Management in Decentralized System, pages 291-
321, Springer, 2007.
[23]Jay J. Kim and William E. Winkler, Multiplicative
Noise for Masking Continuous Data, Research
Report Series, Statistics #2003-01, Statistical
Research Division, U.S. Bureau of the Census.
[24]Rastogi et al, The boundary between privacy and
utility in data publishing, VLDB ,September 2007,
pp. 531-542.
[25]Sramka et al, A Practice-oriented Framework for
Measuring Privacy and Utility in Data Sanitization
Systems, ACM, EDBT 2010.
[26]Sankar, S.R., Utility and Privacy of Data Sources:
Can Shannon Help Conceal and Reveal
Information?, presented at CoRR, 2010.
[27]Wong, R.C., et al, Minimality attack in privacy
preserving data publishing, VLDB, 2007. pp.543-
554.
[28]Adam, N.R. and Wortmann, J.C., A Comparative
Methods Study for Statistical Databases: Adam and
Wortmann, ACM Comp. Surveys, vol.21, 1989.
[29]Jeffrey J. Goldberger, Practical Signal and Image
Processing in Clinical Cardiology, Springer, 2010,
Page 28-42
[30]John L. Semmlow, Biosignal and biomedical image
processing: MATLAB-based applications, Volume
22 of Signal processing and communications CRC
Press, 2004, ISBN 9780824750688, Page 11.
[31]Jerrold T. Bushberg, The essential physics of
medical imaging, Edition 2, Lippincott Williams &
Wilkins, 2002, ISBN 0683301187, 9780683301182,
Page 278-280.
[32]Narayanan, A. and Shmatikov, V., 2010. Myths and
fallacies of "personally identifiable information". In
Proceedings of Commun. ACM. 2010, 24-26.
[33]Dwork, C., Differential Privacy, in ICALP, Springer,
2006
[34]Muralidhar, K., and Sarathy, R., Does Differential
Privacy Protect Terry Gross’ Privacy?, In Privacy in
Statistical Databases, Vol. 6344 (2011), pp. 200-209.
[35]Muralidhar, K., and Sarathy, R., Some Additional
Insights on Applying Differential Privacy for
Numeric Data, In Privacy in Statistical Databases,
Vol. 6344 (2011), pp. 210-219.
[36]Dwork, C., Differential Privacy: A Survey of
Results, In Theory and Applications of Models of
Computation TAMC , pp. 1-19, 2008
[37]M. S. Alvim, M. E. Andrés, K. Chatzikokolakis, P.
Degano, and C. Palamidessi, "Differential privacy:
on the trade-off between utility and information
leakage," Aug. 2011. [Online]. Available:
http://arxiv.org/abs/1103.5188
[38]Fienberg, S.E., et al, Differential Privacy and the
Risk-Utility Tradeoff for Multi-dimensional
Contingency Tables In Privacy in Statistical
Databases, Vol. 6344 (2011), pp. 187-199.
[39]A. Haeberlem, B.C. Pierce, and A. Narayan,
"Differential privacy under fire," in Proceedings of
the 20th USENIX Security Symposium, Aug. 2011.
[40]Santos, R.J.; Bernardino, J.; Vieira, M.; , "A survey
on data security in data warehousing: Issues,
challenges and opportunities," EUROCON -
International Conference on Computer as a Tool
(EUROCON), 2011 IEEE , vol., no., pp.1-4, 27-29
April 2011
[41]Joshi, P.; Kuo, C.-C.J.; , "Security and privacy in
online social networks: A survey," Multimedia and
Expo (ICME), 2011 IEEE International Conference
on , vol., no., pp.1-6, 11-15 July 2011
[42]Matthews, Gregory J., Harel, Ofer, Data
confidentiality: A review of methods for statistical
disclosure limitation and methods for assessing
privacy, Statistics Surveys, 5,
(2011), 1-29 (electronic).
[43]Liu Ying-hua; Yang Bing-ru; Cao Dan-yang; Ma
Nan; , "State-of-the-art in distributed privacy
preserving data mining," Communication Software
and Networks (ICCSN), 2011 IEEE 3rd International
Conference on , vol., no., pp.545-549, 27-29 May
2011