This document summarizes research on anonymizing network trace data while maintaining usability. It discusses challenges in applying traditional anonymization techniques to network traces due to their unique structure. The paper proposes heuristics for usability-aware anonymization that apply microdata privacy techniques separately to different network trace attributes. Preliminary results suggest the potential to generate anonymized traces with improved usability through trade-offs determined on a case-by-case basis. The document also reviews related work on network trace anonymization and attacks against anonymized data.
A review on privacy preservation in data miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques for masking sensitive information through data modification. The major issues were how to modify the data and how to recover the data mining result from the altered data. The reports were often tightly coupled with the data mining algorithms under consideration. Privacy preserving data publishing focuses on techniques for publishing data, not techniques for data mining. In case, it is expected that standard data mining techniques are applied on the published data. Anonymization of the data is done by hiding the identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data. This survey carries out the various privacy preservation techniques and algorithms.
A Review on Privacy Preservation in Data Miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques
for masking sensitive information through data modification. The major issues were how to modify the data
and how to recover the data mining result from the altered data. The reports were often tightly coupled
with the data mining algorithms under consideration. Privacy preserving data publishing focuses on
techniques for publishing data, not techniques for data mining. In case, it is expected that standard data
mining techniques are applied on the published data. Anonymization of the data is done by hiding the
identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data.
This survey carries out the various privacy preservation techniques and algorithms.
Data mining over diverse data sources is useful
means for discovering valuable patterns, associations, trends, and
dependencies in data. Many variants of this problem are existing,
depending on how the data is distributed, what type of data
mining we wish to do, how to achieve privacy of data and what
restrictions are placed on sharing of information. A transactional
database owner, lacking in the expertise or computational sources
can outsource its mining tasks to a third party service provider
or server. However, both the itemsets along with the association
rules of the outsourced database are considered private property
of the database owner.
In this paper, we consider a scenario where multiple data sources
are willing to share their data with trusted third party called
combiner who runs data mining algorithms over the union
of their data as long as each data source is guaranteed that
its information that does not pertain to another data source
will not be revealed. The proposed algorithm is characterized
with (1) secret sharing based secure key transfer for distributed
transactional databases with its lightweight encryption is used
for preserving the privacy. (2) and rough set based mechanism
for association rules extraction for an efficient and mining task.
Performance analysis and experimental results are provided for
demonstrating the effectiveness of the proposed algorithm.
Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...Kumar Goud
Privacy Preserving Distributed Data Mining with Anonymous ID Assignment
Chikkudu Chandrakanth Bheemari Santhoshkumar Tejavath Charan Singh
M.Tech Scholar(CSE) M.Tech Scholar(CSE) Assistant Professor, Dept of CSE
Sri Indu College of Engg and Tech Sri Indu College of Engg and Tech Sri Indu College of Engg and Tech
Ibrahimpatan, Hyderabad, TS, India Ibrahimpatan, Hyderabad, TS, India Ibrahimpatan, Hyderabad, TS, India
Abstract: This paper builds an algorithm for sharing simple integer data on top of secure sum data mining operation using Newton’s identities and Sturm’s theorem. Algorithm for anonymous sharing of private data among parties is developed. This assignment is anonymous in that the identities received are unknown to the other members of the group. Resistance to collusion among other members is verified in an information theoretic sense when private communication channels are used. This assignment of serial numbers allows more complex data to be shared and has applications to other problems in privacy preserving data mining, collision avoidance in communications and distributed database access. The new algorithms are built on top of a secure sum data mining operation using Newton’s identities and Sturm’s theorem. An algorithm for distributed solution of certain polynomials over finite fields enhances the scalability of the algorithms.
Key words: Cloud, Website, information sharing, DBMS, ID, ODBC, ASP.NET
.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...Kato Mivule
Kato Mivule, Claude Turner, "A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Using Machine Learning Classification as a Gauge", Procedia Computer Science, Volume 20, 2013, Pages 414-419, Baltimore MD, USA
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
A review on privacy preservation in data miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques for masking sensitive information through data modification. The major issues were how to modify the data and how to recover the data mining result from the altered data. The reports were often tightly coupled with the data mining algorithms under consideration. Privacy preserving data publishing focuses on techniques for publishing data, not techniques for data mining. In case, it is expected that standard data mining techniques are applied on the published data. Anonymization of the data is done by hiding the identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data. This survey carries out the various privacy preservation techniques and algorithms.
A Review on Privacy Preservation in Data Miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques
for masking sensitive information through data modification. The major issues were how to modify the data
and how to recover the data mining result from the altered data. The reports were often tightly coupled
with the data mining algorithms under consideration. Privacy preserving data publishing focuses on
techniques for publishing data, not techniques for data mining. In case, it is expected that standard data
mining techniques are applied on the published data. Anonymization of the data is done by hiding the
identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data.
This survey carries out the various privacy preservation techniques and algorithms.
Data mining over diverse data sources is useful
means for discovering valuable patterns, associations, trends, and
dependencies in data. Many variants of this problem are existing,
depending on how the data is distributed, what type of data
mining we wish to do, how to achieve privacy of data and what
restrictions are placed on sharing of information. A transactional
database owner, lacking in the expertise or computational sources
can outsource its mining tasks to a third party service provider
or server. However, both the itemsets along with the association
rules of the outsourced database are considered private property
of the database owner.
In this paper, we consider a scenario where multiple data sources
are willing to share their data with trusted third party called
combiner who runs data mining algorithms over the union
of their data as long as each data source is guaranteed that
its information that does not pertain to another data source
will not be revealed. The proposed algorithm is characterized
with (1) secret sharing based secure key transfer for distributed
transactional databases with its lightweight encryption is used
for preserving the privacy. (2) and rough set based mechanism
for association rules extraction for an efficient and mining task.
Performance analysis and experimental results are provided for
demonstrating the effectiveness of the proposed algorithm.
Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...Kumar Goud
Privacy Preserving Distributed Data Mining with Anonymous ID Assignment
Chikkudu Chandrakanth Bheemari Santhoshkumar Tejavath Charan Singh
M.Tech Scholar(CSE) M.Tech Scholar(CSE) Assistant Professor, Dept of CSE
Sri Indu College of Engg and Tech Sri Indu College of Engg and Tech Sri Indu College of Engg and Tech
Ibrahimpatan, Hyderabad, TS, India Ibrahimpatan, Hyderabad, TS, India Ibrahimpatan, Hyderabad, TS, India
Abstract: This paper builds an algorithm for sharing simple integer data on top of secure sum data mining operation using Newton’s identities and Sturm’s theorem. Algorithm for anonymous sharing of private data among parties is developed. This assignment is anonymous in that the identities received are unknown to the other members of the group. Resistance to collusion among other members is verified in an information theoretic sense when private communication channels are used. This assignment of serial numbers allows more complex data to be shared and has applications to other problems in privacy preserving data mining, collision avoidance in communications and distributed database access. The new algorithms are built on top of a secure sum data mining operation using Newton’s identities and Sturm’s theorem. An algorithm for distributed solution of certain polynomials over finite fields enhances the scalability of the algorithms.
Key words: Cloud, Website, information sharing, DBMS, ID, ODBC, ASP.NET
.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...Kato Mivule
Kato Mivule, Claude Turner, "A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Using Machine Learning Classification as a Gauge", Procedia Computer Science, Volume 20, 2013, Pages 414-419, Baltimore MD, USA
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
Knowledge discovery is carried out using the data mining techniques. Association rule mining,
classification and clustering operations are carried out under data mining. Clustering method is used to group up the
records based on the relevancy. Distance or similarity measures are used to estimate the transaction relationship.
Census data and medical data are referred as micro data. Data publish schemes are used to provide private data for
analysis. Privacy preservation is used to protect private data values. Anonymity is considered in the privacy
preservation process.
Data values are allowed to authorized users using the access control models. Privacy Protection Mechanism
(PPM) uses suppression and generalization of relational data to anonymize and satisfy privacy needs. Accuracyconstrained privacy-preserving access control framework is used to manage access control in relational database. The
access control policies define selection predicates available to roles while the privacy requirement is to satisfy the kanonymity or l-diversity. Imprecision bound constraint is assigned for each selection predicate. k-anonymous
Partitioning with Imprecision Bounds (k-PIB) is used to estimate accuracy and privacy constraints. Role-based Access
Control (RBAC) allows defining permissions on objects based on roles in an organization. Top Down Selection
Mondrian (TDSM) algorithm is used for query workload-based anonymization. The Top Down Selection Mondrian
(TDSM) algorithm is constructed using greedy heuristics and kd-tree model. Query cuts are selected with minimum
bounds in Top-Down Heuristic 1 algorithm (TDH1). The query bounds are updated as the partitions are added to the
output in Top-Down Heuristic 2 algorithm (TDH2). The cost of reduced precision in the query results is used in TopDown Heuristic 3 algorithm (TDH3). Repartitioning algorithm is used to reduce the total imprecision for the queries.
The privacy preserved access privilege management scheme is enhanced to provide incremental mining
features. Data insert, delete and update operations are connected with the partition management mechanism. Cell level
access control is provided with differential privacy method. Dynamic role management model is integrated with the
access control policy mechanism for query predicates.
Abstract: Data Mining has wide applications in many areas such as banking, medicine, scientific research and among government agencies. Classification is one of the commonly used tasks in data mining applications. The cloud computing, users have the opportunity to outsource their data, in encrypted form, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy preserving classification techniques are not applicable. On solving the classification problem over encrypted data. A secure k-NN classifier over encrypted data in the cloud. The k-NN protocol protects the confidentiality of the data, user’s input query, and data access patterns. To develop a secure k-NN classifier over encrypted data under the standard semi-honest model. Also, we empirically analyze the efficiency of our solution through various experiments.
Data Anonymization for Privacy Preservation in Big Datarahulmonikasharma
Cloud computing provides capable ascendable IT edifice to provision numerous processing of a various big data applications in sectors such as healthcare and business. Mainly electronic health records data sets and in such applications generally contain privacy-sensitive data. The most popular technique for data privacy preservation is anonymizing the data through generalization. Proposal is to examine the issue against proximity privacy breaches for big data anonymization and try to recognize a scalable solution to this issue. Scalable clustering approach with two phase consisting of clustering algorithm and K-Anonymity scheme with Generalisation and suppression is intended to work on this problem. Design of the algorithms is done with MapReduce to increase high scalability by carrying out dataparallel execution in cloud. Wide-ranging researches on actual data sets substantiate that the method deliberately advances the competence of defensive proximity privacy breaks, the scalability and the efficiency of anonymization over existing methods. Anonymizing data sets through generalization to gratify some of the privacy attributes like k- Anonymity is a popularly-used type of privacy preserving methods. Currently, the gauge of data in numerous cloud surges extremely in agreement with the Big Data, making it a dare for frequently used tools to actually get, manage, and process large-scale data for a particular accepted time scale. Hence, it is a trial for prevailing anonymization approaches to attain privacy conservation for big data private information due to scalabilty issues.
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREMIJNSA Journal
Data mining has made broad significant multidisciplinary field used in vast application domains and extracts knowledge by identifying structural relationship among the objects in large data bases. Privacy preserving data mining is a new area of data mining research for providing privacy of sensitive knowledge of information extracted from data mining system to be shared by the intended persons not to everyone to access. In this paper , we proposed a new approach of privacy preserving data mining by using implicit function theorem for secure transformation of sensitive data obtained from data mining system. we proposed two way enhanced security approach. First transforming original values of sensitive data into different partial derivatives of functional values for perturbation of data. secondly generating symmetric key value by Eigen values of jacobian matrix for secure computation. we given an example of academic sensitive data converting into vector valued functions to explain about our proposed concept and presented implementation based results of new proposed of approach.
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...Editor IJMTER
Security and privacy methods are used to protect the data values. Private data values are secured with
confidentiality and integrity methods. Privacy model hides the individual identity over the public data values.
Sensitive attributes are protected using anonymity methods. Two or more parties have their own private data under
the distributed environment. The parties can collaborate to calculate any function on the union of their data. Secure
Multiparty Computation (SMC) protocols are used in privacy preserving data mining in distributed environments.
Association rule mining techniques are used to fetch frequent patterns.Apriori algorithm is used to mine association
rules in databases. Homogeneous databases share the same schema but hold information on different entities.
Horizontal partition refers the collection of homogeneous databases that are maintained in different parties. Fast
Distributed Mining (FDM) algorithm is an unsecured distributed version of the Apriori algorithm. Kantarcioglu
and Clifton protocol is used for secure mining of association rules in horizontally distributed databases. Unifying
lists of locally Frequent Itemsets Kantarcioglu and Clifton (UniFI-KC) protocol is used for the rule mining process
in partitioned database environment. UniFI-KC protocol is enhanced in two methods for security enhancement.
Secure computation of threshold function algorithm is used to compute the union of private subsets in each of the
interacting players. Set inclusion computation algorithm is used to test the inclusion of an element held by one
player in a subset held by another.The system is improved to support secure rule mining under vertical partitioned
database environment. The subgroup discovery process is adapted for partitioned database environment. The
system can be improved to support generalized association rule mining process. The system is enhanced to control
security leakages in the rule mining process.
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...Kato Mivule
Kato Mivule and Claude Turner, An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge, International Conference on Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA
Privacy Preserving in Cloud Using Distinctive Elliptic Curve Cryptosystem (DECC)ElavarasaN GanesaN
Securing the data over a cloud network is always a challenging problem
for the researcher over the past one decade. There exist many conventional
algorithms/techniques which proclaim to ensure secure transmission, storage
and retrieval of data over the cloud platform. All these mechanisms mainly
focus on ensuring privacy preserve for of client / user‟s data. This research
work aims to propose distinctive elliptic curve cryptography. DECC is based
on the algebraic structure of elliptic curves in the finite fields. The DECC is
used for privacy preserving since smaller keys are used when compared to all
the rest of the existing cryptographic algorithms. Performance metrics such as
average relative error, time, anonymization time and information loss are
taken into account. Implementations are carried out in MATLAB tool. Results
portrays that the proposed DECC outperforms the existing methods.
Applying Data Privacy Techniques on Published Data in UgandaKato Mivule
Kato Mivule, Claude Turner, "Applying Data Privacy Techniques on Published Data in Uganda", Proceedings of the 2012 International Conference on e-Learning, e-Business, Enterprise Information Systems, and e-Government (EEE 2012), Pages 110-115, Las Vegas, NV, USA.
Privacy-Preserving Updates to Anonymous and Confidential Databaseijdmtaiir
The current trend in the application space towards
systems of loosely coupled and dynamically bound
components that enables just-in-time integration jeopardizes
the security of information that is shared between the broker,
the requester, and the provider at runtime. In particular, new
advances in data mining and knowledge discovery that allow
for the extraction of hidden knowledge in an enormous amount
of data impose new threats on the seamless integration of
information. We consider the problem of building privacy
preserving algorithms for one category of data mining
techniques, association rule mining.Suppose Alice owns a kanonymous database and needs to determine whether her
database, when inserted with a tuple owned by Bob, is still kanonymous. Also, suppose that access to the database is strictly
controlled, because for example data are used for certain
experiments that need to be maintained confidential. Clearly,
allowing Alice to directly read the contents of the tuple breaks
the privacy of Bob (e.g., a patient’s medical record); on the
other hand, the confidentiality of the database managed by
Alice is violated once Bob has access to the contents of the
database. Thus, the problem is to check whether the database
inserted with the tuple is still k-anonymous, without letting
Alice and Bob know the contents of the tuple and the database,
respectively. In this paper, we propose two protocols solving
this problem on suppression-based and generalization-based kanonymous and confidential databases. The protocols rely on
well-known cryptographic assumptions, and we provide
theoretical analyses to proof their soundness and experimental
results to illustrate their efficiency.We have presented two
secure protocols for privately checking whether a kanonymous database retains its anonymity once a new tuple is
being inserted to it. Since the proposed protocols ensure the
updated database remains K-anonymous, the results returned
from a user’s (or a medical researcher’s) query are also kanonymous. Thus, the patient or the data provider’s privacy
cannot be violated from any query. As long as the database is
updated properly using the proposed protocols, the user queries
under our application domain are always privacy-preserving
Kato Mivule - Utilizing Noise Addition for Data Privacy, an OverviewKato Mivule
Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering (IKE 2012), Pages 65-71, Las Vegas, NV, USA.
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...Kato Mivule
Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science (Complex Adaptive Systems), 2012, Pages 176-181, Washington DC, USA.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Accessing secured data in cloud computing environmentIJNSA Journal
Number of businesses using cloud computing has increased dramatically over the last few years due to the attractive features such as scalability, flexibility, fast start-up and low costs. Services provided over the web are ranging from using provider’s software and hardware to managing security and other issues. Some of the biggest challenges at this point are providing privacy and data security to subscribers of public cloud servers. An efficient encryption technique presented in this paper can be used for secure access to and storage of data on public cloud server, moving and searching encrypted data through communication channels while protecting data confidentiality. This method ensures data protection against both external and internal intruders. Data can be decrypted only with the provided by the data owner key, while public cloud server is unable to read encrypted data or queries. Answering a query does not depend on it size and done in a constant time. Data access is managed by the data owner. The proposed schema allows unauthorized modifications detection
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...cscpconf
Sharing data that contains personally identifiable or sensitive information, such as medical
records, always has privacy and security implications. The issues can become rather complex
when the methods of access can vary, and accurate individual data needs to be provided whilst
mass data release for specific purposes (for example for medical research) also has to be
catered for. Although various solutions have been proposed to address the different aspects
individually, a comprehensive approach is highly desirable. This paper presents a solution for
maintaining the privacy of data released en masse in a controlled manner, and for providing
secure access to the original data for authorized users. The results show that the solution is provably secure and maintains privacy in a more efficient manner than previous solutions
Abstract: Data Mining has wide applications in many areas such as banking, medicine, scientific research and among government agencies. Classification is one of the commonly used tasks in data mining applications. The cloud computing, users have the opportunity to outsource their data, in encrypted form, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy preserving classification techniques are not applicable. On solving the classification problem over encrypted data. A secure k-NN classifier over encrypted data in the cloud. The k-NN protocol protects the confidentiality of the data, user’s input query, and data access patterns. To develop a secure k-NN classifier over encrypted data under the standard semi-honest model. Also, we empirically analyze the efficiency of our solution through various experiments.
Data Anonymization for Privacy Preservation in Big Datarahulmonikasharma
Cloud computing provides capable ascendable IT edifice to provision numerous processing of a various big data applications in sectors such as healthcare and business. Mainly electronic health records data sets and in such applications generally contain privacy-sensitive data. The most popular technique for data privacy preservation is anonymizing the data through generalization. Proposal is to examine the issue against proximity privacy breaches for big data anonymization and try to recognize a scalable solution to this issue. Scalable clustering approach with two phase consisting of clustering algorithm and K-Anonymity scheme with Generalisation and suppression is intended to work on this problem. Design of the algorithms is done with MapReduce to increase high scalability by carrying out dataparallel execution in cloud. Wide-ranging researches on actual data sets substantiate that the method deliberately advances the competence of defensive proximity privacy breaks, the scalability and the efficiency of anonymization over existing methods. Anonymizing data sets through generalization to gratify some of the privacy attributes like k- Anonymity is a popularly-used type of privacy preserving methods. Currently, the gauge of data in numerous cloud surges extremely in agreement with the Big Data, making it a dare for frequently used tools to actually get, manage, and process large-scale data for a particular accepted time scale. Hence, it is a trial for prevailing anonymization approaches to attain privacy conservation for big data private information due to scalabilty issues.
PRIVACY PRESERVING DATA MINING BY USING IMPLICIT FUNCTION THEOREMIJNSA Journal
Data mining has made broad significant multidisciplinary field used in vast application domains and extracts knowledge by identifying structural relationship among the objects in large data bases. Privacy preserving data mining is a new area of data mining research for providing privacy of sensitive knowledge of information extracted from data mining system to be shared by the intended persons not to everyone to access. In this paper , we proposed a new approach of privacy preserving data mining by using implicit function theorem for secure transformation of sensitive data obtained from data mining system. we proposed two way enhanced security approach. First transforming original values of sensitive data into different partial derivatives of functional values for perturbation of data. secondly generating symmetric key value by Eigen values of jacobian matrix for secure computation. we given an example of academic sensitive data converting into vector valued functions to explain about our proposed concept and presented implementation based results of new proposed of approach.
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...Editor IJMTER
Security and privacy methods are used to protect the data values. Private data values are secured with
confidentiality and integrity methods. Privacy model hides the individual identity over the public data values.
Sensitive attributes are protected using anonymity methods. Two or more parties have their own private data under
the distributed environment. The parties can collaborate to calculate any function on the union of their data. Secure
Multiparty Computation (SMC) protocols are used in privacy preserving data mining in distributed environments.
Association rule mining techniques are used to fetch frequent patterns.Apriori algorithm is used to mine association
rules in databases. Homogeneous databases share the same schema but hold information on different entities.
Horizontal partition refers the collection of homogeneous databases that are maintained in different parties. Fast
Distributed Mining (FDM) algorithm is an unsecured distributed version of the Apriori algorithm. Kantarcioglu
and Clifton protocol is used for secure mining of association rules in horizontally distributed databases. Unifying
lists of locally Frequent Itemsets Kantarcioglu and Clifton (UniFI-KC) protocol is used for the rule mining process
in partitioned database environment. UniFI-KC protocol is enhanced in two methods for security enhancement.
Secure computation of threshold function algorithm is used to compute the union of private subsets in each of the
interacting players. Set inclusion computation algorithm is used to test the inclusion of an element held by one
player in a subset held by another.The system is improved to support secure rule mining under vertical partitioned
database environment. The subgroup discovery process is adapted for partitioned database environment. The
system can be improved to support generalized association rule mining process. The system is enhanced to control
security leakages in the rule mining process.
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...Kato Mivule
Kato Mivule and Claude Turner, An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge, International Conference on Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA
Privacy Preserving in Cloud Using Distinctive Elliptic Curve Cryptosystem (DECC)ElavarasaN GanesaN
Securing the data over a cloud network is always a challenging problem
for the researcher over the past one decade. There exist many conventional
algorithms/techniques which proclaim to ensure secure transmission, storage
and retrieval of data over the cloud platform. All these mechanisms mainly
focus on ensuring privacy preserve for of client / user‟s data. This research
work aims to propose distinctive elliptic curve cryptography. DECC is based
on the algebraic structure of elliptic curves in the finite fields. The DECC is
used for privacy preserving since smaller keys are used when compared to all
the rest of the existing cryptographic algorithms. Performance metrics such as
average relative error, time, anonymization time and information loss are
taken into account. Implementations are carried out in MATLAB tool. Results
portrays that the proposed DECC outperforms the existing methods.
Applying Data Privacy Techniques on Published Data in UgandaKato Mivule
Kato Mivule, Claude Turner, "Applying Data Privacy Techniques on Published Data in Uganda", Proceedings of the 2012 International Conference on e-Learning, e-Business, Enterprise Information Systems, and e-Government (EEE 2012), Pages 110-115, Las Vegas, NV, USA.
Privacy-Preserving Updates to Anonymous and Confidential Databaseijdmtaiir
The current trend in the application space towards
systems of loosely coupled and dynamically bound
components that enables just-in-time integration jeopardizes
the security of information that is shared between the broker,
the requester, and the provider at runtime. In particular, new
advances in data mining and knowledge discovery that allow
for the extraction of hidden knowledge in an enormous amount
of data impose new threats on the seamless integration of
information. We consider the problem of building privacy
preserving algorithms for one category of data mining
techniques, association rule mining.Suppose Alice owns a kanonymous database and needs to determine whether her
database, when inserted with a tuple owned by Bob, is still kanonymous. Also, suppose that access to the database is strictly
controlled, because for example data are used for certain
experiments that need to be maintained confidential. Clearly,
allowing Alice to directly read the contents of the tuple breaks
the privacy of Bob (e.g., a patient’s medical record); on the
other hand, the confidentiality of the database managed by
Alice is violated once Bob has access to the contents of the
database. Thus, the problem is to check whether the database
inserted with the tuple is still k-anonymous, without letting
Alice and Bob know the contents of the tuple and the database,
respectively. In this paper, we propose two protocols solving
this problem on suppression-based and generalization-based kanonymous and confidential databases. The protocols rely on
well-known cryptographic assumptions, and we provide
theoretical analyses to proof their soundness and experimental
results to illustrate their efficiency.We have presented two
secure protocols for privately checking whether a kanonymous database retains its anonymity once a new tuple is
being inserted to it. Since the proposed protocols ensure the
updated database remains K-anonymous, the results returned
from a user’s (or a medical researcher’s) query are also kanonymous. Thus, the patient or the data provider’s privacy
cannot be violated from any query. As long as the database is
updated properly using the proposed protocols, the user queries
under our application domain are always privacy-preserving
Kato Mivule - Utilizing Noise Addition for Data Privacy, an OverviewKato Mivule
Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering (IKE 2012), Pages 65-71, Las Vegas, NV, USA.
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...Kato Mivule
Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science (Complex Adaptive Systems), 2012, Pages 176-181, Washington DC, USA.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Accessing secured data in cloud computing environmentIJNSA Journal
Number of businesses using cloud computing has increased dramatically over the last few years due to the attractive features such as scalability, flexibility, fast start-up and low costs. Services provided over the web are ranging from using provider’s software and hardware to managing security and other issues. Some of the biggest challenges at this point are providing privacy and data security to subscribers of public cloud servers. An efficient encryption technique presented in this paper can be used for secure access to and storage of data on public cloud server, moving and searching encrypted data through communication channels while protecting data confidentiality. This method ensures data protection against both external and internal intruders. Data can be decrypted only with the provided by the data owner key, while public cloud server is unable to read encrypted data or queries. Answering a query does not depend on it size and done in a constant time. Data access is managed by the data owner. The proposed schema allows unauthorized modifications detection
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...cscpconf
Sharing data that contains personally identifiable or sensitive information, such as medical
records, always has privacy and security implications. The issues can become rather complex
when the methods of access can vary, and accurate individual data needs to be provided whilst
mass data release for specific purposes (for example for medical research) also has to be
catered for. Although various solutions have been proposed to address the different aspects
individually, a comprehensive approach is highly desirable. This paper presents a solution for
maintaining the privacy of data released en masse in a controlled manner, and for providing
secure access to the original data for authorized users. The results show that the solution is provably secure and maintains privacy in a more efficient manner than previous solutions
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
A Review on Privacy Preservation in Data Miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques
for masking sensitive information through data modification. The major issues were how to modify the data
and how to recover the data mining result from the altered data. The reports were often tightly coupled
with the data mining algorithms under consideration. Privacy preserving data publishing focuses on
techniques for publishing data, not techniques for data mining. In case, it is expected that standard data
mining techniques are applied on the published data. Anonymization of the data is done by hiding the
identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data.
This survey carries out the various privacy preservation techniques and algorithms.
A Review on Privacy Preservation in Data Miningijujournal
The main focus of privacy preserving data publishing was to enhance traditional data mining techniques
for masking sensitive information through data modification. The major issues were how to modify the data
and how to recover the data mining result from the altered data. The reports were often tightly coupled
with the data mining algorithms under consideration. Privacy preserving data publishing focuses on
techniques for publishing data, not techniques for data mining. In case, it is expected that standard data
mining techniques are applied on the published data. Anonymization of the data is done by hiding the
identity of record owners, whereas privacy preserving data mining seeks to directly belie the sensitive data.
This survey carries out the various privacy preservation techniques and algorithms.
A Comparative Study on Privacy Preserving Datamining TechniquesIJMER
Privacy protection is very important in the recent years for the reason of increasing in the
ability to store data. In particular, recent advances in the data mining field have lead to increased
concerns about privacy. Data in its original form, however, typically contains sensitive information about
individuals, and publishing such data will violate individual privacy. The current practice in data
publishing based on that what type of data can be released and use of that data. Recently, PPDM has
received immersed attention in research communities, and many approaches have been proposed for
different data publishing scenarios. In this comparative study we will systematically summarize and
evaluate different approaches for PPDM, study the challenges ,differences and requirements that
distinguish PPDM from other related problems, and propose future research directions
https://utilitasmathematica.com/index.php/Index
Our journal has academic and professional communities fosters collaboration and knowledge sharing. When all voices are heard and respected, it strengthens the collective capabilities of the statistical community.
Use of network forensic mechanisms to formulate network securityIJMIT JOURNAL
Network Forensics is fairly a new area of research which would be used after an intrusion in various
organizations ranging from small, mid-size private companies and government corporations to the defence
secretariat of a country. At the point of an investigation valuable information may be mishandled which
leads to difficulties in the examination and time wastage. Additionally the intruder could obliterate tracks
such as intrusion entry, vulnerabilities used in an entry, destruction caused, and most importantly the
identity of the intruder. The aim of this research was to map the correlation between network security and
network forensic mechanisms. There are three sub research questions that had been studied. Those have
identified Network Security issues, Network Forensic investigations used in an incident, and the use of
network forensics mechanisms to eliminate network security issues. Literature review has been the
research strategy used in order study the sub research questions discussed. Literature such as research
papers published in Journals, PhD Theses, ISO standards, and other official research papers have been
evaluated and have been the base of this research. The deliverables or the output of this research was
produced as a report on how network forensics has assisted in aligning network security in case of an
intrusion. This research has not been specific to an organization but has given a general overview about
the industry. Embedding Digital Forensics Framework, Network Forensic Development Life Cycle, and
Enhanced Network Forensic Cycle could be used to develop a secure network. Through the mentioned
framework, and cycles the author has recommended implementing the 4R Strategy (Resistance,
Recognition, Recovery, Redress) with the assistance of a number of tools. This research would be of
interest to Network Administrators, Network Managers, Network Security personnel, and other personnel interested in obtaining knowledge in securing communication devices/infrastructure. This research provides a framework that can be used in an organization to eliminate digital anomalies through network forensics, helps the above mentioned persons to prepare infrastructure readiness for threats and also enables further research to be carried on in the fields of computer, database, mobile, video, and audio.
USE OF NETWORK FORENSIC MECHANISMS TO FORMULATE NETWORK SECURITYIJMIT JOURNAL
Network Forensics is fairly a new area of research which would be used after an intrusion in various
organizations ranging from small, mid-size private companies and government corporations to the defence
secretariat of a country. At the point of an investigation valuable information may be mishandled which
leads to difficulties in the examination and time wastage. Additionally the intruder could obliterate tracks
such as intrusion entry, vulnerabilities used in an entry, destruction caused, and most importantly the
identity of the intruder. The aim of this research was to map the correlation between network security and
network forensic mechanisms. There are three sub research questions that had been studied. Those have
identified Network Security issues, Network Forensic investigations used in an incident, and the use of
network forensics mechanisms to eliminate network security issues. Literature review has been the
research strategy used in order study the sub research questions discussed. Literature such as research
papers published in Journals, PhD Theses, ISO standards, and other official research papers have been
evaluated and have been the base of this research. The deliverables or the output of this research was
produced as a report on how network forensics has assisted in aligning network security in case of an
intrusion. This research has not been specific to an organization but has given a general overview about
the industry. Embedding Digital Forensics Framework, Network Forensic Development Life Cycle, and
Enhanced Network Forensic Cycle could be used to develop a secure network. Through the mentioned
framework, and cycles the author has recommended implementing the 4R Strategy (Resistance,
Recognition, Recovery, Redress) with the assistance of a number of tools. This research would be of
interest to Network Administrators, Network Managers, Network Security personnel, and other personnel
interested in obtaining knowledge in securing communication devices/infrastructure. This research
provides a framework that can be used in an organization to eliminate digital anomalies through network
forensics, helps the above mentioned persons to prepare infrastructure readiness for threats and also
enables further research to be carried on in the fields of computer, database, mobile, video, and audio.
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...IJECEIAES
Leakage and misuse of sensitive data is a challenging problem to enterprises. It has become more serious problem with the advent of cloud and big data. The rationale behind this is the increase in outsourcing of data to public cloud and publishing data for wider visibility. Therefore Privacy Preserving Data Publishing (PPDP), Privacy Preserving Data Mining (PPDM) and Privacy Preserving Distributed Data Mining (PPDM) are crucial in the contemporary era. PPDP and PPDM can protect privacy at data and process levels respectively. Therefore, with big data privacy to data became indispensable due to the fact that data is stored and processed in semi-trusted environment. In this paper we proposed a comprehensive methodology for effective sanitization of data based on misusability measure for preserving privacy to get rid of data leakage and misuse. We followed a hybrid approach that caters to the needs of privacy preserving MapReduce programming. We proposed an algorithm known as Misusability Measure-Based Privacy Preserving Algorithm (MMPP) which considers level of misusability prior to choosing and application of appropriate sanitization on big data. Our empirical study with Amazon EC2 and EMR revealed that the proposed methodology is useful in realizing privacy preserving Map Reduce programming.
A Frame Work for Ontological Privacy Preserved MiningIJNSA Journal
Data Mining analyses the stocked data and helps in foretelling the future trends. There are different techniques by which data can be mined. These different techniques reveal different types of hidden knowledge. Using the right procedure of technique result specific patterns emerge. Ontology is a specification of conceptualization. It is a description of concepts and relationships that can exist for an agent or a community of agents. To make software more user-friendly, ontology could be used to explain both the technical and domain details. In the process of analyzing a data certain important details cannot be revealed, therefore security is the most important feature dealt in all technologies and work places. Data mining and Ontology techniques when integrated would capitulate an efficient system capable of selecting the appropriate algorithm for a data mining technique and privacy preserving techniques also by exploring the domain knowledge using ontology.
In the past decade, big technical advances have appeared which can bring more comfort not only in the corporate sector but at the personal level of everyday life activities. The growth and deployment of cloud computing technologies by either private or public sectors were important. Recently it became apparent to many organizations and businesses that their workloads were moved to the cloud. However, protection for cloud providers focused on Internet connectivity is a major problem, leaving it vulnerable to numerous attacks. Although cloud storage protection mechanisms are being introduced in recent years. However, cloud protection remains a major concern. This survey paper tackles this problem by recent technology that enables confidentiality conscious outsourcing of the data to public cloud storage and analysis of sensitive data. In specific, as an advancement, we explore outsourced data strategies focused on data splitting, anonymization and cryptographic methods. We then compare these approaches for operations assisted by accuracy, overheads, masked outsourced data and data processing implications. Finally, we recognize excellent solutions to these cloud security issues.
Through the generalization of deep learning, the research community has addressed critical challenges in
the network security domain, like malware identification and anomaly detection. However, they have yet to
discuss deploying them on Internet of Things (IoT) devices for day-to-day operations. IoT devices are often
limited in memory and processing power, rendering the compute-intensive deep learning environment
unusable. This research proposes a way to overcome this barrier by bypassing feature engineering in the
deep learning pipeline and using raw packet data as input. We introduce a feature- engineering-less
machine learning (ML) process to perform malware detection on IoT devices. Our proposed model,”
Feature engineering-less ML (FEL-ML),” is a lighter-weight detection algorithm that expends no extra
computations on “engineered” features. It effectively accelerates the low-powered IoT edge. It is trained
on unprocessed byte-streams of packets. Aside from providing better results, it is quicker than traditional
feature-based methods. FEL-ML facilitates resource-sensitive network traffic security with the added
benefit of eliminating the significant investment by subject matter experts in feature engineering.
EFFICIENT ATTACK DETECTION IN IOT DEVICES USING FEATURE ENGINEERING-LESS MACH...ijcsit
Through the generalization of deep learning, the research community has addressed critical challenges in
the network security domain, like malware identification and anomaly detection. However, they have yet to
discuss deploying them on Internet of Things (IoT) devices for day-to-day operations. IoT devices are often
limited in memory and processing power, rendering the compute-intensive deep learning environment
unusable. This research proposes a way to overcome this barrier by bypassing feature engineering in the
deep learning pipeline and using raw packet data as input. We introduce a feature- engineering-less
machine learning (ML) process to perform malware detection on IoT devices. Our proposed model,”
Feature engineering-less ML (FEL-ML),” is a lighter-weight detection algorithm that expends no extra
computations on “engineered” features. It effectively accelerates the low-powered IoT edge. It is trained
on unprocessed byte-streams of packets. Aside from providing better results, it is quicker than traditional
feature-based methods. FEL-ML facilitates resource-sensitive network traffic security with the added
benefit of eliminating the significant investment by subject matter experts in feature engineering.
SECURITY AND PRIVACY AWARE PROGRAMMING MODEL FOR IOT APPLICATIONS IN CLOUD EN...ijccsa
The introduction of Internet of Things (IoT) applications into daily life has raised serious privacy concerns
among consumers, network service providers, device manufacturers, and other parties involved. This paper
gives a high-level overview of the three phases of data collecting, transmission, and storage in IoT systems
as well as current privacy-preserving technologies. The following elements were investigated during these
three phases:(1) Physical and data connection layer security mechanisms(2) Network remedies(3)
Techniques for distributing and storing data. Real-world systems frequently have multiple phases and
incorporate a variety of methods to guarantee privacy. Therefore, for IoT research, design, development,
and operation, having a thorough understanding of all phases and their technologies can be beneficial. In
this Study introduced two independent methodologies namely generic differential privacy (GenDP) and
Cluster-Based Differential privacy ( Cluster-based DP) algorithms for handling metadata as intents and
intent scope to maintain privacy and security of IoT data in cloud environments. With its help, we can
virtual and connect enormous numbers of devices, get a clearer understanding of the IoT architecture, and
store data eternally. However, due of the dynamic nature of the environment, the diversity of devices, the
ad hoc requirements of multiple stakeholders, and hardware or network failures, it is a very challenging
task to create security-, privacy-, safety-, and quality-aware Internet of Things apps. It is becoming more
and more important to improve data privacy and security through appropriate data acquisition. The
proposed approach resulted in reduced loss performance as compared to Support Vector Machine (SVM) ,
Random Forest (RF) .
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships.
Many data mining and knowledge discovery methodologies and process models have been developed,
with varying degrees of success, there are three main methods used to discover patterns in data; KDD,
SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in
practice. To our knowledge, there is no clear methodology developed to support link mining. However,
there is a well known methodology in knowledge discovery in databases, known as Cross Industry
Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial
companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to
the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links
that are not yet known in a given network. This approach is implemented through the use of a case study
of realworld data (co-citation data). This case study aims to use mutual information to interpret the
semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining
the nature of a given link and potentially identifying important future link relationships
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Similar to A Study of Usability-aware Network Trace Anonymization (20)
Implementation of Data Privacy and Security in an Online Student Health Recor...Kato Mivule
Kato Mivule, Stephen Otunba, Tattwamasi Tripathy, Sharad and Sharma, "Implementation of Data Privacy and Security in an Online Student Health Records System", Proceedings at the ISCA 21th Int Conf on Software Engineering and Data Engineering (SEDE-2012), Pages 143-148, Los Angeles, CA, USA
Kato Mivule - Towards Agent-based Data Privacy EngineeringKato Mivule
Towards Agent-based Data Privacy Engineering - Given any original data set X, a set of data privacy engineering phases should be followed from start to completion in the generation of a privatized data set Y. Could we have agents that autonomously implement privacy?
An Investigation of Data Privacy and Utility Using Machine Learning as a GaugeKato Mivule
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge By Kato Mivule for the Degree of D.Sc. in Computer Science - Bowie State University
Lit Review Talk by Kato Mivule: A Review of Genetic AlgorithmsKato Mivule
Lit Review Talk by Kato Mivule: A Review of Genetic Algorithms and Paper Review: C. H. Ooi and P. Tan, “Genetic algorithms applied to multi-class prediction for the analysis of gene expression data,” Bioinformatics, vol. 19, no. 1, pp. 37–44, 2003.
An Investigation of Data Privacy and Utility Using Machine Learning as a GaugeKato Mivule
Dissertation Defense: "An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge" by Kato Mivule, Bowie State University, April 17, 2014.
Lit Review Talk - Signal Processing and Machine Learning with Differential Pr...Kato Mivule
Literature Review – Talk, By Kato Mivule, COSC891 Fall 2013, Computer Science Department, Bowie State University
"Signal Processing and Machine Learning with Differential Privacy Algorithms and challenges for continuous data" Sarwate and Chaudhuri (2013)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
A Study of Usability-aware Network Trace Anonymization
1. 1
A Study of Usability-aware Network Trace
Anonymization
Kato Mivule
Los Alamos National Laboratory
Los Alamos, New Mexico, USA
kmivue@gmail.com
Blake Anderson
Los Alamos National Laboratory
Los Alamos, New Mexico, USA
banderson@lanl.gov
Abstract— The publication and sharing of network trace
data is a critical to the advancement of collaborative
research among various entities, both in government,
private sector, and academia. However, due to the
sensitive and confidential nature of the data involved,
entities have to employ various anonymization techniques
to meet legal requirements in compliance with
confidentiality policies. Nevertheless, the very composition
of network trace data makes it a challenge when applying
anonymization techniques. On the other hand, basic
application of microdata anonymization techniques on
network traces is problematic and does not deliver the
necessary data usability. Therefore, as a contribution, we
point out some of the ongoing challenges in the network
trace anonymization. We then suggest usability-aware
anonymization heuristics by employing microdata privacy
techniques while giving consideration to usability of the
anonymized data. Our preliminary results show that with
trade-offs, it might be possible to generate anonymized
network traces with enhanced usability, on a case-by-case
basis using micro-data anonymization techniques.
Keywords—Network Trace Anonymization; Usability;
Differential Privacy; K-anonymity; Generalization
I. INTRODUCTION
While a number of network trace anonymization techniques
have been presented in literature, data utility remains
problematic due to the unique usability requirements by the
different consumers of the privatized network traces. Yet still,
a number of microdata privacy techniques from the statistical
and computation sciences, are difficult to implement when
anonymizing network traces due to the low usability of results.
Moreover, finding the right proportionality between
anonymization and data utility of network trace data is
intractable and requires trade-offs on a case-by-case basis,
after a careful consideration of privacy needs stipulated by
policy makers, and likewise the usability requirements of the
researchers, who in this case, are the consumers of the
anonymized data. Furthermore, a generalized approach fails to
deliver unique solutions, as each entity will have unique data
privacy requirements. In this study, we take a look at the
structure of the network trace data. We vertically partition the
network trace data into different attributes and apply micro-
data privatization techniques separately for each attribute. We
then suggest usability-aware anonymization heuristics for the
anonymization process. While a number of anonymization
attacks have been presented in literature, the main goal of this
study was generation of anonymized network traces with
better data usability. Therefore, the focus of the suggested
heuristics and preliminary results, is about the generation of
anonymized usability-aware network trace data, using privacy
techniques covered in the statistical disclosure control domain;
that include the following: Generalization, Noise addition and
Multiplicative noise perturbation, Differential Privacy, and
Data swapping [38]. A measure of usability by quantifying
descriptive and inference statistics of the anonymized data in
comparison with that of the original data is also presented.
Furthermore, we apply frequency distribution analysis and
unsupervised learning techniques in the measure of usability
for the unlabeled data. The rest of the paper is organized as
follows: In Section II, we present a review of related work,
and definition of important terms pertaining to this paper. In
Section III, we present methodologies and usability-aware
anonymization heuristics. In Section IV, the experiment and
results are given. Finally in Section V, the conclusion,
recommendations, and future works are presented.
II. RELATED WORK
One of the challenges of anonymizing network traces, is
how to keep the structure and flow of the data intact so as to
provide usability to the consumer of the anonymized data. In
such efforts, Maltz et al. (2004) demonstrated that network
trace data could be anonymized while preserving the structure
of the original data [1]. Additionally, Maltz et al. (2004)
observed and noted that some of the challenges in
anonymizing network traces included figuring out attributes in
the network trace that could leak sensitive information, and
how to anonymize the data such that the original
configurations are preserved [1]. Observations by Maltz et al.
are still relevant today, especially when considering the
intractability between privacy and usability. On the other
hand, Slagell, Wang, and Yurcik (2004) proposed Crypto-Pan,
a network trace anonymization tool that employs
cryptographic techniques in the privatization of network trace
data [2]. While anonymization using cryptographic means
might be effective in concealing sensitive data, usability of the
anonymized data is always a challenge. Bishop, Crawford,
Bhumiratana, Clark, and Levitt (2006), observed that one of
2. 2
the problems in the anonymization of network traces, is that
when handling IP addresses, the set of available addresses is
finite, thus setting a limit to any anonymization prospects [3].
Each octet in the IP address would handle a range of 0 to 255.
For instance, it would not make much sense to have an
anonymized IP address with an octet value of 345. This
limitation makes the data vulnerable to de-anonymization
attacks. On the issue of de-anonymization attacks, Coull,
Wright, Monrose, Collins, and Reiter (2007) presented
inference techniques for de-anonymizing and detecting
network topologies in anonymized network trace data [4].
Coull et al. showed that topological data could be deduced as
an artifact of functional network packet traces, if the data on
activity of hosts can be utilized as an advantage to prevent a
successful obfuscation of the network traces [4]. Moreover,
Coull et al., pointed out that obfuscating network trace data is
not a trivial task as publishers of the data need to be aware of
the tension between balancing privacy and data utility needs
for anonymized network traces [4]. Additionally, Ribeiro,
Chen, Miklau, and Towsley (2008), showed that systematic
attacks on prefix-preserving anonymized network traces,
could be done by adversaries using modest amount of publicly
available information about a network and employing attack
techniques such as finger printing [5]. However, Ribeiro et al.
anticipated that their proposed attack methodologies would be
employed in evaluating worst-case vulnerabilities and finding
trade-offs between privacy and utility in prefix-preserving
privatization of network traces [5]. Therefore, while
researchers might have an interest in anonymized data sets
that maintain the structure and flow of the original data,
curators of that data have to contend with the fact that such
prefix-preserving anonymization is subject to de-
anonymization attacks.
A comprehensive reference model was presented by Gattani
and Daniels (2008), in which they outlined that entities needed
to formulate the problem of anonymizing network traces [6].
Gattani and Daniels (2008) noted that the anonymization
procedure always aims at the following three goals [6]: (i)
defending the confidentiality of users, (ii) obfuscating the
inner structure of a network, and (iii) generating anonymized
network traces with acceptable levels of usability [6].
However, Gattani and Daniels (2008) observed that attaining
those three anonymization goals is problematic, as removing
too much sensitive information from a network data trace only
reduces the usability of the anonymized network traces [6].
Additionally, Gattani and Daniels (2008), categorized attacks
on anonymized data categorized as, (i) active data injection
attacks, (ii) known mapping attacks, (iii) network topology
inference attacks, and (iv) cryptographic attacks [6]. On the
categorization of attacks, King, Lakkaraju, and Slagell (2009)
presented a taxonomy of attacks on anonymization techniques
with the aim of helping curators of the privatization process
negotiate trade-offs between data utility and anonymization
[7]. King et al., classified attacks on anonymization methods
as (i) fingerprinting, (ii) structure recognition, (iii) known
mapping, (iv) data injection, and (v) cryptographic attacks [7].
A combined categorization of attacks on anonymization
techniques, from Gattani and Daniels, and King et al., would
then be listed as follows [7] [6]: (i) Fingerprinting attacks: in
this this category of attacks, attributes of anonymized data are
compared with traits of known network structures to uncover a
relationship between the anonymized and non-anonymized
data. (ii) Data injection attacks: in this type of exploit, an
attacker injects pseudo-traffic data in a network trace before
anonymization process and uses the pseudo-traffic traces to
de-anonymize the network traces and network structure. (iii)
Structure recognition attacks: in this type of exploit, an
attacker seeks to determine the structure between objects in
the anonymized data to discover multiple relations between
anonymized and non-anonymized data. (iv) Network topology
inference: similar to known mapping attacks, this category of
exploits seeks to retrieve the network topology map by de-
anonymizing the nodes that make up the vertices of the
network, the edges between the nodes that represent the
connectivity and the routers. (v) Known mapping attacks: in
this category of exploit, the attacker relies on external data
(auxiliary data) to find a mapping between the anonymized
network trace data and the original network trace data in order
to retrieve the original IP addresses. (vi) Cryptographic
attacks: in this category of attacks, exploits are carried out to
break cryptographic algorithms used to encrypt the network
traces.
A comparative analysis was done by Coull, Monrose, Reiter,
and Bailey (2009) in which they pointed out the similarities
and differences between network data anonymization and
microdata privatization techniques, and how microdata
obfuscation methods could be applied to anonymize network
traces [8]. Coull, et al. observed that uncertainties did exist
about the effectiveness of network data anonymization, from
both methodological and policy view, with the research
community in need for more study to understand the
implications of publishing anonymized network data and the
utility of such data to researchers [8]. Furthermore, Coull, et
al. suggested that the extensive work that exists in the
statistical disclosure control discipline could be employed by
the network research community towards the privatization of
network flow data [8]. On network trace packet
anonymization, Foukarakis, Antoniades, and Polychronakis
(2009), proposed the anonymization of network traces at the
packet level – in the payload of a packet, due to inadequacies
found in various network trace anonymization techniques [9].
Foukarakis et al., suggested identifying revealing information
contained in the shell-code of code injection attacks, and
anonymizing such packets to grant confidentiality in published
network attack traces [9]. However, on the subject of IP-flow
intrusion detection methods, Sperotto et al. (2010) presented
an overview of IP-flow intrusion detection approach and
highlighted the classification of attacks, and defense methods
and how flow-based method can be used to discover scans,
worms, botnets and denial of service (DoS) attacks [10].
Furthermore Sperotto et al. highlighted two types of sampling;
packet sampling whereby a packet is deterministically chosen
based on a time interval for analysis; and flow sampling in
which a sample flow is chosen for analysis [10]. At the same
3. 3
time, Burkhart et al. (2010), in their review of anonymization
techniques, showed that current anonymization techniques are
vulnerable to a series of injection attacks, by inserting attacker
packets into the network flow prior to anonymization, then
later retrieving the packets, thus revealing vulnerabilities and
patterns in the anonymized data [11]. As a mitigation to
injection attacks, Burhart et al. suggested that anonymization
of network flow data should be done as part of a
comprehensive approach including both legal and technical
perspectives on data confidentiality [11].
Meanwhile, McSherry and Mahajan (2011) showed that
differential privacy could be employed to anonymize network
trace data. Yet despite privacy guarantees provided by
differential privacy, the usability of the privatized data
remains a challenge due to excessive noise from the
anonymization [12]. However, McSherry, Frank, and Mahajan
(2011), in their study of applying differential privacy on
network trace data, acknowledged the challenges of balancing
usability and privacy, despite the confidentiality assurances
accorded by differential privacy [13]. On real time interactive
anonymization, Paul, Valgenti, and Kim (2011) proposed the
Real-time Netshuffle anonymization technique whereby
distortion is done to a complete graph to prevent inference
attacks in network traffic [14]. Netshuffle works by employing
k-anonymity methodology on network traces, by ensuring that
all trace records appear at least k>1, with k being the
anonymized record, and then shuffling gets applied on the k-
anonymized records, making it difficult for an attacker to
decipher due to the distortion [14]. A network trace
obfuscation methodology, (k, j)-obfuscation, was proposed by
Riboni, Villani, Vitali, Bettini, and Mancini (2012), in which a
network flow is considered obfuscated if it cannot be linked
with greater assurance, to its source and destination IP
addresses [15]. Riboni, et al. observed from their
implementation of (k, j)-obfuscation, that the large set of
network flows maintained the utility of the original network
trace [15]. However, the context of data utility remains
challenging as each consumer of privatized data will have
unique usability requirements, different levels of needed
assurance, and therefore, utility becomes constrained to a
case-by-case basis, depending on an entity's privacy and
usability needs. On the issue of preserving IP consistency in
anonymized data, Qardaji and Li (2012) observed that full
prefix-preserving IP anonymization suffers from a number of
attacks yet from a usability perspective, some level of
consistency is required in anonymized IP addresses [16]. To
mitigate this problem, Qardaji and li (2012), proposed
maintaining pseudonym consistency by dividing flow data
into buckets based on temporal closeness and separately
privatize flows in each bucket, thus maintaining consistency
only in each bucket but not globally across all buckets [16].
Mendonca, Seetharaman, and Obraczka (2012) proposed
AnonyFlow, an interactive anonymization technique that
provides end point privacy by preventing the tracking of
source behavior and location in network data [17]. However,
Mendonca et al. acknowledged that AnonyFlow does not
address issues of complete anonymity, data security,
steganography, and network trace anonymization in non-
interactive settings [17].
On generating synthetic network traces, Jeon, Yun, and Kim
(2013), proposed an anomaly-based intrusion detection system
(A-IDS) to generate pseudo-network traffic for the
obfuscation of real sensitive network traffic in supervisory
control and data acquisition (SCADA) systems [18]. An
overview of network data anonymization was presented by
Nassar, al Bouna, Malluhi (2013), in which the need to
address the problem of finding appropriate anonymization
algorithms that grant privacy but with an optimal risk-utility
trade-off, was highlighted [19]. On using entropy and
similarity distance measures, Xiaoyun, Yujie, Xiaosheng,
Xiaohong, and Yan (2013) employed similarity distance and
entropy techniques in the quantification of anonymized
network trace data [20]. Xiaoyun et al. proposed two types of
similarity measures: (i) external similarity, in which the
distance measurements are done to compute the probability
that an adversary will obtain a one-to-one mapping relation
between the anonymized and the original data, based on
auxiliary knowledge; (ii) Internal similarity, in which distance
measurements are done on the privatized and the original data
to indicate how distinguishable or indistinguishable the data
sets are [20]. On the extracting, classification, and
anonymization of packet traces, Lin, Lin, Wang, Chen, and
Lai (2014), observed that capturing and sharing real network
traffic faced two challenges, first various protocols are
associated with the packet traces and secondly, such packet
traces tend not to be well classified before deep packet
anonymization [21]. Therefore, Lin et al. proposed PCAPLib
methodology to extract, classify, and the deep packet
anonymization of packet traces [21]. In their work on Session
Initiation Protocol (SIP) used in multimedia communication
sessions, Stanek, Kencl, and Kuthan (2014), pointed out that
current network trace anonymization techniques are
insufficient for SIP traces due to the data format of the SIP
trace, that includes, the IP address, the SIP URI, and the e-
mail address [22]. To mitigate this problem, Stanek et al,
proposed SiAnTo, an anonymization methodology that
replaces SIP information with non-descriptive but matching
labels [22]. Of recent, Riboni, Villani, Vitali, Bettini, and
Mancini (2014), cautioned that current network trace
anonymization techniques are vulnerable to various attacks
while at the same time it is problematic to apply microdata
privatization methods in obfuscating network traces [23].
Moreover, Riboni et al. noted that current obfuscation
methods depend on assumptions about an adversary
intentions, which are challenging to model, and do not
guarantee privacy against background knowledge attacks [23].
In Table I, is a summary of some of network trace
anonymization challenges outlined in literature for the past ten
years.
A. Network trace anonymization techniques
In the following section, a review of some of the common
network trace anonymization techniques is presented [24] [25]
[26] [27] [28] [16]: (i) Black marker technique: in this
4. 4
method, sensitive values are erased or substituted with fixed
values.
TABLE I. SUMMARY OF NETWORK TRACE ANONYMIZATION
CHALLENGES
Author (s) Network Trace Anonymization Challenges
Maltz et al., (2004) Challenge of identifying attributes to anonymize while
conserving usability
Slagell et al., (2004) Crypto-Pan – cryptography to anonymize IP addresses
– usability a challenge.
Bishop et al.,(2006) Anonymization of IP addresses problematic – set of IP
addresses is finite.
Coull et al.,(2007) Obfuscation not trivial task due to the tensions
between privacy and usability.
Ribeiro et al., (2008) Prefix-preserving anonymized data subject to
Fingerprinting attacks.
King et al., (2009) Taxonomy of attacks on anonymization technique –
anonymization challenges.
Coull et al., (2009) Comparison between network and micro data
anonymization – significant differences.
Foukarakis et al.,
(2009)
Network trace anonymization at the packet level – a
challenge.
Burkhart et al, (2010) Injection attacks on anonymized network trace data.
McSherry and
Mahajan (2011)
Differential privacy anonymization of network trace
data.
Paul, Valgenti, and
Kim (2011)
Real-time anonymization with k-anonymity.
Riboni et al., (2012) (k, j)-obfuscation – network flow is obfuscated if it
cannot be linked to original data with greater
assurance
Qardaji and Li (2012) Global Prefix Consistency is subject to attacks.
Mendonca et al.,
(2012)
Interactive network trace anonymization.
Jeon, Yun, and Kim
(2013)
Synthetic (anonymized) network trace data generation.
Nassar, al et al.,
(2013)
Balance between utility and privacy needed - still a
problem.
Farah and Trajkovic
(2013)
Network trace anonymization techniques - an
overview.
Stanek et al., (2014) Proposed Session Initiation Protocol (SIP)
anonymization and challenges.
Riboni et al., (2014) Caution with current network anonymization
techniques – vulnerable to attacks
(ii) Enumeration technique: in this scheme, sensitive values in
a sequence are replaced with an ordered sequence of synthetic
values. (iii) Hash technique: unique values are substituted
with a fixed size bit string in the hash technique. (iv)
Partitioning technique: with the partitioning method,
revealing values are partitioned into a subset of values and
each of the values in the subset is replaced with a generalized
value. For example, an IP address 141.121.10.12, could be
partitioned into four octets and the last two octets replaced
with zero values, 141.121.0.0. (v) Precision degradation
technique: highly specific values of a time-stamp attribute are
removed when employing the precision degradation method.
(vi) Permutation technique: A random permutation is done to
link non-anonymized IP and MAC addresses to a set of
available addresses. (vii) Prefix-preserving anonymization
technique: in this technique, values of an IP address are
replaced with synthetic values in such a way that the original
structure of the IP address is kept – the prefix values of an IP
address structure is preserved. Prefix-preservation could be
applied fully or partially on the IP address. The fully prefix-
preserving anonymization will map the full structure of the
original IP address in the anonymized data, while the partially
prefix-preserving anonymization will preserve a select
structure of the original IP address, for example the first two
octets. (viii) Random time shift technique: this methodology
works by applying a random value as an offset to each value
in the field. (ix) Truncation technique: with this technique,
part of the IP or MAC address is suppressed or deleted and the
remaining IP address remains intact. (x) Time unit
annihilation: In this partitioning anonymization methodology,
part of the time-stamp is deleted and replaced with zeros. In
Table 1, a summary of ongoing challenges from literature, on
anonymizing network traces is given. Although a number of
network trace anonymization solutions have been proposed in
literature, usability of the anonymized data remains
problematic. While a number of challenges exist, this study
labored to focus on the challenge of usability-aware
anonymization of network traces.
B. Statistical disclosure control techniques
The following are some of the main microdata privatization
methods used: Suppression: in this technique, revealing and
sensitive data values are deleted from a data set at the cell
level [29]. Generalization: to achieve confidentiality for
revealing values in an attribute, a single value is allocated to a
group of sensitive values in the attribute [30]. K-anonymity: in
this method, data privacy is enforced by requiring that all
values in the quasi-attributes be repeated k times, such that k
>1, thus providing confidentiality and making it harder to
uniquely distinguish individuals values. K-anonymity employs
both generalization and suppression methods to achieve k >1
[31]. Data swapping: Data swapping is a data privacy
technique that involves exchanging sensitive cell values with
other cell values in the same attribute while keeping intact the
frequencies and statistical traits of the original data, and as
such, making it difficult for an attacker to map the privatized
values to the original record [32]. Noise addition: noise
addition is a data privacy method that adds random values
(noise) to revealing and sensitive numerical values, in the
original data, to ensure confidentiality. The random values are
usually chosen from between the mean and standard deviation
of the original values [33]:
𝑋! + 𝜀! = 𝑍! (1)
Multiplicative noise: similar to noise addition, random values
generated between the mean and variance of the original data
5. 5
values, are then multiplied to the original data generating a
privatized data set [34].
𝑋! ∗ 𝜀! = 𝑍! (2)
Where X = original data, Z = privatized data, and ε = the
random values. Differential Privacy: Similar to noise addition,
differential privacy imposes privacy by adding Laplace noise
to query results from the database such that it cannot be
distinguished if a particular value has been adjusted in that
database or not; making it more difficult for an attacker to
decode items in the database [35]. ε-differential privacy is
satisfied if the results to a query run on database D1 and D2
should probabilistically be similar, and meet the following
condition [35]:
𝑃 𝑞! 𝐷! ∈ 𝑅 𝑃 𝑞! 𝐷! ∈ 𝑅 ⩽ 𝑒!
(3)
Where D1 and D2 are the two databases; P is the probability of
the perturbed query results D1 and D2; qn() is the privacy
granting procedure (perturbation); qn(D1) is the privacy
granting procedure on query results from database D1; qn(D2)
is the privacy granting procedure on query results from
database D2; R is the perturbed query results from the
databases D1 and D2 respectively; eε
is the exponential e
epsilon value. Differential privacy can be implemented as
follows [36]:
(i) Run query on database
𝑤ℎ𝑒𝑟𝑒𝑓 𝑥 = 𝑞𝑢𝑒𝑟𝑦𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
(ii) Calculate the most influential observation
𝛥𝑓 = 𝑀𝑎𝑥 𝑓 𝐷! − 𝑓 𝐷! (4)
(iii) Calculate the Laplace noise distribution
𝑏 = 𝛥 𝑓 𝜀 (5)
(iv) Add Laplace noise distribution to the query results
𝐷𝑃 = 𝑓 𝑥 + 𝐿𝑎𝑝𝑎𝑙𝑐𝑒 0, 𝑏 (6)
(v) Publish perturbed query results in interactive (query
responses) or non-interactive (macro, micro data) mode.
C. Metrics used to quantify usability in this study
The Shannon entropy: entropy is used essentially to measure
the amount of randomness and uncertainty in a data set; if all
values in a set of information fall into one category, then
entropy in such cases is at zero. Probability is used to quantify
randomness of elements in an information set; normalized
entropy values range from 0 to 1, getting to the upper bound
level when all probabilities are equal [37] [36]. Entropy is
formally described using the following formula [37]:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐻 𝑝!, 𝑝!, . . . , 𝑝! = 𝑝!
!
!!! ⋅ 𝑙𝑜𝑔
!
!!
(7)
where pi = probability; H(p1, p2,...,pn) is entropy for each pi.
Correlation Metric (between Original data X and Privatized
data Z): Correlation rxz computes the inclination and tendency
of an additive linear relation between two data points; the
correlation is dimensionless, independent of the environs in
which the data points x and y are measured and is expressed as
follows [38]:
𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑟!" = 𝐶𝑜𝑣 𝑥𝑧 𝜎! 𝜎! (8)
Where Cov xz is the covariance product of X and Z, sigma (σ)
represents the standard deviation product of X and Z. If rxz = -
1, then a negative linear relation exists between X and Z; if rxz
= 0, no linear relation exits between X and Z; when rxz = 1, a
strong linear relation between X and Z exists. Descriptive
Statistics Metric: Descriptive statistics (DS) such as the mean,
standard deviation, variance, etc., are used in quantifying how
much distortion there is between the anonymized and original
data. The larger the difference, the more privacy but also an
indication of less usability; the closer the difference, the more
usability but perhaps less privacy. The format used in the
quantification is always in the form [36]:
𝑈𝑠𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝐷𝑆(𝑍) − 𝐷𝑆(𝑋) (9)
Where Z is the anonymized data, X is the original data, and
DS, the descriptive statistics. Distance Measures Metric
(Euclidean Distance): For distance measures, we employed
clustering with k-means to evaluate how well the clustering of
the original data compares with that of the anonymized data.
In this case, the Euclidean Distance is used for k-means
clustering and is expressed as follows [39]:
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥, 𝑦 = 𝑥! − 𝑦!
!!
!!! (10)
The Davis Bouldin index: was also used in the evaluation of
how well the clustering performed. The Davis Bouldin Index
(DBI) is expressed as follows [21]:
𝐷𝐵𝐼 =
!
!
𝐷!
!
!!! (11)
Where 𝐷! ≡ 𝑚𝑎𝑥
!:!!!
𝑅!,! (12)
And 𝑅!,! ≡
!!!!!
!!,!
(13)
And Ri,j is a quantification of how good the clustering is. Si
and Sj is the distance within each cluster. Mi,j is the distance
between clusters. Classification Error Metric: With the
classification error test, both the original and anonymized data
are passed through machine learning and the classification
error (or accuracy) is returned. The classification error (CE) of
the anonymized data is subtracted from that of the original.
6. 6
The larger the difference, the more privacy (due to distortion);
this might be an indication of low usability. However, a
smaller difference might indicate better usability but then low
privacy, as anonymized results might be closer to the original
data in similarity. Depending on the machine-learning
algorithm used, the classification error metric will be in this
form [36]:
𝑈𝑠𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝐺𝑎𝑢𝑔𝑒 = 𝐶𝐸 𝑍 − 𝐶𝐸 𝑋 (14)
Where Z is the anonymized data, X the original data, and
CE is the classification error.
III. METHODOLOGY
In this section, we describe the implemented methodology;
in this case, heuristics used in the anonymization of network
trace data, within the context of usability while at the same
time granting privacy requirements. The goal of the heuristics
is to provide an anonymized data that could be used by
researchers with close statistical traits to the original data. The
trade-off in this case, is that we tilt towards more utility while
making it harder for an attacker to decrypt the original data,
assuming that the attacker has no prior knowledge. Because of
the unique data structure of network traces, a single
generalized approach is not applicable in anonymizing all the
network trace attributes. In our approach, we apply a hybrid of
anonymization heuristics for each group of related attributes.
Combinations of microdata anonymization techniques were
used in this study, as illustrated in Figure 1. The following
attributes were anonymized in the network trace data: (i) Start
and End Time (Time-stamp), (ii) Source IP and Destination
IP, (iii) Protocol, (iv) Source Port and Destination Port, (v)
Source Packet Size and Destination Packet Size, (vi) Source
Bytes and Destination Bytes, (vii) TOS Flags. However, due
to space constraints, we only present results for the Timestamp
and IP Address attributes.
Figure 1: An illustration of the proposed anonymization heuristics for the network trace data.
A. Enumeration with multiplicative pertubation
To preserve the flow structure of the timestamp, we
employed enumeration with multiplicative perturbation, a
heuristic that combines multiplicative noise addition technique
from the microdata privatization techniques and enumeration
from network trace anonymization. The Enumeration with
Multiplicative Perturbation Heuristic is implemented as
follows: Step (I): A small epsilon constant value is chosen
between 0 and 1. Data curators could conceal this random
value, arbitrarily chosen between 0 and 1, as an additional
layer of confidentiality. Step (ii): The small epsilon constant
value is then multiplied to the original data (timestamp, both
Start and End Time attributes) generating an enumerated set.
Step (iii): The generated enumerated data is then added to the
original data, producing an anonymized data set. Step (iv): A
test for usability is done, using descriptive statistical analysis,
entropy, correlation, and unsupervised learning using
clustering (k-means). Step (v): If the desired threshold is met,
the anonymized data is published. The goal with this heuristic
is to keep the time flow structure intact and similar to the
original data while at the same time anonymizing the time
series values. In this case, the anonymized time series data
should generate similar usability results to the original.
B. Generalization and differential privacy
The IP address is one of the most challenging attributes to
anonymize since each octet of the IP address is limited to a
finite set of numbers, from 0 to 255. This makes the IP address
attribute vulnerable to attackers in attempts to de-anonymize
the privatized network trace [3]. With such restrictions, the
curator of the data is left with the choice of completely
anonymizing the IP address by employing full perturbation
techniques, which in turn keeps the flow structure and prefix
of the IP address distorted, and thus poor data usability. One
solution to this problem would be to employ heuristics that
would grant anonymization and at the same time keep the
prefix of the IP address intact. However, full IP address
prefix-preserving anonymization has been shown to be prone
to de-anonymization attacks, yet presenting another challenge
[5]. Therefore, to deal with this problem, we suggest a partial
prefix-preserving heuristic in which differential privacy and
generalization are used and implemented as follows: Octet 1,
anonymization: The IP address is split into four octets.
Generalization is applied to the first octet to partially preserve
7. 7
the prefix of the anonymized IP address. The goal is to give
the users of the anonymized data some level of usability by
being able get a synthetic flow of the IP address structure in
the network trace. Step (i): A small epsilon constant value is
chosen and used for application (added or multiplied to data)
with noise addition or multiplicative noise on the first octet.
The goal is to preserve the flow structure in the first octet.
Step (ii): Frequency count analysis to check that none of the
first octet values from the original data re-appear in the
anonymized data is done at this stage. Step (iii): If first octet
values reappear in the anonymized data, generalization by
replacing the reappearing values with the most frequent values
in the anonymized first octet is done. Step (iv): Finally,
generalization and k-anonymity are applied to ensure no
unique values appear, and that all values in the first octet
appear k >1. Step (v): A test for usability by comparing the
original and anonymized first octet values, is done. Octet 2, 3,
and 4 anonymization: To make it difficult to de-anonymize
the full IP address, randomization using differential privacy is
applied to the remaining three octets. However, since each
octet is limited to a set of 0 to 255 finite numbers, the
differential privacy perturbation process will generate some
values that would exceed 255; for instance, it would not make
meaning to have an octet value of 350. To mitigate this
situation, a control statement is introduced at the end of the
differential privacy process, to exclude all values greater than
an IP class address and octet range. In this case, any values
greater than 255 are excluded from the end results of the
perturbation process. Differential privacy is applied to each of
the three octets vertically and separately. Step (i): A vertical
split of octet 2, 3, and 4 into separate attributes, is done. Step
(ii): Anonymization using differential privacy on each
attribute (octet) separately is done at this stage. Step (iii): Test
to ensure that anonymized values in each octet are in range,
from 0 to 255. Step (iv): If the anonymized values in an octet
exceed the 0 to 255 range then return a generalized value
using the most frequent value in that 0 to 255 range. Step (v):
Test for usability. Step (vi): Combine all octets to a full
anonymized IP address.
IV. RESULTS
Preliminary results are presented in this section. However,
due to space limitation in this publication, only results for the
timestamp and IP address attributes are presented. Real 2014
network trace (NetFlow) data provided by Los Alamos
National Laboratory were used in this experiment. A total of
500000 network flow records were anonymized in this study.
Microdata obfuscation techniques were applied for the
anonymization process. Each attribute of the NetFlow trace
was anonymized separately.
A. Timestamp anonymization and usability results
Descriptive statistical analysis was done on both the original
and anonymized data sets, as shown in Table II. The aim was
to study the statistical traits of both the original and
anonymized data sets and show any similarities. In this case,
the statistical traits of the anonymized data show an
augmentation of the original data – a generation of a synthetic
data set in this case. For instance the original mean of the start
time and end time was 1123355142 and 1123355214
respectively, and while that of the anonymized data set was at
1944808589 and 1944808714. The difference between the
anonymized and original data was at 821453447 and
821453500 respectively. A larger difference might indicate
more privacy and less usability, while a smaller difference
might indicate better privacy but less usability. The results
presented in Table II indicate a mid-way with both privacy
and usability needs met after trade-offs (the difference).
TABLE II. STATISTICAL TRAITS OF ORIGINAL AND ANONYMIZED
TIMESTAMP DATA
However, to meet the requirements of different users for the
anonymized data, a fine-tuning of the parameters in the
anonymization heuristics would need to be done. Additionally,
the normalized Shannon's entropy results, as shown in Table
II, were similar for both original and anonymized data at
approximately 0.77 and 0.76 for the start and end times
respectively. The entropy results indicate that the distortions
and uncertainty in both data sets might be similar. While the
entropy results might be good for usability, it could likewise
be argued that privacy levels might be inadequate since the
two data sets are similar in that regard. However, the
correlation values between the anonymized and original data
was at 0.532 and 0.534 for the start and end time attributes
respectively. The results could indicate that while correlations
exist between the two data sets, the significance is not that
high since the values do not approach 1.
Figure 2: K-means clustering results for the original start and end time data.
8. 8
The results might indicate that privacy is maintained in the
anonymized data, with an acceptable level of usability. In
Figure 2, results from clustering the original network trace
data (timestamp attribute), is presented. The x-axis in Figure 2
represents the start-time, while the y-axis represents the end-
time of the activity in the network trace. The value of k for the
k-means was set to 5 in this experiment. From an anecdotal
point of view, we can see that the clustering results in Figure 2
have their own skeletal structure. However, this is not the case
in Figure 3. In Figure 3, data privacy using noise addition was
applied idealistically, without much consideration given to the
issue of usability.
Figure 3: Idealistic Privacy application and clustering results
An anecdotal view of results in Figure 3 might point to better
privacy, since the skeletal cluster structure of the original data
was dismantled and replaced with a new skeletal cluster
structure.
Figure 4: K-means clustering for the anonymized start and end-time data.
However, usability remains a challenge, as the anonymized
clustering results are far from being close to the original
clustering. In the case of this study, the aim was to obtain
clustering results with better usability. Therefore, a re-tuning
of the parameters in the data privacy procedure is done to
achieve better usability. On the other hand, the goal of using
cluster analysis with k-means was to analyze how the
unlabeled original network trace data would perform in
comparison to the anonymized data. Furthermore, the Davis-
Bouldin criterion shows a value of 0.522, as depicted in Table
II, indicating how well the clustering algorithm (k-means)
performed with the original time-stamp (start and end times)
data. In Figure 4, clustering results (with k=5 for the k-means)
for the anonymized data are presented, with the x-axis
showing the start time and the y-axis presenting the end time.
Figure 5: K-means Cluster performance showing the average distance within
centroid and items in each cluster
The Davis-Bouldin criterion for the clustering performance on
the anonymized data was 0.393 as shown in Table II, a value
lower than that of the original data, and an indication of better
clustering. However, while an anecdotal view of the plots
shows that the cluster results look similar, the number of items
in each cluster in the anonymized data differ from that of the
original, as shown in Figure 5. For instance, in Figure 5, the
number of items in cluster 0 for the original data is at 310678,
while that of the anonymized data is at 291002. The trade-off
would be the difference of 19676 items. The challenge still
remains as to effectively balance anonymity and usability
requirements, with trade-offs. In this case, if the usability
threshold is not met, then the curator can fine-tune the
anonymization parameters. The average-within-centroid
distance returned a lower value for the anonymized data at
77865, and for the original data at 157093, with the lower
value indicating better clustering, as shown in Figure 5.
B. Source and destination IP address anonymity results
The IP address remains a challenging attribute to anonymize
due to the finite nature of the IP addresses. Each octet is
limited to a range of 0 to 255 and obfuscation becomes
constricted to that range. As we hinted earlier, it would not
make any meaning to have octet values ranging between 270
and 450, for instance. In this section we present preliminary
results on the anonymization and usability of the source and
destination IP attribute values using the heuristics in section 3.
Correlation: The correlation between the original and
anonymized data, as shown in Table III, for the first octet of
9. 9
the source and destination IP show values of 0.9 and 1
respectively. These strong correlation values are indicative of
a strong linear relationship between the original and
anonymized octet 1 data. The first octet of the IP address was
anonymized using noise addition and generalization to keep
the flow structure similar to the original. Since a partial prefix
preserving anonymization was used, it is noteworthy that there
are strong correlation values between the original and
anonymized data for the first octet IP values.
TABLE III. STATISTICAL TRAITS OF ORIGINAL AND ANONYMIZED SOURCE AND DESTINATION IP ADDRESSES
Our view is that a researcher could still derive general
network information from the flow structure presented by the
first octet in the IP address without compromising the
specifics of the other inner 3 octets. Yet the correlation
between the anonymized data and original data for the 2, 3,
and 4 octets show values of 0 for the destination IP addresses
and minimal values of -0.081, 0.093, and 0.213, for source IP
addresses, indicating that there is very low relationship
between the anonymized and original data for octets 2, 3, and
4. However, the very low correlation values might be a good
indicator for stronger privacy, since we employed differential
privacy in the anonymization of octets 2, 3 and 4. Therefore
the correlation between the anonymized and original data
would be nonexistent or at least very minimal due to the
differential privacy randomization. Hence the partial prefix-
preserving heuristic works in this case, the user of the
anonymized data is only able to derive information from the
first octet while all other internal IP address information is
kept confidential.
Entropy: The Shannon Entropy test was done on both the
original and anonymized data IP addresses to study the
uncertainty and randomness in the data sets. The normalized
Shannon's entropy values range between 0 and 1, with 0
indicating certainty and 1 indicating uncertainty. As shown in
both Table III and Figure 6, the entropy values for octet 1 in
both the original and anonymized data, is approximately at
0.1, indicative of certainty of values and thus maintenance of
flow in the first octet. However, for octets 3 and 4, there is
much less certainty in the original data and in octets 2, 3, and
4 for the anonymized data, though much lower than the
original. Nevertheless, octet 2 in the original data provides
more certainty than octet 2 in the anonymized data. While the
entropy levels in octet 3 and 4 in the original data seem higher
than that of the anonymized data, overall, octets 2, 3, and 4 in
the anonymized data, provide more distributed uncertainty,
better randomness, and thus better anonymity. Yet still, we
constrained the random values in octet 2, 3, and 4 generated
during the differential privacy procedure not to exceed 255.
An octet value of 355 or 400 would affect the usability of the
anonymized IP address data. However, it could be argued that
the certainty levels are maintained in octet 1 for both original
and anonymized data, with distortion on octet 2, 3, and 4 in
the anonymized data, indicating that the flow structure is kept,
and thus partial prefix-preserving anonymity might be
achieved.
Figure 6: Normalized Shannon's Entropy values for the original and
anonymized IP addresses.
Frequency Distribution histogram analysis: Furthermore, we
did a frequency analysis to compare the distribution of values
in each octet in the IP address, for both the original and
anonymized IP addresses. For the original data the number of
items in octet 1 between 40 and 45, that is, source IP addresses
that start with octet values 40 to 45, came to approximately
400,000 out of 500,000 records, as shown in Figure 7. Similar
results were actualized for the destination IP address, for octet
1 with about 300,000 items with values 40 to 45, as illustrated
in Figure 8. With the exception of octet 2, the values in octet 3
and 4 are distributed across the range 0 to 85 in the original IP
address data; this correlates with results shown in Figure 6,
with higher entropy values for octet 3 and 4 in the original
10. 10
data, indicating more uncertainty. The x-axis in each graph
represents the IP octet values, and the y-axis, shows the
frequency of each of those octet values. However, a look at
the anonymized IP address data shows that octet 1 had about
390,000 IP address octet 1 values beginning with 200, as
shown in Figure 9 and 10, for both source and destination IP
address data respectively. The results show the effect of
generalization used in the obfuscation of the original data for
octet 1. The values in octet 2, 3 and 4 were distributed across
the 0 to 255 range, with the highest concentration around octet
value 190 due to the constraints placed on the differential
privacy results, to prevent a return of value greater than 255. It
would not make much meaning, as mentioned earlier, to have
differential privacy results that exceed 255. For octet 2, 3, and
4, the Laplace distribution is kept due to the noise distribution
used in the differential privacy process.
Figure 7: Frequency distribution for the original source IP octet values.
Figure 8: Frequency distribution for the original destination IP octet values
Our recommendation as a result of this study is that a privacy
engineering approach be highly considered by curators during
the anonymization process.
V. CONCLUSION
Anonymizing network traces while maintaining an acceptable
level of usability remains a challenge, especially when
employing privatization techniques used for microdata
obfuscation. Moreover, obfuscating network traces remains
problematic due to the IP addresses and octet values being
finite. Furthermore, generalized anonymization approaches
fail to deliver specific solutions, as each entity will have
unique data privacy and usability requirements, and the data in
most cases have varying characteristics to be considered
during the obfuscation process. In this study, we have
provided a review of literature, pointing out some of the
ongoing challenges in the network trace anonymization over
the last 10-year period. We have suggested usability-aware
anonymization heuristics by employing microdata privacy
techniques, while taking into consideration the usability of the
anonymized network trace data. Our preliminary results show
that with trade-offs, it might be possible to generate
anonymized network traces on a case-by-case basis, using
micro-data anonymization techniques, such as differential
privacy, k-anonymity, generalization, multiplicative noise
addition.
Figure 9: Frequency distribution for anonymized source IP octet values
In the initial stage of the privacy engineering process, the
curators could gather privacy and usability requirements from
the stakeholders involved, this would include both the policy
makers and anticipated users (researchers) of the anonymized
network trace data. The curators could then model the most
applicable approach given trade-offs, on a case-by-case basis.
The generated anonymization model could then be
implemented across the enterprise for uniformity and
prevention of information leakage attacks. On the limitations
of this study, focus was placed on usability-aware
11. 11
anonymization of network trace data and not on the types of
attacks on anonymized network traces. While some
consideration and mention of anonymization attacks was
given in this study, focusing on de-anonymization attacks was
beyond the scope of this study, and a subject left for future
work.
Figure 10: Frequency distribution for anonymized destination IP octet values
ACKNOWLEDGMENT
We would like to express our appreciation to the Los
Alamos National Laboratory, and more specifically, the
Advanced Computing Solutions Group, for making this work
possible.
REFERENCES
[1] D. A. Maltz, J. Zhan, G. Xie, H. Zhang, G. Hjálmtýsson, A. Greenberg,
and J. Rexford, “Structure preserving anonymization of router
configuration data”, In Proceedings of the 4th ACM SIGCOMM
conference on Internet measurement (IMC '04), 2004, Pages 239-244.
[2] A. Slagell, J. Wang, and W. Yurcik, "Network log anonymization:
Application of crypto-pan to cisco netflows." In Proceedings of the
Workshop on Secure Knowledge Management , 2004.
[3] M. Bishop, R. Crawford, B. Bhumiratana, L. Clark, and K. Levitt,
"Some problems in sanitizing network data.", 15th IEEE International
Workshops on Enabling Technologies: Infrastructure for Collaborative
Enterprises, 2006., pp. 307-312.
[4] S.E. Coull, C.V. Wright, F. Monrose, M.P. Collins, and M.K. Reiter,
"Playing Devil's Advocate: Inferring Sensitive Information from
Anonymized Network Traces." In NDSS, 2007, vol. 7, pp. 35-47.
[5] B.F. Ribeiro, W. Chen, G. Miklau, and D.F. Towsley, "Analyzing
Privacy in Enterprise Packet Trace Anonymization." In NDSS, 2008.
[6] S. Gattani and T.E. Daniels, “Reference models for network data
anonymization”, In Proceedings of the 1st ACM workshop on Network
data anonymization (NDA '08), 2008, pp. 41-48.
[7] J. King, K. Lakkaraju, and A. Slagell. "A taxonomy and adversarial
model for attacks against network log anonymization." In Proceedings
of the 2009 ACM symposium on Applied Computing, 2009, pp. 1286-
1293.
[8] S.E. Coull, F. Monrose, M.K. Reiter, M. Bailey, "The Challenges of
Effectively Anonymizing Network Data," Conference For Homeland
Security, CATCH 2009, pp.230-236.
[9] M. Foukarakis, D. Antoniades, and M. Polychronakis, “Deep packet
anonymization”, In Proceedings of the Second European Workshop on
System Security (EUROSEC '09). ACM, 2009, pp. 16-21.
[10] A. Sperotto, G. Schaffrath, R. Sadre, C. Morariu, A. Pras, and B. Stiller,
"An overview of IP flow-based intrusion detection." Communications
Surveys & Tutorials, IEEE 12, no. 3, 2010, pp. 343-356.
[11] M. Burkhart, D. Schatzmann, B. Trammell, E. Boschi, and B. Plattner.
"The role of network trace anonymization under attack.", ACM
SIGCOMM Computer Communication Review 40, no. 1, 2010, pp. 5-
11.
[12] F. McSherry, and R. Mahajan, "Differentially-private network trace
analysis.", ACM SIGCOMM Computer Communication Review 41.4,
2011, pp. 123-134.
[13] F. McSherry, and R. Mahajan., "Differentially-private network trace
analysis.", ACM SIGCOMM Computer Communication Review 41, no.
4, 2011, pp. 123-134.
[14] R.R. Paul, V.C. Valgenti, M. Kim, "Real-time Netshuffle: Graph
distortion for on-line anonymization," Network Protocols (ICNP), 19th
IEEE International Conference on, 2011, pp.133,134.
[15] D. Riboni, A. Villani, D. Vitali, C. Bettini, L.V. Mancini, "Obfuscation
of sensitive data in network flows," INFOCOM, 2012 Proceedings,
IEEE, 2012, pp.2372-2380.
[16] W. Qardaji and L. Ninghui, "Anonymizing Network Traces with
Temporal Pseudonym Consistency." IEEE 32nd International
Conference on Distributed Computing Systems Workshops (ICDCSW),
2012, pp. 622-633.
[17] M. Mendonca, S. Seetharaman, and K. Obraczka, "A flexible in-network
ip anonymization service.", In Communications (ICC), 2012 IEEE
International Conference, pp. 6651-6656.
[18] S. Jeon, J-H. Yun, and W-N. Kim, “Obfuscation of Critical
Infrastructure Network Traffic using Fake Communication”, Annual
Computer Security Applications Conference (ACSAC) 2013, Poster.
[19] M. Nassar, B. al Bouna, and Q. Malluhi, "Secure Outsourcing of
Network Flow Data Analysis.", In Big Data (BigData Congress), 2013
IEEE International Congress, 2013, pp. 431-432.
[20] C. Xiaoyun, S. Yujie, T. Xiaosheng, H. Xiaohong, and M. Yan, "On
measuring the privacy of anonymized data in multiparty network data
sharing.", Communications, China 10, no. 5, 2013, pp. 120-127.
[21] Y-D. Ying-Dar, P-C. Lin, S-H. Wang, I-W. Chen, and Y-C. Lai.
"Pcaplib: A system of extracting, classifying, and anonymizing real
packet traces.", IEEE Systems Journal, Issue 99, pp.1-12.
[22] J. Stanek, L. Kencl, and J. Kuthan, "Analyzing anomalies in anonymized
SIP traffic.", In IEEE 2014 IFIP Networking Conference, 2014, 2014,
pp. 1-9.
[23] D. Riboni, A. Villani, D. Vitali, C. Bettini, L.V. Mancini, L.V,
"Obfuscation of Sensitive Data for Incremental Release of Network
Flows," IEEE Transactions on Networking, Issue 99, 2014, pp.1.
[24] T. Farah, and L. Trajkovic, "Anonym: A tool for anonymization of the
Internet traffic." In IEEE 2013 International Conference on Cybernetics
(CYBCONF), 2013, pp. 261-266.
[25] A.J. Slagell, K. Lakkaraju, and K. Luo, "FLAIM: A Multi-level
Anonymization Framework for Computer and Network Logs." In LISA,
vol. 6, 2006, pp. 3-8.
[26] J. Xu, J. Fan, M.H. Ammar, and Sue B. Moon, "Prefix-preserving ip
address anonymization: Measurement-based security evaluation and a
new cryptography-based scheme.", In 10th IEEE International
Conference on Network Protocols, 2002, pp. 280-289.
[27] M. Burkhart, D. Brauckhoff, M. May, and E. Boschi, "The risk-utility
tradeoff for IP address truncation." In Proceedings of the 1st ACM
workshop on Network data anonymization, 2008, pp. 23-30.
[28] W. Yurcik, C. Woolam, G. Hellings, L. Khan, B. Thuraisingham,
"Measuring anonymization privacy/analysis tradeoffs inherent to sharing
network data", IEEE Network Operations and Management Symposium,
2008, pp.991-994.
[29] V. Ciriani, S.D.C. Vimercati, S. Foresti, and P. Samarati, “Theory of
privacy and Anonymity”, In M. J. Atallah & M. Blanton (Eds.), In
Algorithms and theory of computation handbook, CRC Press, 2009, pp.
12. 12
18-33.
[30] P. Samarati and L. Sweeney, “Protecting privacy when disclosing
information: k-anonymity and its enforcement through generalization
and suppression”, Technical Report SRI-CSL-98-04, SRI Computer
Science Laboratory, 1998
[31] L. Sweeney, “Achieving k-anonymity privacy protection using
generalization and suppression”, International Journal of Uncertainty
Fuzziness and Knowledge-Based Systems, 10(5), 2002, pp.571–588.
[32] T. Dalenius and S.P. Reiss, “Data-swapping: A technique for disclosure
control”, Journal of Statistical Planning and Inference, 6(1), 1982, pp.
73–85.
[33] J. Kim, “A Method For Limiting Disclosure in Microdata Based
Random Noise and Transformation”, In Proceedings of the Survey
Research Methods, American Statistical Association, Vol. A, 1986, pp.
370–374.
[34] J. Kim and W.E. Winkler, “Multiplicative Noise for Masking
Continuous Data”, Research Report Series, Statistics #2003-01,
Statistical Research Division. 2003, Washington, D.C. Retrieved from
http://www.census.gov/srd/papers/pdf/rrs2003-01.pdf
[35] C. Dwork, “Differential Privacy”, In M. Bugliesi, B. Preneel, V.
Sassone, & I. Wegener (Eds.), Automata languages and programming,
Vol. 4052, 2006, pp. 1–12. Springer.
[36] K. Mivule, “An Investigation Of Data Privacy and utility using machine
learning as a gauge”, Dissertation, Computer Science Department,
Bowie State University, 2014, ProQuest No: 3619387.
[37] M.H. Dunham, “Data Mining Introductory and Advanced Topics”,
2003, pp. 58–60, 97–99. Upper Saddle River, New Jersey: Prentice Hall.
[38] K. Mivule, (2012). “Utilizing noise addition for data privacy, an
Overview”, In Proceedings of the International Conference on
Information and Knowledge Engineering (IKE), 2012, pp. 65–71.
[39] S.E. Coull, C.V. Wright, A.D. Keromytis, F. Monrose, and M.K. Reiter,
“Taming the devil: Techniques for evaluating anonymized network
data”, In Network and Distributed System Security Symposium, 2008,
pp. 125-135.