This document summarizes statistical disclosure control techniques for protecting private data, specifically microaggregation. Microaggregation involves clustering individual records into small groups to anonymize the data before release. It aims to minimize information loss while preventing re-identification of individuals. The document discusses challenges with multivariate microaggregation and reviews different heuristic approaches. It also covers related topics like k-anonymity algorithms, various clustering techniques for microaggregation like k-means, and using genetic algorithms to handle large datasets.
PERFORMING DATA MINING IN (SRMS) THROUGH VERTICAL APPROACH WITH ASSOCIATION R...Editor IJMTER
This system technique is used for efficient data mining in SRMS (Student Records
Management System) through vertical approach with association rules in distributed databases. The
current leading technique is that of Kantarcioglu and Clifton[1]. In this system I deal with two
challenges or issues, one that computes the union of private subsets that each of the interacting users
hold, and another that tests the inclusion of an element held by one user in a subset held by another.
The existing system uses different techniques for data mining purpose like Apriori algorithm. The
Fast Distributed Mining (FDM) algorithm of Cheung et al. [2], which is an unsecured distributed
version of the Apriori algorithm. Proposed system offers enhanced privacy and data mining with
respect to the Encryption techniques and Association rule with Fp-Growth Algorithm in private
cloud (system contains different files of subjects with respect to their branches). Due to this above
techniques the expected effect on this system is that, it is simpler and more efficient in terms of
communication cost and combinational cost. Due to these techniques it will affect the parameter like
time consumption for execution, length of the code is decrease, find the data fast, extracting hidden
predictive information from large databases and the efficiency of this proposed system should
increase by the 20%.
Hitachi Data Systems provides storage solutions to help life sciences organizations address challenges from rapidly growing data volumes. Their solutions offer automated data migration between storage tiers to optimize storage utilization and improve computational workflow performance. Long-term data management needs are met through integrated archiving functionality allowing long-term retention of data as required by regulations.
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
This document proposes using randomized response techniques to conduct privacy-preserving data mining and build decision tree classifiers from disguised data. It presents a method called Multivariate Randomized Response (MRR) that extends randomized response to handle multiple attributes. Experiments show that while the data is disguised, decision trees built from it can still achieve high accuracy compared to trees built from original data, if the randomization parameter is chosen appropriately. The accuracy is affected by this randomization parameter.
This document discusses privacy-preserving techniques for data stream mining. It proposes a hybrid method that uses both rotation and translation based data perturbation to anonymize sensitive attributes in data streams. The key steps are:
1) Select attribute pairs and set security thresholds for perturbation.
2) Apply rotation transformations to selected attribute pairs to distort the data within the security thresholds.
3) Also apply translation perturbations by adding or subtracting random noise values to other attributes.
The goal is to anonymize the data enough to preserve privacy while maintaining accuracy for data stream mining tasks like clustering. Evaluation focuses on balancing privacy protections with preserving data utility for analysis.
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
Now-a day’s data sharing between two organizations
is common in many application areas like business planning
or marketing. When data are to be shared between parties,
there could be some sensitive data which should not be
disclosed to the other parties. Also medical records are more
sensitive so, privacy protection is taken more seriously. As
required by the Health Insurance Portability and
Accountability Act (HIPAA), it is necessary to protect the
privacy of patients and ensure the security of the medical
data. To address this problem, released datasets must be
modified unavoidably. We propose a method called Hybrid
approach for privacy preserving and implemented it. First we
randomized the original data. Then we have applied
generalization on randomized or modified data. This
technique protect private data with better accuracy, also it can
reconstruct original data and provide data with no information
loss, makes usability of data.
VOLUME-7 ISSUE-8, AUGUST 2019 , International Journal of Research in Advent Technology (IJRAT) , ISSN: 2321-9637 (Online) Published By: MG Aricent Pvt Ltd
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
The document discusses big data and techniques for processing it, including data mining. It begins by defining big data and its key characteristics of volume, variety, and velocity. It then discusses various data mining techniques that can be used to process big data, including clustering, classification, and prediction. It introduces the HACE theorem for characterizing big data based on its huge size, heterogeneous and diverse sources, decentralized control, and complex relationships within the data. The document proposes a big data processing model involving data set aggregation, pre-processing, connectivity-based clustering, and subset selection to efficiently retrieve relevant data. It evaluates the performance of subset selection versus deterministic search methods.
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...Editor IJMTER
Security and privacy methods are used to protect the data values. Private data values are secured with
confidentiality and integrity methods. Privacy model hides the individual identity over the public data values.
Sensitive attributes are protected using anonymity methods. Two or more parties have their own private data under
the distributed environment. The parties can collaborate to calculate any function on the union of their data. Secure
Multiparty Computation (SMC) protocols are used in privacy preserving data mining in distributed environments.
Association rule mining techniques are used to fetch frequent patterns.Apriori algorithm is used to mine association
rules in databases. Homogeneous databases share the same schema but hold information on different entities.
Horizontal partition refers the collection of homogeneous databases that are maintained in different parties. Fast
Distributed Mining (FDM) algorithm is an unsecured distributed version of the Apriori algorithm. Kantarcioglu
and Clifton protocol is used for secure mining of association rules in horizontally distributed databases. Unifying
lists of locally Frequent Itemsets Kantarcioglu and Clifton (UniFI-KC) protocol is used for the rule mining process
in partitioned database environment. UniFI-KC protocol is enhanced in two methods for security enhancement.
Secure computation of threshold function algorithm is used to compute the union of private subsets in each of the
interacting players. Set inclusion computation algorithm is used to test the inclusion of an element held by one
player in a subset held by another.The system is improved to support secure rule mining under vertical partitioned
database environment. The subgroup discovery process is adapted for partitioned database environment. The
system can be improved to support generalized association rule mining process. The system is enhanced to control
security leakages in the rule mining process.
PERFORMING DATA MINING IN (SRMS) THROUGH VERTICAL APPROACH WITH ASSOCIATION R...Editor IJMTER
This system technique is used for efficient data mining in SRMS (Student Records
Management System) through vertical approach with association rules in distributed databases. The
current leading technique is that of Kantarcioglu and Clifton[1]. In this system I deal with two
challenges or issues, one that computes the union of private subsets that each of the interacting users
hold, and another that tests the inclusion of an element held by one user in a subset held by another.
The existing system uses different techniques for data mining purpose like Apriori algorithm. The
Fast Distributed Mining (FDM) algorithm of Cheung et al. [2], which is an unsecured distributed
version of the Apriori algorithm. Proposed system offers enhanced privacy and data mining with
respect to the Encryption techniques and Association rule with Fp-Growth Algorithm in private
cloud (system contains different files of subjects with respect to their branches). Due to this above
techniques the expected effect on this system is that, it is simpler and more efficient in terms of
communication cost and combinational cost. Due to these techniques it will affect the parameter like
time consumption for execution, length of the code is decrease, find the data fast, extracting hidden
predictive information from large databases and the efficiency of this proposed system should
increase by the 20%.
Hitachi Data Systems provides storage solutions to help life sciences organizations address challenges from rapidly growing data volumes. Their solutions offer automated data migration between storage tiers to optimize storage utilization and improve computational workflow performance. Long-term data management needs are met through integrated archiving functionality allowing long-term retention of data as required by regulations.
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
This document proposes using randomized response techniques to conduct privacy-preserving data mining and build decision tree classifiers from disguised data. It presents a method called Multivariate Randomized Response (MRR) that extends randomized response to handle multiple attributes. Experiments show that while the data is disguised, decision trees built from it can still achieve high accuracy compared to trees built from original data, if the randomization parameter is chosen appropriately. The accuracy is affected by this randomization parameter.
This document discusses privacy-preserving techniques for data stream mining. It proposes a hybrid method that uses both rotation and translation based data perturbation to anonymize sensitive attributes in data streams. The key steps are:
1) Select attribute pairs and set security thresholds for perturbation.
2) Apply rotation transformations to selected attribute pairs to distort the data within the security thresholds.
3) Also apply translation perturbations by adding or subtracting random noise values to other attributes.
The goal is to anonymize the data enough to preserve privacy while maintaining accuracy for data stream mining tasks like clustering. Evaluation focuses on balancing privacy protections with preserving data utility for analysis.
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
Now-a day’s data sharing between two organizations
is common in many application areas like business planning
or marketing. When data are to be shared between parties,
there could be some sensitive data which should not be
disclosed to the other parties. Also medical records are more
sensitive so, privacy protection is taken more seriously. As
required by the Health Insurance Portability and
Accountability Act (HIPAA), it is necessary to protect the
privacy of patients and ensure the security of the medical
data. To address this problem, released datasets must be
modified unavoidably. We propose a method called Hybrid
approach for privacy preserving and implemented it. First we
randomized the original data. Then we have applied
generalization on randomized or modified data. This
technique protect private data with better accuracy, also it can
reconstruct original data and provide data with no information
loss, makes usability of data.
VOLUME-7 ISSUE-8, AUGUST 2019 , International Journal of Research in Advent Technology (IJRAT) , ISSN: 2321-9637 (Online) Published By: MG Aricent Pvt Ltd
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
The document discusses big data and techniques for processing it, including data mining. It begins by defining big data and its key characteristics of volume, variety, and velocity. It then discusses various data mining techniques that can be used to process big data, including clustering, classification, and prediction. It introduces the HACE theorem for characterizing big data based on its huge size, heterogeneous and diverse sources, decentralized control, and complex relationships within the data. The document proposes a big data processing model involving data set aggregation, pre-processing, connectivity-based clustering, and subset selection to efficiently retrieve relevant data. It evaluates the performance of subset selection versus deterministic search methods.
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...Editor IJMTER
Security and privacy methods are used to protect the data values. Private data values are secured with
confidentiality and integrity methods. Privacy model hides the individual identity over the public data values.
Sensitive attributes are protected using anonymity methods. Two or more parties have their own private data under
the distributed environment. The parties can collaborate to calculate any function on the union of their data. Secure
Multiparty Computation (SMC) protocols are used in privacy preserving data mining in distributed environments.
Association rule mining techniques are used to fetch frequent patterns.Apriori algorithm is used to mine association
rules in databases. Homogeneous databases share the same schema but hold information on different entities.
Horizontal partition refers the collection of homogeneous databases that are maintained in different parties. Fast
Distributed Mining (FDM) algorithm is an unsecured distributed version of the Apriori algorithm. Kantarcioglu
and Clifton protocol is used for secure mining of association rules in horizontally distributed databases. Unifying
lists of locally Frequent Itemsets Kantarcioglu and Clifton (UniFI-KC) protocol is used for the rule mining process
in partitioned database environment. UniFI-KC protocol is enhanced in two methods for security enhancement.
Secure computation of threshold function algorithm is used to compute the union of private subsets in each of the
interacting players. Set inclusion computation algorithm is used to test the inclusion of an element held by one
player in a subset held by another.The system is improved to support secure rule mining under vertical partitioned
database environment. The subgroup discovery process is adapted for partitioned database environment. The
system can be improved to support generalized association rule mining process. The system is enhanced to control
security leakages in the rule mining process.
Additive gaussian noise based data perturbation in multi level trust privacy ...IJDKP
This document discusses a technique called additive Gaussian noise based data perturbation for privacy preserving data mining. The technique introduces multiple perturbed copies of data for different trust levels of data miners to prevent diversity attacks. Gaussian noise is added to the original data and correlated between copies so that combining copies does not provide additional information about the original data. The goal is to limit what information adversaries can learn from individual or combined copies to within what the data owner intends to share, while still allowing accurate data mining. Experiments on banking customer data show the approach controls the normalized estimation error from individual and combined copies.
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...cscpconf
Sharing data that contains personally identifiable or sensitive information, such as medical
records, always has privacy and security implications. The issues can become rather complex
when the methods of access can vary, and accurate individual data needs to be provided whilst
mass data release for specific purposes (for example for medical research) also has to be
catered for. Although various solutions have been proposed to address the different aspects
individually, a comprehensive approach is highly desirable. This paper presents a solution for
maintaining the privacy of data released en masse in a controlled manner, and for providing
secure access to the original data for authorized users. The results show that the solution is provably secure and maintains privacy in a more efficient manner than previous solutions
Data has become an indispensable part of every economy, industry, organization, business
function and individual. Big Data is a term used to identify the datasets that whose size is
beyond the ability of typical database software tools to store, manage and analyze. The Big
Data introduce unique computational and statistical challenges, including scalability and
storage bottleneck, noise accumulation, spurious correlation and measurement errors. These
challenges are distinguished and require new computational and statistical paradigm. This
paper presents the literature review about the Big data Mining and the issues and challenges
with emphasis on the distinguished features of Big Data. It also discusses some methods to deal
with big data.
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
A new hybrid algorithm for business intelligence recommender systemIJNSA Journal
Business Intelligence is a set of methods, process and technologies that transform raw data into meaningful
and useful information. Recommender system is one of business intelligence system that is used to obtain
knowledge to the active user for better decision making. Recommender systems apply data mining
techniques to the problem of making personalized recommendations for information. Due to the growth in
the number of information and the users in recent years offers challenges in recommender systems.
Collaborative, content, demographic and knowledge-based are four different types of recommendations
systems. In this paper, a new hybrid algorithm is proposed for recommender system which combines
knowledge based, profile of the users and most frequent item mining technique to obtain intelligence.
IRJET- A Study of Privacy Preserving Data Mining and TechniquesIRJET Journal
This document summarizes a study on privacy preserving data mining techniques. It begins with an abstract that introduces privacy preserving data mining as a technique for analyzing shared data while preserving data sensitivity and privacy. It then reviews literature on recent privacy preserving data mining techniques, including techniques for vertically partitioned databases using homomorphic encryption. The document proposes a new privacy preserving association rule mining model and technique. It concludes that privacy preserving data mining is an important new technique for situations where different parties need to combine data for analysis while preserving privacy.
A survey on data security in data warehousing Rezgar Mohammad
This document summarizes a survey on data security techniques for data warehousing environments. It discusses issues with applying existing preventative security solutions like encryption and signatures to data warehouses due to performance overhead. It also examines challenges with reactive solutions like intrusion detection systems in distinguishing normal behavior from attacks in heterogeneous data warehouses. The document concludes by identifying open research challenges to improve solutions for encryption, signatures, intrusion detection, data recovery and benchmarking security in data warehouses.
Data mining over diverse data sources is useful
means for discovering valuable patterns, associations, trends, and
dependencies in data. Many variants of this problem are existing,
depending on how the data is distributed, what type of data
mining we wish to do, how to achieve privacy of data and what
restrictions are placed on sharing of information. A transactional
database owner, lacking in the expertise or computational sources
can outsource its mining tasks to a third party service provider
or server. However, both the itemsets along with the association
rules of the outsourced database are considered private property
of the database owner.
In this paper, we consider a scenario where multiple data sources
are willing to share their data with trusted third party called
combiner who runs data mining algorithms over the union
of their data as long as each data source is guaranteed that
its information that does not pertain to another data source
will not be revealed. The proposed algorithm is characterized
with (1) secret sharing based secure key transfer for distributed
transactional databases with its lightweight encryption is used
for preserving the privacy. (2) and rough set based mechanism
for association rules extraction for an efficient and mining task.
Performance analysis and experimental results are provided for
demonstrating the effectiveness of the proposed algorithm.
1. The document discusses Big Data analytics using Hadoop. It defines Big Data and explains the characteristics of volume, velocity, and variety.
2. Hadoop is introduced as a framework for distributed storage and processing of large data sets across clusters of commodity hardware. It uses HDFS for reliable storage and streaming of large data sets.
3. Key Hadoop components are the NameNode, which manages file system metadata, and DataNodes, which store and retrieve data blocks. Hadoop provides scalability, fault tolerance, and high performance on large data sets.
Providing support and services for researchers in good data governanceRobin Rice
The University of Edinburgh provides support and services to help researchers with good data governance. This includes a research data policy, research data service with various tools across the data lifecycle, and a data safe haven for sensitive data. The research data service offers centralized storage, version control, collaboration tools, and repositories for sharing data openly or long-term retention. Training and outreach aim to educate researchers on topics like data management plans, sensitive data, and GDPR compliance.
Efficient Association Rule Mining in Heterogeneous Data BaseIJTET Journal
This document summarizes an algorithm for efficiently mining association rules from heterogeneous databases. The algorithm distributes the database across multiple sites while ensuring privacy. Each site locally mines frequent itemsets using an algorithm like Apriori. The sites then securely combine candidate itemsets and check rule confidence to find globally frequent rules meeting minimum support and confidence thresholds. The algorithm uses commutative encryption and a map-reduce model to parallelize the distributed mining efficiently while limiting data disclosure between sites.
IRJET- Swift Retrieval of DNA Databases by Aggregating QueriesIRJET Journal
This document summarizes a research paper that proposes a new method for securely sharing and querying genomic DNA sequences stored in the cloud without violating privacy. The method builds on existing frameworks by offering deterministic results with zero error probability, and a scheme that is twice as fast but uses twice the storage space, which is preferable given cloud storage pricing. The encoding of the data supports a richer set of query types beyond exact matching, including counting matches, logical OR matches, handling ambiguities, threshold queries, and concealing results from the decrypting server. Linear and logistic regression algorithms are used to analyze the data. The literature review discusses previous work on securely sharing genomic data and transforming protocols to ensure accountability without compromising privacy.
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...Kato Mivule
Kato Mivule and Claude Turner, An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge, International Conference on Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA
Big data service architecture: a surveyssuser0191d4
This document discusses big data service architecture. It begins with an introduction to big data services and their economic benefits. It then describes the key components of big data service architecture, including data collection and storage, data processing, and applications. For data collection and storage, it covers Extract-Transform-Load tools, distributed file systems, and NoSQL databases. For data processing, it discusses batch, stream, and hybrid processing frameworks like MapReduce, Storm, and Spark. It concludes by noting big data applications in various fields and cloud computing services for big data.
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...Kato Mivule
Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science (Complex Adaptive Systems), 2012, Pages 176-181, Washington DC, USA.
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONcscpconf
Huge Volumes of detailed personal data is regularly collected and analyzed by applications
using data mining, sharing of these data is beneficial to the application users. On one hand it is
an important asset to business organizations and governments for decision making at the same
time analysing such data opens treats to privacy if not done properly. This paper aims to reveal
the information by protecting sensitive data. We are using Vector quantization technique for
preserving privacy. Quantization will be performed on training data samples it will produce
transformed data set. This transformed data set does not reveal the original data. Hence privacy
is preserved
In this paper, the authors describe an approach for sharing sensitive medical data with the consent of the
data owner. The framework builds on the advantages of the Semantic Web technologies and makes it
secure and robust for sharing sensitive information in a controlled environment. The framework uses a
combination of Role-Based and Rule-Based Access Policies to provide security to a medical data
repository as per the FAIR guidelines. A lightweight ontologywas developed, to collect consent from the
users indicating which part of their data they want to share with another user having a particular role.
Here, the authors have considered the scenario of sharing the medical data by the owner of data, say the
patient, with relevant persons such as physicians, researchers, pharmacist, etc. To prove this concept, the
authors developed a prototype and validated using the Sesame OpenRDF Workbench with 202,908 triples
and a consent graph stating consents per patient.
Kato Mivule - Utilizing Noise Addition for Data Privacy, an OverviewKato Mivule
Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering (IKE 2012), Pages 65-71, Las Vegas, NV, USA.
This document presents an Intelligent Vertical Handoff Algorithm (IVHA) that uses fuzzy logic to improve handoff decisions between heterogeneous wireless networks. The algorithm has two phases: 1) The handoff initialization phase uses fuzzy logic to adaptively set the handoff threshold based on RSSI, SINR, and data rate to trigger handoffs at the right time. 2) The handoff decision phase uses fuzzy logic to select the best network among available options based on bandwidth, network load, coverage, and user velocity. The algorithm aims to improve quality of service by reducing problems like ping-ponging during handoffs.
This document summarizes research on defeating denial-of-service (DoS) attacks in wireless networks in the presence of jammers. It describes common types of jamming attacks like constant, deceptive, random, and reactive jammers. Detection techniques for jammers and methods to reduce the impact of DoS attacks are discussed. The objective is to detect jammers, lessen the effect of DoS attacks, and improve wireless communication security. Key jamming criteria like energy efficiency, detection probability, denial-of-service level, and strength against physical layer techniques are also outlined.
This document summarizes a study on landslides in the Western Ghat region of Maharashtra, India. It discusses the causes of landslides including heavy rainfall, erosion, and human activities like deforestation and construction. The major types of landslides addressed are creep, slump slides, debris avalanches, earth flows, and rock falls. The document then provides details on specific landslide events in the study area and suggests mitigation approaches like drainage control, stabilizing slope materials, planting vegetation, and setting structures back from slopes to minimize landslide risks and impacts.
This document describes a hybrid energy management system based on a fuzzy logic controller for power distribution. The system uses four power sources - wind power, photovoltaics, fuel cells, and electric power - connected to a common DC bus. An automatic energy management system provides load sharing between the power sources based on the load demand. Experimental results show that the system can successfully meet different load levels of 1000W, 2000W, and 3000W by distributing power from the sources according to the fuzzy logic controller and without wasted power.
Additive gaussian noise based data perturbation in multi level trust privacy ...IJDKP
This document discusses a technique called additive Gaussian noise based data perturbation for privacy preserving data mining. The technique introduces multiple perturbed copies of data for different trust levels of data miners to prevent diversity attacks. Gaussian noise is added to the original data and correlated between copies so that combining copies does not provide additional information about the original data. The goal is to limit what information adversaries can learn from individual or combined copies to within what the data owner intends to share, while still allowing accurate data mining. Experiments on banking customer data show the approach controls the normalized estimation error from individual and combined copies.
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...cscpconf
Sharing data that contains personally identifiable or sensitive information, such as medical
records, always has privacy and security implications. The issues can become rather complex
when the methods of access can vary, and accurate individual data needs to be provided whilst
mass data release for specific purposes (for example for medical research) also has to be
catered for. Although various solutions have been proposed to address the different aspects
individually, a comprehensive approach is highly desirable. This paper presents a solution for
maintaining the privacy of data released en masse in a controlled manner, and for providing
secure access to the original data for authorized users. The results show that the solution is provably secure and maintains privacy in a more efficient manner than previous solutions
Data has become an indispensable part of every economy, industry, organization, business
function and individual. Big Data is a term used to identify the datasets that whose size is
beyond the ability of typical database software tools to store, manage and analyze. The Big
Data introduce unique computational and statistical challenges, including scalability and
storage bottleneck, noise accumulation, spurious correlation and measurement errors. These
challenges are distinguished and require new computational and statistical paradigm. This
paper presents the literature review about the Big data Mining and the issues and challenges
with emphasis on the distinguished features of Big Data. It also discusses some methods to deal
with big data.
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyKato Mivule
Genomic data provides clinical researchers with vast opportunities to study various patient ailments. Yet the same data contains revealing information, some of which a patient might want to remain concealed. The question then arises: how can an entity transact in full DNA data while concealing certain sensitive pieces of information in the genome sequence, and maintain DNA data utility? As a response to this question, we propose a codon frequency obfuscation heuristic, in which a redistribution of codon frequency values with highly expressed genes is done in the same amino acid group, generating an obfuscated DNA sequence. Our preliminary results show that it might be possible to publish an obfuscated DNA sequence with a desired level of similarity (utility) to the original DNA sequence. http://arxiv.org/abs/1405.5410
A new hybrid algorithm for business intelligence recommender systemIJNSA Journal
Business Intelligence is a set of methods, process and technologies that transform raw data into meaningful
and useful information. Recommender system is one of business intelligence system that is used to obtain
knowledge to the active user for better decision making. Recommender systems apply data mining
techniques to the problem of making personalized recommendations for information. Due to the growth in
the number of information and the users in recent years offers challenges in recommender systems.
Collaborative, content, demographic and knowledge-based are four different types of recommendations
systems. In this paper, a new hybrid algorithm is proposed for recommender system which combines
knowledge based, profile of the users and most frequent item mining technique to obtain intelligence.
IRJET- A Study of Privacy Preserving Data Mining and TechniquesIRJET Journal
This document summarizes a study on privacy preserving data mining techniques. It begins with an abstract that introduces privacy preserving data mining as a technique for analyzing shared data while preserving data sensitivity and privacy. It then reviews literature on recent privacy preserving data mining techniques, including techniques for vertically partitioned databases using homomorphic encryption. The document proposes a new privacy preserving association rule mining model and technique. It concludes that privacy preserving data mining is an important new technique for situations where different parties need to combine data for analysis while preserving privacy.
A survey on data security in data warehousing Rezgar Mohammad
This document summarizes a survey on data security techniques for data warehousing environments. It discusses issues with applying existing preventative security solutions like encryption and signatures to data warehouses due to performance overhead. It also examines challenges with reactive solutions like intrusion detection systems in distinguishing normal behavior from attacks in heterogeneous data warehouses. The document concludes by identifying open research challenges to improve solutions for encryption, signatures, intrusion detection, data recovery and benchmarking security in data warehouses.
Data mining over diverse data sources is useful
means for discovering valuable patterns, associations, trends, and
dependencies in data. Many variants of this problem are existing,
depending on how the data is distributed, what type of data
mining we wish to do, how to achieve privacy of data and what
restrictions are placed on sharing of information. A transactional
database owner, lacking in the expertise or computational sources
can outsource its mining tasks to a third party service provider
or server. However, both the itemsets along with the association
rules of the outsourced database are considered private property
of the database owner.
In this paper, we consider a scenario where multiple data sources
are willing to share their data with trusted third party called
combiner who runs data mining algorithms over the union
of their data as long as each data source is guaranteed that
its information that does not pertain to another data source
will not be revealed. The proposed algorithm is characterized
with (1) secret sharing based secure key transfer for distributed
transactional databases with its lightweight encryption is used
for preserving the privacy. (2) and rough set based mechanism
for association rules extraction for an efficient and mining task.
Performance analysis and experimental results are provided for
demonstrating the effectiveness of the proposed algorithm.
1. The document discusses Big Data analytics using Hadoop. It defines Big Data and explains the characteristics of volume, velocity, and variety.
2. Hadoop is introduced as a framework for distributed storage and processing of large data sets across clusters of commodity hardware. It uses HDFS for reliable storage and streaming of large data sets.
3. Key Hadoop components are the NameNode, which manages file system metadata, and DataNodes, which store and retrieve data blocks. Hadoop provides scalability, fault tolerance, and high performance on large data sets.
Providing support and services for researchers in good data governanceRobin Rice
The University of Edinburgh provides support and services to help researchers with good data governance. This includes a research data policy, research data service with various tools across the data lifecycle, and a data safe haven for sensitive data. The research data service offers centralized storage, version control, collaboration tools, and repositories for sharing data openly or long-term retention. Training and outreach aim to educate researchers on topics like data management plans, sensitive data, and GDPR compliance.
Efficient Association Rule Mining in Heterogeneous Data BaseIJTET Journal
This document summarizes an algorithm for efficiently mining association rules from heterogeneous databases. The algorithm distributes the database across multiple sites while ensuring privacy. Each site locally mines frequent itemsets using an algorithm like Apriori. The sites then securely combine candidate itemsets and check rule confidence to find globally frequent rules meeting minimum support and confidence thresholds. The algorithm uses commutative encryption and a map-reduce model to parallelize the distributed mining efficiently while limiting data disclosure between sites.
IRJET- Swift Retrieval of DNA Databases by Aggregating QueriesIRJET Journal
This document summarizes a research paper that proposes a new method for securely sharing and querying genomic DNA sequences stored in the cloud without violating privacy. The method builds on existing frameworks by offering deterministic results with zero error probability, and a scheme that is twice as fast but uses twice the storage space, which is preferable given cloud storage pricing. The encoding of the data supports a richer set of query types beyond exact matching, including counting matches, logical OR matches, handling ambiguities, threshold queries, and concealing results from the decrypting server. Linear and logistic regression algorithms are used to analyze the data. The literature review discusses previous work on securely sharing genomic data and transforming protocols to ensure accountability without compromising privacy.
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...Kato Mivule
Kato Mivule and Claude Turner, An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge, International Conference on Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA
Big data service architecture: a surveyssuser0191d4
This document discusses big data service architecture. It begins with an introduction to big data services and their economic benefits. It then describes the key components of big data service architecture, including data collection and storage, data processing, and applications. For data collection and storage, it covers Extract-Transform-Load tools, distributed file systems, and NoSQL databases. For data processing, it discusses batch, stream, and hybrid processing frameworks like MapReduce, Storm, and Spark. It concludes by noting big data applications in various fields and cloud computing services for big data.
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...Kato Mivule
Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science (Complex Adaptive Systems), 2012, Pages 176-181, Washington DC, USA.
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONcscpconf
Huge Volumes of detailed personal data is regularly collected and analyzed by applications
using data mining, sharing of these data is beneficial to the application users. On one hand it is
an important asset to business organizations and governments for decision making at the same
time analysing such data opens treats to privacy if not done properly. This paper aims to reveal
the information by protecting sensitive data. We are using Vector quantization technique for
preserving privacy. Quantization will be performed on training data samples it will produce
transformed data set. This transformed data set does not reveal the original data. Hence privacy
is preserved
In this paper, the authors describe an approach for sharing sensitive medical data with the consent of the
data owner. The framework builds on the advantages of the Semantic Web technologies and makes it
secure and robust for sharing sensitive information in a controlled environment. The framework uses a
combination of Role-Based and Rule-Based Access Policies to provide security to a medical data
repository as per the FAIR guidelines. A lightweight ontologywas developed, to collect consent from the
users indicating which part of their data they want to share with another user having a particular role.
Here, the authors have considered the scenario of sharing the medical data by the owner of data, say the
patient, with relevant persons such as physicians, researchers, pharmacist, etc. To prove this concept, the
authors developed a prototype and validated using the Sesame OpenRDF Workbench with 202,908 triples
and a consent graph stating consents per patient.
Kato Mivule - Utilizing Noise Addition for Data Privacy, an OverviewKato Mivule
Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering (IKE 2012), Pages 65-71, Las Vegas, NV, USA.
This document presents an Intelligent Vertical Handoff Algorithm (IVHA) that uses fuzzy logic to improve handoff decisions between heterogeneous wireless networks. The algorithm has two phases: 1) The handoff initialization phase uses fuzzy logic to adaptively set the handoff threshold based on RSSI, SINR, and data rate to trigger handoffs at the right time. 2) The handoff decision phase uses fuzzy logic to select the best network among available options based on bandwidth, network load, coverage, and user velocity. The algorithm aims to improve quality of service by reducing problems like ping-ponging during handoffs.
This document summarizes research on defeating denial-of-service (DoS) attacks in wireless networks in the presence of jammers. It describes common types of jamming attacks like constant, deceptive, random, and reactive jammers. Detection techniques for jammers and methods to reduce the impact of DoS attacks are discussed. The objective is to detect jammers, lessen the effect of DoS attacks, and improve wireless communication security. Key jamming criteria like energy efficiency, detection probability, denial-of-service level, and strength against physical layer techniques are also outlined.
This document summarizes a study on landslides in the Western Ghat region of Maharashtra, India. It discusses the causes of landslides including heavy rainfall, erosion, and human activities like deforestation and construction. The major types of landslides addressed are creep, slump slides, debris avalanches, earth flows, and rock falls. The document then provides details on specific landslide events in the study area and suggests mitigation approaches like drainage control, stabilizing slope materials, planting vegetation, and setting structures back from slopes to minimize landslide risks and impacts.
This document describes a hybrid energy management system based on a fuzzy logic controller for power distribution. The system uses four power sources - wind power, photovoltaics, fuel cells, and electric power - connected to a common DC bus. An automatic energy management system provides load sharing between the power sources based on the load demand. Experimental results show that the system can successfully meet different load levels of 1000W, 2000W, and 3000W by distributing power from the sources according to the fuzzy logic controller and without wasted power.
The document analyzes the performance of M-ary modulations through the human body area channel. It simulates M-ary PAM and M-ary BOK modulation schemes at different carrier frequencies to obtain bit error rates. The simulations show that a carrier frequency of 2400MHz provides the best performance for both 16-PAM and 32-PAM modulation, with minimum bit error rates achieved using a selective rake receiver. Partial rake receivers performed poorer than selective rake receivers for all modulation schemes and frequencies tested.
This document summarizes a study on segmenting cysts in breast ultrasound images using texture features and an active contour method. The authors apply the Chan-Vese level set method to segment cyst regions based on texture features calculated from the images using different kernel sizes. Segmentation performance is evaluated using measures like area error rate, DICE coefficient, sensitivity and Hausdorff distance. The results show that mean texture features and preprocessing the images with Qui's mask produce more accurate segmentations with lower error rates compared to other texture features and kernels.
This study investigated the use of low-cost agricultural waste materials as biosorbents for removing chromium (VI) from wastewater. Batch experiments were conducted using sweetlime fruit skin and bagasse to adsorb chromium (VI) at different concentrations, pH levels, and adsorbent amounts. The results showed that adsorption was most effective at lower chromium (VI) concentrations and acidic pH levels. Sweetlime fruit skin achieved 65% removal at 40 μg/L chromium (VI) and pH 2.5, while bagasse achieved 75% removal at the same concentration and pH 5. The study suggests that locally available agricultural wastes have potential as low-cost biosorbents for wastewater treatment.
This document describes a proposed advanced car automation and security system. The system would allow for multiple user profiles to be saved, with iris recognition used for authentication. When an authorized user's iris is recognized, the car settings would automatically adjust based on that user's saved profile. If an unauthorized person tries to access the car, a security message would be sent to the car owner. The system aims to provide convenience by automatically adjusting settings for different users, and enhance security by monitoring unauthorized access and notifying the owner. It would use an FPGA for automation control, MATLAB for authentication, and a GSM module for communication.
This document describes an algorithmic approach for detecting car accidents using smartphones. It proposes using sensors in smartphones like GPS, accelerometers, and microphones to detect accidents. The algorithm uses an 11-tuple model including factors like acceleration, sound, and speed to predict accidents. If acceleration and sound thresholds are exceeded while speed is above a minimum, or if movement is below a distance threshold after speed drops, an accident is detected. The algorithm aims to provide rapid emergency notification by detecting accidents and alerting emergency contacts.
This document summarizes a research paper that proposes using a hotspot algorithm to improve node stability in mobile ad hoc networks (MANETs). The paper first provides background on MANETs and challenges with routing in dynamic network topologies. It then discusses the importance of node stability and proposes using a hotspot algorithm to identify stable nodes. The algorithm calculates stability factors for nodes based on their mobility and neighbors' mobility. Routing is done through stable nodes to improve efficiency. The paper models this approach in a network simulator and analyzes results on parameters like packet loss and throughput. Future work involves further optimizing the network using this routing method.
This document summarizes the RedTacton technology, which enables data transfer between two devices through physical contact with the human body. RedTacton uses electric fields generated by the human body as a transmission medium. It allows for connectivity between various personal devices through natural physical interactions like handshakes. The technology works by placing sensors on the body that can detect minute electric fields used to transmit data in a point-to-point way. RedTacton provides a new way of connecting devices through a human-centered approach and establishes a type of network called a Human Area Network. Future applications could include uses in healthcare, security, and other areas where device-to-device communication through touch is useful.
This document discusses material selection for the structural design of a mini milling machine. It begins by analyzing the existing cast iron structure and then proposes a hybrid structure using both casting and fabricated parts. A methodology is presented involving material selection criteria, CAD modeling, and analysis. Various materials are considered for different components, including steel, cast iron, and polymer composites for the base. An analytic hierarchy process is used to rank materials based on properties like strength, damping, and cost. Steel, cast iron, and a polymer composite are compared for the base, with the composite ranking highest based on its properties and manufacturability.
This document summarizes security issues related to mobile devices, networks, and communication. It discusses how mobile devices store sensitive data and access various networks, raising security concerns. Issues addressed include unauthorized access of data on lost or stolen devices, insecure communication channels, and vulnerabilities in mobile networks like cellular networks. The document also examines existing security measures and the need for improved solutions to address issues like authentication, encryption, and access control across mobile technologies.
This document summarizes and evaluates scheduling algorithms for wireless IP networks that support multiclass traffic. It begins by describing the challenges of providing quality of service (QoS) over wireless networks due to time-varying transmission quality and location-dependent errors. It then reviews existing scheduling algorithms like weighted fair queuing (WFQ) and discusses their limitations in wireless environments. The document proposes a new scheduling mechanism that differentiates service between traffic classes and subclasses, allows compensation for non-real time traffic, and adjusts weights of real-time flows in error states to maintain throughput. Overall, the scheduling algorithm aims to provide QoS, fairness between flows, and flexibility to adapt to changing wireless channel conditions.
This document discusses automatic view synthesis from stereoscopic 3D video through image domain warping. It begins with an introduction to stereoscopic 3D cinema and television, and the need for multi-view auto stereoscopic displays that allow glasses-free 3D viewing. It then describes image domain warping, which synthesizes new views from 2-view video using sparse disparity features and warping images to enforce the disparities, rather than using depth maps. The document outlines the image warping process and view synthesis algorithm, which extracts sparse disparity features, calculates warps to enforce the disparities for intermediate views, and warps the images to synthesize the output views.
The document discusses an algorithm called Adaptive Multichannel Component Analysis (AMMCA) for separating image sources from mixtures using adaptively learned dictionaries. It begins by reviewing image denoising using learned dictionaries, then extends this to image separation from single mixtures. The key contribution is applying this approach to separating sources from multichannel mixtures by learning local dictionaries for each source during the separation process. The algorithm is described and simulated results are shown separating two images from a noisy mixture using the learned dictionaries. In conclusion, AMMCA is able to separate sources without prior knowledge of their sparsity domains by fusing dictionary learning into the separation process.
This document reviews various video steganography methods that use neural networks. It discusses how neural networks can be used for steganalysis to detect hidden data in digital media. The document provides an overview of different neural network approaches that have been used for video watermarking and audio digital watermarking. These include using neural networks to preferentially allocate watermarks to motion coefficients in video and memorizing watermarks in the neurons of a counterpropagation neural network for audio. The conclusion states that neural network techniques can help improve the performance of various video steganography methods.
This document describes an audio steganography technique that aims to increase security by introducing randomness. It discusses how traditional least significant bit (LSB) modification is vulnerable to attacks. The proposed technique randomly selects both the bit position (1st, 2nd, or 3rd LSB) and audio sample for embedding secret message bits. This is intended to prevent attackers from detecting the embedding pattern. The technique uses character encoding like Huffman coding before message bits are hidden in an audio file using the modified LSB method. Experimental results showed the stego audio maintained quality while providing improved security over fixed LSB techniques.
This document describes a proposed detection and warning system for railway tracks using wireless sensors. The system uses MEMS sensors, GPS, GSM, and ultrasonic sensors to monitor tracks and bridges for damage like cracks or structural issues. If a problem is detected, the system would immediately notify trains in the area through wireless communication. It discusses the technical components of the system in detail, including the microcontroller, sensors, GPS/GSM modules, and wireless transmission. The system aims to more quickly detect track issues and notify trains to prevent delays compared to existing systems. It provides block diagrams of the sensor network components and how they would function on the tracks and on trains.
This document summarizes recent research on regenerated silk fibroin fibers produced through wet-spinning and electrospinning techniques. It discusses how silk fibroin is obtained from silkworm cocoons and its composition. The degumming and dissolution processes to prepare silk fibroin solutions for fiber spinning are described. Wet-spinning involves extruding a silk fibroin dope solution through a spinneret into a non-solvent bath, while electrospinning uses electric fields to spin nano-to-micrometer diameter fibers from silk fibroin solutions. The properties of regenerated silk fibers can be tailored for applications as tissue engineering scaffolds.
Big Data Processing with Hadoop : A ReviewIRJET Journal
1. This document provides an overview of big data processing with Hadoop. It defines big data and describes the challenges of volume, velocity, variety and variability.
2. Traditional data processing approaches are inadequate for big data due to its scale. Hadoop provides a distributed file system called HDFS and a MapReduce framework to address this.
3. HDFS uses a master-slave architecture with a NameNode and DataNodes to store and retrieve file blocks. MapReduce allows distributed processing of large datasets across clusters through mapping and reducing functions.
Data mining involves discovering hidden patterns in data, while data warehousing involves integrating data from multiple sources and storing it in a centralized location to support analysis. Some key differences are:
- Data mining uses techniques like classification, clustering, and association to discover insights from data, while data warehousing focuses on data integration and OLAP tools.
- Data mining looks for unknown relationships and makes predictions, while data warehousing provides a way to extract and analyze historical data.
- Data warehousing involves extracting, cleaning, and transforming data during an ETL process before loading it into a separate database optimized for analysis. Data mining builds on the outputs of data warehousing.
IRJET-Implementation of Threshold based Cryptographic Technique over Cloud Co...IRJET Journal
This document discusses implementing a threshold-based cryptographic technique for data and key storage security over cloud computing. It proposes a system that encrypts data stored on the cloud to prevent unauthorized access and data attacks by the cloud service provider. The system uses a threshold-based cryptographic approach that distributes encryption keys among multiple users, requiring a threshold number of keys to decrypt the data. This prevents collusion attacks and ensures data remains secure even if some user keys are compromised. The implementation results show the system can effectively secure data on the cloud and protect legitimate users from cheating or attacks from the cloud service provider or other users.
The document proposes an approach to identify which intermediate datasets generated during cloud-based data processing need to be encrypted to preserve privacy, while avoiding encrypting all datasets which is inefficient and costly. It models the generation relationships between datasets and uses an upper-bound constraint on privacy leakage to determine which datasets exceed the threshold. This is formulated as an optimization problem to minimize privacy-preserving costs. Evaluation on real-world datasets shows the approach significantly reduces costs compared to fully encrypting all intermediate datasets.
The document discusses clinical data mining and data warehousing. It begins by introducing clinical data mining as a process to analyze and interpret available clinical data for decision making and knowledge building. It then describes approaches to clinical data mining including data collection, pre-processing, parsing, and applying knowledge to create new databases and queries. The document also discusses online clinical data mining tools, advantages of data warehousing, challenges of clinical data warehousing, and applications of data mining such as creating electronic patient files and improving healthcare quality.
A STUDY OF TRADITIONAL DATA ANALYSIS AND SENSOR DATA ANALYTICSijistjournal
The growth of smart and intelligent devices known as sensors generate large amount of data. These generated data over a time span takes such a large volume which is designated as big data. The data structure of repository holds unstructured data. The traditional data analytics methods well developed and used widely to analyze structured data and to limit extend the semi-structured data which involves additional processing over heads. The similar methods used to analyze unstructured data are different because of distributed computing approach where as there is a possibility of centralized processing in case of structured and semi-structured data. The under taken work is confined to analysis of both verities of methods. The result of this study is targeted to introduce methods available to analyze big data.
Big Data in Bioinformatics & the Era of Cloud ComputingIOSR Journals
This document discusses the challenges of big data in bioinformatics and how cloud computing can address them. It notes that high-throughput experiments are generating huge amounts of biological data from fields like genomics and proteomics. Storing and analyzing this "big data" requires massive computational resources that are costly for individual organizations. However, cloud computing provides elastic, on-demand access to storage and processing power at an affordable cost. This allows bioinformatics data to be securely stored and shared on the cloud to enable collaborative analysis and overcome issues of data transfer, storage limitations, and infrastructure maintenance.
In this paper we explore the issue of store determination in a portable shared specially appointed system. In our vision reserve determination ought to fulfill the accompanying prerequisites: (i) it ought to bring about low message overhead and (ii) the data ought to be recovered with least postponement. In this paper, we demonstrate that these objectives can be accomplished by part the one bounce neighbors into two sets in view of the transmission run. The proposed approach lessens the quantity of messages overflowed into the system to discover the asked for information. This plan is completely circulated and comes requiring little to no effort as far as store overhead. The test comes about gives a promising outcome in view of the measurements of studies.
To Get any Project for CSE, IT ECE, EEE Contact Me @ 09849539085, 09966235788 or mail us - ieeefinalsemprojects@gmail.com-Visit Our Website: www.finalyearprojects.org
DATA SCIENCE METHODOLOGY FOR CYBERSECURITY PROJECTS cscpconf
Cybersecurity solutions are traditionally static and signature-based. The traditional solutions
along with the use of analytic models, machine learning and big data could be improved by
automatically trigger mitigation or provide relevant awareness to control or limit consequences
of threats. This kind of intelligent solutions is covered in the context of Data Science for
Cybersecurity. Data Science provides a significant role in cybersecurity by utilising the power
of data (and big data), high-performance computing and data mining (and machine learning) to
protect users against cybercrimes. For this purpose, a successful data science project requires
an effective methodology to cover all issues and provide adequate resources. In this paper, we
are introducing popular data science methodologies and will compare them in accordance with
cybersecurity challenges. A comparison discussion has also delivered to explain methodologies’
strengths and weaknesses in case of cybersecurity projects.
Big data is to be implemented in as full way in real-time; it is still in a research. People
need to know what to do with enormous data. Insurance agencies are actively participating for the
analysis of patient's data which could be used to extract some useful information. Analysis is done in
term of discharge summary, drug & pharma, diagnostics details, doctor’s report, medical history,
allergies & insurance policies which are made by the application of map reduce and useful data is
extracted. We are analysing more number of factors like disease Types with its agreeing reasons,
insurance policy details along with sanctioned amount, family grade wise segregation.
Keywords: Big data, Stemming, Map reduce Policy and Hadoop.
Data Deduplication: Venti and its improvementsUmair Amjad
This document summarizes the data deduplication system called Venti and improvements over it. Venti identifies duplicate data blocks using cryptographic hashes of block contents. It stores only a single copy of each unique block. The document discusses three key limitations of Venti: hash collisions, fixed-size chunking sensitivity, and access control. It then summarizes approaches taken by other systems to improve on these limitations, such as using multiple hash functions to reduce collisions, variable-length chunking, and stronger authentication and encryption. In conclusion, while Venti was effective at eliminating data duplication, later systems aimed to address its remaining challenges to handle growing archive sizes securely and efficiently.
This document discusses privacy-preserving techniques for data stream mining. It proposes a hybrid method that uses both rotation and translation transformations to perturb data streams and preserve privacy. The key steps are:
1) The data stream is represented as a matrix and only numeric attributes are considered.
2) Attribute pairs are randomly selected and perturbed using rotation transformations within a calculated "security range".
3) Additional attributes are perturbed using translation transformations, where random numbers generated by a secure function determine whether values are added to or subtracted from the original data.
4) The perturbed data stream is then used for clustering and analysis while preserving privacy. The goal is to maximize both privacy and utility of results.
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, technqiues and frameworks. Hadoop is an open source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Mr. Ketan Bagade | Mrs. Anjali Gharat | Mrs. Helina Tandel "A Review Paper on Big Data and Hadoop for Data Science" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-1 , December 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29816.pdf Paper URL: https://www.ijtsrd.com/computer-science/data-miining/29816/a-review-paper-on-big-data-and-hadoop-for-data-science/mr-ketan-bagade
Applying Classification Technique using DID3 Algorithm to improve Decision Su...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
This document provides a review of Hadoop storage and clustering algorithms. It begins with an introduction to big data and the challenges of storing and processing large, diverse datasets. It then discusses related technologies like cloud computing and Hadoop, including the Hadoop Distributed File System (HDFS) and MapReduce processing model. The document analyzes and compares various clustering techniques like K-means, fuzzy C-means, hierarchical clustering, and Self-Organizing Maps based on parameters such as number of clusters, size of clusters, dataset type, and noise.
8 Guiding Principles to Kickstart Your Healthcare Big Data ProjectCitiusTech
This white paper illustrates our experiences and learnings across multiple Big Data implementation projects. It contains a broad set of guidelines and best practices around Big Data management.
How Partitioning Clustering Technique For Implementing...Nicolle Dammann
This document discusses and compares different clustering and conjoint analysis techniques. It provides definitions and examples of cluster analysis and conjoint analysis, highlighting their advantages and limitations. Cluster analysis is used to group similar customers and products for market segmentation. Conjoint analysis determines how people value different attributes of a product or service. The document also compares these two techniques and discusses their various applications in business and marketing research.
A scalabl e and cost effective framework for privacy preservation over big d...amna alhabib
This document proposes a scalable and cost-effective framework called SaC-FRAPP for preserving privacy over big data on the cloud. The key idea is to leverage cloud-based MapReduce to anonymize large datasets before releasing them to other parties. Anonymized datasets are then managed using HDFS to avoid re-computation costs. A prototype system is implemented to demonstrate that the framework can anonymize and manage anonymized big data sets in a highly scalable, efficient and cost-effective manner.
This document summarizes a research paper that examines pricing strategy in a two-stage supply chain consisting of a supplier and retailer. The supplier offers a credit period to the retailer, who then offers credit to customers. A mathematical model is formulated to maximize total profit for the integrated supply chain system. The model considers three cases based on the relative lengths of the credit periods offered at each stage. Equations are developed to represent the profit functions for the supplier, retailer and overall system in each case. The goal is to determine the optimal selling price that maximizes total integrated profit.
The document discusses melanoma skin cancer detection using a computer-aided diagnosis system based on dermoscopic images. It begins with an introduction to skin cancer and melanoma. It then reviews existing literature on automated melanoma detection systems that use techniques like image preprocessing, segmentation, feature extraction and classification. Features extracted in other studies include asymmetry, border irregularity, color, diameter and texture-based features. The proposed system collects dermoscopic images and performs preprocessing, segmentation, extracts 9 features based on the ABCD rule, and classifies images using a neural network classifier to detect melanoma. It aims to develop an automated diagnosis system to eliminate invasive biopsy procedures.
This document summarizes various techniques for image segmentation that have been studied and proposed in previous research. It discusses edge-based, threshold-based, region-based, clustering-based, and other common segmentation methods. It also reviews applications of segmentation in medical imaging, plant disease detection, and other fields. While no single technique can segment all images perfectly, hybrid and adaptive methods combining multiple approaches may provide better results. Overall, image segmentation remains an important but challenging task in digital image processing and computer vision.
This document presents a test for detecting a single upper outlier in a sample from a Johnson SB distribution when the parameters of the distribution are unknown. The test statistic proposed is based on maximum likelihood estimates of the four parameters (location, scale, and two shape) of the Johnson SB distribution. Critical values of the test statistic are obtained through simulation for different sample sizes. The performance of the test is investigated through simulation, showing it performs well at detecting outliers when the contaminant observation represents a large shift from the original distribution parameters. An example application to census data is also provided.
This document summarizes a research paper that proposes a portable device called the "Disha Device" to improve women's safety. The device has features like live location tracking, audio/video recording, automatic messaging to emergency contacts, a buzzer, flashlight, and pepper spray. It is designed using an Arduino microcontroller connected to GPS and GSM modules. When the button is pressed, it sends an alert message with the woman's location, sets off an alarm, activates the flashlight and pepper spray for self-defense. The goal is to provide women a compact, one-click safety system to help them escape dangerous situations or call for help with just a single press of a button.
- The document describes a study that constructed physical fitness norms for female students attending social welfare schools in Andhra Pradesh, India.
- Researchers tested 339 students in classes 6-10 on speed, strength, agility and flexibility tests. Tests included 50m run, bend and reach, medicine ball throw, broad jump, shuttle run, and vertical jump.
- The results showed that 9th class students had the best average time for the 50m run. 10th class students had the highest flexibility on average. Strength and performance generally improved with increased class level.
This document summarizes research on downdraft gasification of biomass. It discusses how downdraft gasifiers effectively convert solid biomass into a combustible producer gas. The gasification process involves pyrolysis and reactions between hot char and gases that produce CO, H2, and CH4. Downdraft gasifiers are well-suited for biomass gasification due to their simple design and ability to manage the gasification process with low tar production. The document also reviews previous studies on gasifier configuration upgrades and their impact on performance, and the principles of downdraft gasifier operation.
This document summarizes the design and manufacturing of a twin spindle drilling attachment. Key points:
- The attachment allows a drilling machine to simultaneously drill two holes in a single setting, improving productivity over a single spindle setup.
- It uses a sun and planet gear arrangement to transmit power from the main spindle to two drilling spindles.
- Components like gears, shafts, and housing were designed using Creo software and manufactured. Drill chucks, bearings, and bits were purchased.
- The attachment was assembled and installed on a vertical drilling machine. It is aimed at improving productivity in mass production applications by combining two drilling operations into one setup.
The document presents a comparative study of different gantry girder profiles for various crane capacities and gantry spans. Bending moments, shear forces, and section properties are calculated and tabulated for 'I'-section with top and bottom plates, symmetrical plate girder, 'I'-section with 'C'-section top flange, plate girder with rolled 'C'-section top flange, and unsymmetrical plate girder sections. Graphs of steel weight required per meter length are presented. The 'I'-section with 'C'-section top flange profile is found to be optimized for biaxial bending but rolled sections may not be available for all spans.
This document summarizes research on analyzing the first ply failure of laminated composite skew plates under concentrated load using finite element analysis. It first describes how a finite element model was developed using shell elements to analyze skew plates of varying skew angles, laminations, and boundary conditions. Three failure criteria (maximum stress, maximum strain, Tsai-Wu) were used to evaluate first ply failure loads. The minimum load from the criteria was taken as the governing failure load. The research aims to determine the effects of various parameters on first ply failure loads and validate the numerical approach through benchmark problems.
This document summarizes a study that investigated the larvicidal effects of Aegle marmelos (bael tree) leaf extracts on Aedes aegypti mosquitoes. Specifically, it assessed the efficacy of methanol extracts from A. marmelos leaves in killing A. aegypti larvae (at the third instar stage) and altering their midgut proteins. The study found that the leaf extract achieved 50% larval mortality (LC50) at a concentration of 49 ppm. Proteomic analysis of larval midguts revealed changes in protein expression levels after exposure to the extract, suggesting its bioactive compounds can disrupt the midgut. The aim is to identify specific inhibitor proteins in the midg
This document presents a system for classifying electrocardiogram (ECG) signals using a convolutional neural network (CNN). The system first preprocesses raw ECG data by removing noise and segmenting the signals. It then uses a CNN to extract features directly from the ECG data and classify arrhythmias without requiring complex feature engineering. The CNN architecture contains 11 convolutional layers and is optimized using techniques like batch normalization and dropout. The system was tested on ECG datasets and achieved classification accuracy of over 93%, demonstrating its effectiveness at automated ECG classification.
This document presents a new algorithm for extracting and summarizing news from online newspapers. The algorithm first extracts news related to the topic using keyword matching. It then distinguishes different types of news about the same topic. A term frequency-based summarization method is used to generate summaries. Sentences are scored based on term frequency and the highest scoring sentences are selected for the summary. The algorithm was evaluated on news datasets from various newspapers and showed good performance in intrinsic evaluation metrics like precision, recall and F-score. Thus, the proposed method can effectively extract and summarize online news for a given keyword or topic.
1. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
344
Survey of Statistical Disclosure Control Technique
Jony N1, AnithaM2,
Department of Computer Science and Engineering 1, 2, Adithya Institute of technology, Coimbatore1, 2
Email: nesajony@gmail.com1, anithash0911@gmail.com 2
Abstract- In distributed scenarios the concept of double-phase micro aggregation has been proposed which is
an improvement of classical micro aggregation. It is mainly used for protecting the privacy without fully trusted
parties. The proposed architecture allows the remote collecting and sharing of biomedical data. The concerned
private data should be kept back in secret and conveyed proficiently between numerous storing and grouped
entities lacking any fail. The dual phase aggregation turns the essentials for privacy conservation in biomedical
data in the scattered framework of transportable healthiness. Besides, the aggregation attains glowing and
similarly to classical micro aggregation in terms of data loss, disclosure risk, and association safeguarding,
while avoiding the limitations of a consolidated method.
Index Terms- Distributed environments, Information loss, Disclosure risk.
1. INTRODUCTION
Most organizations involve the handler to switch as of
admittance mode to a broadcasting module when
mining data which trust such mutual accountabilities
as data collection and data superiority control should
be done on the same page as the data entry. Example
of how to aggregate data over days, months and years
and enterprise with a few clicks. All facts may at any
level be export to any database by clicking the [Copy
all] button. Information may unpalatable route be
combined above any managerial or environmental or
other sight that the user chooses to operate at the same
time as running with the data.
2. RECENT TRENDS
2.1 Web – Healthiness
The benefits of online health societies to accumulate,
systems must be technologically advanced that are
manageable, hospitable, easy to traverse and use, and
able to help associates recognize information quality
and interact with other participants in meaningful
ways. The effective design of such schemes will be
assisted by collaborations among clinicians, informed
designers, and patients. Health professionals and
patients can help explain the physical and emotional
stages that entities go over after they are analyzed
with a specific illness.
2.2 Training
Gathering and investigating data related to education
archive will acquire, process, document, and
disseminate data collected by administrations. Data
files, documentation, and reports are downloadable
from the website in public-use format. The website
Features an online Files Analysis System (DAS) that
allows users to conduct analyses on selected datasets
within the Archive.
2.3 Database & data warehousing
Material stores are databases that are used for
exposure and data investigation. Big data however,
requires changed data warehouses than the straight
pattern ones used in the past 10-20 years. There are
numerous open source data warehouses accessible for
dissimilar purposes..
2.4 Multidimensional database
A multidimensional database is optimized for data
online analytical processing (OLAP) applications and
for data warehousing. They are often created with the
input from relational databases. Multidimensional
databases can be charity for queries around business
operations or trends. Multidimensional database
organization systems (MDDBMS) can process the
data in a database a high speed and can generate
answers quickly.
2.4 Data aggregation
Data aggregation is the process of transforming
scattered data from numerous sources into a single
new one. The objective of data aggregation can be to
combine sources together as such that the output is
smaller than the input. This helps processing massive
amounts of data in batch jobs and in real time
applications. This reduces the network traffic and
increases the performance while in progress.Data
aggregation is any process in which information is
expressed in a summary form for purposes such as
reporting or analysis. Ineffective data aggregation is
currently a major component that limits query
2. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
345
performance. And, with up to 90 percent of all reports
containing aggregate information, it becomes clear
why proactively implementing an aggregation
solution can generate significant performance
benefits, opening up the opportunity for companies to
enhance their organizations’ analysis and reporting
abilities. Aggregate Data in a Comprehensive, Out-of-the-
Box Environment Informatics solution for B2B
data exchange fully automate key steps of the data
aggregation process, freeing your IT team to focus on
your core competencies. Taking those steps further,
the Informatics solution for B2B files aggregation
increases efficiency, accelerates delivery times, and
dramatically reduces costs with a broad range of fully
integrated capabilities that include:
2.4.1 Data collection
To gather data from core and exterior sources
using managed file transfer, this leverages secure
communication protocols as S/FTP, AS1, AS2,
HTTP/S, and PGP Data and format validation to
confirm the integrity of data’s structure and syntax.
2.4.2 Data transformation
To convert and translate from any external or internal
file and message format to a canonical format (i.e.,
XML).
2.4.3 Data normalization
To cleanse and match data and handle all exceptions
to ensure high-quality data.
2.4.4 Data enrichment
To access additional sources and systems to extract
and append additional information necessary to create
a complete data set.
2.4.5 Data mapping
To plot the format and structure of data between its
source and target systems according to certain
transformation rules and business logic.
2.4.6 Data extraction
To select and mine relevant data using specific
parameters.
The rest of this paper is organized as follows:
Section II introduces the background of SDC. Section
III then describes the proposed algorithm which is a
two-phase method. Finally, Section 4 presents
conclusions.
3 LITERATURE SURVEY
Micro aggregation is a family of methods for
statistical disclosure control (SDC) of micro data
(records on individuals and/or companies), that is, for
masking micro data so that they can be released while
preserving the privacy of the underlying individuals.
The principle of micro aggregation is to aggregate
original database records into small groups prior to
publication .Each group should contain at least k
records to prevent disclosure of individual
information, where k is a constant value preset by the
data protector. Recently,Micro aggregation has been
shown to be useful to achieve k-anonymity, in
addition to it being a good masking method. Optimal
micro aggregation (with minimum within-groups
variability loss) can be computed in polynomial time
for univariate data. Unfortunately, for multivariate
data it is an NP-hard problem.
Several heuristic approaches to micro
aggregation have been proposed in the literature.
Heuristics yielding groups with fixed size k tends to
be more efficient whereas data oriented heuristics
yielding variable group size tends to result in lower
information loss. This paper presents new data-oriented
heuristics which improve on the trade-off
between computational complexity and information
loss and are thus usable for large datasets.
A class of perturbative SDC methods for
micro data. Given an original set of micro data whose
respondents (i.e., contributors) must have their
privacy preserved, micro aggregation yields a
protected data set consisting of aggregate
information(e.g., mean values) computed on small
groups of records in the original dataset. Since this
protected dataset contains only aggregate data, its
release is less likely to violate respondent privacy. For
the released dataset to stay analytically useful, the
information loss caused by micro aggregation must be
minimized: a way to approach this minimization is for
records within each group to be as homogeneous as
possible.
Multivariate micro aggregation (for several
attributes) with maximum within groups record
homogeneity is NP-hard, so heuristics are normally
used. There is a dichotomy between fixed-size
heuristics yielding groups with a fixed number of
records and data-oriented heuristics yielding groups
whose size varies depending on the distribution of the
original records. Even if the latter heuristics can in
principle achieve lower information loss than fixed-size
micro aggregation, they are often dismissed for
large datasets due to complexity reasons. For
example, the μ-Argus SDC package only features
fixed-size micro aggregation. Our contribution in this
paper is an approach to turn some fixed-size heuristics
for multivariate micro aggregation of numerical data
into data-oriented heuristics with little additional
computation.The resulting new heuristics improves
the trade-off between information loss and
computational complexity.
One approach to facilitate health research
and alleviate some of the problems documented above
is to de-identify data beforehand or at the earliest
opportunity. Many research ethics boards will waive
the consent requirement if the data collected or
disclosed is deemed to be de-identified. A commonly
used de-identification criterion is k-anonymity, and
many k-anonymity algorithms have been developed.
This criterion stipulates that each record in a dataset is
similar to at least another k-1 records on the
potentially identifying variables. For example, if k-5
3. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
346
and the potentially identifying variables are age and
gender, then a k-anonym zed dataset has at least 5
records for each value combination of age and gender.
A new k-anonymity algorithm, Optimal
Lattice Anonymization (OLA), which produces a
globally optimal de-identification solution suitable for
health datasets. We demonstrate on six datasets that
OLA results in less information loss and has faster
performance compared to current de-identification
algorithms.
o Quasi-identifiers
The variables that are going to be de-identified
in a dataset are called the quasi-identifiers.
Examples of common quasi-identifiers are dates (such
as birth, death, admission,
Discharge, visit, and specimen collection), locations
(such as postal codes, hospital names, and regions),
race, ethnicity, languages spoken, aboriginal status,
and gender.
o Equivalence Classes
All the records that have the same values on
the quasi-identifiers are called an equivalence class.
For example, all the records in a dataset about 17-
year-old males admitted on Jan 1, 2008 are an
equivalence class. Equivalence class sizes potentially
change during de-identification. For example, there
may be 3 records for 17-year-old
Males admitted on Jan 1, 2008. When the age is
recoded to a five year interval, then there may be 8
records for males between 16 and 20 years old
admitted on Jan 1, 2008.
o De-identification Optimality Criterion
A de-identification algorithm balances the
probability of re-identification with the amount of
distortion to the data (the information loss). For all k-anonymity
algorithms, disclosure risk is defined by
the k value, which stipulates a maximum probability
of re-identification.There is no generally accepted
information loss metrics.
Micro aggregation is a clustering problem
with minimum size constraints on the resulting
clusters or groups; the number of groups is
unconstrained and the within-group homogeneity
should be maximized. In the context of privacy in
statistical databases, micro aggregation is a well-known
approach to obtaining anonym zed versions of
confidential micro data. Optimally solving micro
aggregation on multivariate data sets is known to be
difficult. Therefore, heuristic methods are used in
practice. This paper presents a new heuristic approach
to multivariate micro aggregation, which provides
variable-sized groups (and thus higher within-group
homogeneity) with a computational cost similar to the
one of fixed-size micro aggregation heuristics.
Micro aggregation is a problem appearing in
statistical disclosure control (SDC), where it is used to
cluster a set of records in groups of at least k records,
with k being a user-definable parameter. The
collection of groups is called a k-partition of the data
set. The micro aggregated data set is built by
replacing each original record by the centroid of the
group it belongs to. The micro aggregated data set can
be released without jeopardizing the privacy of the
individuals which form the original data set: records
within a group are indistinguishable in the released
data set. The higher the within-group homogeneity in
the original data set, the lower the information loss
incurred when replacing records in a group by their
centric therefore within-group homogeneity is
inversely related to information loss caused by micro
aggregation.
A simple and efficient implementation of
Lloyd's k-means clustering algorithm, which we call
the filtering algorithm. This algorithm is easy to
implement, requiring a kd-tree as the only major data
structure. We establish the practical efficiency of the
filtering algorithm in two ways. First, we present a
data-sensitive analysis of the algorithm's running
time, which shows that the algorithm runs faster as the
separation between clusters increases. Second, we
present a number of empirical studies both on
synthetically generated data and on real data sets from
applications in color quantization, data compression,
and image segmentation.
Clustering based on k-means is closely
related to a number of other clustering and location
problems. These include the Euclidean k-medians in
which the objective is to minimize the sum of
distances to the nearest center and the geometric k-center
problem in which the objective is to minimize
the maximum distance from every point to its closest
centre. There are no efficient solutions known to any
of these problems and some formulations are NP-hard.
An asymptotically efficient approximation for
the k-means Clustering problem has been presented
by Matousek but the large constant factors suggest
that it is not a good candidate for practical
implementation. One of the most popular heuristics
for solving the k-means problem is based on a simple
iterative scheme for finding a locally minimal
solution. This algorithm is often called the k-means
algorithm.
An encryption method is presented with the
novel property that publicly revealing An encryption
key does not thereby reveal the corresponding
decryption Key. This method provides an
implementation of a public-key cryptosystem an
elegant concept invented by Diffie and Hellman. This
has two important consequences:
1. Couriers or other secure means are not
needed to transmit keys, since a message can be
enciphered using an encryption key publicly revealed
by the intended recipient. Only he can decipher the
message, since only he knows the corresponding
decryption key.
2. A message can be signed using a privately
held decryption key. Anyone can verify this signature
using the corresponding publicly revealed encryption
4. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
347
key. Signatures cannot be forged, and a signer cannot
later deny the validity of his signature. This has
obvious applications in electronic mail and electronic
funds transfer systems. a public-key cryptosystem can
ensure privacy and enable signatures.
All classical encryption methods suffer from
the key distribution problem.The problem is that
before a private communication can begin, another
private transaction is necessary to distribute
corresponding encryption and decryption keys to the
sender and receiver, respectively. Typically a private
courier is used to carry a key from the sender to the
receiver. Such a practice is not feasible if an
electronic mail system is to be rapid and inexpensive.
A public-key cryptosystem needs no private couriers;
the keys can be distributed over the insecure
communications channel.
Micro aggregation is a Statistical Disclosure
Control (SDC) technique that aims at protecting the
privacy of individual respondents before their data are
released .Optimally micro aggregating multivariate
data sets is known to be an NP-hard problem. Thus,
using heuristics has been suggested as a possible
strategy to tackle it. Specifically, Genetic Algorithms
have been shown to be serious candidates that can
find good solutions on small data sets. However, due
to the very nature of these algorithms and the coding
of the micro aggregation problem, GA can hardly
cope with large data sets. In order to apply them to
large data sets, the latter have to be previously
partitioned into smaller disjoint subsets that
the GA can handle.
With the aim to protect from re-identification
of individual respondents, micro data sets are properly
modified prior to their publication. The degree of
modification can vary between two extremes:
(i) Encrypting the micro data.
(ii) leaving the micro data intact.
In the first extreme, the protection is
perfect however, the utility of the data is almost
nonexistent because the encrypted micro data can be
hardly studied or analysed. In the other extreme, the
micro data are extremely useful (i.e. all their
information remains intact), however, the privacy of
the respondents is endangered. SDC methods for
micro data protection aim at distorting the original
data set to protect respondents from re-identification
whilst maintaining as much as possible, some of the
statistical properties of the data and minimising the
information loss. The goal is to find the right balance
between data utility and respondents privacy micro
data sets are organised in records that refer to
individual respondents. Each record has several
attributes in a micro data set X can be classified in
three categories as follows:
1) Identifiers:
Attributes in X that unambiguously identify
the respondent. For example, passport
numbers, full names, etc. the attribute "social
security number" is an identifier.
2) Key attributes:
If properly combined can be linked with
external information sources to re-identify
some of the respondents to whom some of
the records refer. For example, address, age,
gender, etc.
3) Confidential outcome attributes:
It containing sensitive information on the
respondent, namely salary, religion, political
affiliation, health condition, etc.
A formal protection model named k-anonymity
and a set of accompanying policies for
deployment. A release provides k-anonymity
protection if the information for each person
contained in the release cannot be distinguished from
at least k-1 individuals whose information also
appears in the release. This paper also examines re-identification
attacks that can be realized on releases
that adhere to k anonymity unless accompanying
policies are respected. The k-anonymity protection
model is important because it forms the basis on
which the real-world systems known as Data fly, m-
Argus and k-Similar provide guarantees of privacy
protection.
Micro aggregation is a technique used by
statistical agencies to limit disclosure of sensitive
micro data. Noting that no polynomial algorithms are
known to micro aggregate optimally, Domingo-
Ferrier and Mateo-Saenz have presented heuristic
micro aggregation methods. This paper is the first to
present an efficient polynomial algorithm for optimal
univariate micro aggregation. Optimal partitions are
shown to correspond to shortest paths in a network .It
is at least as efficient as published heuristic methods
and can be used on very large data sets. While our
algorithm focuses on uni variate data, it can be used
on multivariate data when the data vectors are
projected to a single Axis.
It formulate the micro aggregation problem
as a shortest-path problem which construct a graph
and show that optimal micro aggregation corresponds
to a shortest path in this graph in a natural way. Each
arc of the graph corresponds to a possible group that
may be part of an optimal partition. Each arc is
labelled by the error that would result if that group
were to be included in the partition.
A clustering algorithm for partitioning a
minimum spanning tree with a constraint on minimum
group size. The problem is motivated by micro
aggregation, a disclosure limitation technique in
which similar records are aggregated into groups
containing a minimum of k records. Heuristic
clustering methods are needed since the minimum
information loss micro aggregation problem is NP-hard.
Our MST partitioning algorithm for micro
aggregation is sufficiently efficient to be practical for
5. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
348
large data sets and yields results that are comparable
to the best available heuristic methods for micro
aggregation. For data that contain pronounced
clustering effects, results in significantly lower
information loss. Algorithm is general enough to
accommodate different measures of information loss
and can be used for other clustering applications that
have a constraint on minimum group size.
To protect the anonymity of the entities
(called respondents) to which information refers, data
holders often remove or encrypt explicit identifiers
such as names, addresses, and phone numbers. De-identifying
data, however, provides no guarantee of
anonymity .Released information often contains other
data, such as race, birth date, sex, and ZIP code that
can be linked to publicly available information to re
identify respondents and inferring information that
was not intended for disclosure. In this paper we
address the problem of releasing micro data while
safeguarding the anonymity of the respondents to
which the data refer. The approach is based on the
definition of k-anonymity.
How k-anonymity can be provided without
compromising the integrity (or truthfulness) of the
information released by using generalization and
suppression techniques. We introduce the concept of
minimal generalization that Captures the property of
the release process not to distort the data more than
needed to achieve k-anonymity, and present an
algorithm for the computation of such a
generalization. We also discuss possible preference
policies to choose among different minimal
generalizations.
4 CONCLUSION
An architecture that allows the private
gathering and sharing of biomedical data in the
context of m-health .We have introduced to concept
of double-phase micro aggregation to limit the
information accessible by intermediate entities (such
as the SAS). It preserves the correlations of the
original data set. Then, we can conclude that the
distributed double-phase micro aggregation proposed
can be applied in a distributed environment to protect
the privacy of individuals with the same effects of
classical micro aggregation . Further research might
include the analysis of the influence of time in the
series of data collected using our model.
REFERENCES
[1] Brand R. (2002) ‘Micro data protection through
noise addition’, Lecture Notes in Computer Sci.,
vol. 2316, pp. 97–116.
[2] C. D. Brown (2000) ‘Body mass index and
prevalence of hypertension and dyslipidemia
‘,Obesity Res., vol. 8, no. 9, pp. 605–619.
[3] T. Dalenius and S. P. Reiss (1982) ‘Data-swapping:
A technique for disclosure control’,
Statistical Planning and Inference, vol. 6, no. 1,
pp. 73–85.
[4] J. Domingo-Ferrer, F. Sebé, and J.
Castellà,(2004) ‘On the security of noise addition
for privacy in statistical databases’, Lecture Notes
in Computer Sci., vol. 3050, pp. 149–161.
[5] J. Domingo-Ferrer (2006) ‘Efficient multi variate
data-oriented micro aggregation’, Int. J. Very
Large Databases, vol. 15, no. 4, pp. 355–369.
[6] Domingo-Ferrer, J., Mateo-Sanz, J.M (2002)
‘Practical data-oriented micro aggregation for
statistical disclosure control’, IEEE Trans.
Knowl. Data Eng. 14(1), 189–201.
[7] K. Emam (2009) ’Globally optimal k-anonymity
for de-identification of health data’, J. Amer.
Med. Inform. Assoc., vol. 16, no. 5, pp. 670–682.
[8] T. ElGamal, (1985) ‘A public-key cryptosystem
and a signature scheme based on discrete
logarithms’, IEEE Trans. Inf. Theory, vol. 31, no.
4, pp. 469–472, Jul.
[9] B. Greenberg (1987) ‘Rank Swapping for
Masking Ordinal Micro data’, Tech. report. U.S.
Bureau of the Census , unpublished
[10]Hansen, S.L., Mukherjee, S (2003) ‘A
polynomial algorithm for optimal univariate
micro aggregation’, IEEE Trans. Knowl. Data
Eng. 15(4), 1043–1044.
[11]M. Naehrig, K. Lauter, and V. Vaikuntanathan,
’Can homomorphic encryption be practical?’, in
Proc. 3rd ACM Workshop on Cloud Computing
Security Workshop (CCSW’11), New York, NY,
USA, pp. 113–124.
[12]G. J. Matthews and O. Harel (2011) ‘Data
confidentiality: A review of methods for
statistical disclosure limitation and methods for
assessing privacy’, Statist. Surveys, vol. 5, pp. 1–
29.
[13]D. Pagliuca and G. Seri (1998) ‘Some Results of
Individual Ranking Method on the System of
Enterprise Accounts Annual Survey’, Esprit SDC
Project, Deliverable MI3/D2.
[14]R. Rivest, A. Shamir, and L. Adleman (1978) ‘A
method for obtaining digital signatures and
public-key cryptosystems’, Commun. ACM, vol.
21, no. 2, pp. 120–126.
[15]P. Samarati,(2001) ‘Protecting respondents’
identities in micro data release’, IEEE Trans.
Knowl. Data Eng., vol. 13, no. 6, pp. 1010–1027,
Nov.Dec.
[16]Solanas and A. Martínez-Ballesté (2006) V-MDAV:
Variable group size multivariate micro
aggregation,” in Proc. COMPSTAT, pp.917–925.
[17]Solanas, A. Martínez-Ballesté, and Ú. González-
Nicolas (2010)’A variable-MDAV-based
partitioning strategy to continuous multivariate
micro aggregation with genetic algorithms’, in
Proc. Int. Joint Conf.Neural Networks (IJCNN),
pp. 1–7.
6. International Journal of Research in Advent Technology, Vol.2, No.5, May 2014
E-ISSN: 2321-9637
349
[18]L. Sweeney (2002) ‘k-anonymity: A model for
protecting privacy’, Int. J.Uncertainty, Fuzziness
and Knowledge-Based Syst., vol. 10, no. 5,
pp.557–570.
[19]L. Willenborg and T. DeWaal (1996)‘Statistical
Disclosure Control in Practice’,in Lecture Notes
in Statistics. New York, NY, USA: Springer-
Verlag, vol. 111.
[20]L.Willenborg and T. DeWaal (2001) ‘Elements
of Statistical Disclosure Control’, in Lecture
Notes in Statistics. New York, NY, USA:
Springer-Verlag, vol. 155.
[21] J. Domingo-Ferrer and J.M. Mateo-Sanz,
“Practical Data-Oriented Microaggregation for
Statistical Disclosure Control,” IEEE Trans.
Knowledge and Data Eng., vol. 14, no. 1, pp.
189-201, Jan./Feb. 2002.
[22] J. Domingo-Ferrer and V. Torra, “A Quantitative
Comparison of Disclosure Control Methods for
Microdata,” Confidentiality, Disclosure, and Data
Access: Theory and Practical Application for
Statistical Agencies, P. Doyle, J. Lane, J.
Theeuwes, and L. Zayatz, eds., pp. 111-133,
Amsterdam: North-Holland, 2001.
[23]S.L. Hansen and S. Mukherjee, “A Polynomial
Algorithm for Optimal Univariate
Microaggregation,” IEEE Trans. Knowledge and
Data Eng., vol. 15, no. 4, pp. 1043-1044,
July/Aug. 2003.
[24]A.K. Jain, M.N. Murty, and P.J. Flynn, “Data
Clustering: A Review,” ACM Computing
Surveys, vol. 31, no. 3, 1999.