SlideShare a Scribd company logo
1 of 6
Download to read offline
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 8, No. 5, October 2018, pp. 2920~2925
ISSN: 2088-8708, DOI: 10.11591/ijece.v8i5.pp2920-2925  2920
Journal homepage: http://iaescore.com/journals/index.php/IJECE
Framework to Avoid Similarity Attack in Big Streaming Data
Ganesh Dagadu Puri, D. Haritha
CSE, Koneru Lakshmaiah Education Foundation, India
Article Info ABSTRACT
Article history:
Received Dec 13, 2017
Revised Feb 24, 2018
Accepted Aug 12, 2018
The existing methods for privacy preservation are available in variety of
fields like social media, stock market, sentiment analysis, electronic health
applications. The electronic health dynamic stream data is available in large
quantity. Such large volume stream data is processed using delay free
anonymization framework. Scalable privacy preserving techniques are
required to satisfy the needs of processing large dynamic stream data. In this
paper privacy preserving technique which can avoid similarity attack in big
streaming data is proposed in distributed environment. It can process the data
in parallel to reduce the anonymization delay. In this paper the replacement
technique is used for avoiding similarity attack. Late validation technique is
used to reduce information loss. The application of this method is in medical
diagnosis, e-health applications, health data processing at third party.
Keyword:
Big data
Distributed
Privacy
Similarity
Copyright © 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Ganesh Dagadu Puri,
CSE KLEF, Vaddeswaram, Guntur,
Andhra Pradesh, 522502 - India.
Email: puriganeshengg@gmail.com
1. INTRODUCTION
Nowadays electronic health information and electronic health applications are available in large
quantity [1]. The users of health information like health care providers, researchers, analysts use this data for
making inferences [2]. Since health records contain the private data of patient, the access is restricted. To
make this access easy and possible, privacy preservation techniques are useful. Electronic health records are
useful for the communication and keeping the information of patient intact. The demand of such big amount
of electronic health data has increased concern of privacy for the patients [3]. For providing privacy to
electronic health data de-identification techniques are used. These techniques provide privacy by removing
direct identifiers which can expose identity of individual or disclose sensitive information of individual. It
provides privacy by suppression, generalization or replacement of the identifiers [4-5].
Various laws in different countries are available for providing privacy to electronic health data [2].
In USA Health Insurance Portability and Accountability Act (HIPAA), Patient Safety and Quality
Improvement Act (PSQIA), HITECH Act protects privacy of electronic health data. Data Protection Act
(DPA) in UK provides options to individuals for protecting information. Russian Federal Law on Personal
Data in Russia makes it necessary to take all permissions for organizations before handing over the health
data to other. Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada give
citizens right to know the reasons behind the collection of private data [6]. IT Act and IT (Amendment) Act
in India suggests strict actions like imprisonment or fine for misusing personal information. Data Protection
Directive in European Union helps to keep fundamental rights of people with respect to accessing of
personal data.
In the anonymization of electronic health data de-identification methods are used. These methods
are further divided into K-anonymity, L-diversity, T- closeness [7-9]. In the L-diversity, there is possibility of
similarity attack. In Figure 1 architecture of delay free anonymization for privacy preservation is shown [10].
Input data is coming from source in terms of tuples. This tuple is divided in two parts.
Int J Elec& Comp Eng ISSN: 2088-8708 
Framework to Avoid Similarity Attack in Big... (Ganesh Dagadu Puri)
2921
Figure 1. Delay free anonymization: Architectural diagram
First part contains quasi identifier and group number. Second part contains sensitive tuple along
with its L-1 counterfeit values, count of each sensitive value and group number. Adding L-1 counterfeit
values with real value will make difficult to disclose the sensitive value. These counterfeit values will be
validated with the upcoming input tuples. Groupwise count of released tuples will be maintained. The
similarity of counterfeit values will affect privacy. It can be avoided by replacing similar values in the group.
2. RELATED WORK
There is need of establishing guidelines for privacy against invasive marketing and inadvertent
privacy disclosure [11]. Privacy requirements in data sharing for big data operators need scalable privacy
preserving algorithms to provide privacy to the datasets. Health information providers can benefit from cost-
profit model to take decision about sharing the health related data to other parties [12]. Privacy requirements
are important in big data collection, storage and intra and inter-organization processing. To make the
computing of big data in privacy preserved way Privacy preserving aggregation, encrypted data operations
and de-identification techniques are suggested [13]. In data privacy, it is required to understand privacy
requirements in data provider, data collector, data miner, decision maker stages [14]. Need of keeping source
or origin of data is important to identify privacy attack. In [10] delay free anonymization technique is used
for to reduce delay and increase data utility by late validation.
Distributed stream processing is done with extending storm capabilities for task management,
scheduling, and executing in distributed manner [15]. DART system propose framework for different devices
present on remote sites in distributed environment. This framework provides facility of registration and
authorization of devices on remote site, task allocation and management of user application. In the system
computation load is reduced by utilizing idle resources [16]. The distributed stream processing systems
possess different availability requirement for different applications. When one of the nodes in distributed
environment gets failed, the backup or secondary server resumes the execution. While doing this, the state
should be maintained. The type of recovery technique and performance is based on stream processing
application [17]. The new stream processing systems exploit the tasks instead of nodes for fault
tolerance [18].
3. RESEARCH METHOD
3.1. The Need and Importance of the Problem
Electronic health data is produced in large quantity. In anonymization of this data minimum
execution time and less information loss is important. Anonymization delay is minimized using delay free
framework. To avoid similarity attack on l-diverse counterfeit group, replacement of similar value is
required. Due to large amount of tuples of electronic health data, there is possibility of formation of similar
groups and it can disclose the sensitive value. Repetition of such values in group is avoided using the
synthetic value formation. The complexity of big electronic health data creates challenge for existing privacy
preserving algorithm which cannot work on large datasets.
In Figure 2 to avoid similarity attack, similarity index of each group is calculated [19]. If some
values are similar then such values will be replaced with other values. For this replacement help of past data
is taken. With the policy of past reflect future, for early late validation of counterfeit values in the group the
values from the past data are selected. Information loss and utility of the replaced data is calculated. It will be
 ISSN: 2088-8708
Int J Elec& Comp Eng, Vol. 8, No. 5, October 2018 : 2920 - 2925
2922
note down in statistic data to see if that replaced value in counterfeit group caused more or less
information loss.
Figure 2. Work flow model to avoid similarity attack
3.2. Algorithm
In the Figure 3 algorithm for the proposed method using big data as input is given. For each tuple set
of streaming big data input [20], the source is maintained. It is useful to find source of data in case of
adversary attack.
Figure 3. Algorithm for proposed method
The incoming stream data may not be in suitable format. Preprocessing is used to convert the
incoming data in suitable format. The steps used for the preprocessing are as follows.
a. Read the url or address of streaming data source.
b. Load the raw data in dataset file.
c. Read the first line of attributes in the file and split it as per the delimiter.
Int J Elec& Comp Eng ISSN: 2088-8708 
Framework to Avoid Similarity Attack in Big... (Ganesh Dagadu Puri)
2923
d. Convert the split data of first line into columns.
e. Read the files data in buffer line by line up to end of file convert it into tuple.
f. Split the data stream using delimiter and insert in the columns.
g. Identify the quasi and sensitive identifiers in data table.
After preprocessing the data is available in proper format. The Anatomy [21] technique divides
input tuples into two parts. The counterfeit values will be added to form the groups. If the counterfeit values
are repeated more no of times in the groups, synthetic values can be used to replace these repeated values
otherwise past data is sufficient to form group of counterfeit values. For each individual group of counterfeit
values similarity index of group will be calculated. If the values are similar then these values will be replaced
with other values from the past data. Late validation is done by maintaining the group count and the released
tuples in the group. Statistic data of information loss and utility measures is maintained. If information loss
ratio is more than threshold value then the process is repeated by changing the values in the group.
4. DISTRIBUTED EXECUTION FLOW
In case of analysis if the organization does not have enough processing capability and infrastructure
to process large amount of data, such stream data will be given to third party. In such situation existing
methods are inappropriate to provide enough privacy. Figure 4 show the distributed execution flow of big
streaming data. In delay free anonymization method L-diverse counterfeit values will be generated when new
tuple arrives. It generates these values from past data (domain of sensitive values). For big data, millions of
tuples are arriving in one session and randomly counterfeit values get generated [20]. There is probability of
similar values getting generated in a group. This may cause similarity attack on the patient data.
Figure 4. Distributed Execution flow for big streaming data to avoid similarity attack
To avoid this situation when the similar values get generated in the group, we can replace these
similar values with other sensitive values so that similarity attack can be avoided. At the same time repeated
values among the groups are found and such values are replaced with synthetic values. Vertical dotted lines
in Figure 4 show the execution on different nodes in distributed fashion. While tuples are anonymized and
published on first node, second node will be used for the group data formation and replacement of similar
values. Third node will keep statistic data based information loss due to replaced or synthetic values in group.
Domain of sensitive values contains limited values and these values are getting repeated. For
example for N records N/L groups of counterfeit values will be generated. For 500 records 50 groups with
L=10 will be generated. But as the big data is the input for example there are 500000 records and L=10. It
will generate 50000 groups. In each group the counterfeit sensitive values will get repeated. In sensitive
domain if we have 50 unique values. For 50000 groups, repetition of 50 values will be 1000 times in different
groups. To avoid this repetition of sensitive domain values in the groups, few values can be replaced with
synthetic values. The probability of disclosure of real sensitive value is increased if repetition of sensitive
values in groups takes place. Creating groups of counterfeit values for millions of records in very short time
 ISSN: 2088-8708
Int J Elec& Comp Eng, Vol. 8, No. 5, October 2018 : 2920 - 2925
2924
and finding repeated or similar values in groups in very short time will require executing this work in
distributed or parallel fashion.
5. RESULTS AND ANALYSIS
For processing the big streaming data, we have used task level parallelism and data level
parallelism. For the tasks like reading streamed data from source, preprocessing of streamed data and
counterfeit and loss management parallelism is applied. To achieve the result stream data is processed on
flink data processing engine [22]. It supports for processing of big datastreaming as well as batch data
processing. Flink data engine also support for complex event processing, machine learning and graph
analysis. Table 1 shows the similarity values of sensitive value of tuple of different groups obtained by
executing this data parallelly using different measures.
In Table 1 similarity between different group values calculated. When the tuple appear, it is released
using the counterfeit value addition in the group. Table 1 shows similarity values for dengue, leprosy, malaria
and diphtheria sensitive value with other counterfeit values. Similarity results are obtained using different
measures [23].
Table 1. Similarity Values for Different Groups Using Different Measures
Wu & Palmer, Path length, Jiang & Cornath, Conceptual distance and Lin measures are used to find
the similarity of values in the group [23]. Based on those measures, to avoid similarity attack similar value
can be replaced with other value in that group. Measures for four groups with real sensitive values dengue,
leprosy, malaria and diphtheria are shown in Table 1. Figure 5 shows graph comparison for similarity of the
real sensitive value with group value.
Figure 5. Graph based on similarity in different group using different measures
Int J Elec& Comp Eng ISSN: 2088-8708 
Framework to Avoid Similarity Attack in Big... (Ganesh Dagadu Puri)
2925
6. CONCLUSION
Privacy preservation framework to avoid similarity attack in electronic health streams is proposed.
To find different similarity of sensitive value with counterfeit value, similarity measures are used.
Replacement of similar counterfeit values is done by past data of tuples to increase data utility. For big
streaming data synthetic values are used for replacement of counterfeit values among groups. Anonymization
delay of framework is reduced using distributed execution.
REFERENCES
[1] Nugraha DC, Aknuranda I, “An Overview of e-Health in Indonesia: Past and Present Applications”, International
Journal of Electrical and Computer Engineering, 2017 Oct 1; 7(5): 2441.
[2] Abouelmehdi K, Beni-Hssane A, Khaloufi H, Saadi M, “Big data security and privacy in healthcare: A Review”,
Procedia Computer Science, 2017 Jan 1; 113: 73-80.
[3] Puri GD, Haritha D, “Survey big data analytics, applications and privacy concerns”, Indian Journal of Science and
Technology, 2016 May 18; 9(17).
[4] Fung B, Wang K, Chen R, Yu PS, “Privacy-preserving data publishing: A survey of recent developments”, ACM
Computing Surveys (CSUR), 2010 Jun 1; 42(4): 14.
[5] Gkoulalas-Divanis A, Loukides G, Sun J, “Publishing data from electronic health records while preserving privacy:
A survey of algorithms”, Journal of biomedical informatics, 2014 Aug 31; 50: 4-19.
[6] Jensen M, “Challenges of privacy protection in big data analytics”, InBig Data (BigData Congress), 2013 IEEE
International Congress, 2013 Jun 27 (pp. 235-238). IEEE.
[7] Sweeney L, “k-anonymity: A model for protecting privacy”, International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems, 2002 Oct; 10(05): 557-70.
[8] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M, “l-diversity: Privacy beyond k-anonymity”, In
Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference, 2006 Apr 3 (pp. 24-24).
IEEE.
[9] Li N, Li T, Venkatasubramanian S, “t-closeness: Privacy beyond k-anonymity and l-diversity”, In Data
Engineering, 2007. ICDE 2007. IEEE 23rd International Conference, 2007 Apr 15 (pp. 106-115). IEEE.
[10] Kim S, Sung MK, Chung YD, “A framework to preserve the privacy of electronic health data streams”, Journal of
biomedical informatics, 2014 Aug 31; 50: 95-106.
[11] Puri GD, Haritha D, “A Framework to Preserve the Privacy of Electronic Health Dynamic Data Streams Using
Parallel Architecture”, International Journal of Control Theory and Applications, 2017: 10(6): 527-535.
[12] Khokhar RH, Chen R, Fung BC, Lui SM, “Quantifying the costs and benefits of privacy-preserving health data
publishing”, Journal of biomedical informatics, 2014 Aug 31; 50: 107-21.
[13] Lu R, Zhu H, Liu X, Liu JK, Shao J, “Toward efficient and privacy-preserving computing in big data era”, IEEE
Network. 2014 Jul; 28(4): 46-50.
[14] Xu L, Jiang C, Wang J, Yuan J, Ren Y, “Information security in big data: privacy and data mining”, IEEE Access,
2014; 2: 1149-76.
[15] Nardelli M, “A Framework for Data Stream Applications in a Distributed Cloud”, In ZEUS 2016 Jan 27 (pp. 56-
63).
[16] Choi JH, Park J, Park HD, Min OG., “DART: fast and efficient distributed stream processing framework for
internet of things”, ETRI Journal, 2017 Apr 1; 39(2): 202-12.
[17] Hwang JH, Balazinska M, Rasin A, “Cetintemel U, Stonebraker M, Zdonik S. High-availability algorithms for
distributed stream processing”, InData Engineering, 2005. ICDE 2005. Proceedings. 21st International
Conference, 2005 Apr 5 (pp. 779-790). IEEE.
[18] Kamburugamuve S, Fox G, Leake D, Qiu J, “Survey of distributed stream processing for large stream sources”,
Technical report, 2013 Dec 14.
[19] Erritali M, Beni-Hssane A, Birjali M, Madani Y, “An Approach of Semantic Similarity Measure between
Documents Based on Big Data”, International Journal of Electrical and Computer Engineering, 2016 Oct 1; 6(5):
2454.
[20] Nair LR, Shetty SD, Shetty SD, “Streaming Big Data Analysis for Real-Time Sentiment based Targeted
Advertising”, International Journal of Electrical and Computer Engineering (IJECE), 2017 Feb 1; 7(1): 402-7.
[21] Xiao X, Tao Y, “Anatomy: Simple and effective privacy preservation”, In Proceedings of the 32nd international
conference on Very large data bases, 2006 Sep 1 (pp. 139-150). VLDB Endowment.
[22] Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache flink: Stream and batch processing in
a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015; 36(4).
[23] McInnes BT, Pedersen T, Pakhomov SV, “UMLS-Interface and UMLS-Similarity: open source software for
measuring paths and semantic similarity”, In AMIA Annual Symposium Proceedings 2009 (Vol. 2009, p. 431).
American Medical Informatics Association.

More Related Content

What's hot

Preprocessing and secure computations for privacy preservation data mining
Preprocessing and secure computations for privacy preservation data miningPreprocessing and secure computations for privacy preservation data mining
Preprocessing and secure computations for privacy preservation data mining
IAEME Publication
 
DLD_SYNOPSIS
DLD_SYNOPSISDLD_SYNOPSIS
DLD_SYNOPSIS
Nisha Jain
 

What's hot (19)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
78201919
7820191978201919
78201919
 
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
Kato Mivule - Utilizing Noise Addition for Data Privacy, an OverviewKato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
Kato Mivule - Utilizing Noise Addition for Data Privacy, an Overview
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...
 
Preprocessing and secure computations for privacy preservation data mining
Preprocessing and secure computations for privacy preservation data miningPreprocessing and secure computations for privacy preservation data mining
Preprocessing and secure computations for privacy preservation data mining
 
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data PrivacyA Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
A Codon Frequency Obfuscation Heuristic for Raw Genomic Data Privacy
 
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
Towards A Differential Privacy and Utility Preserving Machine Learning Classi...
 
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...An Investigation of Data Privacy and Utility Preservation Using KNN Classific...
An Investigation of Data Privacy and Utility Preservation Using KNN Classific...
 
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATIONPRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
PRIVACY PRESERVING CLUSTERING IN DATA MINING USING VQ CODE BOOK GENERATION
 
Applying Data Privacy Techniques on Published Data in Uganda
 Applying Data Privacy Techniques on Published Data in Uganda Applying Data Privacy Techniques on Published Data in Uganda
Applying Data Privacy Techniques on Published Data in Uganda
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and ApproachesA Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
 
Implementation of Data Privacy and Security in an Online Student Health Recor...
Implementation of Data Privacy and Security in an Online Student Health Recor...Implementation of Data Privacy and Security in an Online Student Health Recor...
Implementation of Data Privacy and Security in an Online Student Health Recor...
 
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
A Survey on Features and Techniques Description for Privacy of Sensitive Info...A Survey on Features and Techniques Description for Privacy of Sensitive Info...
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
 
PRIVACY PRESERVING DATA MINING BASED ON VECTOR QUANTIZATION
PRIVACY PRESERVING DATA MINING BASED  ON VECTOR QUANTIZATION PRIVACY PRESERVING DATA MINING BASED  ON VECTOR QUANTIZATION
PRIVACY PRESERVING DATA MINING BASED ON VECTOR QUANTIZATION
 
DLD_SYNOPSIS
DLD_SYNOPSISDLD_SYNOPSIS
DLD_SYNOPSIS
 
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...
AN EFFICIENT SOLUTION FOR PRIVACYPRESERVING, SECURE REMOTE ACCESS TO SENSITIV...
 
IEEE 2014 JAVA DATA MINING PROJECTS M privacy for collaborative data publishing
IEEE 2014 JAVA DATA MINING PROJECTS M privacy for collaborative data publishingIEEE 2014 JAVA DATA MINING PROJECTS M privacy for collaborative data publishing
IEEE 2014 JAVA DATA MINING PROJECTS M privacy for collaborative data publishing
 

Similar to Framework to Avoid Similarity Attack in Big Streaming Data

Big data challenges for e governance
Big data challenges for e governanceBig data challenges for e governance
Big data challenges for e governance
bharati k
 
Data attribute security and privacy in Collaborative distributed database Pub...
Data attribute security and privacy in Collaborative distributed database Pub...Data attribute security and privacy in Collaborative distributed database Pub...
Data attribute security and privacy in Collaborative distributed database Pub...
International Journal of Engineering Inventions www.ijeijournal.com
 
A Study on Big Data Privacy Protection Models using Data Masking Methods
A Study on Big Data Privacy Protection Models using Data Masking Methods A Study on Big Data Privacy Protection Models using Data Masking Methods
A Study on Big Data Privacy Protection Models using Data Masking Methods
IJECEIAES
 
Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...
Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...
Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...
Kumar Goud
 

Similar to Framework to Avoid Similarity Attack in Big Streaming Data (20)

Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy Algorithms
 
130509
130509130509
130509
 
130509
130509130509
130509
 
Performance analysis of perturbation-based privacy preserving techniques: an ...
Performance analysis of perturbation-based privacy preserving techniques: an ...Performance analysis of perturbation-based privacy preserving techniques: an ...
Performance analysis of perturbation-based privacy preserving techniques: an ...
 
Big data challenges for e governance
Big data challenges for e governanceBig data challenges for e governance
Big data challenges for e governance
 
Data attribute security and privacy in Collaborative distributed database Pub...
Data attribute security and privacy in Collaborative distributed database Pub...Data attribute security and privacy in Collaborative distributed database Pub...
Data attribute security and privacy in Collaborative distributed database Pub...
 
A Survey: Data Leakage Detection Techniques
A Survey: Data Leakage Detection Techniques A Survey: Data Leakage Detection Techniques
A Survey: Data Leakage Detection Techniques
 
Evaluation of a Multiple Regression Model for Noisy and Missing Data
Evaluation of a Multiple Regression Model for Noisy and Missing Data Evaluation of a Multiple Regression Model for Noisy and Missing Data
Evaluation of a Multiple Regression Model for Noisy and Missing Data
 
IRJET - Random Data Perturbation Techniques in Privacy Preserving Data Mi...
IRJET -  	  Random Data Perturbation Techniques in Privacy Preserving Data Mi...IRJET -  	  Random Data Perturbation Techniques in Privacy Preserving Data Mi...
IRJET - Random Data Perturbation Techniques in Privacy Preserving Data Mi...
 
A Study on Big Data Privacy Protection Models using Data Masking Methods
A Study on Big Data Privacy Protection Models using Data Masking Methods A Study on Big Data Privacy Protection Models using Data Masking Methods
A Study on Big Data Privacy Protection Models using Data Masking Methods
 
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
[IJET-V1I3P10] Authors : Kalaignanam.K, Aishwarya.M, Vasantharaj.K, Kumaresan...
 
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...
Content an Insight to Security Paradigm for BigData on Cloud: Current Trend a...
 
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET-  	  Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...IRJET-  	  Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
IRJET- Study Paper on: Ontology-based Privacy Data Chain Disclosure Disco...
 
2014 IEEE JAVA DATA MINING PROJECT M privacy for collaborative data publishing
2014 IEEE JAVA DATA MINING PROJECT M privacy for collaborative data publishing2014 IEEE JAVA DATA MINING PROJECT M privacy for collaborative data publishing
2014 IEEE JAVA DATA MINING PROJECT M privacy for collaborative data publishing
 
Ib3514141422
Ib3514141422Ib3514141422
Ib3514141422
 
IRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET- Improving the Performance of Smart Heterogeneous Big DataIRJET- Improving the Performance of Smart Heterogeneous Big Data
IRJET- Improving the Performance of Smart Heterogeneous Big Data
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 
Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...
Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...
Ijeee 7-11-privacy preserving distributed data mining with anonymous id assig...
 
IRJET- A Study of Privacy Preserving Data Mining and Techniques
IRJET- A Study of Privacy Preserving Data Mining and TechniquesIRJET- A Study of Privacy Preserving Data Mining and Techniques
IRJET- A Study of Privacy Preserving Data Mining and Techniques
 

More from IJECEIAES

Remote field-programmable gate array laboratory for signal acquisition and de...
Remote field-programmable gate array laboratory for signal acquisition and de...Remote field-programmable gate array laboratory for signal acquisition and de...
Remote field-programmable gate array laboratory for signal acquisition and de...
IJECEIAES
 
An efficient security framework for intrusion detection and prevention in int...
An efficient security framework for intrusion detection and prevention in int...An efficient security framework for intrusion detection and prevention in int...
An efficient security framework for intrusion detection and prevention in int...
IJECEIAES
 
A review on internet of things-based stingless bee's honey production with im...
A review on internet of things-based stingless bee's honey production with im...A review on internet of things-based stingless bee's honey production with im...
A review on internet of things-based stingless bee's honey production with im...
IJECEIAES
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
IJECEIAES
 
Hyperspectral object classification using hybrid spectral-spatial fusion and ...
Hyperspectral object classification using hybrid spectral-spatial fusion and ...Hyperspectral object classification using hybrid spectral-spatial fusion and ...
Hyperspectral object classification using hybrid spectral-spatial fusion and ...
IJECEIAES
 
SADCNN-ORBM: a hybrid deep learning model based citrus disease detection and ...
SADCNN-ORBM: a hybrid deep learning model based citrus disease detection and ...SADCNN-ORBM: a hybrid deep learning model based citrus disease detection and ...
SADCNN-ORBM: a hybrid deep learning model based citrus disease detection and ...
IJECEIAES
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
IJECEIAES
 
A deep learning framework for accurate diagnosis of colorectal cancer using h...
A deep learning framework for accurate diagnosis of colorectal cancer using h...A deep learning framework for accurate diagnosis of colorectal cancer using h...
A deep learning framework for accurate diagnosis of colorectal cancer using h...
IJECEIAES
 
Predicting churn with filter-based techniques and deep learning
Predicting churn with filter-based techniques and deep learningPredicting churn with filter-based techniques and deep learning
Predicting churn with filter-based techniques and deep learning
IJECEIAES
 
Automatic customer review summarization using deep learningbased hybrid senti...
Automatic customer review summarization using deep learningbased hybrid senti...Automatic customer review summarization using deep learningbased hybrid senti...
Automatic customer review summarization using deep learningbased hybrid senti...
IJECEIAES
 

More from IJECEIAES (20)

Remote field-programmable gate array laboratory for signal acquisition and de...
Remote field-programmable gate array laboratory for signal acquisition and de...Remote field-programmable gate array laboratory for signal acquisition and de...
Remote field-programmable gate array laboratory for signal acquisition and de...
 
Detecting and resolving feature envy through automated machine learning and m...
Detecting and resolving feature envy through automated machine learning and m...Detecting and resolving feature envy through automated machine learning and m...
Detecting and resolving feature envy through automated machine learning and m...
 
Smart monitoring technique for solar cell systems using internet of things ba...
Smart monitoring technique for solar cell systems using internet of things ba...Smart monitoring technique for solar cell systems using internet of things ba...
Smart monitoring technique for solar cell systems using internet of things ba...
 
An efficient security framework for intrusion detection and prevention in int...
An efficient security framework for intrusion detection and prevention in int...An efficient security framework for intrusion detection and prevention in int...
An efficient security framework for intrusion detection and prevention in int...
 
Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...
 
A review on internet of things-based stingless bee's honey production with im...
A review on internet of things-based stingless bee's honey production with im...A review on internet of things-based stingless bee's honey production with im...
A review on internet of things-based stingless bee's honey production with im...
 
A trust based secure access control using authentication mechanism for intero...
A trust based secure access control using authentication mechanism for intero...A trust based secure access control using authentication mechanism for intero...
A trust based secure access control using authentication mechanism for intero...
 
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbers
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbersFuzzy linear programming with the intuitionistic polygonal fuzzy numbers
Fuzzy linear programming with the intuitionistic polygonal fuzzy numbers
 
The performance of artificial intelligence in prostate magnetic resonance im...
The performance of artificial intelligence in prostate  magnetic resonance im...The performance of artificial intelligence in prostate  magnetic resonance im...
The performance of artificial intelligence in prostate magnetic resonance im...
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
Analysis of driving style using self-organizing maps to analyze driver behavior
Analysis of driving style using self-organizing maps to analyze driver behaviorAnalysis of driving style using self-organizing maps to analyze driver behavior
Analysis of driving style using self-organizing maps to analyze driver behavior
 
Hyperspectral object classification using hybrid spectral-spatial fusion and ...
Hyperspectral object classification using hybrid spectral-spatial fusion and ...Hyperspectral object classification using hybrid spectral-spatial fusion and ...
Hyperspectral object classification using hybrid spectral-spatial fusion and ...
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 
SADCNN-ORBM: a hybrid deep learning model based citrus disease detection and ...
SADCNN-ORBM: a hybrid deep learning model based citrus disease detection and ...SADCNN-ORBM: a hybrid deep learning model based citrus disease detection and ...
SADCNN-ORBM: a hybrid deep learning model based citrus disease detection and ...
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
 
A deep learning framework for accurate diagnosis of colorectal cancer using h...
A deep learning framework for accurate diagnosis of colorectal cancer using h...A deep learning framework for accurate diagnosis of colorectal cancer using h...
A deep learning framework for accurate diagnosis of colorectal cancer using h...
 
Swarm flip-crossover algorithm: a new swarm-based metaheuristic enriched with...
Swarm flip-crossover algorithm: a new swarm-based metaheuristic enriched with...Swarm flip-crossover algorithm: a new swarm-based metaheuristic enriched with...
Swarm flip-crossover algorithm: a new swarm-based metaheuristic enriched with...
 
Predicting churn with filter-based techniques and deep learning
Predicting churn with filter-based techniques and deep learningPredicting churn with filter-based techniques and deep learning
Predicting churn with filter-based techniques and deep learning
 
Taxi-out time prediction at Mohammed V Casablanca Airport
Taxi-out time prediction at Mohammed V Casablanca AirportTaxi-out time prediction at Mohammed V Casablanca Airport
Taxi-out time prediction at Mohammed V Casablanca Airport
 
Automatic customer review summarization using deep learningbased hybrid senti...
Automatic customer review summarization using deep learningbased hybrid senti...Automatic customer review summarization using deep learningbased hybrid senti...
Automatic customer review summarization using deep learningbased hybrid senti...
 

Recently uploaded

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Bhubaneswar🌹Call Girls Bhubaneswar ❀Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❀Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❀Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❀Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Jaipur ❀CALL GIRL 0000000000❀CALL GIRLS IN Jaipur ESCORT SERVICE❀CALL GIRL IN...
Jaipur ❀CALL GIRL 0000000000❀CALL GIRLS IN Jaipur ESCORT SERVICE❀CALL GIRL IN...Jaipur ❀CALL GIRL 0000000000❀CALL GIRLS IN Jaipur ESCORT SERVICE❀CALL GIRL IN...
Jaipur ❀CALL GIRL 0000000000❀CALL GIRLS IN Jaipur ESCORT SERVICE❀CALL GIRL IN...
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 

Framework to Avoid Similarity Attack in Big Streaming Data

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 8, No. 5, October 2018, pp. 2920~2925 ISSN: 2088-8708, DOI: 10.11591/ijece.v8i5.pp2920-2925  2920 Journal homepage: http://iaescore.com/journals/index.php/IJECE Framework to Avoid Similarity Attack in Big Streaming Data Ganesh Dagadu Puri, D. Haritha CSE, Koneru Lakshmaiah Education Foundation, India Article Info ABSTRACT Article history: Received Dec 13, 2017 Revised Feb 24, 2018 Accepted Aug 12, 2018 The existing methods for privacy preservation are available in variety of fields like social media, stock market, sentiment analysis, electronic health applications. The electronic health dynamic stream data is available in large quantity. Such large volume stream data is processed using delay free anonymization framework. Scalable privacy preserving techniques are required to satisfy the needs of processing large dynamic stream data. In this paper privacy preserving technique which can avoid similarity attack in big streaming data is proposed in distributed environment. It can process the data in parallel to reduce the anonymization delay. In this paper the replacement technique is used for avoiding similarity attack. Late validation technique is used to reduce information loss. The application of this method is in medical diagnosis, e-health applications, health data processing at third party. Keyword: Big data Distributed Privacy Similarity Copyright © 2018 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Ganesh Dagadu Puri, CSE KLEF, Vaddeswaram, Guntur, Andhra Pradesh, 522502 - India. Email: puriganeshengg@gmail.com 1. INTRODUCTION Nowadays electronic health information and electronic health applications are available in large quantity [1]. The users of health information like health care providers, researchers, analysts use this data for making inferences [2]. Since health records contain the private data of patient, the access is restricted. To make this access easy and possible, privacy preservation techniques are useful. Electronic health records are useful for the communication and keeping the information of patient intact. The demand of such big amount of electronic health data has increased concern of privacy for the patients [3]. For providing privacy to electronic health data de-identification techniques are used. These techniques provide privacy by removing direct identifiers which can expose identity of individual or disclose sensitive information of individual. It provides privacy by suppression, generalization or replacement of the identifiers [4-5]. Various laws in different countries are available for providing privacy to electronic health data [2]. In USA Health Insurance Portability and Accountability Act (HIPAA), Patient Safety and Quality Improvement Act (PSQIA), HITECH Act protects privacy of electronic health data. Data Protection Act (DPA) in UK provides options to individuals for protecting information. Russian Federal Law on Personal Data in Russia makes it necessary to take all permissions for organizations before handing over the health data to other. Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada give citizens right to know the reasons behind the collection of private data [6]. IT Act and IT (Amendment) Act in India suggests strict actions like imprisonment or fine for misusing personal information. Data Protection Directive in European Union helps to keep fundamental rights of people with respect to accessing of personal data. In the anonymization of electronic health data de-identification methods are used. These methods are further divided into K-anonymity, L-diversity, T- closeness [7-9]. In the L-diversity, there is possibility of similarity attack. In Figure 1 architecture of delay free anonymization for privacy preservation is shown [10]. Input data is coming from source in terms of tuples. This tuple is divided in two parts.
  • 2. Int J Elec& Comp Eng ISSN: 2088-8708  Framework to Avoid Similarity Attack in Big... (Ganesh Dagadu Puri) 2921 Figure 1. Delay free anonymization: Architectural diagram First part contains quasi identifier and group number. Second part contains sensitive tuple along with its L-1 counterfeit values, count of each sensitive value and group number. Adding L-1 counterfeit values with real value will make difficult to disclose the sensitive value. These counterfeit values will be validated with the upcoming input tuples. Groupwise count of released tuples will be maintained. The similarity of counterfeit values will affect privacy. It can be avoided by replacing similar values in the group. 2. RELATED WORK There is need of establishing guidelines for privacy against invasive marketing and inadvertent privacy disclosure [11]. Privacy requirements in data sharing for big data operators need scalable privacy preserving algorithms to provide privacy to the datasets. Health information providers can benefit from cost- profit model to take decision about sharing the health related data to other parties [12]. Privacy requirements are important in big data collection, storage and intra and inter-organization processing. To make the computing of big data in privacy preserved way Privacy preserving aggregation, encrypted data operations and de-identification techniques are suggested [13]. In data privacy, it is required to understand privacy requirements in data provider, data collector, data miner, decision maker stages [14]. Need of keeping source or origin of data is important to identify privacy attack. In [10] delay free anonymization technique is used for to reduce delay and increase data utility by late validation. Distributed stream processing is done with extending storm capabilities for task management, scheduling, and executing in distributed manner [15]. DART system propose framework for different devices present on remote sites in distributed environment. This framework provides facility of registration and authorization of devices on remote site, task allocation and management of user application. In the system computation load is reduced by utilizing idle resources [16]. The distributed stream processing systems possess different availability requirement for different applications. When one of the nodes in distributed environment gets failed, the backup or secondary server resumes the execution. While doing this, the state should be maintained. The type of recovery technique and performance is based on stream processing application [17]. The new stream processing systems exploit the tasks instead of nodes for fault tolerance [18]. 3. RESEARCH METHOD 3.1. The Need and Importance of the Problem Electronic health data is produced in large quantity. In anonymization of this data minimum execution time and less information loss is important. Anonymization delay is minimized using delay free framework. To avoid similarity attack on l-diverse counterfeit group, replacement of similar value is required. Due to large amount of tuples of electronic health data, there is possibility of formation of similar groups and it can disclose the sensitive value. Repetition of such values in group is avoided using the synthetic value formation. The complexity of big electronic health data creates challenge for existing privacy preserving algorithm which cannot work on large datasets. In Figure 2 to avoid similarity attack, similarity index of each group is calculated [19]. If some values are similar then such values will be replaced with other values. For this replacement help of past data is taken. With the policy of past reflect future, for early late validation of counterfeit values in the group the values from the past data are selected. Information loss and utility of the replaced data is calculated. It will be
  • 3.  ISSN: 2088-8708 Int J Elec& Comp Eng, Vol. 8, No. 5, October 2018 : 2920 - 2925 2922 note down in statistic data to see if that replaced value in counterfeit group caused more or less information loss. Figure 2. Work flow model to avoid similarity attack 3.2. Algorithm In the Figure 3 algorithm for the proposed method using big data as input is given. For each tuple set of streaming big data input [20], the source is maintained. It is useful to find source of data in case of adversary attack. Figure 3. Algorithm for proposed method The incoming stream data may not be in suitable format. Preprocessing is used to convert the incoming data in suitable format. The steps used for the preprocessing are as follows. a. Read the url or address of streaming data source. b. Load the raw data in dataset file. c. Read the first line of attributes in the file and split it as per the delimiter.
  • 4. Int J Elec& Comp Eng ISSN: 2088-8708  Framework to Avoid Similarity Attack in Big... (Ganesh Dagadu Puri) 2923 d. Convert the split data of first line into columns. e. Read the files data in buffer line by line up to end of file convert it into tuple. f. Split the data stream using delimiter and insert in the columns. g. Identify the quasi and sensitive identifiers in data table. After preprocessing the data is available in proper format. The Anatomy [21] technique divides input tuples into two parts. The counterfeit values will be added to form the groups. If the counterfeit values are repeated more no of times in the groups, synthetic values can be used to replace these repeated values otherwise past data is sufficient to form group of counterfeit values. For each individual group of counterfeit values similarity index of group will be calculated. If the values are similar then these values will be replaced with other values from the past data. Late validation is done by maintaining the group count and the released tuples in the group. Statistic data of information loss and utility measures is maintained. If information loss ratio is more than threshold value then the process is repeated by changing the values in the group. 4. DISTRIBUTED EXECUTION FLOW In case of analysis if the organization does not have enough processing capability and infrastructure to process large amount of data, such stream data will be given to third party. In such situation existing methods are inappropriate to provide enough privacy. Figure 4 show the distributed execution flow of big streaming data. In delay free anonymization method L-diverse counterfeit values will be generated when new tuple arrives. It generates these values from past data (domain of sensitive values). For big data, millions of tuples are arriving in one session and randomly counterfeit values get generated [20]. There is probability of similar values getting generated in a group. This may cause similarity attack on the patient data. Figure 4. Distributed Execution flow for big streaming data to avoid similarity attack To avoid this situation when the similar values get generated in the group, we can replace these similar values with other sensitive values so that similarity attack can be avoided. At the same time repeated values among the groups are found and such values are replaced with synthetic values. Vertical dotted lines in Figure 4 show the execution on different nodes in distributed fashion. While tuples are anonymized and published on first node, second node will be used for the group data formation and replacement of similar values. Third node will keep statistic data based information loss due to replaced or synthetic values in group. Domain of sensitive values contains limited values and these values are getting repeated. For example for N records N/L groups of counterfeit values will be generated. For 500 records 50 groups with L=10 will be generated. But as the big data is the input for example there are 500000 records and L=10. It will generate 50000 groups. In each group the counterfeit sensitive values will get repeated. In sensitive domain if we have 50 unique values. For 50000 groups, repetition of 50 values will be 1000 times in different groups. To avoid this repetition of sensitive domain values in the groups, few values can be replaced with synthetic values. The probability of disclosure of real sensitive value is increased if repetition of sensitive values in groups takes place. Creating groups of counterfeit values for millions of records in very short time
  • 5.  ISSN: 2088-8708 Int J Elec& Comp Eng, Vol. 8, No. 5, October 2018 : 2920 - 2925 2924 and finding repeated or similar values in groups in very short time will require executing this work in distributed or parallel fashion. 5. RESULTS AND ANALYSIS For processing the big streaming data, we have used task level parallelism and data level parallelism. For the tasks like reading streamed data from source, preprocessing of streamed data and counterfeit and loss management parallelism is applied. To achieve the result stream data is processed on flink data processing engine [22]. It supports for processing of big datastreaming as well as batch data processing. Flink data engine also support for complex event processing, machine learning and graph analysis. Table 1 shows the similarity values of sensitive value of tuple of different groups obtained by executing this data parallelly using different measures. In Table 1 similarity between different group values calculated. When the tuple appear, it is released using the counterfeit value addition in the group. Table 1 shows similarity values for dengue, leprosy, malaria and diphtheria sensitive value with other counterfeit values. Similarity results are obtained using different measures [23]. Table 1. Similarity Values for Different Groups Using Different Measures Wu & Palmer, Path length, Jiang & Cornath, Conceptual distance and Lin measures are used to find the similarity of values in the group [23]. Based on those measures, to avoid similarity attack similar value can be replaced with other value in that group. Measures for four groups with real sensitive values dengue, leprosy, malaria and diphtheria are shown in Table 1. Figure 5 shows graph comparison for similarity of the real sensitive value with group value. Figure 5. Graph based on similarity in different group using different measures
  • 6. Int J Elec& Comp Eng ISSN: 2088-8708  Framework to Avoid Similarity Attack in Big... (Ganesh Dagadu Puri) 2925 6. CONCLUSION Privacy preservation framework to avoid similarity attack in electronic health streams is proposed. To find different similarity of sensitive value with counterfeit value, similarity measures are used. Replacement of similar counterfeit values is done by past data of tuples to increase data utility. For big streaming data synthetic values are used for replacement of counterfeit values among groups. Anonymization delay of framework is reduced using distributed execution. REFERENCES [1] Nugraha DC, Aknuranda I, “An Overview of e-Health in Indonesia: Past and Present Applications”, International Journal of Electrical and Computer Engineering, 2017 Oct 1; 7(5): 2441. [2] Abouelmehdi K, Beni-Hssane A, Khaloufi H, Saadi M, “Big data security and privacy in healthcare: A Review”, Procedia Computer Science, 2017 Jan 1; 113: 73-80. [3] Puri GD, Haritha D, “Survey big data analytics, applications and privacy concerns”, Indian Journal of Science and Technology, 2016 May 18; 9(17). [4] Fung B, Wang K, Chen R, Yu PS, “Privacy-preserving data publishing: A survey of recent developments”, ACM Computing Surveys (CSUR), 2010 Jun 1; 42(4): 14. [5] Gkoulalas-Divanis A, Loukides G, Sun J, “Publishing data from electronic health records while preserving privacy: A survey of algorithms”, Journal of biomedical informatics, 2014 Aug 31; 50: 4-19. [6] Jensen M, “Challenges of privacy protection in big data analytics”, InBig Data (BigData Congress), 2013 IEEE International Congress, 2013 Jun 27 (pp. 235-238). IEEE. [7] Sweeney L, “k-anonymity: A model for protecting privacy”, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2002 Oct; 10(05): 557-70. [8] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M, “l-diversity: Privacy beyond k-anonymity”, In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference, 2006 Apr 3 (pp. 24-24). IEEE. [9] Li N, Li T, Venkatasubramanian S, “t-closeness: Privacy beyond k-anonymity and l-diversity”, In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference, 2007 Apr 15 (pp. 106-115). IEEE. [10] Kim S, Sung MK, Chung YD, “A framework to preserve the privacy of electronic health data streams”, Journal of biomedical informatics, 2014 Aug 31; 50: 95-106. [11] Puri GD, Haritha D, “A Framework to Preserve the Privacy of Electronic Health Dynamic Data Streams Using Parallel Architecture”, International Journal of Control Theory and Applications, 2017: 10(6): 527-535. [12] Khokhar RH, Chen R, Fung BC, Lui SM, “Quantifying the costs and benefits of privacy-preserving health data publishing”, Journal of biomedical informatics, 2014 Aug 31; 50: 107-21. [13] Lu R, Zhu H, Liu X, Liu JK, Shao J, “Toward efficient and privacy-preserving computing in big data era”, IEEE Network. 2014 Jul; 28(4): 46-50. [14] Xu L, Jiang C, Wang J, Yuan J, Ren Y, “Information security in big data: privacy and data mining”, IEEE Access, 2014; 2: 1149-76. [15] Nardelli M, “A Framework for Data Stream Applications in a Distributed Cloud”, In ZEUS 2016 Jan 27 (pp. 56- 63). [16] Choi JH, Park J, Park HD, Min OG., “DART: fast and efficient distributed stream processing framework for internet of things”, ETRI Journal, 2017 Apr 1; 39(2): 202-12. [17] Hwang JH, Balazinska M, Rasin A, “Cetintemel U, Stonebraker M, Zdonik S. High-availability algorithms for distributed stream processing”, InData Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference, 2005 Apr 5 (pp. 779-790). IEEE. [18] Kamburugamuve S, Fox G, Leake D, Qiu J, “Survey of distributed stream processing for large stream sources”, Technical report, 2013 Dec 14. [19] Erritali M, Beni-Hssane A, Birjali M, Madani Y, “An Approach of Semantic Similarity Measure between Documents Based on Big Data”, International Journal of Electrical and Computer Engineering, 2016 Oct 1; 6(5): 2454. [20] Nair LR, Shetty SD, Shetty SD, “Streaming Big Data Analysis for Real-Time Sentiment based Targeted Advertising”, International Journal of Electrical and Computer Engineering (IJECE), 2017 Feb 1; 7(1): 402-7. [21] Xiao X, Tao Y, “Anatomy: Simple and effective privacy preservation”, In Proceedings of the 32nd international conference on Very large data bases, 2006 Sep 1 (pp. 139-150). VLDB Endowment. [22] Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015; 36(4). [23] McInnes BT, Pedersen T, Pakhomov SV, “UMLS-Interface and UMLS-Similarity: open source software for measuring paths and semantic similarity”, In AMIA Annual Symposium Proceedings 2009 (Vol. 2009, p. 431). American Medical Informatics Association.