SlideShare a Scribd company logo
1 of 8
Download to read offline
Safe Machine Learning for Patient Data Privacy
Ashwini Chaudhari, Hemangi Shinge, Mansi Chowkkar, Seemanthini Narasimha Moorthy
x18129676, x18130429, x18134599, x18141447
MSc Data Analytics
Advanced Data Mining
National College of Ireland
Abstract— The evolution of various data mining processes
has contributed majorly in diverse fields for prediction of
future trends by enabling transformation of data into useful
information. This has evoked interest of many researchers in
developing machine learning models for disease prediction.
Massive growth in healthcare data has led to the migration
of data from traditional storage systems to cloud based
systems which increases privacy concerns. In this project, a
privacy-preserving system is proposed for protecting sensitive
healthcare data. The system first encrypts the plaintext which
is fed to the data mining model so that during the prediction
process sensitive data is always maintained in encrypted form.
The output of the prediction system is then provided to
the medical practitioner who already has the private key
for decryption which ensures data privacy. For the purpose
of encrypting patient data, Paillier Homomorphic Encryption
is used and various data mining models have been tested
with encrypted data. It was observed that for the Breast
Cancer Diagnostic Dataset from UCI Data Repository, Logistic
Regression outperformed all other implemented models without
compromising data privacy.
Keywords - Homomorphic Encryption, Data Mining,
Logistic Regression, Privacy-preserving, Disease prediction
I. INTRODUCTION
Medical estimates are ubiquitous in several fields, ranging
from healthcare to illness diagnosis.Predictors of disease
provide the facilities of pre-diagnosis to make a clinical
judgement based on the medical data of the user. Many
medical applications deal with sensitive information hence
it is necessary to ensure privacy [1].
As the healthcare industry is in the global scope for
providing health services to patients and is facing a sharp
increase in growth of electronic data, information and data
security is a major concern. [2] To solve the privacy issues,
we propose a privacy preserving system using Homomorphic
Cryptosystems.
Safety and reliability factors are key points to be
considered for upcoming learning algorithms in areas such
as healthcare, defence, finance which contain sensitive data.
Especially in healthcare, patient data is considered highly
sensitive, and the patient would not want it to be revealed to
any entities other than the concerned persons.
There are two stages of supervised learning algorithms:
1. Training phase- In which the algorithm knows a model w
from a collection of examples marked. 2. The classification
stage running the C classifier over the previously unknown
vector x function, using the w model to show C(x, w).It is
crucial that the feature vector x and the model w remain
private which involved in applications that manage sensitive
data. Through this whole document, we soon refer to this
objective as a classification that protects privacy. In practical
terms, a client has a personal input described as a vector x
function, and also the server has a personal input made up
of a personal model w. The method of obtaining the model
w here is separate of our protocols [3].
Open source datasets could give an upper hand for
machine learning resulting in good models being produced
but patient privacy could be compromised, with malicious
users having the ability to back track patient data to specific
patients.
In this project we will be using Paillier Homomorphic
Encryption for encrypting breast cancer data. As
Homomorphic Encryption is applied over encrypted
data, Format Preserving Encryption (FPE) will be used for
encryption. HE will then be applied over this encrypted
data. For encryption we will be using public key and
the private key will be use for decryption of data by
the medical practitioner which will ensure privacy of the
sensitive data. Encrypted data is then fed to the following
machine learning and deep learning models to build
effective privacy-preserving protocols which are the most
widely known classifiers: KNN, Logistic Regression, SVC,
LBFGS, Nave Bayes, Decision Tree, Bagged Decision Tree,
Random Forest, Extra Trees, AdaBoost, Stochastic Gradient
Boosting, Stochastic Voting Ensemble. The classification
models will be evaluated on the basis of of cross-validation
accuracy and the F1 score.
The objective of this project is to identify the best
fir classifier model for the UCI Breast Cancer Wisconsin
(Diagnostic) Dataset and the structure of the paper is as
follows: Section III discusses the related research. The
methodology followed for implementation of this project is
explained in detail in Section IV. The evaluation and results
are discussed in Section V
II. RESEARCH QUESTION
”How efficiently and accurately can disease prediction
models be implemented on encrypted data to secure patient
privacy using data mining techniques?”
III. LITERATURE REVIEW
In the current technological era, traditional paper
based medical records, prescription, and patient data
have advanced to electronic health records (EHRs). To
maintain such a large amount of healthcare data requires
privacy preserving for maintaining the confidentiality of
data. Spoofing identity attacks, information disclosures
are the main threats in healthcare. The cryptographic
approaches are used to protect e-health data such as:
symmetric key encryption method(SKE) and public key
encryption method (PKE). Some of the techniques used for
cryptography is mentioned by [4] are: 1. Fully Homomorphic
Encryption (FHE) 2. Somewhat Homomorphic Encryption
(SHE) 3. Searchable Encryption 4. Predicate/Hierarchical
Encryption Homomorphic encryption (HE) allows encryption
on ciphertexts and integers which provides encrypted
results. FHE performs arbitrary number of addition and
multiplication on encrypted data without decryption. SHE
performs a limited number of addition and multiplication
operation on encrypted data without decryption. In [4] study,
privacy preserving for health data approaches have explained.
Searchable encryption techniques based on PKE is less
efficient, hence new techniques need to be explored in
privacy preserving area.
Machine learning algorithms on encrypted data is
one of the important applications of the security and
privacy-preserving area. In the work [5], the logistic
regression model is explained on encrypted data using FHE.
Bootstrapping operation is used for performing FHE to
run arbitrary number of steps. In critical applications, HE
provides the highest level of data privacy with increasing
computation time. With the advancement in machine
learning, new techniques to predict disease support system
has been proposed by [6]. In the disease prediction support
system (DPSS), the traditional cryptographic technique
can protect healthcare data but, they are unfair for new
modern applications. Hence in this study, privacy preserving
prediction system is developed using paillier HE. Paillier
HE encrypted sensitive patient data and the efficiency of the
machine learning model (FKNN-CBR) is evaluated.
In the study of cryptography done by [7], cryptography
primitive implementation to protect sensitive data before
exposing to any classifier model has been proposed. To
predict the output of the classifier model for accuracy or
sensitivity, data needs to be provided to the model. To reduce
the risk of exposing such a piece of sensitive information,
data protection is an important part in machine learning. In
[7] study, Fully homomorphic encryption (FHE) has been
implemented with privacy preserving Naive Bayes classifier.
The Problem faced in this study is the performance of the
classifier since FHE is slow performance algorithm. The
author implemented performance improving techniques such
as privacy preserving techniques using FHE with HELib
library. In the research done by [8], proposed privacy
preserving disease prediction (PPDP) model based on single
layer perceptron (SLP). For implementing SLP, they used HE
techniques where modulus and a prime generating algorithm
is used. HE allows the user to communicate with untrusted
parties using encrypted data. Because of its low efficiency, it
is not practically implemented in modern world examples.
In [9] paper, new practical algorithms have implemented
using FHE. These algorithms provide symmetric approach,
efficiency, and robustness to protect plaintext, ciphertext
attacks. In [10] paper, for performing statistical analysis on
encrypted data, the FHE method is used on three types of
data. Here, encryptor is proposed to encrypt data and then
this data is sent to the cloud for performing HE on the
ciphertext. Decryptor receives HE data and then decrypts the
data.
To reduce time and cost SHE is now implementing by
many researchers. In [11], SHE is implemented for private
equality test(PET) of integers and private equality batch
processing test (PriBET). SHE is faster than FHE which
supports many addition operations but fewer multiplication
operations. In [12] paper, privacy preserving support system
is proposed to preserve data in the healthcare sector. One
of the technique suggested here is HE based on paillier
cryptography. This algorithm is based on public key additive
encryption scheme. The advantage of using paillier scheme
is that encryption can be made at the server side to perform
linear operations based on paillier properties where a
clinician can keep his private key as secret. Hence the
only clinicians can decrypt the data using the private key.
Here gaussian kernel based SVM is applied to encrypted
data. The integers and continuous variables have encrypted
using paillier encryption method. The result shows no
hampering on the accuracy of paillier encryption. In the
study [2], the encryption method is applied for designing
a privacy-preserving system. In this system encrypted data
is used to train Naive Bayes classifier without leaking
patient data. The trained classifier is then can be applied
to other patient data to predict risk factors of the disease.
For achieving privacy preserving, paillier HE is used. The
key generation process generates the public key as pk and
private key as sk. Given numbers then encrypted by random
number modulus, addition, and multiplication operation
under the public key. ciphertext can be recovered into
plaintext using the private key.
Very few studies have been carried out to predict data
mining and machine learning methods introduced to Diabetes
mellitus (DM) studies were identified and reviewed in a
systematic manner. Diabetes mellitus is characterized as a
cluster for metabolic disorders affecting human health all
across the globe. A broad variety of algorithms for machine
learning were used. Overall, 85 % of these were described by
supervised methods and 15 % with unsupervised methods, as
well as association rules used more precisely. Nephropathy,
diabetic foot, Alzheimers disease, liver cancer, heart disease,
hypoglycaemic events, depression are the diabetic medical
issues coated in this study. The occurrence of biotech, with
huge volume of data generated, together with the growing
number of EHRs, is required to offer rise to more indepth
discovery of DM diagnosis, ethopathophysiology and therapy
by the use of data mining and ML strategies dataset that
provide biological and clinical data [13].
[14] This study focuses on chronic kidney disease (CKD)
which is regarded to be kidney injury that exceeds 3
months. ML classification algorithms were used to evaluate
the value. Classified designs identified the patients CKD
and non-CKD status with distinct classification algorithms.
All these designs have added 25 attributes and 400
records to the recently recorded CKD dataset obtained
from the UCI library. 14 CKD related characteristics
for various machine learning evaluation methods have
been evaluated and estimated: Multiclass Decision Jungle,
Multiclass Decision Forest, Multiclass Neural Network,
Multiclass Logistic Regression. It is noted from the outcomes
that the Multiclass Decision Forest algorithm offers the 99.1
percent precision. The primary focus of the executed system
is to detect the health situation of a current CKD patient
by concentrating more on highlighted fields to assist get a
clearer understanding of the situation of the patient.
According to [15] predictive modeling approach for
cardiovascular disease analysis is highly difficult in the
field of healthcare informatics. The goal of this research
is to obtain patterns which link the factors of predictors
in a health science database in data mining. In this study
researcher suggest the Ensemble model strategy to combine
the predictive capacity of the system of various classifiers to
improve predictive precision. Ensemble learning integrates
the system techniques of five classification algorithms
to predict and identify the recurrence of cardiovascular
disease, including supporting vector machine, artificial neural
network, Nave Bayesian, regression analysis, and random
forest, and data is take from UCI repository. This model was
built using the ”WEKA DM tool”. 10-fold cross-validations
have been used in these studies to divide the data into
testing and training sets ; this meets the system training and
testing requirement. As an outcome of all classifier designs
implemented in the research, the precision level acquired
from this test was above 93 percent and 98.17 percent is
the highest accuracy for the RF algorithm.
[16] This article provides an estimated mathematical
model using linear regression to detect person’s disease
connection on homomorphically encrypted information.
Parkinson’s patients dataset is used for building a PD-patient
information analysis model outsourced to a remote server.
The primary challenge of this research is to use encrypted
data to perform the necessary analysis to build a model
which can identify the unidentified samples depending on
the training samples which is built using a linear regression
structure. Gradient descent algorithm has been used for
converging the model to create the linear regression. The
implementation perceives Parkinson’s encrypted sensitive
voice recording samples. The samples are encoded using
Homomorphic Encoding. If the model can deliver more
actual true positive rates, the classifier’s efficiency improves
than that of the false positive rates.
A research carried out by [1], proposed maintaining
security and highly precise outsourced random forest disease
predictor termed PHPR. The PHPR model can carry out
secure training with health information that belongs to
various data holders and predict accurately. In addition, the
rational field’s original data and calculated outcome can be
filtered and stored safely in the cloud without any privacy
leakage. Hypothesis results using real-world data show that
PHPR not just gives a secure prediction of disease over
ciphertexts, but it also holds predictive accuracy as the
original classifier. The study concludes is important to find
a suitable classification model for the prediction of disease.
Another study [17] used fully homomorphic encryption
that allowed development of new privacy preserving
machine learning schemes. Demonstrate how well these
systems can be implemented to the automatic evaluation
of speech impacted by medical circumstances, enabling
patient confidentiality in treatment and scenario tracking.
More precisely, it presented results of Parkinsons Disease,
detection of cold and degree of depression. The second
degree polynomials and linear equations replace the
activation functions, as only sum and multiplications are
feasible. The resulting template is then used in an encrypted
part of the network after training the network with non
encrypted data to generate encrypted predictions. The small
differentiation between the outcomes of encrypted neural
networks and their unencrypted counterparts, furthermore,
indicates the validity of safe strategy. However, the restricted
volume of records does not enable deeper networks to
thoroughly analyze performance degradation.
Classification of machine learning is now used for
various functions, for example, prediction genomics or
medical, face detection, spam detection, also economic
predictions. Because of privacy issues, it is crucial that
the classifier and data remain private. In this research
[3], Nave Bayes, hyperplane decision, and decision trees
these are 3 main classification protocols build to meet
this privacy requirement. Based on these structures is a
new building block library that allows a broad variety of
privacy-conserving classifiers to be constructed. It Illustrates
how well this library can be used to build other classifiers,
such as a multiplexer and face recognition classifier, than
that of the 3 listed above. The classifiers and library were
introduced and assessed. In these protocols, it is effective to
take milliseconds to a few seconds to perform classification
when operating on actual medical data.
One of the functions of the 2017 iDASH Secure Genome
Assessment Competition was to allow logistic regression
approaches to be trained over encrypted genomic records.
More specifically, it provided a list of about 1,500 patient
data, each one with 18 binary characteristics containing
information on particular diseases. The concept was that the
data owner would encrypt documents using homomorphic
encryption and deliver them to an unsecure cloud for
storage. Cloud could then introduce a training method
homomorphically to the encrypted records to achieve a
system of encrypted logistic regression and can be sent
for decryption to the data holder. Data provider could thus
effectively outsource the training procedure without either
disclosing its sensitive information or the skilled model
to the cloud. For the encryption of fixed point number
and multibit plaintext encryption homomorphic encryption
is used. The outcome shows that training on encrypted
information is feasible but it comes with high computing
cost. On the other side, in critical apps, this technique can
ensure the greatest level of data confidentiality[18].
Machine learning is a very successful strategy that works
with early diagnosis of disease that could assist physicians to
make diagnostic decisions. The aim of this article is to build
a classifier model using the WEKA tool to detect diabetes
using Naive Bayes, SVM, RF and Simple CART algorithms.
The aforementioned 4 classifiers were ranked depending
on training time, test time and precision value. From the
evaluation and calculations it is evident that in predicting the
disease with maximum precision, Support Vector Machine
performed best. The SVM’s precision value obtained is is
0.784 which is highest and the Random Forest’s accuracy
value is 0.756 which is lowest as per[19] research. In this
study [20], researchers provided the first secure multiparty
computation (SMC) for private classification protocols with
cluster ensembles-boosted decision trees, RF, Delivery of
protocols on the KenSci Model for Medical Analysis.
IV. METHODOLOGY
Fig. 1: Fayyad Methodology [21]
1. Data Selection:
The dataset chosen for this project is Breast Cancer
Wisconsin (Diagnostic) Data Set [22] taken from UCI
Machine Learning repository, as this is a popular dataset
among researches in the field of healthcare. With 11 columns
and 699 rows, this is a concise dataset perfectly suited to test
encryption based classifier models. The dataset consists of
binary classification data, with 357 benign and 212 malignant
cases. The Class attribute is the dependent variable with ’2’
denoting benign and ’4’ represents malignant tumor.
2. Data Preprocessing:
The study and practice of keeping messages safe and secure
by using mathematical techniques is known as cryptography
in general. Cryptosystem can provide one or more than
one of the four services which are confidentiality, integrity,
authentication and non-repudiation so that information can be
protected from being disclosed to unauthorized parties. The
two categories that cryptosystems can be classified into are
secret key cryptosystem or symmetric key cryptosystem and
public key cryptosystem or asymmetric key cryptosystem.
The same key is used to perform encryption and decryption
process on a message in a symmetric key cryptosystem.
DES, IDEA and AES are some of the popular symmetric
key cryptosystems. Two different keys i.e. private and public
keys are used for the encryption and decryption process in
an asymmetric key cryptosystem. Public key is required for
the encryption process whereas for the process of decryption
private key is used [23].
In this paper, we use an algorithm that uses healthcare data
and preserves the privacy using Homomorphic Encryption.
Homomorphic Encryption is an encryption technique that
permits us to perform computations on ciphertext and
generates a result in encrypted format. The result in
encrypted format decryption will match the outcome of
the operation in a manner as though it was performed
on the original plaintext. Rivest et al. first coined the
term Homomorphic Encryption in the year 1978. Post
that, multiple researchs have been proposed that used
Homomorphic encryption that supported either addition or
multiplication on encrypted data but not both addition
and multiplication. But only either of addition and
multiplication wasn’t sufficient for a number of extended
computations in various fields such as data mining,
machine learning, bioinformatics, etc. Later on, many
researchers proposed Fully Homomorphic Cryptosystems
and Partially Homomorphic Cryptosystems that supported
multiple additions and multiplications [11].
Homomorphic Encryption is applied over already
encrypted data and computations are performed on that
outcome of the result. For the purpose of initial encryption
this paper proposes a data masking scheme that is based
on the Format Preserving Encryption (FPE). FPE is an
encryption method that is irreversibly symmetric. FPE is
not exactly similar to traditional symmetric encryption since
instead of completely changed unreadable binary string, the
FPE ciphertext maintain the format and structure of the
original plaintext and hence the result can be easily saved
back to the database without the need to make any changes
to the database system. FPE can very well be utilized for
masking data for function testing, performance testing and
secure testing that can help prevent privacy exposure of the
real data. [24] This paper specifically proposes the use of
Python implementation of Format-preserving, Feistel-based
encryption (FFX). The algorithm [25] for the same is as
shown below:
Paillier Homomorphic Encryption will be adopted as
the building block in this privacy preserving system. The
Paillier cryptosystem is a probabilistic asymmetric algorithm
having an additive homomorphic property for public key
cryptography and was invented in the year 1999 by Pascal
Paillier. oneone
The output of the Format preserving, Feistel based
encryption (FFX) will be used as the input for Paillier
Homomorphic Encryption. This cryptographic technique has
three stages which are Key Generation, Encryption and
Decryption. The key notations and definitions corresponding
to them are given below [6]:
Key Generation: Consider p and q as two independent
large prime numbers.
We compute N = p ∗ q and λ = lcm(p − 1, q − 1).
Then define a function L(x) = x−1
N
Choosing an integer g of order N
and µ = (L(gmodN2
))−1
The public key is now given by PK = (N, g)
and the private key is SK = (λ, µ).
Encryption:
Let m ∈ ZN be the plaintext and r ∈ ZN be a random
number.
Then the Ciphertext can be generated as
C = EP K(m) = gm
rN
mod N2
where E() is the Paillier encryption on plaintext m and
random number r with modulo N2
.
Decryption:
Given a Ciphertext C, the plaintext m can be derived by
the following equation
m = DSK(C) = L(Cλ
mod N2
) µ mod N
Additive Homomorphism:
Given two plaintexts x and y encrypted under the same
public key PK, then the product of those two Ciphertext
EP K(x) and EP K(y) is equal to the Ciphertext of sum of
two plaintexts.
EP K(x) · EP K(y) = gx
rN
1 mod N2
· gy
rN
2 mod N2
EP K(x) · EP K(y) = gx+y
rN
1 rN
2 mod N2
EP K(x) · EP K(y) = EP K (x + y)
Scalar-Multiplicative Homomorphism:
Given a constant c ∈ ZN , then the Ciphertext EP K(m)
raised to the power of c is equal to the encryption of product
of constant and plain text.
EP K(x)c
= (gx
rN
mod N2
)c
EP K(x)c
= gxc
rcN
mod N2
EP K(x)c
= EP K (x · c)
With 11 columns and 699 records, this is a clean and
concise dataset. The column Bare nuclei contained 16
records filled with ?. This was replaced with NA, and then
further filled with the last valid occurrence of data in the
column with fillna() method in pandas. The ID column
that identifies breast cancer patients has been removed as
the inclusion of this column would not add any additional
benefits to the classification models, and would further
ensure protection of patient identity.
3. Data Exploration and Transformation:
Fig. 2: Exporatory Data Analysis
Exploratory data analysis is performed on all the columns
in the dataset to ensure that the data features fulfill the
requirements of the research question with no hidden
problems. Prominent reasons to perform this analysis is to
detect if there is any correlation between data columns.
This also ensures the expectation from the dataset is met
with statistical backing. Figure 2 visualizes the correlation
between all columns. matplotlib library has been used for
the visualisation. The graph legend indicates gradual colour
gradient moving from bottom to top. The darker shade
indicates lower correlation, and lighter shade indicates higher
correlation. The principle diagonal has been eliminated to
remove auto correlation. We observe that correlation between
variables is significantly lower than the limit except for
columns uniformity of cell shape and uniformity of cell size.
4. Data Mining:
K-fold cross validation: Due to imbalance in dataset (65.5%
Benign and 34.5% Malignant), k-fold cross validation
procedure has been implemented using k-fold method from
scikit-learn library. The dataset is split into k subsets , where
k-1 subsets are used for training and the last subset is used
for testing. This is iterative in nature, and every subset
gets its turn to be the test set. Our experiments showed
observable increase in initial accuracy using k-fold cross
validation when compared to train-test dataset split. 10-fold
cross validation has used as our experiments showed that 10
folds gave better validation accuracy when compared to 5
folds, as higher folds bring out better results.
K-nearest neighbour(KNN):
KNN algorithm is used to solve regression and classification
problems, with the least learning retention capacity. It
is a lazy learner. Data is expected to be numerical and
standardized, which is important for running the KNN
model.
Naive Bayes:
It is a deterministic classifier derived from the Bayes
principle , which works in the basis of indicating root nodes
by the previous probabilities. The Bayes theorem is provided
in 1st Equation and the constant of normalization is provided
in Equation 2. [15]
P(Xi|y) =
P(y|Xi)P(Xi)
p(y)
(1)
p(y) = Σ4
1p(y|Xi)P(Xi) (2)
Logistic Regression:
It is a predictive and probability-based assessment machine
learning algorithm used to solve classification problems. The
logistic regression approach tends to restrict the cost function
from 0 to 1 using the sigmoid activation function. The solver
that has been used for this implementation is liblinear, as it
performs best with small datasets. 1
.
Support Vector Classifier (SVC):
A Support Vector classifier is explicitly described by a
1https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148
separate hyperplane as a discriminative classifier. It is
derivation of Support Vector machine. Radial basis function
kernel has been used by default, as it is easy to tune and
provides a good benchmark for comparison. The kernel
coefficient applied is the reciprocal of the number of features.
LBFGS:
LBFGS is a conventional quasi-Newton method for
optimizing many variables ’ smooth operations. The solver
used is lbfgs for the respective model. Logistic regression is
the activation function and 11 hidden layers are used.
Decision Tree:
Learning the decision tree utilizes as a statistical model to
proceed from findings about its target value and to conclude
the result. This works best with data that can be easily
divided into categorical variables. The criterion to ensure
quality of split is assigned as gini for Gini impurity.
AdaBoost:
Is a meta-estimator that first integrates a classifier on the
main dataset , then fits extra classifier replicas on the similar
dataset. But weights are adapted for wrongly grouped cases
so that successive classifiers concentrate more on challenging
cases. 30 estimators have been used, with decision tree is
chosen as the base estimator .
Stochastic Gradient Boosting:
Which constructs an additive model in a forward-stage
manner. It enables random differential loss functions to
be optimized. 100 boosting stages were chosen for this
implementation.
Stochastic Voting Ensemble:
To merge projections, various kinds of models and statistics
such as measuring the mean can be used. The models
that were selected for this ensemble are logistic regression,
decision tree and Support Vector Classifier.
Bagged Decision Tree: Constructing models of similar type
from distinct sub-samples of training data.
Random forest: It is used for functions of classification
as well as regression,Is a decision-based classifier ensemble
method that includes a flowchart like a tree structure.
Extra Trees: Working on the same principle as randoms
forests, it selects random subsets of features and split nodes.
The main differences between then are that sampling is done
without replacement, and the criteria of node split is random
split and not best split.
V. EVALUATION AND RESULTS
A. Cross validation accuracy result
This metric has been calculated using the parameters
classifier model, feature columns, classifier column and
k-fold parameter. This has a collection of cross validation
results of all 10 folds and the mean is calculated in the end
for cross validation accuracy. cross val score method from
model selection library from scikit-learn.
Below table shows the accuracy scores obtained for the
implemented models with plaintext and encrypted input data
sorted by column ”After encryption”, as this is the column
being evaluated.
Fig. 3: Test Accuracy Table
Fig. 4: Test Accuracy
B. F1 score
The confusion matrix was evaluated for every model, but
the most relevant metric for classification chosen was the F1
score. The F1 score is calculated with the below formula:
F1 = 2 ∗
(precision ∗ recall)
(precision + recall)
(3)
Where precision= the ratio of predicted positives count and
total number of true positives predicted, and recall= the
ratio of predicted positives count and total number of true
positives and false negatives predicted. F1 score is derived
after considering all runs of k fold cross validation and its
weighted mean is calculated. Table 5 shows the F1 scores
obtained for the implemented models with plaintext and
encrypted input data sorted by column ”After Encryption”.
Fig. 5: F1 Score Table
Fig. 6: F1 Score
VI. CONCLUSION
From the results, we observe that logistic regression is
the best disease prediction model for this dataset which
outperforms other implemented models with encrypted data
without compromising patient privacy. We see that logistic
regression has the highest value with 96% for plain text
and 65% for encrypted text, followed by support vector
classifier with very similar accuracies. Logistic regression
model works very well on this dataset as it works best on
binary classification problems. We see the encryption offers
proportional deterioration for every model, and does not have
a detrimental effect on one particular model. With the dataset
having a slight imbalance, it is more appropriate to compare
them on the basis of their F1 scores. From the plot we can see
that Logistic regression model consistently performs well in
comparison to other models. Due to sensitivity of healthcare
data, it was difficult to obtain a huge dataset for the above
research. Future work could include the involvement of
more complex datasets which would enable usage of neural
networks.
REFERENCES
[1] Z. Ma, J. Ma, Y. Miao, and X. Liu, “Privacy-preserving and
high-accurate outsourced disease predictor on random forest,”
Information Sciences, vol. 496, pp. 225 – 241, 2019.
[2] X. Liu, R. Lu, J. Ma, L. Chen, and B. Qin, “Privacy-Preserving
Patient-Centric Clinical Decision Support System on Na¨ıve Bayesian
Classification,” IEEE Journal of Biomedical and Health Informatics,
vol. 20, no. 2, pp. 655–668, 2016.
[3] R. Bost, R. Ada Popa, S. Tu, and S. Goldwasser, “Machine learning
classification over encrypted data,” 01 2015.
[4] “A review on the state-of-the-art privacy-preserving approaches in the
e-health clouds.,” IEEE Journal of Biomedical and Health Informatics,
Biomedical and Health Informatics, IEEE Journal of, IEEE J. Biomed.
Health Inform, no. 4, p. 1431, 2014.
[5] C. Hao, G.-B. Ran, H. Kyoohyung, H. Zhicong, J. Amir, L. Kim,
and L. Kristin, “Logistic regression over encrypted data from fully
homomorphic encryption.,” BMC Medical Genomics, no. S4, p. 3,
2018.
[6] M. D., L. R., S. V., V. V., and A. K. Sangaiah, “Hybrid reasoning-based
privacy-aware disease prediction support system.,” Computers and
Electrical Engineering, vol. 73, pp. 114 – 127, 2019.
[7] H. Park, P. Kim, H. Kim, K.-W. Park, and Y. Lee, “Efficient machine
learning over encrypted data with non-interactive communication.,”
Computer Standards Interfaces, vol. 58, pp. 87 – 108, 2018.
[8] C. Zhang, L. Zhu, C. Xu, and R. Lu, “Ppdp: An efficient
and privacy-preserving disease prediction scheme in cloud-based
e-healthcare system.,” Future Generation Computer Systems, vol. 79,
no. Part 1, pp. 16 – 25, 2018.
[9] K. Hariss, H. Noura, and A. E. Samhat, “Fully enhanced homomorphic
encryption algorithm of more approach for real world applications.,”
Journal of Information Security and Applications, vol. 34, no. Part 2,
pp. 233 – 242, 2017.
[10] W.-j. Lu, S. Kawasaki, and J. Sakuma, “Using Fully Homomorphic
Encryption for Statistical Analysis of Categorical, Ordinal and
Numerical Data,” 2017.
[11] “Private equality test using ring-lwe somewhat homomorphic
encryption.,” 2016 3rd Asia-Pacific World Congress on Computer
Science and Engineering (APWC on CSE), Computer Science and
Engineering (APWC on CSE), 2016 3rd Asia-Pacific World Congress
on, APWC-ON-CSE, p. 1, 2016.
[12] Y. Rahulamathavan, S. Veluru, R. C. Phan, J. A. Chambers,
and M. Rajarajan, “Privacy-preserving clinical decision support
system using gaussian kernel-based classification,” IEEE Journal of
Biomedical and Health Informatics, vol. 18, no. 1, pp. 56–66, 2014.
[13] I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas,
and I. Chouvarda, “Machine learning and data mining methods
in diabetes research,” Computational and Structural Biotechnology
Journal, vol. 15, pp. 104 – 116, 2017.
[14] W. H. S. D. Gunarathne, K. D. M. Perera, and K. A. D. C. P.
Kahandawaarachchi, “Performance evaluation on machine learning
classification techniques for disease classification and forecasting
through data analytics for chronic kidney disease (ckd),” in 2017 IEEE
17th International Conference on Bioinformatics and Bioengineering
(BIBE), pp. 291–296, Oct 2017.
[15] J. M. A. A. K. M. N. S, “Research reports in clinical cardiology,”
Ensemble approach for developing a smart heart disease prediction
system using classification algorithms, vol. 9, pp. 33 – 45, 2019.
[16] T. Morshed, D. Alhadidi, and N. Mohammed, “Parallel linear
regression on encrypted data,” in 2018 16th Annual Conference on
Privacy, Security and Trust (PST), pp. 1–5, Aug 2018.
[17] F. Teixeira, A. Abad, and I. Trancoso, “Patient privacy in paralinguistic
tasks,” pp. 3428–3432, 09 2018.
[18] H. Chen, R. Gilad-Bachrach, K. Han, Z. Huang, A. Jalali, K. Laine,
and K. Lauter, “Logistic regression over encrypted data from fully
homomorphic encryption,” BMC Medical Genomics, vol. 11, p. 81,
Oct 2018.
[19] A. Mir and S. N. Dhage, “Diabetes disease prediction using machine
learning on big data of healthcare,” in 2018 Fourth International
Conference on Computing Communication Control and Automation
(ICCUBEA), pp. 1–6, Aug 2018.
[20] K. Fritchman, K. Saminathan, R. Dowsley, T. Hughes, M. De Cock,
A. Nascimento, and A. Teredesai, “Privacy-preserving scoring of tree
ensembles: A novel framework for ai in healthcare,” in 2018 IEEE
International Conference on Big Data (Big Data), pp. 2413–2422,
Dec 2018.
[21] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for
extracting useful knowledge from volumes of data,” vol. 39, no. 11,
pp. 27–34, 1996.
[22] D. Dua and C. Graff, “UCI machine learning repository,” 2017.
[23] “Advanced e-voting system using paillier homomorphic encryption
algorithm.,” 2016 International Conference on Informatics and
Computing (ICIC), Informatics and Computing (ICIC), International
Conference on, p. 338, 2016.
[24] “A data masking scheme for sensitive big data based on
format-preserving encryption.,” 2017 IEEE International Conference
on Computational Science and Engineering (CSE) and IEEE
International Conference on Embedded and Ubiquitous Computing
(EUC), Computational Science and Engineering (CSE) and Embedded
and Ubiquitous Computing (EUC), 2017 IEEE International
Conference on, CSE-EUC, p. 518, 2017.
[25] “Addendum to the ffx mode of operation for format-preserving
encryption,” p. 1, 2010.
VII. APPENDIX

More Related Content

What's hot

An efficient data masking for securing medical data using DNA encoding and ch...
An efficient data masking for securing medical data using DNA encoding and ch...An efficient data masking for securing medical data using DNA encoding and ch...
An efficient data masking for securing medical data using DNA encoding and ch...IJECEIAES
 
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and ApproachesA Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches14894
 
Privacy Preserving Data Mining
Privacy Preserving Data MiningPrivacy Preserving Data Mining
Privacy Preserving Data MiningROMALEE AMOLIC
 
Enhancing the Security for Clinical Document Architecture Generating System u...
Enhancing the Security for Clinical Document Architecture Generating System u...Enhancing the Security for Clinical Document Architecture Generating System u...
Enhancing the Security for Clinical Document Architecture Generating System u...IRJET Journal
 
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...IRJET Journal
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data miningeSAT Publishing House
 
Privacy preserving dm_ppt
Privacy preserving dm_pptPrivacy preserving dm_ppt
Privacy preserving dm_pptSagar Verma
 
A novel ppdm protocol for distributed peer to peer information sources
A novel ppdm protocol for distributed peer to peer information sourcesA novel ppdm protocol for distributed peer to peer information sources
A novel ppdm protocol for distributed peer to peer information sourcesIAEME Publication
 
Cryptography for privacy preserving data mining
Cryptography for privacy preserving data miningCryptography for privacy preserving data mining
Cryptography for privacy preserving data miningMesbah Uddin Khan
 
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...Editor IJMTER
 
Intelligent Heart Attack Prediction System Using Big Data
Intelligent Heart Attack Prediction System Using Big DataIntelligent Heart Attack Prediction System Using Big Data
Intelligent Heart Attack Prediction System Using Big Datapaperpublications3
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsA Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsDrjabez
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data miningNeeda Multani
 
Successive iteration method for reconstruction of missing data
Successive iteration method for reconstruction of missing dataSuccessive iteration method for reconstruction of missing data
Successive iteration method for reconstruction of missing dataIAEME Publication
 
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining ApproachPrivacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining ApproachIRJET Journal
 
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
A Survey on Features and Techniques Description for Privacy of Sensitive Info...A Survey on Features and Techniques Description for Privacy of Sensitive Info...
A Survey on Features and Techniques Description for Privacy of Sensitive Info...IRJET Journal
 

What's hot (19)

J018116973
J018116973J018116973
J018116973
 
An efficient data masking for securing medical data using DNA encoding and ch...
An efficient data masking for securing medical data using DNA encoding and ch...An efficient data masking for securing medical data using DNA encoding and ch...
An efficient data masking for securing medical data using DNA encoding and ch...
 
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and ApproachesA Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
 
Privacy Preserving Data Mining
Privacy Preserving Data MiningPrivacy Preserving Data Mining
Privacy Preserving Data Mining
 
Enhancing the Security for Clinical Document Architecture Generating System u...
Enhancing the Security for Clinical Document Architecture Generating System u...Enhancing the Security for Clinical Document Architecture Generating System u...
Enhancing the Security for Clinical Document Architecture Generating System u...
 
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...Early Identification of Diseases Based on Responsible Attribute using Data Mi...
Early Identification of Diseases Based on Responsible Attribute using Data Mi...
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Privacy preserving dm_ppt
Privacy preserving dm_pptPrivacy preserving dm_ppt
Privacy preserving dm_ppt
 
A novel ppdm protocol for distributed peer to peer information sources
A novel ppdm protocol for distributed peer to peer information sourcesA novel ppdm protocol for distributed peer to peer information sources
A novel ppdm protocol for distributed peer to peer information sources
 
Cryptography for privacy preserving data mining
Cryptography for privacy preserving data miningCryptography for privacy preserving data mining
Cryptography for privacy preserving data mining
 
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
SECURED FREQUENT ITEMSET DISCOVERY IN MULTI PARTY DATA ENVIRONMENT FREQUENT I...
 
Intelligent Heart Attack Prediction System Using Big Data
Intelligent Heart Attack Prediction System Using Big DataIntelligent Heart Attack Prediction System Using Big Data
Intelligent Heart Attack Prediction System Using Big Data
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsA Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data mining
 
Successive iteration method for reconstruction of missing data
Successive iteration method for reconstruction of missing dataSuccessive iteration method for reconstruction of missing data
Successive iteration method for reconstruction of missing data
 
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining ApproachPrivacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
Privacy Preserving Data Mining Using Inverse Frequent ItemSet Mining Approach
 
1699 1704
1699 17041699 1704
1699 1704
 
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
A Survey on Features and Techniques Description for Privacy of Sensitive Info...A Survey on Features and Techniques Description for Privacy of Sensitive Info...
A Survey on Features and Techniques Description for Privacy of Sensitive Info...
 

Similar to Safe machinelearning

2014 IEEE JAVA CLOUD COMPUTING PROJECT A review on the state of-the-art priva...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A review on the state of-the-art priva...2014 IEEE JAVA CLOUD COMPUTING PROJECT A review on the state of-the-art priva...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A review on the state of-the-art priva...IEEEFINALSEMSTUDENTPROJECTS
 
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...NAUMAN MUSHTAQ
 
Secure analytics and machine learning in cloud use cases
Secure analytics and machine learning in cloud use casesSecure analytics and machine learning in cloud use cases
Secure analytics and machine learning in cloud use casesUlf Mattsson
 
Personal Health Record over Encrypted Data Using Cloud Service
Personal Health Record over Encrypted Data Using Cloud ServicePersonal Health Record over Encrypted Data Using Cloud Service
Personal Health Record over Encrypted Data Using Cloud ServiceYogeshIJTSRD
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...Ecway2004
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...Ecwaytechnoz
 
Dotnet scalable and secure sharing of personal health records in cloud compu...
Dotnet  scalable and secure sharing of personal health records in cloud compu...Dotnet  scalable and secure sharing of personal health records in cloud compu...
Dotnet scalable and secure sharing of personal health records in cloud compu...Ecwaytech
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...Ecwaytechnoz
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...Ecwaytechnoz
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...Ecwayt
 
Scalable and secure sharing of personal health records in cloud computing usi...
Scalable and secure sharing of personal health records in cloud computing usi...Scalable and secure sharing of personal health records in cloud computing usi...
Scalable and secure sharing of personal health records in cloud computing usi...Ecwayt
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...Ecwaytech
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...Ecwayt
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...Ecway2004
 
A Proposed Security Architecture for Establishing Privacy Domains in Systems ...
A Proposed Security Architecture for Establishing Privacy Domains in Systems ...A Proposed Security Architecture for Establishing Privacy Domains in Systems ...
A Proposed Security Architecture for Establishing Privacy Domains in Systems ...IJERA Editor
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...ecway
 
Java scalable and secure sharing of personal health records in cloud computi...
Java  scalable and secure sharing of personal health records in cloud computi...Java  scalable and secure sharing of personal health records in cloud computi...
Java scalable and secure sharing of personal health records in cloud computi...Ecway Technologies
 
Java scalable and secure sharing of personal health records in cloud computi...
Java  scalable and secure sharing of personal health records in cloud computi...Java  scalable and secure sharing of personal health records in cloud computi...
Java scalable and secure sharing of personal health records in cloud computi...ecwayerode
 
Java scalable and secure sharing of personal health records in cloud computi...
Java  scalable and secure sharing of personal health records in cloud computi...Java  scalable and secure sharing of personal health records in cloud computi...
Java scalable and secure sharing of personal health records in cloud computi...ecwayerode
 
Scalable and secure sharing of personal health records in cloud computing usi...
Scalable and secure sharing of personal health records in cloud computing usi...Scalable and secure sharing of personal health records in cloud computing usi...
Scalable and secure sharing of personal health records in cloud computing usi...ecway
 

Similar to Safe machinelearning (20)

2014 IEEE JAVA CLOUD COMPUTING PROJECT A review on the state of-the-art priva...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A review on the state of-the-art priva...2014 IEEE JAVA CLOUD COMPUTING PROJECT A review on the state of-the-art priva...
2014 IEEE JAVA CLOUD COMPUTING PROJECT A review on the state of-the-art priva...
 
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
Access and Secure Storage Based Block Chain Scheme with IPFS Implemented in E...
 
Secure analytics and machine learning in cloud use cases
Secure analytics and machine learning in cloud use casesSecure analytics and machine learning in cloud use cases
Secure analytics and machine learning in cloud use cases
 
Personal Health Record over Encrypted Data Using Cloud Service
Personal Health Record over Encrypted Data Using Cloud ServicePersonal Health Record over Encrypted Data Using Cloud Service
Personal Health Record over Encrypted Data Using Cloud Service
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
Dotnet scalable and secure sharing of personal health records in cloud compu...
Dotnet  scalable and secure sharing of personal health records in cloud compu...Dotnet  scalable and secure sharing of personal health records in cloud compu...
Dotnet scalable and secure sharing of personal health records in cloud compu...
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
Scalable and secure sharing of personal health records in cloud computing usi...
Scalable and secure sharing of personal health records in cloud computing usi...Scalable and secure sharing of personal health records in cloud computing usi...
Scalable and secure sharing of personal health records in cloud computing usi...
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
A Proposed Security Architecture for Establishing Privacy Domains in Systems ...
A Proposed Security Architecture for Establishing Privacy Domains in Systems ...A Proposed Security Architecture for Establishing Privacy Domains in Systems ...
A Proposed Security Architecture for Establishing Privacy Domains in Systems ...
 
Android scalable and secure sharing of personal health records in cloud comp...
Android  scalable and secure sharing of personal health records in cloud comp...Android  scalable and secure sharing of personal health records in cloud comp...
Android scalable and secure sharing of personal health records in cloud comp...
 
Java scalable and secure sharing of personal health records in cloud computi...
Java  scalable and secure sharing of personal health records in cloud computi...Java  scalable and secure sharing of personal health records in cloud computi...
Java scalable and secure sharing of personal health records in cloud computi...
 
Java scalable and secure sharing of personal health records in cloud computi...
Java  scalable and secure sharing of personal health records in cloud computi...Java  scalable and secure sharing of personal health records in cloud computi...
Java scalable and secure sharing of personal health records in cloud computi...
 
Java scalable and secure sharing of personal health records in cloud computi...
Java  scalable and secure sharing of personal health records in cloud computi...Java  scalable and secure sharing of personal health records in cloud computi...
Java scalable and secure sharing of personal health records in cloud computi...
 
Scalable and secure sharing of personal health records in cloud computing usi...
Scalable and secure sharing of personal health records in cloud computing usi...Scalable and secure sharing of personal health records in cloud computing usi...
Scalable and secure sharing of personal health records in cloud computing usi...
 

More from MansiChowkkar

M sc research_project_report_x18134599
M sc research_project_report_x18134599M sc research_project_report_x18134599
M sc research_project_report_x18134599MansiChowkkar
 
X18134599 mansi chowkkar
X18134599 mansi chowkkarX18134599 mansi chowkkar
X18134599 mansi chowkkarMansiChowkkar
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansiChowkkar
 
Data visualisation magzine
Data visualisation magzineData visualisation magzine
Data visualisation magzineMansiChowkkar
 
Mansi_BreastCancerDetection
Mansi_BreastCancerDetectionMansi_BreastCancerDetection
Mansi_BreastCancerDetectionMansiChowkkar
 

More from MansiChowkkar (6)

M sc research_project_report_x18134599
M sc research_project_report_x18134599M sc research_project_report_x18134599
M sc research_project_report_x18134599
 
X18134599 mansi chowkkar
X18134599 mansi chowkkarX18134599 mansi chowkkar
X18134599 mansi chowkkar
 
Regression project
Regression projectRegression project
Regression project
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analytics
 
Data visualisation magzine
Data visualisation magzineData visualisation magzine
Data visualisation magzine
 
Mansi_BreastCancerDetection
Mansi_BreastCancerDetectionMansi_BreastCancerDetection
Mansi_BreastCancerDetection
 

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 

Safe machinelearning

  • 1. Safe Machine Learning for Patient Data Privacy Ashwini Chaudhari, Hemangi Shinge, Mansi Chowkkar, Seemanthini Narasimha Moorthy x18129676, x18130429, x18134599, x18141447 MSc Data Analytics Advanced Data Mining National College of Ireland Abstract— The evolution of various data mining processes has contributed majorly in diverse fields for prediction of future trends by enabling transformation of data into useful information. This has evoked interest of many researchers in developing machine learning models for disease prediction. Massive growth in healthcare data has led to the migration of data from traditional storage systems to cloud based systems which increases privacy concerns. In this project, a privacy-preserving system is proposed for protecting sensitive healthcare data. The system first encrypts the plaintext which is fed to the data mining model so that during the prediction process sensitive data is always maintained in encrypted form. The output of the prediction system is then provided to the medical practitioner who already has the private key for decryption which ensures data privacy. For the purpose of encrypting patient data, Paillier Homomorphic Encryption is used and various data mining models have been tested with encrypted data. It was observed that for the Breast Cancer Diagnostic Dataset from UCI Data Repository, Logistic Regression outperformed all other implemented models without compromising data privacy. Keywords - Homomorphic Encryption, Data Mining, Logistic Regression, Privacy-preserving, Disease prediction I. INTRODUCTION Medical estimates are ubiquitous in several fields, ranging from healthcare to illness diagnosis.Predictors of disease provide the facilities of pre-diagnosis to make a clinical judgement based on the medical data of the user. Many medical applications deal with sensitive information hence it is necessary to ensure privacy [1]. As the healthcare industry is in the global scope for providing health services to patients and is facing a sharp increase in growth of electronic data, information and data security is a major concern. [2] To solve the privacy issues, we propose a privacy preserving system using Homomorphic Cryptosystems. Safety and reliability factors are key points to be considered for upcoming learning algorithms in areas such as healthcare, defence, finance which contain sensitive data. Especially in healthcare, patient data is considered highly sensitive, and the patient would not want it to be revealed to any entities other than the concerned persons. There are two stages of supervised learning algorithms: 1. Training phase- In which the algorithm knows a model w from a collection of examples marked. 2. The classification stage running the C classifier over the previously unknown vector x function, using the w model to show C(x, w).It is crucial that the feature vector x and the model w remain private which involved in applications that manage sensitive data. Through this whole document, we soon refer to this objective as a classification that protects privacy. In practical terms, a client has a personal input described as a vector x function, and also the server has a personal input made up of a personal model w. The method of obtaining the model w here is separate of our protocols [3]. Open source datasets could give an upper hand for machine learning resulting in good models being produced but patient privacy could be compromised, with malicious users having the ability to back track patient data to specific patients. In this project we will be using Paillier Homomorphic Encryption for encrypting breast cancer data. As Homomorphic Encryption is applied over encrypted data, Format Preserving Encryption (FPE) will be used for encryption. HE will then be applied over this encrypted data. For encryption we will be using public key and the private key will be use for decryption of data by the medical practitioner which will ensure privacy of the sensitive data. Encrypted data is then fed to the following machine learning and deep learning models to build effective privacy-preserving protocols which are the most widely known classifiers: KNN, Logistic Regression, SVC, LBFGS, Nave Bayes, Decision Tree, Bagged Decision Tree, Random Forest, Extra Trees, AdaBoost, Stochastic Gradient Boosting, Stochastic Voting Ensemble. The classification models will be evaluated on the basis of of cross-validation accuracy and the F1 score. The objective of this project is to identify the best fir classifier model for the UCI Breast Cancer Wisconsin (Diagnostic) Dataset and the structure of the paper is as follows: Section III discusses the related research. The methodology followed for implementation of this project is explained in detail in Section IV. The evaluation and results are discussed in Section V II. RESEARCH QUESTION ”How efficiently and accurately can disease prediction models be implemented on encrypted data to secure patient privacy using data mining techniques?”
  • 2. III. LITERATURE REVIEW In the current technological era, traditional paper based medical records, prescription, and patient data have advanced to electronic health records (EHRs). To maintain such a large amount of healthcare data requires privacy preserving for maintaining the confidentiality of data. Spoofing identity attacks, information disclosures are the main threats in healthcare. The cryptographic approaches are used to protect e-health data such as: symmetric key encryption method(SKE) and public key encryption method (PKE). Some of the techniques used for cryptography is mentioned by [4] are: 1. Fully Homomorphic Encryption (FHE) 2. Somewhat Homomorphic Encryption (SHE) 3. Searchable Encryption 4. Predicate/Hierarchical Encryption Homomorphic encryption (HE) allows encryption on ciphertexts and integers which provides encrypted results. FHE performs arbitrary number of addition and multiplication on encrypted data without decryption. SHE performs a limited number of addition and multiplication operation on encrypted data without decryption. In [4] study, privacy preserving for health data approaches have explained. Searchable encryption techniques based on PKE is less efficient, hence new techniques need to be explored in privacy preserving area. Machine learning algorithms on encrypted data is one of the important applications of the security and privacy-preserving area. In the work [5], the logistic regression model is explained on encrypted data using FHE. Bootstrapping operation is used for performing FHE to run arbitrary number of steps. In critical applications, HE provides the highest level of data privacy with increasing computation time. With the advancement in machine learning, new techniques to predict disease support system has been proposed by [6]. In the disease prediction support system (DPSS), the traditional cryptographic technique can protect healthcare data but, they are unfair for new modern applications. Hence in this study, privacy preserving prediction system is developed using paillier HE. Paillier HE encrypted sensitive patient data and the efficiency of the machine learning model (FKNN-CBR) is evaluated. In the study of cryptography done by [7], cryptography primitive implementation to protect sensitive data before exposing to any classifier model has been proposed. To predict the output of the classifier model for accuracy or sensitivity, data needs to be provided to the model. To reduce the risk of exposing such a piece of sensitive information, data protection is an important part in machine learning. In [7] study, Fully homomorphic encryption (FHE) has been implemented with privacy preserving Naive Bayes classifier. The Problem faced in this study is the performance of the classifier since FHE is slow performance algorithm. The author implemented performance improving techniques such as privacy preserving techniques using FHE with HELib library. In the research done by [8], proposed privacy preserving disease prediction (PPDP) model based on single layer perceptron (SLP). For implementing SLP, they used HE techniques where modulus and a prime generating algorithm is used. HE allows the user to communicate with untrusted parties using encrypted data. Because of its low efficiency, it is not practically implemented in modern world examples. In [9] paper, new practical algorithms have implemented using FHE. These algorithms provide symmetric approach, efficiency, and robustness to protect plaintext, ciphertext attacks. In [10] paper, for performing statistical analysis on encrypted data, the FHE method is used on three types of data. Here, encryptor is proposed to encrypt data and then this data is sent to the cloud for performing HE on the ciphertext. Decryptor receives HE data and then decrypts the data. To reduce time and cost SHE is now implementing by many researchers. In [11], SHE is implemented for private equality test(PET) of integers and private equality batch processing test (PriBET). SHE is faster than FHE which supports many addition operations but fewer multiplication operations. In [12] paper, privacy preserving support system is proposed to preserve data in the healthcare sector. One of the technique suggested here is HE based on paillier cryptography. This algorithm is based on public key additive encryption scheme. The advantage of using paillier scheme is that encryption can be made at the server side to perform linear operations based on paillier properties where a clinician can keep his private key as secret. Hence the only clinicians can decrypt the data using the private key. Here gaussian kernel based SVM is applied to encrypted data. The integers and continuous variables have encrypted using paillier encryption method. The result shows no hampering on the accuracy of paillier encryption. In the study [2], the encryption method is applied for designing a privacy-preserving system. In this system encrypted data is used to train Naive Bayes classifier without leaking patient data. The trained classifier is then can be applied to other patient data to predict risk factors of the disease. For achieving privacy preserving, paillier HE is used. The key generation process generates the public key as pk and private key as sk. Given numbers then encrypted by random number modulus, addition, and multiplication operation under the public key. ciphertext can be recovered into plaintext using the private key. Very few studies have been carried out to predict data mining and machine learning methods introduced to Diabetes mellitus (DM) studies were identified and reviewed in a systematic manner. Diabetes mellitus is characterized as a cluster for metabolic disorders affecting human health all across the globe. A broad variety of algorithms for machine learning were used. Overall, 85 % of these were described by supervised methods and 15 % with unsupervised methods, as well as association rules used more precisely. Nephropathy, diabetic foot, Alzheimers disease, liver cancer, heart disease, hypoglycaemic events, depression are the diabetic medical issues coated in this study. The occurrence of biotech, with huge volume of data generated, together with the growing number of EHRs, is required to offer rise to more indepth
  • 3. discovery of DM diagnosis, ethopathophysiology and therapy by the use of data mining and ML strategies dataset that provide biological and clinical data [13]. [14] This study focuses on chronic kidney disease (CKD) which is regarded to be kidney injury that exceeds 3 months. ML classification algorithms were used to evaluate the value. Classified designs identified the patients CKD and non-CKD status with distinct classification algorithms. All these designs have added 25 attributes and 400 records to the recently recorded CKD dataset obtained from the UCI library. 14 CKD related characteristics for various machine learning evaluation methods have been evaluated and estimated: Multiclass Decision Jungle, Multiclass Decision Forest, Multiclass Neural Network, Multiclass Logistic Regression. It is noted from the outcomes that the Multiclass Decision Forest algorithm offers the 99.1 percent precision. The primary focus of the executed system is to detect the health situation of a current CKD patient by concentrating more on highlighted fields to assist get a clearer understanding of the situation of the patient. According to [15] predictive modeling approach for cardiovascular disease analysis is highly difficult in the field of healthcare informatics. The goal of this research is to obtain patterns which link the factors of predictors in a health science database in data mining. In this study researcher suggest the Ensemble model strategy to combine the predictive capacity of the system of various classifiers to improve predictive precision. Ensemble learning integrates the system techniques of five classification algorithms to predict and identify the recurrence of cardiovascular disease, including supporting vector machine, artificial neural network, Nave Bayesian, regression analysis, and random forest, and data is take from UCI repository. This model was built using the ”WEKA DM tool”. 10-fold cross-validations have been used in these studies to divide the data into testing and training sets ; this meets the system training and testing requirement. As an outcome of all classifier designs implemented in the research, the precision level acquired from this test was above 93 percent and 98.17 percent is the highest accuracy for the RF algorithm. [16] This article provides an estimated mathematical model using linear regression to detect person’s disease connection on homomorphically encrypted information. Parkinson’s patients dataset is used for building a PD-patient information analysis model outsourced to a remote server. The primary challenge of this research is to use encrypted data to perform the necessary analysis to build a model which can identify the unidentified samples depending on the training samples which is built using a linear regression structure. Gradient descent algorithm has been used for converging the model to create the linear regression. The implementation perceives Parkinson’s encrypted sensitive voice recording samples. The samples are encoded using Homomorphic Encoding. If the model can deliver more actual true positive rates, the classifier’s efficiency improves than that of the false positive rates. A research carried out by [1], proposed maintaining security and highly precise outsourced random forest disease predictor termed PHPR. The PHPR model can carry out secure training with health information that belongs to various data holders and predict accurately. In addition, the rational field’s original data and calculated outcome can be filtered and stored safely in the cloud without any privacy leakage. Hypothesis results using real-world data show that PHPR not just gives a secure prediction of disease over ciphertexts, but it also holds predictive accuracy as the original classifier. The study concludes is important to find a suitable classification model for the prediction of disease. Another study [17] used fully homomorphic encryption that allowed development of new privacy preserving machine learning schemes. Demonstrate how well these systems can be implemented to the automatic evaluation of speech impacted by medical circumstances, enabling patient confidentiality in treatment and scenario tracking. More precisely, it presented results of Parkinsons Disease, detection of cold and degree of depression. The second degree polynomials and linear equations replace the activation functions, as only sum and multiplications are feasible. The resulting template is then used in an encrypted part of the network after training the network with non encrypted data to generate encrypted predictions. The small differentiation between the outcomes of encrypted neural networks and their unencrypted counterparts, furthermore, indicates the validity of safe strategy. However, the restricted volume of records does not enable deeper networks to thoroughly analyze performance degradation. Classification of machine learning is now used for various functions, for example, prediction genomics or medical, face detection, spam detection, also economic predictions. Because of privacy issues, it is crucial that the classifier and data remain private. In this research [3], Nave Bayes, hyperplane decision, and decision trees these are 3 main classification protocols build to meet this privacy requirement. Based on these structures is a new building block library that allows a broad variety of privacy-conserving classifiers to be constructed. It Illustrates how well this library can be used to build other classifiers, such as a multiplexer and face recognition classifier, than that of the 3 listed above. The classifiers and library were introduced and assessed. In these protocols, it is effective to take milliseconds to a few seconds to perform classification when operating on actual medical data. One of the functions of the 2017 iDASH Secure Genome Assessment Competition was to allow logistic regression approaches to be trained over encrypted genomic records. More specifically, it provided a list of about 1,500 patient data, each one with 18 binary characteristics containing information on particular diseases. The concept was that the data owner would encrypt documents using homomorphic encryption and deliver them to an unsecure cloud for storage. Cloud could then introduce a training method homomorphically to the encrypted records to achieve a system of encrypted logistic regression and can be sent for decryption to the data holder. Data provider could thus
  • 4. effectively outsource the training procedure without either disclosing its sensitive information or the skilled model to the cloud. For the encryption of fixed point number and multibit plaintext encryption homomorphic encryption is used. The outcome shows that training on encrypted information is feasible but it comes with high computing cost. On the other side, in critical apps, this technique can ensure the greatest level of data confidentiality[18]. Machine learning is a very successful strategy that works with early diagnosis of disease that could assist physicians to make diagnostic decisions. The aim of this article is to build a classifier model using the WEKA tool to detect diabetes using Naive Bayes, SVM, RF and Simple CART algorithms. The aforementioned 4 classifiers were ranked depending on training time, test time and precision value. From the evaluation and calculations it is evident that in predicting the disease with maximum precision, Support Vector Machine performed best. The SVM’s precision value obtained is is 0.784 which is highest and the Random Forest’s accuracy value is 0.756 which is lowest as per[19] research. In this study [20], researchers provided the first secure multiparty computation (SMC) for private classification protocols with cluster ensembles-boosted decision trees, RF, Delivery of protocols on the KenSci Model for Medical Analysis. IV. METHODOLOGY Fig. 1: Fayyad Methodology [21] 1. Data Selection: The dataset chosen for this project is Breast Cancer Wisconsin (Diagnostic) Data Set [22] taken from UCI Machine Learning repository, as this is a popular dataset among researches in the field of healthcare. With 11 columns and 699 rows, this is a concise dataset perfectly suited to test encryption based classifier models. The dataset consists of binary classification data, with 357 benign and 212 malignant cases. The Class attribute is the dependent variable with ’2’ denoting benign and ’4’ represents malignant tumor. 2. Data Preprocessing: The study and practice of keeping messages safe and secure by using mathematical techniques is known as cryptography in general. Cryptosystem can provide one or more than one of the four services which are confidentiality, integrity, authentication and non-repudiation so that information can be protected from being disclosed to unauthorized parties. The two categories that cryptosystems can be classified into are secret key cryptosystem or symmetric key cryptosystem and public key cryptosystem or asymmetric key cryptosystem. The same key is used to perform encryption and decryption process on a message in a symmetric key cryptosystem. DES, IDEA and AES are some of the popular symmetric key cryptosystems. Two different keys i.e. private and public keys are used for the encryption and decryption process in an asymmetric key cryptosystem. Public key is required for the encryption process whereas for the process of decryption private key is used [23]. In this paper, we use an algorithm that uses healthcare data and preserves the privacy using Homomorphic Encryption. Homomorphic Encryption is an encryption technique that permits us to perform computations on ciphertext and generates a result in encrypted format. The result in encrypted format decryption will match the outcome of the operation in a manner as though it was performed on the original plaintext. Rivest et al. first coined the term Homomorphic Encryption in the year 1978. Post that, multiple researchs have been proposed that used Homomorphic encryption that supported either addition or multiplication on encrypted data but not both addition and multiplication. But only either of addition and multiplication wasn’t sufficient for a number of extended computations in various fields such as data mining, machine learning, bioinformatics, etc. Later on, many researchers proposed Fully Homomorphic Cryptosystems and Partially Homomorphic Cryptosystems that supported multiple additions and multiplications [11]. Homomorphic Encryption is applied over already encrypted data and computations are performed on that outcome of the result. For the purpose of initial encryption this paper proposes a data masking scheme that is based on the Format Preserving Encryption (FPE). FPE is an encryption method that is irreversibly symmetric. FPE is not exactly similar to traditional symmetric encryption since instead of completely changed unreadable binary string, the FPE ciphertext maintain the format and structure of the original plaintext and hence the result can be easily saved back to the database without the need to make any changes to the database system. FPE can very well be utilized for masking data for function testing, performance testing and secure testing that can help prevent privacy exposure of the real data. [24] This paper specifically proposes the use of Python implementation of Format-preserving, Feistel-based encryption (FFX). The algorithm [25] for the same is as shown below:
  • 5. Paillier Homomorphic Encryption will be adopted as the building block in this privacy preserving system. The Paillier cryptosystem is a probabilistic asymmetric algorithm having an additive homomorphic property for public key cryptography and was invented in the year 1999 by Pascal Paillier. oneone The output of the Format preserving, Feistel based encryption (FFX) will be used as the input for Paillier Homomorphic Encryption. This cryptographic technique has three stages which are Key Generation, Encryption and Decryption. The key notations and definitions corresponding to them are given below [6]: Key Generation: Consider p and q as two independent large prime numbers. We compute N = p ∗ q and λ = lcm(p − 1, q − 1). Then define a function L(x) = x−1 N Choosing an integer g of order N and µ = (L(gmodN2 ))−1 The public key is now given by PK = (N, g) and the private key is SK = (λ, µ). Encryption: Let m ∈ ZN be the plaintext and r ∈ ZN be a random number. Then the Ciphertext can be generated as C = EP K(m) = gm rN mod N2 where E() is the Paillier encryption on plaintext m and random number r with modulo N2 . Decryption: Given a Ciphertext C, the plaintext m can be derived by the following equation m = DSK(C) = L(Cλ mod N2 ) µ mod N Additive Homomorphism: Given two plaintexts x and y encrypted under the same public key PK, then the product of those two Ciphertext EP K(x) and EP K(y) is equal to the Ciphertext of sum of two plaintexts. EP K(x) · EP K(y) = gx rN 1 mod N2 · gy rN 2 mod N2 EP K(x) · EP K(y) = gx+y rN 1 rN 2 mod N2 EP K(x) · EP K(y) = EP K (x + y) Scalar-Multiplicative Homomorphism: Given a constant c ∈ ZN , then the Ciphertext EP K(m) raised to the power of c is equal to the encryption of product of constant and plain text. EP K(x)c = (gx rN mod N2 )c EP K(x)c = gxc rcN mod N2 EP K(x)c = EP K (x · c) With 11 columns and 699 records, this is a clean and concise dataset. The column Bare nuclei contained 16 records filled with ?. This was replaced with NA, and then further filled with the last valid occurrence of data in the column with fillna() method in pandas. The ID column that identifies breast cancer patients has been removed as the inclusion of this column would not add any additional benefits to the classification models, and would further ensure protection of patient identity. 3. Data Exploration and Transformation: Fig. 2: Exporatory Data Analysis
  • 6. Exploratory data analysis is performed on all the columns in the dataset to ensure that the data features fulfill the requirements of the research question with no hidden problems. Prominent reasons to perform this analysis is to detect if there is any correlation between data columns. This also ensures the expectation from the dataset is met with statistical backing. Figure 2 visualizes the correlation between all columns. matplotlib library has been used for the visualisation. The graph legend indicates gradual colour gradient moving from bottom to top. The darker shade indicates lower correlation, and lighter shade indicates higher correlation. The principle diagonal has been eliminated to remove auto correlation. We observe that correlation between variables is significantly lower than the limit except for columns uniformity of cell shape and uniformity of cell size. 4. Data Mining: K-fold cross validation: Due to imbalance in dataset (65.5% Benign and 34.5% Malignant), k-fold cross validation procedure has been implemented using k-fold method from scikit-learn library. The dataset is split into k subsets , where k-1 subsets are used for training and the last subset is used for testing. This is iterative in nature, and every subset gets its turn to be the test set. Our experiments showed observable increase in initial accuracy using k-fold cross validation when compared to train-test dataset split. 10-fold cross validation has used as our experiments showed that 10 folds gave better validation accuracy when compared to 5 folds, as higher folds bring out better results. K-nearest neighbour(KNN): KNN algorithm is used to solve regression and classification problems, with the least learning retention capacity. It is a lazy learner. Data is expected to be numerical and standardized, which is important for running the KNN model. Naive Bayes: It is a deterministic classifier derived from the Bayes principle , which works in the basis of indicating root nodes by the previous probabilities. The Bayes theorem is provided in 1st Equation and the constant of normalization is provided in Equation 2. [15] P(Xi|y) = P(y|Xi)P(Xi) p(y) (1) p(y) = Σ4 1p(y|Xi)P(Xi) (2) Logistic Regression: It is a predictive and probability-based assessment machine learning algorithm used to solve classification problems. The logistic regression approach tends to restrict the cost function from 0 to 1 using the sigmoid activation function. The solver that has been used for this implementation is liblinear, as it performs best with small datasets. 1 . Support Vector Classifier (SVC): A Support Vector classifier is explicitly described by a 1https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148 separate hyperplane as a discriminative classifier. It is derivation of Support Vector machine. Radial basis function kernel has been used by default, as it is easy to tune and provides a good benchmark for comparison. The kernel coefficient applied is the reciprocal of the number of features. LBFGS: LBFGS is a conventional quasi-Newton method for optimizing many variables ’ smooth operations. The solver used is lbfgs for the respective model. Logistic regression is the activation function and 11 hidden layers are used. Decision Tree: Learning the decision tree utilizes as a statistical model to proceed from findings about its target value and to conclude the result. This works best with data that can be easily divided into categorical variables. The criterion to ensure quality of split is assigned as gini for Gini impurity. AdaBoost: Is a meta-estimator that first integrates a classifier on the main dataset , then fits extra classifier replicas on the similar dataset. But weights are adapted for wrongly grouped cases so that successive classifiers concentrate more on challenging cases. 30 estimators have been used, with decision tree is chosen as the base estimator . Stochastic Gradient Boosting: Which constructs an additive model in a forward-stage manner. It enables random differential loss functions to be optimized. 100 boosting stages were chosen for this implementation. Stochastic Voting Ensemble: To merge projections, various kinds of models and statistics such as measuring the mean can be used. The models that were selected for this ensemble are logistic regression, decision tree and Support Vector Classifier. Bagged Decision Tree: Constructing models of similar type from distinct sub-samples of training data. Random forest: It is used for functions of classification as well as regression,Is a decision-based classifier ensemble method that includes a flowchart like a tree structure. Extra Trees: Working on the same principle as randoms forests, it selects random subsets of features and split nodes. The main differences between then are that sampling is done without replacement, and the criteria of node split is random split and not best split. V. EVALUATION AND RESULTS A. Cross validation accuracy result This metric has been calculated using the parameters classifier model, feature columns, classifier column and k-fold parameter. This has a collection of cross validation results of all 10 folds and the mean is calculated in the end for cross validation accuracy. cross val score method from model selection library from scikit-learn.
  • 7. Below table shows the accuracy scores obtained for the implemented models with plaintext and encrypted input data sorted by column ”After encryption”, as this is the column being evaluated. Fig. 3: Test Accuracy Table Fig. 4: Test Accuracy B. F1 score The confusion matrix was evaluated for every model, but the most relevant metric for classification chosen was the F1 score. The F1 score is calculated with the below formula: F1 = 2 ∗ (precision ∗ recall) (precision + recall) (3) Where precision= the ratio of predicted positives count and total number of true positives predicted, and recall= the ratio of predicted positives count and total number of true positives and false negatives predicted. F1 score is derived after considering all runs of k fold cross validation and its weighted mean is calculated. Table 5 shows the F1 scores obtained for the implemented models with plaintext and encrypted input data sorted by column ”After Encryption”. Fig. 5: F1 Score Table Fig. 6: F1 Score VI. CONCLUSION From the results, we observe that logistic regression is the best disease prediction model for this dataset which outperforms other implemented models with encrypted data without compromising patient privacy. We see that logistic regression has the highest value with 96% for plain text and 65% for encrypted text, followed by support vector classifier with very similar accuracies. Logistic regression model works very well on this dataset as it works best on binary classification problems. We see the encryption offers proportional deterioration for every model, and does not have a detrimental effect on one particular model. With the dataset having a slight imbalance, it is more appropriate to compare them on the basis of their F1 scores. From the plot we can see that Logistic regression model consistently performs well in comparison to other models. Due to sensitivity of healthcare data, it was difficult to obtain a huge dataset for the above research. Future work could include the involvement of more complex datasets which would enable usage of neural networks.
  • 8. REFERENCES [1] Z. Ma, J. Ma, Y. Miao, and X. Liu, “Privacy-preserving and high-accurate outsourced disease predictor on random forest,” Information Sciences, vol. 496, pp. 225 – 241, 2019. [2] X. Liu, R. Lu, J. Ma, L. Chen, and B. Qin, “Privacy-Preserving Patient-Centric Clinical Decision Support System on Na¨ıve Bayesian Classification,” IEEE Journal of Biomedical and Health Informatics, vol. 20, no. 2, pp. 655–668, 2016. [3] R. Bost, R. Ada Popa, S. Tu, and S. Goldwasser, “Machine learning classification over encrypted data,” 01 2015. [4] “A review on the state-of-the-art privacy-preserving approaches in the e-health clouds.,” IEEE Journal of Biomedical and Health Informatics, Biomedical and Health Informatics, IEEE Journal of, IEEE J. Biomed. Health Inform, no. 4, p. 1431, 2014. [5] C. Hao, G.-B. Ran, H. Kyoohyung, H. Zhicong, J. Amir, L. Kim, and L. Kristin, “Logistic regression over encrypted data from fully homomorphic encryption.,” BMC Medical Genomics, no. S4, p. 3, 2018. [6] M. D., L. R., S. V., V. V., and A. K. Sangaiah, “Hybrid reasoning-based privacy-aware disease prediction support system.,” Computers and Electrical Engineering, vol. 73, pp. 114 – 127, 2019. [7] H. Park, P. Kim, H. Kim, K.-W. Park, and Y. Lee, “Efficient machine learning over encrypted data with non-interactive communication.,” Computer Standards Interfaces, vol. 58, pp. 87 – 108, 2018. [8] C. Zhang, L. Zhu, C. Xu, and R. Lu, “Ppdp: An efficient and privacy-preserving disease prediction scheme in cloud-based e-healthcare system.,” Future Generation Computer Systems, vol. 79, no. Part 1, pp. 16 – 25, 2018. [9] K. Hariss, H. Noura, and A. E. Samhat, “Fully enhanced homomorphic encryption algorithm of more approach for real world applications.,” Journal of Information Security and Applications, vol. 34, no. Part 2, pp. 233 – 242, 2017. [10] W.-j. Lu, S. Kawasaki, and J. Sakuma, “Using Fully Homomorphic Encryption for Statistical Analysis of Categorical, Ordinal and Numerical Data,” 2017. [11] “Private equality test using ring-lwe somewhat homomorphic encryption.,” 2016 3rd Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE), Computer Science and Engineering (APWC on CSE), 2016 3rd Asia-Pacific World Congress on, APWC-ON-CSE, p. 1, 2016. [12] Y. Rahulamathavan, S. Veluru, R. C. Phan, J. A. Chambers, and M. Rajarajan, “Privacy-preserving clinical decision support system using gaussian kernel-based classification,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 1, pp. 56–66, 2014. [13] I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, and I. Chouvarda, “Machine learning and data mining methods in diabetes research,” Computational and Structural Biotechnology Journal, vol. 15, pp. 104 – 116, 2017. [14] W. H. S. D. Gunarathne, K. D. M. Perera, and K. A. D. C. P. Kahandawaarachchi, “Performance evaluation on machine learning classification techniques for disease classification and forecasting through data analytics for chronic kidney disease (ckd),” in 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 291–296, Oct 2017. [15] J. M. A. A. K. M. N. S, “Research reports in clinical cardiology,” Ensemble approach for developing a smart heart disease prediction system using classification algorithms, vol. 9, pp. 33 – 45, 2019. [16] T. Morshed, D. Alhadidi, and N. Mohammed, “Parallel linear regression on encrypted data,” in 2018 16th Annual Conference on Privacy, Security and Trust (PST), pp. 1–5, Aug 2018. [17] F. Teixeira, A. Abad, and I. Trancoso, “Patient privacy in paralinguistic tasks,” pp. 3428–3432, 09 2018. [18] H. Chen, R. Gilad-Bachrach, K. Han, Z. Huang, A. Jalali, K. Laine, and K. Lauter, “Logistic regression over encrypted data from fully homomorphic encryption,” BMC Medical Genomics, vol. 11, p. 81, Oct 2018. [19] A. Mir and S. N. Dhage, “Diabetes disease prediction using machine learning on big data of healthcare,” in 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1–6, Aug 2018. [20] K. Fritchman, K. Saminathan, R. Dowsley, T. Hughes, M. De Cock, A. Nascimento, and A. Teredesai, “Privacy-preserving scoring of tree ensembles: A novel framework for ai in healthcare,” in 2018 IEEE International Conference on Big Data (Big Data), pp. 2413–2422, Dec 2018. [21] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for extracting useful knowledge from volumes of data,” vol. 39, no. 11, pp. 27–34, 1996. [22] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [23] “Advanced e-voting system using paillier homomorphic encryption algorithm.,” 2016 International Conference on Informatics and Computing (ICIC), Informatics and Computing (ICIC), International Conference on, p. 338, 2016. [24] “A data masking scheme for sensitive big data based on format-preserving encryption.,” 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), Computational Science and Engineering (CSE) and Embedded and Ubiquitous Computing (EUC), 2017 IEEE International Conference on, CSE-EUC, p. 518, 2017. [25] “Addendum to the ffx mode of operation for format-preserving encryption,” p. 1, 2010. VII. APPENDIX