Safe machinelearning

Safe Machine Learning for Patient Data Privacy
Ashwini Chaudhari, Hemangi Shinge, Mansi Chowkkar, Seemanthini Narasimha Moorthy
x18129676, x18130429, x18134599, x18141447
MSc Data Analytics
Advanced Data Mining
National College of Ireland
Abstract— The evolution of various data mining processes
has contributed majorly in diverse fields for prediction of
future trends by enabling transformation of data into useful
information. This has evoked interest of many researchers in
developing machine learning models for disease prediction.
Massive growth in healthcare data has led to the migration
of data from traditional storage systems to cloud based
systems which increases privacy concerns. In this project, a
privacy-preserving system is proposed for protecting sensitive
healthcare data. The system first encrypts the plaintext which
is fed to the data mining model so that during the prediction
process sensitive data is always maintained in encrypted form.
The output of the prediction system is then provided to
the medical practitioner who already has the private key
for decryption which ensures data privacy. For the purpose
of encrypting patient data, Paillier Homomorphic Encryption
is used and various data mining models have been tested
with encrypted data. It was observed that for the Breast
Cancer Diagnostic Dataset from UCI Data Repository, Logistic
Regression outperformed all other implemented models without
compromising data privacy.
Keywords - Homomorphic Encryption, Data Mining,
Logistic Regression, Privacy-preserving, Disease prediction
I. INTRODUCTION
Medical estimates are ubiquitous in several fields, ranging
from healthcare to illness diagnosis.Predictors of disease
provide the facilities of pre-diagnosis to make a clinical
judgement based on the medical data of the user. Many
medical applications deal with sensitive information hence
it is necessary to ensure privacy [1].
As the healthcare industry is in the global scope for
providing health services to patients and is facing a sharp
increase in growth of electronic data, information and data
security is a major concern. [2] To solve the privacy issues,
we propose a privacy preserving system using Homomorphic
Cryptosystems.
Safety and reliability factors are key points to be
considered for upcoming learning algorithms in areas such
as healthcare, defence, finance which contain sensitive data.
Especially in healthcare, patient data is considered highly
sensitive, and the patient would not want it to be revealed to
any entities other than the concerned persons.
There are two stages of supervised learning algorithms:
1. Training phase- In which the algorithm knows a model w
from a collection of examples marked. 2. The classification
stage running the C classifier over the previously unknown
vector x function, using the w model to show C(x, w).It is
crucial that the feature vector x and the model w remain
private which involved in applications that manage sensitive
data. Through this whole document, we soon refer to this
objective as a classification that protects privacy. In practical
terms, a client has a personal input described as a vector x
function, and also the server has a personal input made up
of a personal model w. The method of obtaining the model
w here is separate of our protocols [3].
Open source datasets could give an upper hand for
machine learning resulting in good models being produced
but patient privacy could be compromised, with malicious
users having the ability to back track patient data to specific
patients.
In this project we will be using Paillier Homomorphic
Encryption for encrypting breast cancer data. As
Homomorphic Encryption is applied over encrypted
data, Format Preserving Encryption (FPE) will be used for
encryption. HE will then be applied over this encrypted
data. For encryption we will be using public key and
the private key will be use for decryption of data by
the medical practitioner which will ensure privacy of the
sensitive data. Encrypted data is then fed to the following
machine learning and deep learning models to build
effective privacy-preserving protocols which are the most
widely known classifiers: KNN, Logistic Regression, SVC,
LBFGS, Nave Bayes, Decision Tree, Bagged Decision Tree,
Random Forest, Extra Trees, AdaBoost, Stochastic Gradient
Boosting, Stochastic Voting Ensemble. The classification
models will be evaluated on the basis of of cross-validation
accuracy and the F1 score.
The objective of this project is to identify the best
fir classifier model for the UCI Breast Cancer Wisconsin
(Diagnostic) Dataset and the structure of the paper is as
follows: Section III discusses the related research. The
methodology followed for implementation of this project is
explained in detail in Section IV. The evaluation and results
are discussed in Section V
II. RESEARCH QUESTION
”How efficiently and accurately can disease prediction
models be implemented on encrypted data to secure patient
privacy using data mining techniques?”

III. LITERATURE REVIEW
In the current technological era, traditional paper
based medical records, prescription, and patient data
have advanced to electronic health records (EHRs). To
maintain such a large amount of healthcare data requires
privacy preserving for maintaining the confidentiality of
data. Spoofing identity attacks, information disclosures
are the main threats in healthcare. The cryptographic
approaches are used to protect e-health data such as:
symmetric key encryption method(SKE) and public key
encryption method (PKE). Some of the techniques used for
cryptography is mentioned by [4] are: 1. Fully Homomorphic
Encryption (FHE) 2. Somewhat Homomorphic Encryption
(SHE) 3. Searchable Encryption 4. Predicate/Hierarchical
Encryption Homomorphic encryption (HE) allows encryption
on ciphertexts and integers which provides encrypted
results. FHE performs arbitrary number of addition and
multiplication on encrypted data without decryption. SHE
performs a limited number of addition and multiplication
operation on encrypted data without decryption. In [4] study,
privacy preserving for health data approaches have explained.
Searchable encryption techniques based on PKE is less
efficient, hence new techniques need to be explored in
privacy preserving area.
Machine learning algorithms on encrypted data is
one of the important applications of the security and
privacy-preserving area. In the work [5], the logistic
regression model is explained on encrypted data using FHE.
Bootstrapping operation is used for performing FHE to
run arbitrary number of steps. In critical applications, HE
provides the highest level of data privacy with increasing
computation time. With the advancement in machine
learning, new techniques to predict disease support system
has been proposed by [6]. In the disease prediction support
system (DPSS), the traditional cryptographic technique
can protect healthcare data but, they are unfair for new
modern applications. Hence in this study, privacy preserving
prediction system is developed using paillier HE. Paillier
HE encrypted sensitive patient data and the efficiency of the
machine learning model (FKNN-CBR) is evaluated.
In the study of cryptography done by [7], cryptography
primitive implementation to protect sensitive data before
exposing to any classifier model has been proposed. To
predict the output of the classifier model for accuracy or
sensitivity, data needs to be provided to the model. To reduce
the risk of exposing such a piece of sensitive information,
data protection is an important part in machine learning. In
[7] study, Fully homomorphic encryption (FHE) has been
implemented with privacy preserving Naive Bayes classifier.
The Problem faced in this study is the performance of the
classifier since FHE is slow performance algorithm. The
author implemented performance improving techniques such
as privacy preserving techniques using FHE with HELib
library. In the research done by [8], proposed privacy
preserving disease prediction (PPDP) model based on single
layer perceptron (SLP). For implementing SLP, they used HE
techniques where modulus and a prime generating algorithm
is used. HE allows the user to communicate with untrusted
parties using encrypted data. Because of its low efficiency, it
is not practically implemented in modern world examples.
In [9] paper, new practical algorithms have implemented
using FHE. These algorithms provide symmetric approach,
efficiency, and robustness to protect plaintext, ciphertext
attacks. In [10] paper, for performing statistical analysis on
encrypted data, the FHE method is used on three types of
data. Here, encryptor is proposed to encrypt data and then
this data is sent to the cloud for performing HE on the
ciphertext. Decryptor receives HE data and then decrypts the
data.
To reduce time and cost SHE is now implementing by
many researchers. In [11], SHE is implemented for private
equality test(PET) of integers and private equality batch
processing test (PriBET). SHE is faster than FHE which
supports many addition operations but fewer multiplication
operations. In [12] paper, privacy preserving support system
is proposed to preserve data in the healthcare sector. One
of the technique suggested here is HE based on paillier
cryptography. This algorithm is based on public key additive
encryption scheme. The advantage of using paillier scheme
is that encryption can be made at the server side to perform
linear operations based on paillier properties where a
clinician can keep his private key as secret. Hence the
only clinicians can decrypt the data using the private key.
Here gaussian kernel based SVM is applied to encrypted
data. The integers and continuous variables have encrypted
using paillier encryption method. The result shows no
hampering on the accuracy of paillier encryption. In the
study [2], the encryption method is applied for designing
a privacy-preserving system. In this system encrypted data
is used to train Naive Bayes classifier without leaking
patient data. The trained classifier is then can be applied
to other patient data to predict risk factors of the disease.
For achieving privacy preserving, paillier HE is used. The
key generation process generates the public key as pk and
private key as sk. Given numbers then encrypted by random
number modulus, addition, and multiplication operation
under the public key. ciphertext can be recovered into
plaintext using the private key.
Very few studies have been carried out to predict data
mining and machine learning methods introduced to Diabetes
mellitus (DM) studies were identified and reviewed in a
systematic manner. Diabetes mellitus is characterized as a
cluster for metabolic disorders affecting human health all
across the globe. A broad variety of algorithms for machine
learning were used. Overall, 85 % of these were described by
supervised methods and 15 % with unsupervised methods, as
well as association rules used more precisely. Nephropathy,
diabetic foot, Alzheimers disease, liver cancer, heart disease,
hypoglycaemic events, depression are the diabetic medical
issues coated in this study. The occurrence of biotech, with
huge volume of data generated, together with the growing
number of EHRs, is required to offer rise to more indepth

discovery of DM diagnosis, ethopathophysiology and therapy
by the use of data mining and ML strategies dataset that
provide biological and clinical data [13].
[14] This study focuses on chronic kidney disease (CKD)
which is regarded to be kidney injury that exceeds 3
months. ML classification algorithms were used to evaluate
the value. Classified designs identified the patients CKD
and non-CKD status with distinct classification algorithms.
All these designs have added 25 attributes and 400
records to the recently recorded CKD dataset obtained
from the UCI library. 14 CKD related characteristics
for various machine learning evaluation methods have
been evaluated and estimated: Multiclass Decision Jungle,
Multiclass Decision Forest, Multiclass Neural Network,
Multiclass Logistic Regression. It is noted from the outcomes
that the Multiclass Decision Forest algorithm offers the 99.1
percent precision. The primary focus of the executed system
is to detect the health situation of a current CKD patient
by concentrating more on highlighted fields to assist get a
clearer understanding of the situation of the patient.
According to [15] predictive modeling approach for
cardiovascular disease analysis is highly difficult in the
field of healthcare informatics. The goal of this research
is to obtain patterns which link the factors of predictors
in a health science database in data mining. In this study
researcher suggest the Ensemble model strategy to combine
the predictive capacity of the system of various classifiers to
improve predictive precision. Ensemble learning integrates
the system techniques of five classification algorithms
to predict and identify the recurrence of cardiovascular
disease, including supporting vector machine, artificial neural
network, Nave Bayesian, regression analysis, and random
forest, and data is take from UCI repository. This model was
built using the ”WEKA DM tool”. 10-fold cross-validations
have been used in these studies to divide the data into
testing and training sets ; this meets the system training and
testing requirement. As an outcome of all classifier designs
implemented in the research, the precision level acquired
from this test was above 93 percent and 98.17 percent is
the highest accuracy for the RF algorithm.
[16] This article provides an estimated mathematical
model using linear regression to detect person’s disease
connection on homomorphically encrypted information.
Parkinson’s patients dataset is used for building a PD-patient
information analysis model outsourced to a remote server.
The primary challenge of this research is to use encrypted
data to perform the necessary analysis to build a model
which can identify the unidentified samples depending on
the training samples which is built using a linear regression
structure. Gradient descent algorithm has been used for
converging the model to create the linear regression. The
implementation perceives Parkinson’s encrypted sensitive
voice recording samples. The samples are encoded using
Homomorphic Encoding. If the model can deliver more
actual true positive rates, the classifier’s efficiency improves
than that of the false positive rates.
A research carried out by [1], proposed maintaining
security and highly precise outsourced random forest disease
predictor termed PHPR. The PHPR model can carry out
secure training with health information that belongs to
various data holders and predict accurately. In addition, the
rational field’s original data and calculated outcome can be
filtered and stored safely in the cloud without any privacy
leakage. Hypothesis results using real-world data show that
PHPR not just gives a secure prediction of disease over
ciphertexts, but it also holds predictive accuracy as the
original classifier. The study concludes is important to find
a suitable classification model for the prediction of disease.
Another study [17] used fully homomorphic encryption
that allowed development of new privacy preserving
machine learning schemes. Demonstrate how well these
systems can be implemented to the automatic evaluation
of speech impacted by medical circumstances, enabling
patient confidentiality in treatment and scenario tracking.
More precisely, it presented results of Parkinsons Disease,
detection of cold and degree of depression. The second
degree polynomials and linear equations replace the
activation functions, as only sum and multiplications are
feasible. The resulting template is then used in an encrypted
part of the network after training the network with non
encrypted data to generate encrypted predictions. The small
differentiation between the outcomes of encrypted neural
networks and their unencrypted counterparts, furthermore,
indicates the validity of safe strategy. However, the restricted
volume of records does not enable deeper networks to
thoroughly analyze performance degradation.
Classification of machine learning is now used for
various functions, for example, prediction genomics or
medical, face detection, spam detection, also economic
predictions. Because of privacy issues, it is crucial that
the classifier and data remain private. In this research
[3], Nave Bayes, hyperplane decision, and decision trees
these are 3 main classification protocols build to meet
this privacy requirement. Based on these structures is a
new building block library that allows a broad variety of
privacy-conserving classifiers to be constructed. It Illustrates
how well this library can be used to build other classifiers,
such as a multiplexer and face recognition classifier, than
that of the 3 listed above. The classifiers and library were
introduced and assessed. In these protocols, it is effective to
take milliseconds to a few seconds to perform classification
when operating on actual medical data.
One of the functions of the 2017 iDASH Secure Genome
Assessment Competition was to allow logistic regression
approaches to be trained over encrypted genomic records.
More specifically, it provided a list of about 1,500 patient
data, each one with 18 binary characteristics containing
information on particular diseases. The concept was that the
data owner would encrypt documents using homomorphic
encryption and deliver them to an unsecure cloud for
storage. Cloud could then introduce a training method
homomorphically to the encrypted records to achieve a
system of encrypted logistic regression and can be sent
for decryption to the data holder. Data provider could thus

effectively outsource the training procedure without either
disclosing its sensitive information or the skilled model
to the cloud. For the encryption of fixed point number
and multibit plaintext encryption homomorphic encryption
is used. The outcome shows that training on encrypted
information is feasible but it comes with high computing
cost. On the other side, in critical apps, this technique can
ensure the greatest level of data confidentiality[18].
Machine learning is a very successful strategy that works
with early diagnosis of disease that could assist physicians to
make diagnostic decisions. The aim of this article is to build
a classifier model using the WEKA tool to detect diabetes
using Naive Bayes, SVM, RF and Simple CART algorithms.
The aforementioned 4 classifiers were ranked depending
on training time, test time and precision value. From the
evaluation and calculations it is evident that in predicting the
disease with maximum precision, Support Vector Machine
performed best. The SVM’s precision value obtained is is
0.784 which is highest and the Random Forest’s accuracy
value is 0.756 which is lowest as per[19] research. In this
study [20], researchers provided the first secure multiparty
computation (SMC) for private classification protocols with
cluster ensembles-boosted decision trees, RF, Delivery of
protocols on the KenSci Model for Medical Analysis.
IV. METHODOLOGY
Fig. 1: Fayyad Methodology [21]
1. Data Selection:
The dataset chosen for this project is Breast Cancer
Wisconsin (Diagnostic) Data Set [22] taken from UCI
Machine Learning repository, as this is a popular dataset
among researches in the field of healthcare. With 11 columns
and 699 rows, this is a concise dataset perfectly suited to test
encryption based classifier models. The dataset consists of
binary classification data, with 357 benign and 212 malignant
cases. The Class attribute is the dependent variable with ’2’
denoting benign and ’4’ represents malignant tumor.
2. Data Preprocessing:
The study and practice of keeping messages safe and secure
by using mathematical techniques is known as cryptography
in general. Cryptosystem can provide one or more than
one of the four services which are confidentiality, integrity,
authentication and non-repudiation so that information can be
protected from being disclosed to unauthorized parties. The
two categories that cryptosystems can be classified into are
secret key cryptosystem or symmetric key cryptosystem and
public key cryptosystem or asymmetric key cryptosystem.
The same key is used to perform encryption and decryption
process on a message in a symmetric key cryptosystem.
DES, IDEA and AES are some of the popular symmetric
key cryptosystems. Two different keys i.e. private and public
keys are used for the encryption and decryption process in
an asymmetric key cryptosystem. Public key is required for
the encryption process whereas for the process of decryption
private key is used [23].
In this paper, we use an algorithm that uses healthcare data
and preserves the privacy using Homomorphic Encryption.
Homomorphic Encryption is an encryption technique that
permits us to perform computations on ciphertext and
generates a result in encrypted format. The result in
encrypted format decryption will match the outcome of
the operation in a manner as though it was performed
on the original plaintext. Rivest et al. first coined the
term Homomorphic Encryption in the year 1978. Post
that, multiple researchs have been proposed that used
Homomorphic encryption that supported either addition or
multiplication on encrypted data but not both addition
and multiplication. But only either of addition and
multiplication wasn’t sufficient for a number of extended
computations in various fields such as data mining,
machine learning, bioinformatics, etc. Later on, many
researchers proposed Fully Homomorphic Cryptosystems
and Partially Homomorphic Cryptosystems that supported
multiple additions and multiplications [11].
Homomorphic Encryption is applied over already
encrypted data and computations are performed on that
outcome of the result. For the purpose of initial encryption
this paper proposes a data masking scheme that is based
on the Format Preserving Encryption (FPE). FPE is an
encryption method that is irreversibly symmetric. FPE is
not exactly similar to traditional symmetric encryption since
instead of completely changed unreadable binary string, the
FPE ciphertext maintain the format and structure of the
original plaintext and hence the result can be easily saved
back to the database without the need to make any changes
to the database system. FPE can very well be utilized for
masking data for function testing, performance testing and
secure testing that can help prevent privacy exposure of the
real data. [24] This paper specifically proposes the use of
Python implementation of Format-preserving, Feistel-based
encryption (FFX). The algorithm [25] for the same is as
shown below:

Paillier Homomorphic Encryption will be adopted as
the building block in this privacy preserving system. The
Paillier cryptosystem is a probabilistic asymmetric algorithm
having an additive homomorphic property for public key
cryptography and was invented in the year 1999 by Pascal
Paillier. oneone
The output of the Format preserving, Feistel based
encryption (FFX) will be used as the input for Paillier
Homomorphic Encryption. This cryptographic technique has
three stages which are Key Generation, Encryption and
Decryption. The key notations and definitions corresponding
to them are given below [6]:
Key Generation: Consider p and q as two independent
large prime numbers.
We compute N = p ∗ q and λ = lcm(p − 1, q − 1).
Then define a function L(x) = x−1
N
Choosing an integer g of order N
and µ = (L(gmodN2
))−1
The public key is now given by PK = (N, g)
and the private key is SK = (λ, µ).
Encryption:
Let m ∈ ZN be the plaintext and r ∈ ZN be a random
number.
Then the Ciphertext can be generated as
C = EP K(m) = gm
rN
mod N2
where E() is the Paillier encryption on plaintext m and
random number r with modulo N2
.
Decryption:
Given a Ciphertext C, the plaintext m can be derived by
the following equation
m = DSK(C) = L(Cλ
mod N2
) µ mod N
Additive Homomorphism:
Given two plaintexts x and y encrypted under the same
public key PK, then the product of those two Ciphertext
EP K(x) and EP K(y) is equal to the Ciphertext of sum of
two plaintexts.
EP K(x) · EP K(y) = gx
rN
1 mod N2
· gy
rN
2 mod N2
EP K(x) · EP K(y) = gx+y
rN
1 rN
2 mod N2
EP K(x) · EP K(y) = EP K (x + y)
Scalar-Multiplicative Homomorphism:
Given a constant c ∈ ZN , then the Ciphertext EP K(m)
raised to the power of c is equal to the encryption of product
of constant and plain text.
EP K(x)c
= (gx
rN
mod N2
)c
EP K(x)c
= gxc
rcN
mod N2
EP K(x)c
= EP K (x · c)
With 11 columns and 699 records, this is a clean and
concise dataset. The column Bare nuclei contained 16
records filled with ?. This was replaced with NA, and then
further filled with the last valid occurrence of data in the
column with fillna() method in pandas. The ID column
that identifies breast cancer patients has been removed as
the inclusion of this column would not add any additional
benefits to the classification models, and would further
ensure protection of patient identity.
3. Data Exploration and Transformation:
Fig. 2: Exporatory Data Analysis

Exploratory data analysis is performed on all the columns
in the dataset to ensure that the data features fulfill the
requirements of the research question with no hidden
problems. Prominent reasons to perform this analysis is to
detect if there is any correlation between data columns.
This also ensures the expectation from the dataset is met
with statistical backing. Figure 2 visualizes the correlation
between all columns. matplotlib library has been used for
the visualisation. The graph legend indicates gradual colour
gradient moving from bottom to top. The darker shade
indicates lower correlation, and lighter shade indicates higher
correlation. The principle diagonal has been eliminated to
remove auto correlation. We observe that correlation between
variables is significantly lower than the limit except for
columns uniformity of cell shape and uniformity of cell size.
4. Data Mining:
K-fold cross validation: Due to imbalance in dataset (65.5%
Benign and 34.5% Malignant), k-fold cross validation
procedure has been implemented using k-fold method from
scikit-learn library. The dataset is split into k subsets , where
k-1 subsets are used for training and the last subset is used
for testing. This is iterative in nature, and every subset
gets its turn to be the test set. Our experiments showed
observable increase in initial accuracy using k-fold cross
validation when compared to train-test dataset split. 10-fold
cross validation has used as our experiments showed that 10
folds gave better validation accuracy when compared to 5
folds, as higher folds bring out better results.
K-nearest neighbour(KNN):
KNN algorithm is used to solve regression and classification
problems, with the least learning retention capacity. It
is a lazy learner. Data is expected to be numerical and
standardized, which is important for running the KNN
model.
Naive Bayes:
It is a deterministic classifier derived from the Bayes
principle , which works in the basis of indicating root nodes
by the previous probabilities. The Bayes theorem is provided
in 1st Equation and the constant of normalization is provided
in Equation 2. [15]
P(Xi|y) =
P(y|Xi)P(Xi)
p(y)
(1)
p(y) = Σ4
1p(y|Xi)P(Xi) (2)
Logistic Regression:
It is a predictive and probability-based assessment machine
learning algorithm used to solve classification problems. The
logistic regression approach tends to restrict the cost function
from 0 to 1 using the sigmoid activation function. The solver
that has been used for this implementation is liblinear, as it
performs best with small datasets. 1
.
Support Vector Classifier (SVC):
A Support Vector classifier is explicitly described by a
1https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148
separate hyperplane as a discriminative classifier. It is
derivation of Support Vector machine. Radial basis function
kernel has been used by default, as it is easy to tune and
provides a good benchmark for comparison. The kernel
coefficient applied is the reciprocal of the number of features.
LBFGS:
LBFGS is a conventional quasi-Newton method for
optimizing many variables ’ smooth operations. The solver
used is lbfgs for the respective model. Logistic regression is
the activation function and 11 hidden layers are used.
Decision Tree:
Learning the decision tree utilizes as a statistical model to
proceed from findings about its target value and to conclude
the result. This works best with data that can be easily
divided into categorical variables. The criterion to ensure
quality of split is assigned as gini for Gini impurity.
AdaBoost:
Is a meta-estimator that first integrates a classifier on the
main dataset , then fits extra classifier replicas on the similar
dataset. But weights are adapted for wrongly grouped cases
so that successive classifiers concentrate more on challenging
cases. 30 estimators have been used, with decision tree is
chosen as the base estimator .
Stochastic Gradient Boosting:
Which constructs an additive model in a forward-stage
manner. It enables random differential loss functions to
be optimized. 100 boosting stages were chosen for this
implementation.
Stochastic Voting Ensemble:
To merge projections, various kinds of models and statistics
such as measuring the mean can be used. The models
that were selected for this ensemble are logistic regression,
decision tree and Support Vector Classifier.
Bagged Decision Tree: Constructing models of similar type
from distinct sub-samples of training data.
Random forest: It is used for functions of classification
as well as regression,Is a decision-based classifier ensemble
method that includes a flowchart like a tree structure.
Extra Trees: Working on the same principle as randoms
forests, it selects random subsets of features and split nodes.
The main differences between then are that sampling is done
without replacement, and the criteria of node split is random
split and not best split.
V. EVALUATION AND RESULTS
A. Cross validation accuracy result
This metric has been calculated using the parameters
classifier model, feature columns, classifier column and
k-fold parameter. This has a collection of cross validation
results of all 10 folds and the mean is calculated in the end
for cross validation accuracy. cross val score method from
model selection library from scikit-learn.

Below table shows the accuracy scores obtained for the
implemented models with plaintext and encrypted input data
sorted by column ”After encryption”, as this is the column
being evaluated.
Fig. 3: Test Accuracy Table
Fig. 4: Test Accuracy
B. F1 score
The confusion matrix was evaluated for every model, but
the most relevant metric for classification chosen was the F1
score. The F1 score is calculated with the below formula:
F1 = 2 ∗
(precision ∗ recall)
(precision + recall)
(3)
Where precision= the ratio of predicted positives count and
total number of true positives predicted, and recall= the
ratio of predicted positives count and total number of true
positives and false negatives predicted. F1 score is derived
after considering all runs of k fold cross validation and its
weighted mean is calculated. Table 5 shows the F1 scores
obtained for the implemented models with plaintext and
encrypted input data sorted by column ”After Encryption”.
Fig. 5: F1 Score Table
Fig. 6: F1 Score
VI. CONCLUSION
From the results, we observe that logistic regression is
the best disease prediction model for this dataset which
outperforms other implemented models with encrypted data
without compromising patient privacy. We see that logistic
regression has the highest value with 96% for plain text
and 65% for encrypted text, followed by support vector
classifier with very similar accuracies. Logistic regression
model works very well on this dataset as it works best on
binary classification problems. We see the encryption offers
proportional deterioration for every model, and does not have
a detrimental effect on one particular model. With the dataset
having a slight imbalance, it is more appropriate to compare
them on the basis of their F1 scores. From the plot we can see
that Logistic regression model consistently performs well in
comparison to other models. Due to sensitivity of healthcare
data, it was difficult to obtain a huge dataset for the above
research. Future work could include the involvement of
more complex datasets which would enable usage of neural
networks.

REFERENCES
[1] Z. Ma, J. Ma, Y. Miao, and X. Liu, “Privacy-preserving and
high-accurate outsourced disease predictor on random forest,”
Information Sciences, vol. 496, pp. 225 – 241, 2019.
[2] X. Liu, R. Lu, J. Ma, L. Chen, and B. Qin, “Privacy-Preserving
Patient-Centric Clinical Decision Support System on Na¨ıve Bayesian
Classification,” IEEE Journal of Biomedical and Health Informatics,
vol. 20, no. 2, pp. 655–668, 2016.
[3] R. Bost, R. Ada Popa, S. Tu, and S. Goldwasser, “Machine learning
classification over encrypted data,” 01 2015.
[4] “A review on the state-of-the-art privacy-preserving approaches in the
e-health clouds.,” IEEE Journal of Biomedical and Health Informatics,
Biomedical and Health Informatics, IEEE Journal of, IEEE J. Biomed.
Health Inform, no. 4, p. 1431, 2014.
[5] C. Hao, G.-B. Ran, H. Kyoohyung, H. Zhicong, J. Amir, L. Kim,
and L. Kristin, “Logistic regression over encrypted data from fully
homomorphic encryption.,” BMC Medical Genomics, no. S4, p. 3,
2018.
[6] M. D., L. R., S. V., V. V., and A. K. Sangaiah, “Hybrid reasoning-based
privacy-aware disease prediction support system.,” Computers and
Electrical Engineering, vol. 73, pp. 114 – 127, 2019.
[7] H. Park, P. Kim, H. Kim, K.-W. Park, and Y. Lee, “Efficient machine
learning over encrypted data with non-interactive communication.,”
Computer Standards Interfaces, vol. 58, pp. 87 – 108, 2018.
[8] C. Zhang, L. Zhu, C. Xu, and R. Lu, “Ppdp: An efficient
and privacy-preserving disease prediction scheme in cloud-based
e-healthcare system.,” Future Generation Computer Systems, vol. 79,
no. Part 1, pp. 16 – 25, 2018.
[9] K. Hariss, H. Noura, and A. E. Samhat, “Fully enhanced homomorphic
encryption algorithm of more approach for real world applications.,”
Journal of Information Security and Applications, vol. 34, no. Part 2,
pp. 233 – 242, 2017.
[10] W.-j. Lu, S. Kawasaki, and J. Sakuma, “Using Fully Homomorphic
Encryption for Statistical Analysis of Categorical, Ordinal and
Numerical Data,” 2017.
[11] “Private equality test using ring-lwe somewhat homomorphic
encryption.,” 2016 3rd Asia-Pacific World Congress on Computer
Science and Engineering (APWC on CSE), Computer Science and
Engineering (APWC on CSE), 2016 3rd Asia-Pacific World Congress
on, APWC-ON-CSE, p. 1, 2016.
[12] Y. Rahulamathavan, S. Veluru, R. C. Phan, J. A. Chambers,
and M. Rajarajan, “Privacy-preserving clinical decision support
system using gaussian kernel-based classification,” IEEE Journal of
Biomedical and Health Informatics, vol. 18, no. 1, pp. 56–66, 2014.
[13] I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas,
and I. Chouvarda, “Machine learning and data mining methods
in diabetes research,” Computational and Structural Biotechnology
Journal, vol. 15, pp. 104 – 116, 2017.
[14] W. H. S. D. Gunarathne, K. D. M. Perera, and K. A. D. C. P.
Kahandawaarachchi, “Performance evaluation on machine learning
classification techniques for disease classification and forecasting
through data analytics for chronic kidney disease (ckd),” in 2017 IEEE
17th International Conference on Bioinformatics and Bioengineering
(BIBE), pp. 291–296, Oct 2017.
[15] J. M. A. A. K. M. N. S, “Research reports in clinical cardiology,”
Ensemble approach for developing a smart heart disease prediction
system using classification algorithms, vol. 9, pp. 33 – 45, 2019.
[16] T. Morshed, D. Alhadidi, and N. Mohammed, “Parallel linear
regression on encrypted data,” in 2018 16th Annual Conference on
Privacy, Security and Trust (PST), pp. 1–5, Aug 2018.
[17] F. Teixeira, A. Abad, and I. Trancoso, “Patient privacy in paralinguistic
tasks,” pp. 3428–3432, 09 2018.
[18] H. Chen, R. Gilad-Bachrach, K. Han, Z. Huang, A. Jalali, K. Laine,
and K. Lauter, “Logistic regression over encrypted data from fully
homomorphic encryption,” BMC Medical Genomics, vol. 11, p. 81,
Oct 2018.
[19] A. Mir and S. N. Dhage, “Diabetes disease prediction using machine
learning on big data of healthcare,” in 2018 Fourth International
Conference on Computing Communication Control and Automation
(ICCUBEA), pp. 1–6, Aug 2018.
[20] K. Fritchman, K. Saminathan, R. Dowsley, T. Hughes, M. De Cock,
A. Nascimento, and A. Teredesai, “Privacy-preserving scoring of tree
ensembles: A novel framework for ai in healthcare,” in 2018 IEEE
International Conference on Big Data (Big Data), pp. 2413–2422,
Dec 2018.
[21] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The kdd process for
extracting useful knowledge from volumes of data,” vol. 39, no. 11,
pp. 27–34, 1996.
[22] D. Dua and C. Graff, “UCI machine learning repository,” 2017.
[23] “Advanced e-voting system using paillier homomorphic encryption
algorithm.,” 2016 International Conference on Informatics and
Computing (ICIC), Informatics and Computing (ICIC), International
Conference on, p. 338, 2016.
[24] “A data masking scheme for sensitive big data based on
format-preserving encryption.,” 2017 IEEE International Conference
on Computational Science and Engineering (CSE) and IEEE
International Conference on Embedded and Ubiquitous Computing
(EUC), Computational Science and Engineering (CSE) and Embedded
and Ubiquitous Computing (EUC), 2017 IEEE International
Conference on, CSE-EUC, p. 518, 2017.
[25] “Addendum to the ffx mode of operation for format-preserving
encryption,” p. 1, 2010.
VII. APPENDIX

Safe machinelearning

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Safe machinelearning

Similar to Safe machinelearning (20)

More from MansiChowkkar

More from MansiChowkkar (6)

Recently uploaded

Recently uploaded (20)

Safe machinelearning