1-s2.0-S0167923620300944-main.pdf

Contents lists available at ScienceDirect
Decision Support Systems
journal homepage: www.elsevier.com/locate/dss
Missing care: A framework to address the issue of frequent missing
values;The case of a clinical decision support system for Parkinson's disease
Saeed Piri
Department of Operations and Business Analytics, Lundquist College of Business, University of Oregon, Eugene, OR 97403, USA
A R T I C L E I N F O
Keywords:
Electronic health records
Data missing values
Clinical decision support systems
Predictive healthcare analytics
Imbalanced data learning
Parkinson's disease
A B S T R A C T
In recent decades, the implementation of electronic health record (EHR) systems has been evolving worldwide,
leading to the creation of immense data volume in healthcare. Moreover, there has been a call for research
studies to enhance personalized medicine and develop clinical decision support systems (CDSS) by analyzing the
available EHR data. In EHR data, usually, there are millions of patients records with hundreds of features col-
lected over a long period of time. This enormity of EHR data poses significant challenges, one of which is dealing
with many variables with very high degrees of missing values. In this study, the data quality issue of in-
completeness in EHR data is discussed, and a framework called ‘Missing Care’ is introduced to address this issue.
Using Missing Care, researchers will be able to select the most important variables at an acceptable missing
values degree to develop predictive models with high predictive power. Moreover, Missing Care is applied to
analyze a unique, large EHR data to develop a CDSS for detecting Parkinson's disease. Parkinson is a complex
disease, and even a specialist's diagnosis is not without error. Besides, there is a lack of access to specialists in
more remote areas, and as a result, about half of the patients with Parkinson's disease in the US remain un-
diagnosed. The developed CDSS can be integrated into EHR systems or utilized as an independent tool by
healthcare practitioners who are not necessarily specialists; therefore, making up for the limited access to
specialized care in remote areas.
1. Introduction
Over the past two decades, the ever-increasing creation of immense
data volumes in diverse forms from various sources has inspired and
induced industry practitioners and researchers to analyze and exploit
the available big data and transform their analysis to competitive ad-
vantages and strategic value [1–6]. “Big data analytics”, “business
analytics”, “predictive analytics”, and “data-driven decision making”
are some of the terms used in the literature for this stream of research.
In healthcare, similar to many other fields, such as retail, e-commerce,
and media & entertainment, vast amounts of data in different forms
have been available owing to process digitalization in the industry [7].
Two prime sources of big data in healthcare are genomic and payer-
provider. Genomic data is the data retrieved from the genome and DNA
of an organism and is used in bioinformatics. Examples of payer-pro-
vider data sources are electronic health records (EHR), insurance re-
cords, pharmacy prescription, and patient feedback and responses [8].
In the recent decade, the use and implementation of EHR systems
have been evolving worldwide, and in the US [3,9]. Jha et al. [10]
reported that the adoption of basic or comprehensive EHR raised from
8.7% in 2008 to 11.9% in 2009. By 2014, the use of basic EHR
increased to 59%, and in 2017, 96% of all non-federal acute care hos-
pitals possessed certified EHR technology [11]. Many researchers have
studied the diffusion of EHR systems in the US hospitals and facilitating
EHR assimilation in small physician practices [12,13], and also the
impact and benefits associated with the adoption of EHR [9,14–18]. As
using EHR systems has become ubiquitous, many researchers and
clinicians have started using EHR and its data for research purposes
[19,20]. A specific characteristic of an EHR system is that it is a com-
prehensive system, which links multiple patient level data sources such
as demographics, encounters, lab tests, medication, and medical pro-
cedures [21]. This comprehensiveness allows for more reliable and
robust research that considers many aspects of the healthcare system
and patients in this system. The trove of health data available, coupled
with the recent advancement in analytics, has created an ideal oppor-
tunity for researchers to conduct analytics research and gain valuable
insights that improve decision making in healthcare systems [22].
Personalized medicine provides medical care tailored to the unique
physiological and medical history of individuals rather than relying on
general population information. Personalized medicine leads to earlier
diagnosis, more effective interventions and treatments, and lower cost
[23]. Developing clinical decision support systems (CDSS) has been
https://doi.org/10.1016/j.dss.2020.113339
Received 7 November 2019; Received in revised form 1 June 2020; Accepted 1 June 2020
E-mail address: spiri@uoregon.edu.
Decision Support Systems 136 (2020) 113339
Available online 12 June 2020
0167-9236/ © 2020 Elsevier B.V. All rights reserved.
T

outlined as one of the most crucial research directions related to per-
sonalized medicine [1,20,24,25]. CDSSs are tools that aid clinicians in
making more informed decisions, such as diagnosing various diseases.
CDSSs can be integrated into EHR systems and be a part of clinical
workflow and, as a result, help clinicians to stratify patients, diagnose
diseases, and identify the best candidates for various treatments. One of
the most promising directions in developing CDSSs is data mining and
predictive analytics using large size EHR data [26]. Many editorials and
commentaries such as Gupta & Sharda [20], Agarwal et al. [19],
Agarwal & Dhar [27], Chen et al. [3], Shmueli & Koppius [28], and
Baesens et al. [29] have discussed and noted the significance and value
of predictive analytics specifically in healthcare applications. As
Shmueli & Koppius [28] noted, when the goal of the research is the
predictability of a phenomenon, construct operationalization con-
siderations are trivial. Then, they introduced “assessing predictability
of empirical phenomena” as one of the contributions of predictive
analytics. Also, Agarwal & Dhar [27] argued that in the healthcare
domain, prediction could be even more important than explanation (or
causality), because of proven benefits of earlier diagnosis, intervention,
and treatment.
In EHR data, usually, there are hundreds of thousands, and some-
times millions of patients with many records and features (demo-
graphics, laboratory, medications, encounters, and outcomes) collected
over a long period of time. Therefore, the enormity and complexity of
EHR data pose significant research and practical challenges [3]. One of
the most critical challenges is dealing with many variables with very
high degrees of missing values [30]. If we consider the laboratory in-
formation, there are hundreds of different lab tests in EHR data; how-
ever, not every patient takes all of those lab tests. Therefore, for many
features (tests), the majority of values are missing (at the level of more
than 70% to 90% missing). Baesens et al. [29] mention,
“Access to big data and the tools to perform deep analytics suggests that
power now equals information (data) + trust.”
Then, they extensively discuss data quality as a critical and essential
topic that is frequently ignored in data analytics. Research has shown
that while many analytics and machine learning techniques might yield
comparable predictive performance, the best way to enhance this per-
formance is to work on the key element of analytics, data [31]. Data
completeness is a vital aspect of data quality [32,33]. Thus, in ana-
lyzing EHR data, we deal with data quality from the completeness point
of view [30]. Although many studies, especially in the fields of statistics
and machine learning, have focused on imputing missing values, they
only consider variables that have reasonable degrees of completeness
(roughly below 50% missing) and variables with very high degrees of
missingness (as high as 70 to 90% missing) are dropped before applying
imputation methods [30,34,35]. As a result, the challenge of dealing
with variables with very high degrees of missingness remains un-
answered. To address this challenge, a new framework that can be
applied to EHR data and other types of data with the same challenge of
having many variables with very high degrees of missing values is
proposed. This framework is called Missing Care. Using Missing Care,
data analytics researchers will be able to select and keep the most im-
portant variables at an acceptable missing values degree to use im-
putation approaches and develop predictive models with high pre-
dictive power.
Moreover, the proposed framework, Missing Care, is applied to de-
velop a CDSS for detecting and screening for Parkinson's disease (PD).
PD is a chronic and progressive neurological disorder affecting more
than 10 million people worldwide. In the US alone, there are about
500,000 patients diagnosed with PD; however, given many un-
diagnosed or misdiagnosed cases, it is estimated that there are actually
about one million patients with PD in the US. Besides patients them-
selves, PD affects thousands of more spouses, family members, and
other caregivers [36]. The fact that the actual number of patients with
PD is twice as many as the number of diagnosed patients is a strong
indication of an urgent need for a more accessible diagnosis/screening
tool for this disease. Hence, developing tools and CDSSs for a more
accessible diagnosis of PD is vital. In the US, the total annual direct and
indirect costs of PD are about $52 billion [37]. There is no cure for PD
yet, but there are treatment options such as medications and surgery.
The primary current diagnostic method for PD is based on the sub-
jective opinion of neurologists reviewing patients' movement and
speaking [38]. Researchers have discussed the challenge of healthcare
access, particularly in remote areas, and have urged the need for in-
novative solutions that are affordable and easy to implement [39–41].
Because of the limited specialty care access, many patients, especially in
remote and rural areas, may remain undiagnosed. The developed CDSS
can fill this gap and be a solution for the problem of care access in
remote areas and, as a result, alleviate the low diagnosis rate for PD.
In this study, by analyzing a unique, large size EHR data, including
demographic and routine lab tests, a CDSS for detecting PD is devel-
oped. In the development of this CDSS, an imbalanced dataset is ana-
lyzed using “synthetic informative minority over-sampling” approach
(SIMO) [42], which is an over-sampling algorithm for imbalanced da-
tasets. As Von Alan et al. [43] discuss, design science as an essential
paradigm of research is creating new and innovative artifacts (such as
models and systems) to address the real world and applied problems.
Harnessing big data in healthcare, using data mining techniques, is
consistent with the design science paradigm and has received a lot of
attention in recent years. The developed diagnosis/screening CDSS for
PD also belongs to this category of research.
This study's contribution is two-fold. First, to the best of my
knowledge, this is the first study that formally discusses the challenge of
having many variables with very high degrees of missing values in EHR
(and other similar) datasets and addresses it by introducing the Missing
Care framework. This framework addresses the issue of data quality as
well as trust in big data analytics tools, as discussed by Baesens et al.
[29]. Both quality and trust features are discussed in more detail in the
methodology section. And, second, from the precision medicine per-
spective, this study, introduces a CDSS for detecting and screening for
PD, a neurological disease with a meager diagnosis rate. This CDSS can
be utilized as a tool integrated into EHR systems as well as an in-
dependent tool used by healthcare practitioners who are not necessarily
specialists; therefore, making up for the limited access to specialized
care in rural and remote areas. The rest of this manuscript is organized
as follows. In the next section, the related literature is covered. Next, in
the methodology section, the Missing Care framework is introduced.
Following that, the data, as well as data pre-processing steps, are pre-
sented. Next, in the experiment results section, the results of the ana-
lysis in the case of PD are provided. And following that, a series of
robustness checks are presented. Finally, the findings, contributions,
and implications of this research are discussed.
2. Literature review
2.1. Predictive modeling
The primary purpose of predictive analytics is to predict the out-
come of interest for new cases rather than explaining the causal re-
lationships between features and the outcome [28]. Predictive analy-
tics, machine learning, and data mining have been extensively used in
the literature. Many researchers have applied predictive analytics and
machine learning in marketing and retail contexts, such as predicting
online customers' repeat visits [44], predicting consumers' purchase
timing and choice decisions [45], customer churn prediction [46] and,
marketing resource allocation [47]. Data mining and predictive ana-
lytics have also been used in finance-related areas. For instance, fi-
nancial fraud detection [48], evaluating firms' value [49] and pre-
dicting the type of entities in the Bitcoin blockchain [50].
Social networks, recommendation systems, and process events are
other fields that have gained interest from researchers. Examples in this
S. Piri Decision Support Systems 136 (2020) 113339
2

stream are, predicting business process events such as early warning
systems [51], and predicting the probability that a social entity will
adopt a product [52]. Finally, many research works are related to
analyzing unstructured data, such as text, reviews, and blogs [53–56].
For a more extensive review of research in data mining, the readers are
referred to Trieu [57].
2.2. Healthcare data analytics
Kohli & Tan [22] extensively discuss how researchers can contribute
to the healthcare transformation in two semantic areas-integration and
analytics- using EHR. Healthcare analytics' primary goal is to predict
medical/healthcare outcomes such as diseases, hospital readmissions,
and mortality rates, using clinical and non-clinical data [26]. In
healthcare analytics, two types of data sources could be used. First, data
collected in clinical trials; these types of datasets are collected explicitly
for analysis purposes; however, they are usually small-sized and lim-
ited. The other source is secondary data in healthcare, such as EHR
data. Data analytics researchers typically deal with datasets from the
second source that has its own challenges.
Readmission is defined as being readmitted for the same primary
diagnosis within 30 days. Readmissions incur a huge preventable cost
to the US healthcare system. Many researchers have developed pre-
dictive models to identify and predict patients with a high risk of
readmission [34,58–60]. Another group of researchers has studied on-
line healthcare social platforms and used data analytics to investigate
and predict the patients' behavior and outcome. Examples are dis-
covering the health outcomes for patients with mental health issues in
an online health community [61], predicting the social support in a
chronic disease-focused online health community [62], and de-
termining individuals' smoking status [63].
Predicting and detecting adverse health events is another stream of
research in healthcare analytics. Lin et al. [26] proposed an approach,
which can provide multifaceted risk profiling for patients with co-
morbid conditions. Wang et al. [64] also proposed a framework to
predict multiple disease risk. Piri et al. [65] developed a CDSS to detect
diabetic retinopathy and proposed an ensemble approach to improve
the CDSS's performance further. Zhang & Ram [66] introduced a data-
driven framework that integrates multiple machine learning techniques
to identify asthma triggers and risk factors. Wang et al. [67] used ma-
chine learning to analyze patient-level data and identified patient
groups that exhibit significant differences in outcomes of cardiovascular
surgical procedures. Ahsen et al. [68] studied breast cancer diagnosis in
the presence of human bias. And finally, Hsu [69] proposed an attribute
selection method to identify cardiovascular disease risk factors.
Other related healthcare analytics works are on topics such as
treatment failures, clinical trials, and patients' pathways in hospitals. To
mention a few, Meyer et al. [70] proposed a machine learning approach
to improve dynamic decision making. They applied the proposed
method to predict treatment failures for type II diabetic patients.
Gómez-Vallejo et al. [71] developed a system to diagnose healthcare-
associated infections. Researchers have also used predictive modeling
to evaluate the kidney and heart transplant survival [72,73]. And, So-
manchi et al. [74] proposed models to predict whether emergency de-
partment patients will be admitted as inpatients or will be discharged.
2.3. Parkinson's disease
The current gold standard for diagnosing PD is a clinical evaluation
by a specialist. The criteria for diagnosis have been formalized by the
UK Parkinson's Disease Society Brain Bank [75]. Due to the disease
complexity, even the diagnosis by a specialist using the formal criteria
is not perfect, and the accuracy is about 90% [76]. Many researchers
have studied the association of PD with potential biomarkers and other
characteristics of patients. However, most of them are based on
studying only one single biomarker in very small sample sizes. For
instance, Fujimaki et al. [77] showed that caffeine level in the blood of
PD patients is lower compared to the control group. Feigenbaum et al.
[78] studied the possibilities of testing tears to find a specific protein
that has been shown to be an indication of PD. Arroyo-Gallego et al.
[79] studied 25 PD patients and 27 controls to detect PD based on
natural typing interaction. Another study analyzed the handwriting of
20 PD and 20 controls and performed a discriminant analysis to classify
the participant to PD and non-PD [80]. There have been other studies
applying machine learning in managing PD, such as using smartphones
to monitor PD patients' movements, calculating a score, and sending it
to doctors [81]. The significant difference between these types of works
and the current study is that this research focuses on the diagnosis
challenge in PD while these studies address the disease management for
currently diagnosed patients. There is no proven biomarker for PD [82];
therefore, personalized medicine based on big data analytics could be a
potential solution to PD diagnosis challenges.
In summary, none of the previous research studies in predictive
analytics and healthcare analytics introduced or used any formal pro-
cedure to address the challenge of having many variables with very
high degrees of missing values. And this study is the first to propose a
formal framework that addresses this common challenge in working
with EHR. Besides, to the best of my knowledge, this research, for the
first time, introduces a CDSS for detecting and screening for PD that
does not require any specific equipment or test and can be used by any
primary care provider or nurse. This CDSS can be used as a tool in-
tegrated into EHR systems or as a standalone personalized medicine
tool, especially in remote areas with limited access to specialists.
Moreover, while most of the existing studies only use balanced data-
sets,1
in the development of this CDSS, advanced imbalanced data
learning techniques are employed to simulate the realistic situation of
facing imbalanced distribution of patients with PD and those without
PD. Additionally, in this study, an ensemble approach is used to in-
tegrate multiple classifiers to achieve the highest possible accuracy in
detecting PD.
3. Methodology- Missing Care framework
In analyzing datasets similar to EHR data, we face the challenge of
having many variables, most of them with very high degrees of missing
values (as high as 70 to 90%). There are two extreme and immediate
solutions to this issue. One solution is to remove numerous records to
have records with reasonably populated values. Another solution,
which is in a way in the opposite direction, is first to remove variables
with very high degrees of missing values. And then remove the records
with high missing values and, as a result, end up with a dataset that has
reasonable completeness that is suitable for using imputation ap-
proaches. By pursuing the first solution, we will lose numerous records
(in the case of EHR, records correspond to patients or encounter data).
And by taking the second solution, we will deprive ourselves of a lot of
variables (independent variables) that might indeed be strongly asso-
ciated with the target variable.
One might say various imputation methods can be used to impute
missing values. In fact, there is an extensive literature on missing values
imputation methods in statistics and machine learning fields [83–87].
However, imputation methods are only appropriate when there is rea-
sonable completeness in each variable. Acuna & Rodriguez [88] men-
tioned “rates of less than 1 % missing data are generally considered
trivial, 1-5% are manageable. However, 5-15% requires sophisticated
methods to handle, and more than 15% may severely impact any kind
of interpretation.” Many data analytics tools such as SAS Enterprise
Miner automatically (by default) remove variables with more than 50%
missing from further analysis and imputation. The reason is that to
1
Balanced datasets are much easier to learn from; however, their perfor-
mance on real-world imbalanced datasets are not acceptable.
3

impute missing values, we usually need to use the values of other re-
cords for the same variable, and when most of the values (or a con-
siderable portion of them) are missing, the imputation is not reliable.
For instance, in the data used in this research, the maximum missing
value degree was 98%, and many features had missing values around
60% to 90%, and imputing these variables with very high degrees of
missingness was not appropriate. It needs to be pointed out that Missing
Care is not a replacement for imputation methods. In fact, it is a pre-
processing framework that prepares the data for more meaningful use
of imputation methods when there is a reasonable degree of com-
pleteness in the data. In many fields, especially healthcare, even 50%
completeness for a variable is not acceptable.
Baesens et al. [29] extensively discuss data quality and trust in
analytics works. Data quality was elaborated on in the introduction
section, and here the “trust” part is discussed. The matter of trust in
both data and analytics approaches is a critical factor in implementing
DSS based on data analytics [29]. This is even more crucial in health-
care; working with physicians and clinicians as data analysts entails a
remarkable liability and trust. Even if you use a complete and clean
dataset to develop a CDSS, clinicians hardly trust the results, let alone if
they learn that you have heavily imputed your data. Therefore, to en-
hance data-driven decision making in general and more specifically in
healthcare, we need to gain the organizations' leaders' trust and one of
the prerequisites is to limit the level of imputation.
This challenge is addressed by introducing the Missing Care frame-
work. Missing Care begins by using the initial dataset, D with p variables
and N records. Next, keeping all variables in the data, only records with
a reasonable degree of completeness, say δ%2
will be kept, and all other
records will be removed. At this stage, we have dataset D1 containing all
of the initial variables and only records with a high degree of com-
pleteness (N1); therefore, a considerable number of records are re-
moved. The next step is to identify the variables (features or in-
dependent variables) that are strongly associated with the target
variable. In Section 3.1, a procedure is recommended for computing an
importance level for all the variables and identifying the important
ones.
Then, based on the final variables' importance (xj
imp
) top p∗
vari-
ables out of initial p variables are identified and selected for further
analysis. In the next step, from the initial dataset D, only selected p∗
variables are kept, and all other variables are removed. At this point, we
have an updated dataset, D2, which includes only a subset of variables
(those that are strongly associated with the target variable) and all
original records. The next step is to remove the records with very high
missing values and keep only the records with at least δ% completeness.
This dataset (D∗
) is the final dataset that will be used to develop the
predictive models and has p∗
and N∗
records with N∗
> N1.
3.1. Computing variables' importance
Two data mining and machine learning methods for both classifi-
cation and regression problems are recommended to compute the
variables' importance. Here, it needs to be noted that classification
problems are the ones with a binary or categorical target variable, for
instance, when we want to detect and diagnose a disease or when we
want to predict the success or failure of a project. And regression pro-
blems are the cases with a numeric target variable, which is measur-
able, such as predicting the value of a house or predicting the length of
stay for a patient. At the end of this section, two Missing Care frame-
works are presented; one for regression problems, and one for classifi-
cation problems. If the problem is classification, logistic regression with
l1 regularization (the same regularization that lasso uses) and random
forests classifier [89] are used. And if the problem is regression, lasso,
and random forests regressor are used. These methods are re-
commended for three reasons: first, one is representative of classical
statistics (regression), and the other one is a representative of more
advanced machine learning techniques (random forests); second, re-
gressions are highly interpretable and random forests are highly accu-
rate and capable of handling a large number of variables when there is a
relatively small number of records available [89,90]; and third, both
methods provide some sort of variable importance.
Both logistic regression with l1 regularization and lasso regression
train in a way to reduce the number of features in the predictive model
and reduce the chance of over-fitting [91]. Where the linear regression
is in the form of Eq. 1
= + + …+ +
y x x
p p
0 1 1 (1)
Lasso minimizes the following (Eq. 2) to estimate the coefficients,
+
=
MSE | |
j
p
j
0 (2)
MSE is the mean squared error, βj are coefficients of p features, and α
is the parameter that adjusts the trade-off between accuracy on training
data and the regularization. Higher α means forcing more βj to be zero
(removing the corresponding features from the model). Therefore,
when lasso (or logistic regression with l1 regularization) is used, during
the training process, important variables that are strongly associated
with the target variable are kept in the model and other less important
features will end up having zero coefficients, meaning they will be re-
moved from the model. An importance measure is assigned to each
feature included in the model. This importance is based on the R-square
reduction (denoted by Rj
2
red) after removing the variable from the model
and in the end, all of the importance measures are normalized. The
importance of variable j (denoted by Reg_xj
imp
) is calculated as follows
in Eq. 3,
=
+…+
Reg x
R
R R
_ j
imp j red
red pred
2
1
2 2
(3)
The same is done for the logistic regression; however, instead of R-
square reduction, the area under the curve (AUC) reduction is calcu-
lated. AUC is the area under the ROC (Receiving Operator
Characteristic) chart and takes values between 0 and 1. The AUC re-
duction after removing variable j is denoted by AUCjred and the variable
j’s importance in a classification problem is calculated as in Eq. 4,
=
+…+
Reg x
AUC
AUC AUC
_ j
imp jred
red pred
1 (4)
It needs to be noted that Rj
2
red and AUCjred for variables that are not
included in the model are equal to zero, and as a result, their im-
portance is zero as well.
In random forests models, the variable importance is calculated
based on impurity reduction when a variable is used for splitting [92].
The impurity for regression models is based on the variance at each
node and is calculated as shown in Eq. 5 and Eq. 6,
=
y
y
N
m
i N i
m
m
(5)
=
v
y y
N
( )
m
i N i m
m
2
m
(6)
Where yi is the label (target variable value) for record i and vm is
variance at node m, with Nm observations. When variable j is used in the
split in node m in a tree, variance reduction (VR) is calculated as in Eq.
7,
=
VR v w v w v
_ _
j m left left m right right m
m (7)
where wleft and wright are the proportion of the records in each leaf.
Variable j’s importance at each tree is calculated based on the
2
A suggestion for a reasonable degree of completeness is about 85% to 90%,
however, it could be different based on the data characteristics.
4

proportion of variance reduction using variable j in splits to the total
variance reduction in all splits as shown in Eq. 8,
=
VR x
VR
VR
_ j
imp m all noded splited using variable j j
m all nodes m
m
(8)
Next, VR_xj
imp
is normalized using all variables' importance in the
tree (Eq. 9),
=
N VR x
VR x
VR x
_ _
_
_
j
imp j
imp
i p i
imp
(9)
Finally, the importance of variable j in the random forests model,
denoted by RF_xj
imp
is computed based on the average variance reduc-
tion in all trees generated in the random forests model as in Eq. 10
=
RF x
N VR x
number of trees in RF
_
_ _
j
imp all trees i RF j
imp
n
(10)
In classification models, the impurity is the Gini index at each node
(Eq. 11),
=
Gini p p
(1 )
m
K mk mk (11)
Where pmk is the proportion of observation belonging to class k at
node m (there are K classes in the data). When variable j is used in the
split in node m in a tree, Gini reduction (GR) is calculated as in Eq. 12,
=
GR Gini w Gini w Gini
_ _
j m left left m right right m
m (12)
where wleft and wright are the proportion of the records in each leaf.
Variable j’s importance at each tree is calculated based on the propor-
tion of Gini reduction using variable j in splits to the total Gini reduc-
tion in all splits (Eq. 13),
=
GR x
GR
GR
_ j
imp m all noded splited using variable j j
m all nodes m
m
(13)
Next, GR_xj
imp
is normalized using all variables' importance in the
tree (Eq. 14),
=
N GR x
GR x
GR x
_ _
_
_
j
imp j
imp
i p i
imp
(14)
Framework 1
Missing Care Framework for regression problems.
Given, p, N, δ, α
1. Keep all p variables and only records that have at least δ% completeness:
D p variables N records
( & )
1 1
2. Train a lasso regression on D1
3. Using an appropriate value for α, identify the variables with non-zero coefficient p1
4. Calculate Rj
2
red for all variables in p1
5. Normalize the values of Rj
2
red and calculate variable importance, Reg_xj
imp
using
=
+…+
Reg x
R
R R
_ j
imp j red
red pred
2
1
2 2
6. Train a random forests regressor on D1
7. Calculate variance reduction, VR_xj
imp
for all variables at each tree
8. Normalize variance reductions using,
=
N VR x
VR x
VR x
_ _
_
_
j
imp j
imp
i p i
imp
9. Calculate variable importance, RF_xj
imp
based on normalized variance reductions
for all trees in random forests using
=
RF x
N VR x
_
_ _
j
imp all trees in RF j
imp
Framework 1 (continued)
10. Calculate the ultimate variables' importance as
=
+
+
x
w Re g w RF
w w
j
imp
Rsq
Reg
xj
imp Rsq
RF
xj
imp
Rsq
Reg
Rsq
RF
11. Identify top p∗
variables out of initial p variables as selected variables based on xj
imp
12. From D keep only selected p∗
variables and all records
( & )
2
13. From D2, keep only the records with at least δ% completeness:
( & )
>
N N
( )
1
14. D∗
: the final dataset with selected variables and records
Finally, the importance of variable j in the random forests, denoted
by RF_xj
imp
is computed based on the average Gini reduction in all trees
generated in the random forests model as in Eq. 15
=
RF x
N GR x
_
_ _
j
imp
(15)
At this stage, the variables' importance in both regression and
random forests models are available. Next, a weight is assigned to each
variable importance based on models' R-square for regression problems
and models' AUC for classification problems, denoted by wRsq
Reg
, wRsq
RF
,
wAUC
Reg
, and wAUC
RF
respectively. The final importance of variable j is
calculated as Eq. 16 and Eq. 17,
=
+
+
xj
imp w Reg x w RF x
w w
_ _
Rsq
Reg
j
imp
Rsq
RF
j
imp
Rsq
Reg
Rsq
RF
for regression problems
(16)
=
+
+
xj
imp w Reg x w RF x
w w
_ _
AUC
Reg
j
imp
AUC
RF
j
imp
AUC
Reg
AUC
RF
for classification problems
(17)
Framework 2
Missing Care Framework for classification problems.
Given, p, N, δ, α
1. From D, keep all p variables and only records that have at least δ% completeness:
( & )
1 1
2. Train a logistic regression with l1 regularization on D1
3. Using an appropriate value for α, identify the variables with non-zero coefficient p1
4. Calculate AUCjred for all variables in p1
5. Normalize the values of AUCjred and calculate variable importance, Reg_xj
imp
using
=
+…+
Reg x
AUC
AUC AUC
_ j
imp jred
red pred
1
6. Train a random forests classifier on D1
7. Calculate Gini reduction, GR_xj
imp
for all variables at each tree
8. Normalize Gini reductions using,
=
N GR x
GR x
GR x
_ _
_
_
j
imp j
imp
i p i
imp
9. Calculate RF_xj
imp
based on normalized Gini reductions for all trees in random
forests using
=
RF x
N GR x
_
_ _
j
imp
10. Calculate the ultimate variables' importance as
(continued on next page)
5

Framework 2 (continued)
=
+
+
x
w Reg x w RF x
w w
_ _
j
imp AUC
Reg
j
imp
AUC
RF
j
imp
AUC
Reg
AUC
RF
11. Identify top p∗
variables out of initial p variables as selected variables based on xj
imp
12. From D keep only selected p∗
variables and all records
( & )
2
13. From D2, keep only the records with at least δ% completeness:
( & )
>
N N
( )
1
14. D∗
: the final dataset with selected variables and records
Formal pseudo-code for Missing Care for both regression and clas-
sification problems are shown in Framework 1 and 2, and the notations
are listed in Table 1.
An important point needs to be noted here. Even though the re-
commended techniques led to better performance compared to other
variable selection methods such as relative importance in neural net-
works, and stepwise regression, it cannot necessarily be the case for all
settings and datasets. Therefore, it is best if analysts experiment with
various variable selection methods and apply the one that provides the
best results for their data.
4. Data
In this section, the data, and also the data pre-processing and pre-
paration steps that were taken before developing the final predictive
models are described. In this research, a unique data retrieved from the
largest relational healthcare data warehouse in the US, Cerner Health
Facts® is used. Health Facts® contains more than two decades of data of
84 million unique patients in 133 million encounters from over 500
health care facilities across the US [93]. The initial data included mil-
lions of records and hundreds of variables across various tables that
needed to be integrated, cleaned, and pre-processed. Fig. 1 is a sim-
plified diagram for this data, depicting the tables that are included in it
and how they are related. In this diagram, primary and foreign keys
that are used to merge various tables are specified; the primary key in
each table is bold and Italic, and the foreign key is underlined. The
patient table contains demographic information such as age, race,
marital status, and gender. The medication table holds a complete set of
variables on the medications that patients take; examples are medica-
tion name/ID, dosage, ordering physician, order date, unit costs, etc.
The diagnosis table is home to the diagnosis information (ICD 9 and 10
codes). The encounter table contains all the variables related to the
patients' visits. Variables such as admission and discharge dates, ad-
mitting physician, admission type, and admission source are in this
table. Information about the patients' vital signs such as blood pressure,
temperature, and respiratory rate and their respective collection date
and time is stored in the clinical event table. And finally, the lab pro-
cedure table contains all the information about the patients' lab tests.
This table includes variables such as lab name, lab completion date, and
results. Diagnosis and medication tables are used to label patients in the
PD and control groups. And, variables in the patient, clinical event, and
lab procedure tables are used in developing the models.
Data cleaning/pre-processing, especially for EHR, is a very critical
and time-consuming task, and therefore a great deal of time and con-
sideration was dedicated to it. This process also involved consulting
with medical professionals to incorporate their expertise. Data retrieval
from Health Facts® was based on ICD-9 (International Classification of
Diseases) and ICD-10 diagnosis codes. An imbalanced dataset3
is used in
this study to have a fair and factual setting; as in the real world patient
population, only a small percentage of patients have PD. In fact, the
limitation of many healthcare analytics studies is that they consider a
somewhat balanced data in their analysis. For the PD group, all patients
with either ICD-9 or ICD-10 PD diagnosis codes were extracted, and
that yielded data of 83,393 unique patients from about 1 million en-
counters. For the control group, data for patients in the size of 10 times
larger than the PD group were extracted, and this yielded 833,921
unique patients having more than 5 million encounters.
Throughout the process, parts of data were discarded because not all
information in various tables was available for all patients. The process
started with merging the encounter4
and diagnosis data tables, and this
led to having data of 442,556 unique patients, out of which 83,393
were in the PD group and the rest in the control group. This data was
Table 1
Notations for missing care framework.
D: Initial dataset
p: number of variables in the initial data
N: number of records in the initial data
δ: record completeness percentage threshold
α: l1 regularization parameter
Notations for regression problems:
Rj
2
red: R-square reduction when variable j is removed from lasso model
Reg_xj
imp
: variable j’s importance in lass model
VR_xj
imp
: variance reduction when variable j is used for splits in a tree
N_VR_xj
imp
: normalized variance reduction for variable j in a tree
RF_xj
imp
: variable j’s importance in random forests regressor model
wRsq
Reg
: R-square of lasso regression model
wRsq
RF
: R-square of random forests regressor model
Notations for classification problems:
AUCjred: AUC reduction when variable j is removed from the logistic regression
model
Reg_xj
imp
: variable j’s importance in the logistic regression model
GR_xj
imp
: Gini reduction when variable j is used for splits in a tree
N_GR_xj
imp
: normalized Gini reduction for variable j in a tree
RF_xj
imp
: variable j’s importance in random forests classifier model
wAUC
Reg
: AUC of the logistic regression model
wAUC
RF
: AUC of the random forests classifier model
xj
imp
: ultimate variable j’s importance with regard to the target variable
Fig. 1. Cerner data diagram.
3
In an imbalanced data, the number of records belonging to one class—called
majority, outnumbers the number of records belonging to the other class—-
called minority class
4
An encounter is a hospital or clinic visit. The encounter table contains all of
the information that is specific to that visit (encounter). The encounter ID is
unique within this table. The same patient can have more than one record in
this table over time.
6

from about 3 million encounters for these patients. The imbalanced
ratio (number of patients in the PD group divided by the total number
of patients) for this data was about 18.8%. In the next step, the PD
group was double-checked against ICD-9, and ICD-10 diagnosis codes
for PD and 4203 patients were removed from the PD group because
they did not have ICD 9 or ICD 10 diagnosis code for PD. To be as close
as possible to the first PD diagnosis, the first encounter with PD diag-
nosis was kept, and other encounters in the PD group were removed.
This action did not make any changes to the number of patients and
only decreased the encounters number. Next, the lab procedure table
was cleaned and merged with the master data. Combining lab data led
to removing 40,943 PD patients and 197,951 patients in the control
group because there was no lab data for these groups of patients. At this
stage, the imbalanced ratio was 19.2%.
Considering only ICD codes to form the PD and control groups
might not be entirely accurate because of potential errors such as data
entry. To avoid this issue and ensure all patients in the PD group have
PD and patients in the control group do not have PD, an extra step
besides checking for ICD codes was taken. Reviewing the American
Parkinson's disease Association (APDA) website (www.apdaparkinson.
org), a list of medications that PD patients could take was formed (see
Appendix I). Then, using the Health Facts® medication table that stores
the information of all medications patients take, only patients in the PD
group that take PD medications were kept, and others were removed.
Additionally, all patients in the control group that take any PD medi-
cation were excluded. Conducting this two-stage identification using
ICD codes and medications, the veracity of all patients in the PD group
having PD and all patients in the control group not having PD was
confirmed with a high degree of confidence. At this stage, the data was
prepared to develop predictive models. This data included 15,669 un-
ique PD and 160,722 unique patients in the control group and had an
imbalanced ration of 8.9%. This data is called Master I. Data selection
steps are summarized in Fig. 2. There were many other data cleaning/
pre-processing steps that were taken to deal with various variables and
tables in the data. These steps are briefly depicted in Fig. 3.
After having Master I data in hand, Missing Care was applied to have
the final data that contains all important variables with a reasonable
degree of missing values. Master I included 81 variables, out of which
many had very high missing values (as high as 98% missing!).
Employing Missing Care, the majority of records with high missing va-
lues were removed to keep as many as variable possible with up to 30%
missing. This yielded a dataset with 3705 records and 81 variables that
is much smaller than Master I. Then, Missing Care was applied to this
data to identify variables strongly associated with the target variable.
After following Missing Care guidelines, 30 variables were identified as
important variables. Then, going back to Master I, and only keeping
these 30 variables, records with very high missing values were removed
to reach a reasonable missing values degree for the selected 30 vari-
ables. It resulted in a data with 15,000 records, out of which 2000 were
PD, and the rest belonged to the control group. This data, called Master
II, was used to develop the final predictive models. Missingness in
Master II was acceptable (maximum missing value was 37%) to apply
imputation methods. A descriptive analysis of variables in Master II for
both PD and control groups is available in Appendix II. Not employing
Missing Care would lead to losing many variables with a strong asso-
ciation with the target. Without applying Missing Care, only the vari-
ables with at most 35% missingness (the same missingness threshold
that was used for Master II) are kept, and it will result in a dataset with
many records (113,759, considerably more than Master II's records).
However, this data will have only 7 variables, because most of the
variables have missingness levels of 60% to 90%. This data is called
Master III, and it is used to develop models and compare the models'
performance with the ones build on Master II. Data using Missing Care
procedure compared to not using Missing Care is shown in Fig. 4.
5. Experiment results
In this section, the experimental results of predictive models de-
veloped by employing Missing Care and also not employing Missing Care
are provided. Logistic regression (LR), linear support vector machine
(SVM-L), support vector machine with RBF kernel (SVM-RBF), neural
networks (NN), random forests (RF), and gradient boosting (GB) are
used to develop predictive models. For SVM and NN that are sensitive
to the variables' scale, all variables are transformed to a 0 to 1 scale. The
parameters for each model have been tuned to get the best possible
results, and 5-fold cross-validation is used to minimize the effect of data
partitioning bias. First, the results of the models built without using
Missing Care after over-sampling it applying SIMO are presented. Then,
the results of models developed based on SIMO-oversampled data by
applying Missing Care are represented. Next, the results of the two sets
of models are compared and the effectiveness of the Missing Care fra-
mework is shown. Table 2 shows the results of the models when Missing
Care is not applied, and Table 3 represents models after employing
Missing Care. As shown, using Missing Care, led to a significant im-
provement in the models' performance (5% to 7% AUC increase across
various models). Table 4 depicts the significance of this improvement at
the level of 99% confidence. The reason is that Missing Care addresses
data quality as a critical factor in analytics. Without using Missing Care,
many important variables with high predictive power would be dis-
carded from further analysis, and this would deteriorate the models'
performance. The improvement across various models is depicted in
Fig. 5 as well. It needs to be pointed out that after (and during) applying
Missing Care, the data still has missing values. However, the degree of
missingness at these stages is manageable by applying imputation
methods. In this study, mean and mode were used to impute the missing
values for the numeric and categorical variables, respectively. How-
ever, any other imputation methods can be used.
Among various machine learning techniques, random forests and
gradient boosting had the best performance. AUCs for these two models
after applying Missing Care on Master II SIMO-oversampled data are
84.3% and 84.63%, respectively. Therefore, for the rest of the analysis,
the focus is on these two modeling techniques. As was mentioned in the
previous sections, imbalanced data is used in this study. There are
various remedies for imbalanced data learning problems, one of which
is synthetic informative minority over-sampling (SIMO) with two ver-
sions, SIMO and W-SIMO (Weighted-SIMO). Another popular over-
sampling technique is SMOTE (Synthetic Minority Over-sampling
Technique) [94], and a common under-sampling approach is random
under-sampling (RUS). To have the best results, the performance of the
predictive models after using SIMO, W_SIMO, SMOTE, and RUS are
evaluated. The results of the models based on various over-sampling
and under-sampling techniques are presented in Table 5 and Fig. 6. In
this study, models trained based on over-sampling techniques out-
performed models based on RUS. And, among over-sampling methods,
SIMO was the best. The significance of the improvement gained from
employing SIMO and W-SIMO compared with SMOTE and RUS is
shown in Table 6. As evident, not all differences are statistically sig-
nificant. The results presented are the performance of the models on the
validation data, which is in the original imbalanced distribution of the
initial dataset.
To further improve the accuracy of the models, multiple predictive
models using an ensemble approach called, confidence margin en-
semble [65] were combined. Ensemble approaches are the most bene-
ficial when different models, with relatively similar (and good) per-
formance, are combined. As the performance of RF and GB models were
much better than LR, SVM, and NN, only RF and GB models were used
to develop the confidence margin ensemble models. First, the ensemble
models were developed using RF and GB built on each of the im-
balanced data learning techniques (SIMO, W_SIMO, SMOTE, and RUS).
The performance of confidence margin ensembles is shown in Table 7.
In each ensemble, there is a marginal improvement compared to the
7

individual models, RF and GB; this is observable in Fig. 6.
More models were integrated into the ensembles step by step to
create even more accurate models. In Table 8 and Fig. 7, the first
column (point) shows the ensemble of RF and GB models based on
SIMO-oversampled data. Next, RF and GB models based on W_SIMO-
oversampled data were added to the ensemble, and there was an im-
provement in the performance. In the next stage, RF and GB models
built based on SMOTE-oversampled data were added; at this stage,
there were six models in the ensemble, and there was an improvement
from AUC of 85.06% to 85.19%. Finally, RF and GB models based on
RUS data were added, and the AUC of the ensemble of 8 models was
85.29%. The final ensemble model of 8 models is used in the devel-
opment of CDSS to diagnose and screen for PD.
6. Robustness checks
To ensure the Missing Care effectiveness and the proposed CDSS
validity to detect PD, a series of robustness checks needed to be con-
ducted. First, the effectiveness of Missing Care and the validity of the
CDSS were evaluated for datasets with various imbalanced ratios. These
datasets were formed by randomly removing portions of data belonging
to the PD group. In this way, datasets with lower PD records' ratios were
created that were more realistic. Table 9 shows the performance of the
CDSS in various imbalanced ratios. In each scenario, models are de-
veloped as the ensemble of RF and GB over SIMO, W-SIMO, SMOTE,
and RUS. This table illustrates the AUC values with and without ap-
plying Missing Care, the difference between the two scenarios, and the
difference significance.
Second, to confirm the effectiveness of Missing Care, it was applied
to another disease, Diabetic Retinopathy (DR). DR is the most common
eye complication for diabetic patients. And about 28% of diabetic pa-
tients experience this complication [95]. DR data is acquired from the
same source (Cerner) as PD, and data preparation steps, similar to what
has been done for PD data, are taken to pre-process the data. The data
for DR has different characteristics (number of rows, variables, and
missingness) compared to the data used for PD. Table 10 shows the
characteristics of DR data, and Table 11 depicts the statistically sig-
nificant improvement achieved by applying Missing Care.
7. Discussion
Many researchers have discussed the effectiveness and significance
of information systems (IS) and IT tools in improving the care for pa-
tients and also making care providing more efficient [3,12,25]. Wide-
spread adoption of EHR (as an IS/IT tool) in the hospitals and clinics
has led to the emergence of EHR-based healthcare predictive analytics
research that brings about remarkable practical and medical values [3].
While various studies have shown the value of EHR data in analytics,
the challenges associated with analyzing these types of data are rarely
addressed in the literature. The prerequisite of a competent data ana-
lytics research is to have high-quality data [29]. One important aspect
of data quality is data completeness, and EHR data severely suffers from
data incompleteness. In this study, by introducing a new framework
called Missing Care, this critical challenge is addressed. Using Missing
Fig. 2. Data selection steps.
8

Care, before developing any predictive model, out of numerous features
available in EHR data (many of them having very high missing values),
we can identify the features that are highly associated with the target
variable. And for the rest of the analysis, those selected variables will be
involved in the model building. Without employing Missing Care, many
important features with high predictive powers might be discarded,
only because the majority of their values in the initial data are missing.
In the experimental analysis, the improvement in the prediction accu-
racy of models after using the Missing Care framework is demonstrated.
While Missing Care is introduced in the context of EHR, it can be ap-
plied to other datasets with EHR characteristics (many variables with
high degrees of missing values). It needs to be emphasized that Missing
Care does not deny the importance of imputation techniques, rather it
helps preserve variables that could benefit from imputation in later
stages of analysis, while they could be discarded at early stages due to
very high missingness degrees.
Besides the immense traditional contribution of researchers to
healthcare at the level of management and organizations, personalized
medicine has received attention in the recent years, and prestigious
journals have recognized personalized medicine through CDSSs as one
of the promising and impactful streams of research [3,20,25,26,96].
CDSSs can enhance decision-making capabilities in care management at
the patient level rather than the general population. In line with this
recent movement in healthcare analytics literature, and consistent with
the design science research paradigm that focuses on solving practical
problems, a CDSS is developed to detect and monitor for PD, a highly
undiagnosed neurological disorder. Employing this CDSS and thereby,
earlier diagnosis of PD has multiple benefits. It will provide more in-
formation to patients about their health status, thus enabling them to be
proactive. It also helps clinics and physicians in being prepared for
treatment planning. Finally, it improves the specialty care delivery to
patients.
The contributions of this study can be summarized in two cate-
gories. First, to the best of my knowledge, this is the first study that
formally discusses the data quality issue of incompleteness in EHR data
and introduces a framework to address this issue. The benefits of ap-
plying this framework are demonstrated through empirical experiments
Fig. 3. Data pre-processing steps.
Fig. 4. Using missing care vs. not using missing care
Table 3
Models after applying Missing Care (on SIMO-oversampled Master II).
LR SVM-L SVM-RBF NN RF GB
auc 80.63% 79.87% 79.36% 77.63% 84.30% 84.63%
sensitivity 73.15% 72.98% 72.85% 69.95% 75.80% 76.25%
specificity 72.78% 72.46% 72.08% 69.83% 75.28% 75.17%
Table 2
Models without applying Missing Care (on SIMO-oversampled Master III).
Auc 73.63% 73.08% 73.11% 72.24% 78.02% 78.15%
Sensitivity 66.98% 66.35% 66.39% 66.08% 70.12% 70.37%
Specificity 66.08% 66.22% 66.13% 66.00% 70.15% 70.28%
9

Table 4
Statistical significance of effectiveness of Missing Care.
AUC difference 6.998%*** 6.79%*** 6.25%*** 5.39%*** 6.25%*** 6.49%***
p-value 4.1E-05 1.81E-06 5.46E-05 4.63E-05 1.36E-04 1.18E-04
*** 99% confidence, ** 95% confidence, * 90% confidence—two-sample t-test, unequal variances.
Fig. 5. AUC for models with and without Missing Care.
Table 5
AUC for models built on Master II in various over and under-sampling techni-
ques.
SIMO 80.63% 79.87% 79.36% 77.63% 84.30% 84.63%
W_SIMO 80.53% 79.68% 79.20% 77.67% 84.34% 84.45%
SMOTE 80.31% 79.36% 79.29% 77.55% 84.21% 84.23%
RUS 80.08% 81.01% 80.26% 77.24% 84.12% 84.15%
Fig. 6. AUC of RF and GB and their ensembles using various imbalanced data
learning techniques.
Table 6
The significance of the difference between SIMO and SMOTE and RUS.
Difference Between RUS SMOTE
GB Model /SIMO 0.48% * 0.40%***
p-value 0.0943 0.0016
RF Model /W_SIMO 0.22%** 0.13%
p-value 0.0111 0.2991
*** 99% confidence, ** 95% confidence, * 90% confidence———two-sample t-test,
unequal variances.
Table 7
Confidence Margin ensemble of RF and GB in various over and under-sampling
techniques.
RUs SMOTE W_SIMO SIMO
auc 84.62% 84.65% 84.82% 84.97%
sensitivity 75.70% 75.65% 76.05% 76.45%
specificity 75.80% 75.57% 75.56% 75.88%
Table 8
Confidence Margin ensemble of RF and GB.
ENSEMBLE OF SIMO SIMO & W-SIMO …, & SMOTE …, & RUS
Auc 84.97% 85.06% 85.19% 85.29%
Sensitivity 76.45% 76.55% 76.15% 76.00%
Specificity 75.88% 75.44% 76.04% 76.29%
Fig. 7. AUC for the ensemble of RF and GB by adding more models
Table 9
Difference between models with and without applying Missing Care (various
imbalanced Ratio).
Imbalanced Ratio 10% 6.50% 4% 1.30%
Missing Care- auc 85.19% 85.11% 84.71% 83.80%
# of PD Patients 1500 975 600 195
no Missing Care- AUC 78.69% 78.57% 78.50% 77.09%
# of PD Patients 11,376 7394 4550 1479
Difference btw Missing Care and
NO missing Care
6.51%*** 6.53%*** 6.21%*** 6.71%***
p-value for Difference 0.00016 0.00012 0.00098 0.00478
*** 99% confidence, ** 95% confidence, * 90% confidence—two-sample t-test,
unequal variances.
Table 10
DR data characteristics.
# of variable # of Patients Avg. Missing
Rate
Max Missing
Rate
Min Missing
Rate
91 451,392 74.1% 97.8% 29.8%
10

in various predictive modeling techniques. Second, in this study, a
CDSS that can aid clinicians in diagnosing PD is developed. The su-
periority of this CDSS over the existing diagnostic methods is that it can
be applied using only demographic and lab test information of the
patients, and there is no need for more advanced equipment such as
MRI equipment that is scarce in more remote areas.
7.1. Practical implications
The proposed framework, Missing Care, can be used in the devel-
opment of predictive models based on EHR or other similar datasets.
These predictive models can be used to enhance decision making in
various contexts, including healthcare. The CDSS developed in this
study can be employed in different forms and can be beneficial to all
stakeholders in healthcare. It can be integrated into EHR systems and
automatically provide the risk of having PD for patients. It also can be
used as a standalone tool; then, primary care providers and even nurses
can use it as a screening tool. Since this CDSS is very easy to employ, it
can fill the gap of specialists and equipment scarcity in rural and remote
areas as well as many developing countries that are similar to the US
rural areas with regards to healthcare accessibility. This CDSS is easy to
use and accurate at the same time. The ensemble models developed in
this study could reach the AUC of 85.3% that given the complexity of
diagnosing PD is a very good accuracy for a screening tool that uses
simple blood tests. Applying this CDSS can lead to earlier diagnosis for
patients, and earlier diagnosis can always allow for more effective
treatments and interventions, and this could translate to cost-saving
both for patients themselves as well as the whole healthcare system and
country.
Another benefit of this CDSS is identifying the risk factors for PD.
Among the identified factors, some are known to the medical commu-
nity. And another group of factors was not mentioned in the medical
literature (to the best of my knowledge). These new factors could be
potential leads for medical researchers to conduct more controlled
clinical studies and evaluate their relationship with PD. Factors iden-
tified by the CDSS and medical studies that showed their connection to
PD are age, gender [97]; alanine aminotransferase, aspartate amino-
transferase, and, glucose [98]; platelet count and lymphocyte [99];
glomerular filtration rate [100]; blood Pressure and heart rate [101];
serum sodium and chloride [102]; blood monocytes [103]; and, white
blood cell count [104]. No medical research was found studying the
following factors that were identified by the CDSS: mean corpuscular
volume, creatinine serum, specific gravity urine, partial thromboplastin
time, blood urea nitrogen, basophils percent.
7.2. Limitations and future research
This work, similar to other analytics research based on EHR data,
has the limitation of the initial labeling of patients to PD and non-PD
before training the models. While most of the studies rely only on ICD
diagnosis codes, in this study, a reliable measure is taken to mitigate the
effect of this limitation by having a two-stage initial identification both
based on the ICD diagnosis codes in the data and also the medications
that patients take. Future research can be conducted by collecting data
specific to the purpose of this research and in a more controlled
manner, thereby providing causality inference.
Acknowledgment
This work was conducted with data from the Cerner Corporation's
Health Facts database of electronic medical records provided by the
Oklahoma State University Center for Health Systems Innovation
(CHSI). Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and do not ne-
cessarily reflect the views of the Cerner Corporation. I would like to
acknowledge Dr. Yasamin Vahdati, for proofreading the article, and Dr.
Delen, Dr. Paiva, and, Dr. Miao from CHSI for providing the data.
Appendix A. List of PD medications
Carbidopa-Levodopa
Pramipexole
Ropinirole
Benztropine
Amantadine
Selegiline
Carbidopa/Entacapone/Levodopa
Trihexyphenidyl
Rasagiline
Ropinirole
Rotigotine
Carbidopa
Tolcapone
Levodopa
Appendix B. Descriptive statistics for the patients in both PD and control groups
Variables PD Patients Control Patients
Mean STD Median Mean STD Median
Alanine Aminotransferase SGPT 21.83 19.46 18.00 32.72 34.40 26.00
Alkaline Phosphatase 85.88 39.72 78.00 93.60 48.24 83.50
Anion Gap 9.01 3.08 9.00 8.99 3.16 9.00
Aspartate Aminotransferase 30.91 33.58 23.00 36.87 59.24 25.00
Table 11
Difference between models with and without applying Missing Care for DR.
Imbalanced Ratio AUC # of
Variables
# of Patients # of DR
Patients
Missing Care- auc 92.83% 30 64,562 9943
No Missing Care-
AUC
88.54% 9 136,505 22,415
Difference 4.287%***
p-value for
Difference
1.80571E-05 *** 99% confidence, ** 95% confidence—two-
sample t-test, unequal variances
11

Basophils Percent 0.51 0.30 0.50 0.55 0.34 0.55
Bilirubin Total 0.70 0.46 0.60 0.74 0.70 0.60
Blood Pressure, Systolic 131.08 14.66 129.89 129.71 16.58 129.89
Blood Urea Nitrogen 21.75 12.83 19.00 18.92 13.33 15.00
Carbon Dioxide 25.94 3.85 26.00 25.83 3.86 25.85
Chloride Serum 104.78 4.92 105.00 104.09 4.71 104.00
Creatinine Serum 1.07 0.62 0.90 1.08 0.71 0.90
Glomerular Filtration Rate 58.04 16.64 60.00 60.70 21.52 60.00
Glucose Serum 113.87 37.40 104.00 115.39 39.72 104.00
Heart Rate 80.54 10.26 81.86 82.07 11.58 81.86
Height 65.53 3.54 65.47 65.46 3.60 65.47
International Normalized Ratio 1.27 0.42 1.19 1.25 0.41 1.15
Lymphocyte Percent 19.91 7.97 21.39 21.62 10.27 21.39
Mean Corpuscular Volume 91.58 5.89 91.80 89.64 6.74 89.90
Mean Platelet Volume 8.7 1.3 8.7 8.73 1.24 8.7
Monocyte Percent 8.12 2.71 7.90 7.87 2.80 7.90
Partial Thromboplastin Time 32.30 8.48 32.00 32.86 8.96 32.79
Platelet Count 219.90 88.24 208.00 231.45 96.67 222.00
Prothrombin Time 14.17 3.92 13.60 14.07 3.98 13.60
Sodium 139.21 3.86 139.00 138.61 3.67 139.00
Specific Gravity Urine 1.016 0.005 1.015 1.015 0.006 1.015
Weight 167.91 36.40 176.02 178.89 42.28 177.42
White Blood Cell Count 8.27 3.37 7.60 8.57 3.70 7.90
Age 77.12 9.22 78.00 59.49 20.06 61.00
Male Female Male Female
Gender 55% 45% 44% 56%
References
[1] P.B. Goes, Big data and IS research, MIS Quarterly 38 (3) (2014) (p. iii-viii).
[2] Chiang, R.H., et al., Strategic Value of Big Data and Business Analytics. 2018, Taylor
& Francis.
[3] H. Chen, R.H. Chiang, V.C. Storey, Business intelligence and analytics: From big
data to big impact, MIS Q. 36 (4) (2012).
[4] D. Bertsimas, et al., Call for papers—special issue of management science: business
analytics: submission deadline: September 16, 2012 expected publication date:
first quarter 2014, Manag. Sci. 58 (7) (2012) 1422.
[5] A. Seidmann, Y. Jiang, J. Zhang, Introduction to the special issue on analyzing the
impacts of advanced information technologies on business operations, Decis.
Support. Syst. 76 (C) (2015) 1–2.
[6] J.-N. Mazón, et al., Introduction to the special issue of business intelligence and
the web, Decis. Support. Syst. 52 (4) (2012) 851–852.
[7] R.J. Kauffman, D. Ma, B. Yoo, Guest editorial: market transformation to an IT-
enabled services-oriented economy, Decis. Support. Syst. 78 (2015) 65–66.
[8] K. Miller, Big Data Analytics In Biomedical Research, (2012) (cited 2019 June 17).
[9] T.R. Huerta, et al., Electronic health record implementation and hospitals’ total
factor productivity, Decis. Support. Syst. 55 (2) (2013) 450–458.
[10] A.K. Jha, et al., A progress report on electronic health records in US hospitals,
Health Aff. 29 (10) (2010) 1951–1957.
[11] Health-IT-Quick-Stat, Percent of Hospitals, By Type, that Possess Certified Health
IT, (2018) ([cited 2019 June 10th]; Health IT Quick-Stat #52).
[12] C.M. Angst, et al., Social contagion and information technology diffusion: the
adoption of electronic medical records in US hospitals, Manag. Sci. 56 (8) (2010)
1219–1241.
[13] A. Baird, E. Davidson, L. Mathiassen, Reflective technology assimilation: facil-
itating electronic health record assimilation in small physician practices, J. Manag.
Inf. Syst. 34 (3) (2017) 664–694.
[14] H.K. Bhargava, A.N. Mishra, Electronic medical records and physician pro-
ductivity: evidence from panel data analysis, Manag. Sci. 60 (10) (2014)
2543–2562.
[15] K.K. Ganju, H. Atasoy, P.A. Pavlou, 'Where to, Doc?'Electronic Health Record
Systems and the Mobility of Chronic Care Patients, Working paper (2017).
[16] H. Atasoy, P.-y. Chen, K. Ganju, The spillover effects of health IT investments on
regional healthcare costs, Manag. Sci. 64 (6) (2017) 2515–2534.
[17] M.Z. Hydari, R. Telang, W.M. Marella, Saving patient Ryan—can advanced elec-
tronic medical records make patient care safer? Manag. Sci. 65 (5) (2018)
2041–2059.
[18] Y.-K. Lin, M. Lin, H. Chen, Do Electronic Health Records Affect Quality of Care?
Evidence from the HITECH Act. Information Systems Research, (2019).
[19] R. Agarwal, et al., Research commentary—the digital transformation of health-
care: current status and the road ahead, Inf. Syst. Res. 21 (4) (2010) 796–809.
[20] Gupta, A. and R. Sharda, Improving the science of healthcare delivery and informatics
using modeling approaches. 2013, Elsevier.
[21] T.T. Moores, Towards an integrated model of IT acceptance in healthcare, Decis.
Support. Syst. 53 (3) (2012) 507–516.
[22] R. Kohli, S.S.-L. Tan, Electronic health records: how can IS researchers contribute
to transforming healthcare? MIS Q. 40 (3) (2016) 553–573.
[23] J. Glaser, et al., Advancing personalized health care through health information
technology: an update from the American health information Community’s per-
sonalized health care workgroup, J. Am. Med. Inform. Assoc. 15 (4) (2008)
391–396.
[24] M.P. Johnson, K. Zheng, R. Padman, Modeling the longitudinality of user accep-
tance of technology with an evidence-adaptive clinical decision support system,
Decis. Support. Syst. 57 (2014) 444–453.
[25] R.G. Fichman, R. Kohli, R. Krishnan, Editorial overview—the role of information
systems in healthcare: current research and future trends, Inf. Syst. Res. 22 (3)
(2011) 419–428.
[26] Lin, Y.-K., et al., Healthcare predictive analytics for risk profiling in chronic care: A
Bayesian multitask learning approach. MIS Q., 2017. 41(2).
[27] R. Agarwal, V. Dhar, Big data, data science, and analytics: the opportunity and
challenge for IS research, Inf. Syst. Res. 25 (3) (2014) 443–448.
[28] Shmueli, G. and O.R. Koppius, Predictive analytics in information systems research.
MIS Q., 2011: p. 553–572.
[29] Baesens, B., et al., Transformational Issues of Big Data And Analytics in Networked
Business. MIS Q., 2016. 40(4).
[30] G. Jetley, H. Zhang, Electronic health records in IS research: quality issues, es-
sential thresholds and remedial actions, Decis. Support. Syst. 126 (2019) 113137.
[31] Baesens, B., Analytics in a big data world: The essential guide to data science and its
applications. 2014: John Wiley & Sons.
[32] H.-T. Moges, et al., A multidimensional analysis of data quality for credit risk
management: new insights and challenges, Inf. Manag. 50 (1) (2013) 43–58.
[33] J. Du, L. Zhou, Improving financial data quality using ontologies, Decis. Support.
Syst. 54 (1) (2012) 76–86.
[34] I. Bardhan, et al., Predictive analytics for readmission of patients with congestive
heart failure, Inf. Syst. Res. 26 (1) (2014) 19–39.
[35] B. Yet, et al., Decision support system for warfarin therapy management using
Bayesian networks, Decis. Support. Syst. 55 (2) (2013) 488–498.
[36] NINDS, Parkinson's Disease: Challenges, Progress, and Promise. 2015, National
Institutes of Health.
[37] Parkinson's-Foundation. Parkinson's Statistics. [cited 2019 June 20].
[38] Inacio, P. Distinct Brain Activity Patterns Captured by EEG May Help in Treating
Parkinson's, Study Suggests. 2019 [cited 2019 June 20].
[39] J. Barjis, G. Kolfschoten, J. Maritz, A sustainable and affordable support system for
rural healthcare delivery, Decis. Support. Syst. 56 (2013) 223–233.
[40] Y. Li, et al., Designing utilization-based spatial healthcare accessibility decision
support systems: a case of a regional health plan, Decis. Support. Syst. 99 (2017)
51–63.
[41] U. Varshney, Mobile health: four emerging themes of research, Decis. Support.
Syst. 66 (2014) 20–35.
[42] S. Piri, D. Delen, T. Liu, A synthetic informative minority over-sampling (SIMO)
algorithm leveraging support vector machine to enhance learning from im-
balanced datasets, Decis. Support. Syst. 106 (2018) 15–29.
[43] R.H. Von Alan, et al., Design science in information systems research, MIS Q. 28
(1) (2004) 75–105.
[44] Padmanabhan, B., Z. Zheng, and SO. Kimbrough, An empirical analysis of the value
of complete information for eCRM models. MIS Q., 2006: p. 247–267.
[45] L. Ma, R. Krishnan, A.L. Montgomery, Latent homophily or social influence? An
empirical analysis of purchase within a social network, Manag. Sci. 61 (2) (2014)
454–473.
[46] K. Coussement, S. Lessmann, G. Verstraeten, A comparative analysis of data pre-
paration algorithms for customer churn prediction: a case study in the tele-
communication industry, Decis. Support. Syst. 95 (2017) 27–36.
[47] Saboo, A.R., V. Kumar, and I. Park, Using Big Data to Model Time-Varying Effects for
Marketing Resource (Re) Allocation. MIS Q., 2016. 40(4).
12

[48] Abbasi, A., et al., Metafraud: a meta-learning framework for detecting financial fraud.
MIS Q., 2012. 36(4).
[49] C. Kuzey, A. Uyar, D. Delen, The impact of multinationality on firm value: a
comparative analysis of machine learning techniques, Decis. Support. Syst. 59
(2014) 127–142.
[50] H.H. Sun Yin, et al., Regulating cryptocurrencies: a supervised machine learning
approach to de-anonymizing the bitcoin blockchain, J. Manag. Inf. Syst. 36 (1)
(2019) 37–73.
[51] D. Breuker, et al., Comprehensible predictive models for business processes, MIS
Q. 40 (4) (2016) 1009–1034.
[52] X. Fang, et al., Predicting adoption probabilities in social networks, Inf. Syst. Res.
24 (1) (2013) 128–145.
[53] Aggarwal, R. and H. Singh, Differential influence of blogs across different stages of
decision making: The case of venture capitalists. MIS Q., 2013: p. 1093–1112.
[54] S. Stieglitz, L. Dang-Xuan, Emotions and information diffusion in social med-
ia—sentiment of microblogs and sharing behavior, J. Manag. Inf. Syst. 29 (4)
(2013) 217–248.
[55] Y. Bao, A. Datta, Simultaneously discovering and quantifying risk types from
textual risk disclosures, Manag. Sci. 60 (6) (2014) 1371–1391.
[56] N. Kumar, et al., Detecting review manipulation on online platforms with hier-
archical supervised learning, J. Manag. Inf. Syst. 35 (1) (2018) 350–380.
[57] V.-H. Trieu, Getting value from business intelligence systems: a review and re-
search agenda, Decis. Support. Syst. 93 (2017) 111–124.
[58] J. Xie, et al., Readmission Prediction for Patients with Heterogeneous Hazard: A
Trajectory-Based Deep Learning Approach, (2018).
[59] Ben-Assuli, O. and R. Padman, Trajectories of Repeated Readmissions of Chronic
Disease Patients: Risk Stratification, Profiling, and Prediction. MIS Quarterly,
2019(Forthcoming).
[60] H.M. Zolbanin, D. Delen, Processing electronic medical records to improve pre-
dictive analytics outcomes for hospital readmissions, Decis. Support. Syst. 112
(2018) 98–110.
[61] L. Yan, Y. Tan, Feeling blue? Go online: an empirical study of social support among
patients, Inf. Syst. Res. 25 (4) (2014) 690–709.
[62] L. Chen, A. Baird, D. Straub, Fostering participant health knowledge and attitudes:
an econometric study of a chronic disease-focused online health community, J.
Manag. Inf. Syst. 36 (1) (2019) 194–229.
[63] X. Wang, et al., Mining user-generated content in an online smoking cessation
community to identify smoking status: a machine learning approach, Decis.
Support. Syst. 116 (2019) 26–34.
[64] Wang, T., et al., Directed disease networks to facilitate multiple-disease risk assessment
modeling. Decis. Support. Syst., 2019: p. 113171.
[65] S. Piri, et al., A data analytics approach to building a clinical decision support
system for diabetic retinopathy: developing and deploying a model ensemble,
Decis. Support. Syst. 101 (2017) 12–27.
[66] Zhang, W. and S. Ram, A Comprehensive Analysis of Triggers and Risk Factors for
Asthma Based on Machine Learning and Large Heterogeneous Data Sources. MIS
Quarterly, 2019(Forthcoming).
[67] G. Wang, J. Li, W.J. Hopp, Personalized Health Care Outcome Analysis of
Cardiovascular Surgical Procedures, (2017).
[68] M.E. Ahsen, M.U.S. Ayvaci, S. Raghunathan, When algorithmic predictions use
human-generated data: a bias-aware classification algorithm for breast cancer
diagnosis, Inf. Syst. Res. 30 (1) (2019) 97–116.
[69] W.-Y. Hsu, A decision-making mechanism for assessing risk factor significance in
cardiovascular diseases, Decis. Support. Syst. 115 (2018) 64–77.
[70] G. Meyer, et al., A machine learning approach to improving dynamic decision
making, Inf. Syst. Res. 25 (2) (2014) 239–263.
[71] H. Gómez-Vallejo, et al., A case-based reasoning system for aiding detection and
classification of nosocomial infections, Decis. Support. Syst. 84 (2016) 104–116.
[72] A. Dag, et al., A probabilistic data-driven framework for scoring the preoperative
recipient-donor heart transplant survival, Decis. Support. Syst. 86 (2016) 1–12.
[73] K. Topuz, et al., Predicting graft survival among kidney transplant recipients: a
Bayesian decision support model, Decis. Support. Syst. 106 (2018) 97–109.
[74] S. Somanchi, I. Adjerid, R. Gross, To Predict or Not to Predict: The Case of
Inpatient Admissions from the Emergency Department, Available at SSRN,
3054619 (2017).
[75] J. Jankovic, Parkinson’s disease: clinical features and diagnosis, J. Neurol.
Neurosurg. Psychiatry 79 (4) (2008) 368–376.
[76] A.J. Hughes, S.E. Daniel, A.J. Lees, Improved accuracy of clinical diagnosis of
Lewy body Parkinson’s disease, Neurology 57 (8) (2001) 1497–1499.
[77] M. Fujimaki, et al., Serum caffeine and metabolites are reliable biomarkers of early
Parkinson disease, Neurology 90 (5) (2018) e404–e411.
[78] D. Feigenbaum, et al., Tear Proteins as Possible Biomarkers for Parkinson’s Disease
(S3. 006), AAN Enterprises, 2018.
[79] T. Arroyo-Gallego, et al., Detecting motor impairment in early Parkinson’s disease
via natural typing interaction with keyboards: validation of the neuroQWERTY
approach in an uncontrolled at-home setting, J. Med. Internet Res. 20 (3) (2018)
e89.
[80] S. Rosenblum, et al., Handwriting as an objective tool for Parkinson’s disease di-
agnosis, J. Neurol. 260 (9) (2013) 2357–2361.
[81] A. Zhan, et al., Using smartphones and machine learning to quantify Parkinson
disease severity: the mobile Parkinson disease score, JAMA Neurol. 75 (7) (2018)
876–880.
[82] P. Rizek, N. Kumar, And MS jog, An update on the diagnosis and treatment of
Parkinson disease, Cmaj 188 (16) (2016) 1157–1165.
[83] Allison, P.D., Missing data. Vol. 136. 2001: Sage publications.
[84] M. Nakai, W. Ke, Review of the methods for handling missing data in longitudinal
data analysis, Int. J. Math. Anal. 5 (1) (2011) 1–13.
[85] Wells, B.J., et al., Strategies for handling missing data in electronic health record de-
rived data. Egems, 2013. 1(3).
[86] S.H. Holan, et al., Bayesian multiscale multiple imputation with implications for
data confidentiality, J. Am. Stat. Assoc. 105 (490) (2010) 564–577.
[87] Mozharovskyi, P., J. Josse, and F. Husson, Nonparametric imputation by data depth.
J. Am. Stat. Assoc., 2019: p. 1–24.
[88] Acuna, E. and C. Rodriguez, The treatment of missing values and its effect on classifier
accuracy, in Classification, clustering, and data mining applications. 2004, Springer. p.
639–647.
[89] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
[90] H. Ishwaran, Variable importance in binary regression trees and forests, Electron.
J. Stat. 1 (2007) 519–537.
[91] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser.
B Methodol. 58 (1) (1996) 267–288.
[92] L. Breiman, Classification and Regression Trees, Wadsworth, Belmont, CA, 1984.
[93] J.P. DeShazo, M.A. Hoffman, A comparison of a multistate inpatient EHR database
to the HCUP Nationwide inpatient sample, BMC Health Serv. Res. 15 (1) (2015)
384.
[94] N.V. Chawla, et al., SMOTE: synthetic minority over-sampling technique, J. Artif.
Intell. Res. 16 (2002) 321–357.
[95] X. Zhang, et al., Prevalence of diabetic retinopathy in the United States, 2005-
2008, Jama 304 (6) (2010) 649–656.
[96] K.M. Bretthauer, S. Savin, Introduction to the special issue on patient-centric
healthcare Management in the age of analytics, Prod. Oper. Manag. 27 (12) (2018)
2101–2102.
[97] F. Moisan, et al., Parkinson disease male-to-female ratios increase with age: French
nationwide study and meta-analysis, J. Neurol. Neurosurg. Psychiatry 87 (9)
(2016) 952–957.
[98] E. Cereda, et al., Low cardiometabolic risk in Parkinson’s disease is independent of
nutritional status, body composition and fat distribution, Clin. Nutr. 31 (5) (2012)
699–704.
[99] Y.-H. Qin, et al., The role of red cell distribution width in patients with Parkinson’s
disease, Int. J. Clin. Exp. Med. 9 (3) (2016) 6143–6147.
[100] G.E. Nam, et al., Chronic renal dysfunction, proteinuria, and risk of Parkinson’s
disease in the elderly, Mov. Disord. 34 (8) (2019) 1184–1191.
[101] L. Norcliffe-Kaufmann, et al., Orthostatic heart rate changes in patients with au-
tonomic failure caused by neurodegenerative synucleinopathies, Ann. Neurol. 83
(3) (2018) 522–531.
[102] C.J. Mao, et al., Serum sodium and chloride are inversely associated with dyski-
nesia in Parkinson's disease patients, Brain and Behavior 7 (12) (2017) (p.
e00867).
[103] V. Grozdanov, et al., Inflammatory dysregulation of blood monocytes in
Parkinson’s disease patients, Acta Neuropathol. 128 (5) (2014) 651–663.
Saeed Piri (spiri@uoregon.edu) is an Assistant Professor at Lundquist College of Business,
University of Oregon. His research interests lie in data analytics with a particular focus on
healthcare. His research contributes to both data mining methodologies, such as im-
balanced data learning, ensemble modeling, and association analysis and also data ana-
lytics applications, such as developing clinical decision support systems and personalized
medicine. In addition, Dr. Piri has been studying value-based payment systems in
healthcare, online retail platforms, and electronic medical records systems. In his works,
he employs advanced machine learning techniques and also classical empirical methods.
13

1-s2.0-S0167923620300944-main.pdf

Recommended

Recommended

More Related Content

Similar to 1-s2.0-S0167923620300944-main.pdf

Similar to 1-s2.0-S0167923620300944-main.pdf (20)

Recently uploaded

Recently uploaded (20)

1-s2.0-S0167923620300944-main.pdf