Early Identification of GPC Contacts Among Home Care Clients Through Urgency Specific Predictions

E A R LY I D E N T I F I C AT I O N O F
G E N E R A L P R A C T I T I O N E R
C O O P E R AT I V E C O N TA C T S
A M O N G H O M E C A R E C L I E N T S
T H R O U G H U R G E N C Y- S P E C I F I C
P R E D I C T I O N S
A M A C H I N E L E A R N I N G A P P R O A C H
M A RT I J N L O G T E N B E R G
thesis submitted in partial fulfillment
of the requirements for the degree of
master of science in data science & society
at the school of humanities and digital sciences
of tilburg university

student number
2067986
committee
dr. Görkem Saygili
dr. Michal Klincewicz
location
Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science &
Artificial Intelligence
Tilburg, The Netherlands
date
June 24, 2022
number of words
8,749
acknowledgments
I would like to thank my supervisor Dr. Görkem Saygili, external
partner Esculine and in particular Dr. Daan Ooms. Without any of
these people, this project would not be what it is now.

preface
As of writing, it is nineteen weeks ago I had my first conversation about
this topic. I knew little about it and I only recently became interested in
healthcare systems. It was a topic that would very soon prove worthy
of being my thesis topic.
During the project, I became aware of the challenges faced in out-of-
hours primary care and the large impact to make. Furthermore, I
became aware of how data science solutions can help, but at the same
time what the pitfalls were. I decided this solution should move beyond
model performance because society — specifically those that will work
with our solutions in practice — demands more.
I am grateful for the opportunity to talk to them, visit their working
environment and see the challenges they face. I hope these nineteen
weeks of dedication will contribute to a solution.
I want to thank the reader for taking the time to read my work. I am
confident it will be worth your time.
Martijn Logtenberg
Tilburg, 20 May 2022
data source/code/ethics statement
Work on this thesis did not involve collecting data from human par-
ticipants or animals. The original owner retains ownership of the
data, during and after the completion of this thesis. The author of
this thesis acknowledges having no legal claim to this data. The code
used in this thesis is only accessible by reviewers, due to confiden-
tiality requirements by the external partner. The review committee
has been invited to the study’s code through GitHub (URL: https://
github.com/MGJLogtenberg/gpc-contact-identification.git). This
repository is only accessible after acceptance of the invitation and when
logged in to GitHub.

CONTENTS
contents
1 Introduction 1
1.1 Out-Of-Hours Primary Care in the Netherlands 2
1.2 Home Care in the Netherlands 3
1.3 Problem Statement 3
1.4 Ethical and Legal Considerations 4
2 Related Work 5
2.1 Factors Associated with GPC Variation 5
2.2 Machine Learning Studies on Adverse Health Outcomes 7
2.3 Accuracy of the Dependent Variable 8
3 Methods 9
3.1 Resampling Technique 9
3.2 Feature Selection 10
3.3 Prediction Algorithms 10
4 Experimental Setup 12
4.1 Raw Datasets 12
4.2 Cleaning 13
4.3 Exploratory Data Analysis 14
4.4 Experimental Procedure 15
5 Results 18
5.1 Experiment 1: Feature Selection 19
5.2 Experiment 2: Model Evaluation 20
6 Discussion 23
6.1 Scientific and Societal Relevance 25
6.2 Limitations 25
6.3 Future Work 26
6.4 Ethical Considerations 27
7 Conclusion 28
Appendices 36
a Experiments With Zero-Inflated Count Models 36
b Experiments With Feature Extraction and Artificial Neural Networks 38
c Full Dataset Variable Description 42
d Merging and Cleaning Pipeline 46
e Additional Problem Classification Scheme Plots 47
f Correlation Table 49
g Python Packages 49
h Search Spaces and Best-Performing Hyperparameters 50
i Selected Dataset Features 53
j Confusion Matrices 56
k Comprehensive Error Analysis 57

E A R LY I D E N T I F I C AT I O N O F G E N E R A L
P R A C T I T I O N E R C O O P E R AT I V E
C O N TA C T S A M O N G H O M E C A R E
C L I E N T S T H R O U G H
U R G E N C Y- S P E C I F I C P R E D I C T I O N S
A M A C H I N E L E A R N I N G A P P R O A C H
martijn logtenberg
Abstract
As pressure on general practitioner cooperatives (GPCs) is increasing,
there is a demand for ways to decrease the number of GPC contacts. Early
identification and intervention can help to prevent home care clients from
contacting. This study, therefore, focuses on identifying clients who will call
the GPC in the next twelve months, predicting this for different urgency levels
with pre-known health record data.
This is done by combining datasets from a GPC and a home care organisation
and, subsequently, applying resampling and several prediction model archi-
tectures. Feature selection is, moreover, applied to increase interpretability.
The results reveal that maximum model performance can be achieved with
significantly fewer features (1 to 25 out of 115 features) and that the most
important features relate to preceding GPC contact, home care intensity, frailty
status and some problem domains (e.g. respiration). The best-performing
models achieve an Area Under Precision-Recall Curve of 0.281, 0.340 and
0.431 on very urgent, urgent and non-urgent contacts, respectively, with a
Random Undersampling Booster model. However, one might argue other
models are more interpretable or make better trade-offs.
With these results, this study has shown the advantages of a combining and
collaboration approach between GPCs and home care organisations to identify
GPC contacts up to twelve months in advance with data that would otherwise
remain unseen. On top of that, this study names pre-known factors that are
predictive of GPC contact but were not studied before. Accordingly, this study
contributes to a solution to decrease pressure on GPCs.
1 introduction
The ageing population puts pressure on the Dutch healthcare system. The pro-
portion of the population above 60 years is prognosed to grow from 24.6% to
29.1% over the course of 2020 to 2050 (CBS, 2022). Ageing populations create
challenges on housing, the labour market and healthcare services (Tinker, 2002).
1

1 introduction 2
Consequentially, healthcare domains are required to work towards more efficiency,
while providing the same quality of care. This study focuses on decreasing the
demand on out-of-hours primary care through identification among clients in
home care, hence why both contexts are discussed in this introduction.
1.1 Out-Of-Hours Primary Care in the Netherlands
Practically every Dutch citizen has a general practitioner (GP), who is available
during office hours on weekdays. If people need care outside of office hours,
they can contact out-of-hours general practitioner cooperatives (GPCs) for health
requests that cannot wait until the next working day. GPCs are often low on
capacity and crowded. Therefore, a system of telephone triage is used to estimate
the urgency of care and the course of action.
1.1.1 Netherlands Triage System The Netherlands Triage Standard (NTS) is a
standardised triage system that helps triagists to estimate the urgency of contacts.
A triagist receives a set of questions to ask a patient. Based on the answers, an
urgency level is estimated ranging from U0 to U5.
Table 1: Definitions of the NTS urgency estimation levels
Level Urgency Course of action
U0 Loss vital functions Resuscitation
U1 Life-threatening Immediate action is required to prevent irreparable
damages
U2 Emergency Act as soon as possible, delaying treatment will
likely result in irreparable damages
U3 Urgent Treat within a few hours for medical or humane
reasons
U4 Non-urgent Urgent action not necessary, time and place can be
discussed with the patient
U5 Advice only An examination can wait until the next day
As can be seen in Table 1, the course of action is associated with the urgency
estimation. In this study, the urgency estimation levels are binned into three bins.
Levels U0, U1 and U2 are binned (U0-2; Very urgent), as well as U4 and U5 (U4-5;
Non-urgent). The rationale to bin the urgency estimation levels is given in Section
2.3. For the remainder of the thesis, these bins are referred to as urgency bins.
1.1.2 GPC Challenges GPCs are a cornerstone of the Dutch healthcare system
— specifically for home care clients. Among elderly in home care, 28.8% contacted
GPCs over a period of twelve months, while only 20.8% of elderly not in home
care contacted GPCs (Bloemhoff et al., 2020).
GPCs help to prevent an overflowing healthcare system by taking over the gate-
keeper role of a regular GP outside of office hours. Studies also show that GPCs
reduce emergency room (ER) presentations and waiting times significantly (Craw-

1 introduction 3
ford et al., 2017; Huibers et al., 2018).
Despite the advantages, GPCs face multiple challenges. The major challenge is
overcrowding. Between 2011 and 2019 the number of GPC contacts increased by
5.4% (InEen, 2021). Although GPCs are reserved for problems that cannot wait
until the next working day, 41.4% of contacts are non-urgent (U4-5; Gijsen et al.,
2019).
Therefore, gatekeeping through triage is an important strategy to preserve the
continuity of care and reduce costs. However, also triage is overcrowded — leading
to waiting times increasing significantly and norms not being met — hence why
decreasing GPC demand is desired. Among the proposed improvements are
the introduction of co-payment, stricter triage and a larger role for telephone
consultation (Keizer et al., 2016). A narrative review on GPCs is provided by Smits
et al. (2017).
1.2 Home Care in the Netherlands
In 2018, 589 thousand people received home care in the Netherlands (Vektis, 2020).
This number grows with each year due to deinstitutionalisation policies, which
encourages people to live independently for longer. Ageing at home leads to more
autonomy and independence, while also serving as one of the strategies to make
healthcare more efficient (Van Den Broek et al., 2019).
Home care in the Netherlands is offered through home care organisations, where
the objective is to improve the autonomy of clients and to delay or prevent institu-
tionalisation.
1.2.1 Omaha System Electronic health records (EHRs) in home care are struc-
tured through the Omaha system or comparable systems. The Omaha system
helps home care nurses to structure the state and needs of a client. The system
has three components: problem classification scheme, intervention scheme and
problem rating scale. A deeper definition of these components is given in Section
4.1 or in Koster and Harmsen (2016).
1.3 Problem Statement
According to GPCs, home care clients represent a significant portion of contacts,
specifically during peak hours (Post, 2021). While elderly in nursing homes profit
from in-house nursing, filtering out mainly non-urgent cases (Scapinello et al.,
2016), this is not the case in home care. Meanwhile, elderly already tend to receive
higher urgency estimations (Zwaanswijk et al., 2015). Early identification and
intervention may positively impact the health of home care clients and prevent
unnecessary GPC contact (Bloemhoff et al., 2020).
On top of that, comprehensive and structured data — containing possible relevant
predictors for GPC contact — is available. This is another reason to study the

1 introduction 4
home care population as opposed to elderly without care, those having family
caregivers available or the general population.
As described in Section 2.1, several factors are associated with GPC contact and
urgency estimation variation, while home care data can also offer novel insight
into associations with factors that have not been studied before.
Beyond novel insights on associated factors, predictions are relevant to home care
organisations and GPCs because they indicate who to intervene with early support
or where to increase health literacy. In that way, the purpose of identification is to
intervene and prevent GPC contact. Because interventions differ per urgency bin,
this study creates a prediction model for each of the three urgency bins.
This study uniquely aims to provide insight into the identification possibilities of a
combining and collaboration1 approach between GPCs and home care organisations.
This study does not pursue claims on estimation on GPC workload but rather
focusses on identifying GPC contacts to reduce workload long-term. Consequently,
the main question is:
To what extent is it possible to predict whether a home care client will contact
a GPC per urgency bin in the next twelve months?
The data used in this study contains numerous variables and little is known about
their association. Furthermore, interpretability is desired (see Section 1.4), as well
as an evaluation of the predictive power of individual features. Feature selection
provides this evaluation and interpretability. Accordingly, the first subquestion is:
1. Which features can serve as accurate predictors for this prediction task?
Secondly, these features are used to build and tune prediction models. Such
predictions must be accurate to have societal relevance, thus the second subquestion
is:
2. What is the best conventional2 machine learning model performance for this prediction
task?
Together these subquestions allow for a complete answer to the main question and
a conclusion of this study accordingly.
1.4 Ethical and Legal Considerations
A model predicting whether a client will contact the GPC can provide the desired
early identification of home care clients. This allows for early intervention — hence
improving efficiency and adequacy of care with otherwise unseen data. Although
there are certainly benefits, this study must follow the relevant legal rules and
ethical norms.
1 A combining and collaboration approach aims to share and combine data, and collaborate on interven-
tions to decrease GPC demand.
2 The definition of conventional is discussed in Section 2.2.

2 related work 5
The General Data Protection Regulation (GDPR) is the applicable EU data
protection law, which concerns the data protection of identified or identifiable
natural persons. It, thereby, excludes anonymised data (European Union Agency
for Fundamental Rights et al., 2018). All data used in this study is anonymised and,
thus, the GDPR does not apply to it. In practice there needs to be a legal basis for
sharing data of identified natural persons since this is necessary for interventions.
Ethical challenges in data science have moved towards practice instead of
study design. This demands consideration away from issues solely related to
data protection towards the consideration of a study’s impact in practice (Ienca
et al., 2018). Such considerations are mechanised in design principles including
the minimisation of harm to subjects, taking epistemic responsibility through
supportive and interpretable models, and designing fair and balanced models
trained on representative populations (Gerke et al., 2020; Mittelstadt & Floridi,
2016). By considering these topics throughout the study, possible shortcomings can
be either tackled or their impact minimised. In doing so, this study is compliant
with the ethical principles of the categorical imperative according to Kantian ethics
(Keymolen & Taylor, 2021).
A final model should, accordingly, be implementable as a supportive tool for
home care nurses, with the purpose to alert them about which clients are at risk of
GPC contact.
This requires an interpretative character because additional measures for a client
need to be explained. Other reasons for the interpretability requirement are the
imperfect urgency estimations (see Section 2.3) and the possibility that implemen-
tation of a model is unnecessary if insights into the predictive power of features
already provide sufficient information. This means that this study looks into
predictive power without necessarily requiring model implementation and, also,
provides possibilities for instance-specific explanations (e.g. Local Interpretable
Model-Agnostic Explanations).
2 related work
An extensive literature review is performed to gain up-to-date knowledge and to
create a foundation to build on.
2.1 Factors Associated with GPC Variation
Because this study concerns GPC contact per urgency bin, studies explaining
either demand variation or urgency estimation variation in GPCs are included.
The review is structured by the Phase 4 version of Andersen’s Behavior Model
of Health Services Use (Andersen, 1995). Andersen’s model is prevalent in the
literature (Babitsch et al., 2012) and, moreover, has served well in the context of
primary care use (e.g. Moth et al., 2020; Parslow et al., 2002; Vedsted et al., 2004)

2 related work 6
and in studies among home care clients (e.g. Bick & Dowding, 2019; Fortinsky et
al., 2014; Nijmeijer, 2021). Andersen’s model divides factors into three categories:
environmental factors, population characteristics and health behaviour.
Figure 1: Andersen’s Behavior Model of Health Service Use (Andersen, 1995)
2.1.1 Environmental Factors The healthcare system and external environment
of an individual constitute the environmental factors. Characteristics of a regular
GP are associated with GPC demand (Smits et al., 2015). Interestingly, no studies
have been performed to illustrate differences associated with characteristics of
provided home care, even though home care clients are an interesting subgroup
for GPCs (see Section 1.3).
Some neighbourhood characteristics were found to have a significant effect on
GPC contact (Jansen et al., 2019, 2015; Smits et al., 2015), although this data is not
available in this study’s dataset. There are no studies concerning the association of
environmental factors with urgency estimation variation.
2.1.2 Population Characteristics Andersen’s model subdivides population char-
acteristics into predisposing facts, enabling resources and needs. Zwaanswijk et al.
(2015) showed that more than 90% of the variation in urgency estimations can be
explained by such characteristics for the three most common GPC diagnoses.
First, predisposing facts refer to demographics, where most literature considers
age or gender. Higher age is significantly associated with both a higher urgency
estimation and demand (Ramerman et al., 2020; Zwaanswijk et al., 2015). The
same holds for gender, with men being more likely to receive higher urgency
estimations and females contacting the GPC more often (Ramerman et al., 2020;
Zwaanswijk et al., 2015).
Factors such as migrant status, education and intellectual disabilities are all asso-
ciated with medically unnecessary GPC contact (Heutmekers et al., 2017; Jansen
et al., 2018; Keizer et al., 2017, 2021), but this data is not available in this study’s
dataset.
Secondly, enabling resources refer to factors such as income and living situation.
Home care clients are more likely to be frequent attenders compared to elderly
in residential homes (Buja et al., 2015), due to residential nurses filtering out less
urgent cases (Scapinello et al., 2016). Furthermore, home care clients have more

2 related work 7
GPC contacts compared to elderly not receiving home care (Bloemhoff et al., 2020).
There are currently no studies regarding the urgency estimation variation of home
care clients. No differences in urgency estimation between different socioeconomic
groups were found (Jansen et al., 2021), although differences in demand were
found (Jansen et al., 2020).
Thirdly, needs refer to the overall health status or how people perceive their health.
Elderly with cardiovascular, neurological, respiratory and digestive diagnoses are
associated with higher urgency estimations (Haraldseide et al., 2020). Comor-
bidity and other types of frailty among elderly are associated with GPC contact
(Bloemhoff et al., 2020).
2.1.3 Health Behaviour Health behaviour is subdivided into personal health
practices and health service usage. Studies regarding the time of contact, self-
perceived severity and motives are associated with GPC variation (Gamst-Jensen et
al., 2020; Nørøxe et al., 2017; Wouters et al., 2020). Although these may be accurate
predictors, they are not included, because this study focuses on early identification,
which is not possible if non-pre-known data is necessary for such prediction.
No other studies are found associating GPC variation with health behavioural
factors. This may be because this data is not structurally collected during triage.
2.1.4 Gaps in the Literature As discussed, some blank spots in the literature
exist where none to barely any studies concern characteristics of provided home
care, and factors on the external environment, pre-known health behaviour and
urgency estimation variation in general. Especially, pre-known behavioural factors
are prevalent predictors in this study’s dataset and can provide novel insights.
Insights into associations with urgency estimation variation, furthermore, is also
provided by this study through the design of urgency-specific prediction models.
Therefore, this study offers insight into novel associations.
2.2 Machine Learning Studies on Adverse Health Outcomes
No studies currently concern prediction models on demand or urgency estimation
in primary care, but studies in alike contexts were performed. These studies
are included to identify plausible machine learning frameworks for this study.
They focus on the prediction of injurious falls, hospitalisation, ER utilisation or
home care intensity, and all mentioned studies use EHR data among home care
clients. Although not the primary metric (see Section 4.4), the Area Under Receiver
Operating Characteristic Curve (AUC-ROC) is reported to compare the existing
literature since the interpretation of the AUC-ROC is the same over all datasets.
A random guessing classifier achieves an AUC-ROC of 0.5 and higher AUC-ROC
scores indicate better performance. If AUC-ROC is absent, accuracy is mentioned.
Contacting the ER is one of the contexts studied before which may compare to
the GPC context. Veyron et al. (2019) used a Random Forest model to predict ER
contacts. They achieved an AUC-ROC of 0.70.

2 related work 8
A study predicting ER visit risk due to injurious falls and ER visit count showed
an AUC-ROC of 0.679 and 0.655, respectively, can be achieved with a Gradient
Boosted Trees model (Jones et al., 2018).
Prediction of falling among home care clients has been studied multiple times as
well. One study used a Random Forest model achieving an AUC-ROC of 0.67
(Lo et al., 2019). A different study demonstrated an Adaptive Boosting model can
predict falling with an AUC-ROC of 0.751 (Melillo et al., 2017).
Another field of study is the prediction of home care intensity. Teotia et al.
(2021) found Adaptive Boosting, Random Forest and Gradient Boosted Trees mod-
els on a stratified split dataset to perform well in this task, achieving an accuracy
of 0.813, 0.843 and 0.840, respectively.
Hospitalisation is another topic already studied. Unplanned hospitalisation
was predicted with a Gradient Boosted Trees model achieving an AUC-ROC of
0.689 (Jones et al., 2018).
Subsequently, a study from Nijmeijer (2021) showed hospitalisation can be pre-
dicted with an eXtreme Gradient Boosting model trained on features from a
Principal Component Analysis, achieving an AUC-ROC of 0.823.
Although most studies work with Adaptive Boosting, Random Forest or Gradient
Boosted Trees models, one study showed acute hospitalisation is predictable with
an AUC-ROC of 0.99 using a Random Undersampling Booster model (Witt et al.,
2022).
2.3 Accuracy of the Dependent Variable
The urgency of a GPC contact is estimated by a triagist, supported by the Nether-
lands Triage Standard (NTS; see Section 1.1).
Some studies were conducted on the accuracy of urgency estimations. Important
to note is that after an urgency estimation is assigned, the GP needs to approve
and can adjust the urgency estimation. This helps increase the accuracy of urgency
estimations in this study’s dataset.
Triagists correctly estimate the urgency in 68.5% of cases. In 12.5% of cases,
the urgency is overestimated and in 19% of cases the urgency is underestimated
(Giesen et al., 2007). Undertriage was found most common throughout various
levels of clinical implications (Montalto et al., 2010).
The problem of imperfect accuracy in the dependent variable is partly tackled by
binning alike urgency estimation levels.

3 methods 9
3 methods
This section describes the methodology. The first section explains the use of
resampling. Since the study consists of two subquestions, the second and third
section explain the feature selection and the prediction algorithms, respectively.
3.1 Resampling Technique
The dataset shows a severe right skew and zero-inflation in the number of GPC
contacts per twelve months, as illustrated in Figure 2.
0 1 2 3 4 5 6
Number of GPC contacts
0
5000
10000
15000
20000
25000
Count
U0-2
0 1 2 3 4 5 6
U3
0 1 2 3 4 5 6
U4-5
Histogram on the number of GPC contacts per year
Figure 2: Histograms of the distribution of yearly GPC contacts per urgency bin
Conducted experiments reveal that zero-inflated regression models have trouble
predicting with these distributions. This makes it implausible to use regression
algorithms. Readers interested in these experiments are referred to Appendix A.
However, even after transforming the task to binary classification, the dependent
variable is imbalanced with an underrepresented positive class. Resampling tech-
niques can help to solve this imbalance problem.
Resampling can be done through oversampling or undersampling. With over-
sampling, one creates new samples for the minority class. One prevalently used
method is Synthetic Minority Oversampling TEchnique (SMOTE). SMOTE syn-
thetically creates new samples by selecting the k nearest neighbours of an existing
sample. From these neighbours, one is selected and a new sample is created
randomly on the space between these samples (Chawla et al., 2002).
Although oversampling retains all original information, a drawback is the tendency
to extrapolate the minority clusters such that it is harder to separate them from
the majority clusters. Therefore, a hybrid of SMOTE and Tomek linking is used, as
proposed by Batista et al. (2003). Tomek linking is a method of finding ambiguous
majority class instances. Two samples form a Tomek link when both samples are
each other’s nearest neighbour. When these samples belong to different classes, the
majority class instance is removed by the algorithm. The combination of SMOTE
and Tomek showed to increase model performance in medical contexts (e.g. Zeng
et al., 2016).

3 methods 10
3.2 Feature Selection
A well-known problem within machine learning is the curse of dimensionality, which
occurs when the data has too many features — making it hard to discover patterns.
To prevent this, one needs to reduce the number of features. This can be done
through feature selection. Feature selection, as opposed to feature extraction,
provides insight into the predictive power of individual features, which promotes
interpretability (Remeseiro & Bolon-Canedo, 2019). This is crucial in this study, as
explained in Section 1.4.
This study uses Recursive Feature Elimination with Cross-Validation (RFECV),
as proposed by Guyon et al. (2002). Being an example of backward feature
elimination, RFECV starts with all features included in the model and recursively
deletes its weakest feature. It recomputes the validation performance in each
iteration and, thereby, provides insight into the optimal number of features and,
at the same time, names the features to include. The weakest predictors are
determined by the lowest feature importances per iteration, hence requiring models
that can output these importances.
RFECV has shown to increase the performance of prediction models in healthcare
classification tasks (e.g. Huang et al., 2022; Misra & Yadav, 2020; Wang et al., 2019).
Because the importance of a feature depends on which other features are included,
they give limited objective insight into the best predictors. The RFECV ranking
depends on feature importances, yet RFECV recalculate them in each iteration.
Therefore, feature importances are reported in Appendix I, while the highest
ranked features — averaged over all conventional models — are reported as the
best predictors.
3.3 Prediction Algorithms
Since there are no alike studies, no pre-existing performance can serve as a baseline.
Therefore, this study uses Logistic Regression as a baseline because this is one of
the most widespread algorithms for a binary classification task. The algorithms
that have proven to be effective in predicting an adverse health outcome with
EHR data (conventional models; see Section 2.2) are used to outperform the Logistic
Regression model. All models can output feature importances and are, therefore,
compatible with the RFECV procedure.
3.3.1 Logistic Regression Logistic Regression (LR) is a Generalized Linear
Model that predicts the outcome of a binary variable by fitting the best linear
decision boundary separating the classes. This boundary is the line minimising
the difference between the membership likelihoods and the class labels.
LR models are widespread in the literature and intuitive to interpret. Further-
more, in some cases they match the performance of more complex algorithms
(Christodoulou et al., 2019; e.g. Song et al., 2021), specifically after using RFECV
(e.g. Misra & Yadav, 2020). This makes the LR algorithm appropriate to construct

3 methods 11
a baseline.
3.3.2 Random Forest Random Forest (RF) is an example of an ensemble learner.
Ensemble learners combine different weak learners into one ensemble to make
predictions. RF uses an ensemble method known as bagging. In the case of RF,
this is a combination of decision trees with limited depth, all performing the
same prediction task. Furthermore, RF combines bagging with bootstrapping and
feature randomness, through sampling features and samples with replacement
from the training set. A detailed description of RF is given by Breiman (2001).
RF is known as one of the best general-purpose classifiers of our time because it
performs well on large and high-dimensional datasets (Biau & Scornet, 2016).
3.3.3 EXtreme Gradient Boosting Another tree-based ensemble learner is Gra-
dient Boosted Trees (GBT). Instead of a bagging approach — where multiple
learners are trained in parallel — this algorithm takes a boosting approach. With
boosting, the learners are trained sequentially, where a learner is trained to reduce
the errors made by preceding learners. GBT does this through training learners on
the residuals of its preceding learner. By aggregating the results from each learner,
the model can provide a prediction. A detailed description of GBT is given by
Friedman (2001).
An optimised version of GBT is eXtreme Gradient Boosting (XGB). Although XGB
is comparable to GBT, it is known to be more scalable to larger datasets, efficient
and achieve better generalisability. Interested readers are referred to Chen and
Guestrin (2016).
3.3.4 Adaptive Boosting Apart from GBT, other tree-based boosting approaches
exist. One of them is Adaptive Boosting (ADA), proposed by Freund and Schapire
(1997). ADA is not trained on residuals, but on resampled samples. This resam-
pling is weighted, such that the likelihood of a sample being selected is higher for
hard-to-classify samples.
Furthermore, ADA assigns weights to each learner according to its predictive
power. Although GBT outperforms ADA most of the time, in some cases ADA
performs better (e.g. Melillo et al., 2017; Mujumdar & Vaidehi, 2019).
3.3.5 Random Undersampling Booster Lastly, another boosting approach is
considered. Random Undersampling Booster (RUSB) is an extension to ADA. It
performs implicit resampling, which defeats the need for resampling earlier in the
pipeline. Instead of SMOTE+Tomek, RUSB uses random undersampling, which —
although resulting in a loss of information — is more efficient than SMOTE+Tomek.
Because resampling is done iteratively and randomly, the loss of information in
one iteration is made up for by including the same information in other iterations.
RUSB is described in more detail by Seiffert et al. (2010).
Experiments using feature extraction and Artificial Neural Networks were
conducted but did not increase model performance. Since these methods are also

4 experimental setup 12
less interpretable, it was not deemed worth their inclusion in this thesis. Readers
interested in these experiments are referred to Appendix B.
4 experimental setup
This section describes the raw datasets, operations performed to obtain the final
dataset and the experimental procedure. The purpose is to introduce the data and
ensure reproducibility.
4.1 Raw Datasets
This study simulates a combining and collaboration approach between GPCs and
home care organisations. Accordingly, the final dataset is a merge of a home care
organisation dataset and a GPC dataset.
4.1.1 Home Care Organisation Dataset The home care organisation dataset
contains 60,167 EHRs from 21,933 unique home care clients, residing in the North-
Western region of the Netherlands between 2015 and January 2022. The final
dataset only includes clients who were in care for longer than 3 months, which
decreases the size to 13,802 clients with 35,277 EHRs. As described in Section 1.2,
EHRs are structured through the Omaha system. Each EHR consists of a problem
classification scheme, problem rating scale and intervention scheme.
The problem classification scheme exists of 42 problem domains, where multiple
can be selected per EHR. Nurses draw up an EHR when they identify new prob-
lems or when the previous EHR is no longer up-to-date. Nurses score both the
current and the desired level of signals, knowledge and behaviour of a client on
each selected problem domain. This score is expressed on a Likert-scale from 1
(extreme problem) to 5 (no problem). Along with problem identification, nurses can
select from a list of 75 interventions in the intervention scheme. One can regard
this data as a snapshot of the client’s state, planned interventions and desired state.
Some problem domains and interventions are excluded in the dummy coding
because they occur in fewer than 0.3% of instances.
Additionally, information regarding the care intensity, type of care, contact with
nurses on alarm service and demographics (i.e. gender, birth date and living
status) are available in the dataset.
4.1.2 GPC Dataset The GPC dataset contains 1,356,603 contacts, whereof 23,248
originate from home care clients during their period in care. The dataset contains
features regarding the contact itself — such as date, length of conversation and
urgency estimation. Only the urgency estimations are included in the final dataset
because additional contact characteristics are not pre-known. The final dataset
does contain features regarding the number of contacts a client had in preceding
years.

The dataset contains 5,477 contacts with U0-2, 6,937 contacts with U3 and 10,834
contacts with U4-5.
4.1.3 Final Dataset Combining these two datasets is the major novelty of this
study. As it explores the possibilities of early identification of home care clients
contacting the GPC, the dataset reflects the situation where a prediction is made
for each twelve months in care. This starts for each client as soon as their first EHR
is compiled and, thereafter, is repeated for each twelve months in care. Predictions
are made separately for each urgency bin.
Accordingly, the final dataset consists of client-year combinations designated to
predict whether a client will contact the GPC per urgency bin in that period.
The dataset contains 28,327 such combinations. A description of all 115 available
features can be found in Appendix C.
4.2 Cleaning
After merging operations, the dataset required cleaning. First, the three current
problem rating scales required cleaning because 7.1% to 19.1% of data is missing.
As first described by Rubin (1976), missing data can have different patterns: Miss-
ing Not At Random, Missing At Random and Missing Completely At Random.
Since only minor correlations with the missingness in other variables exist and
some variables are correlated with the missingness in scores, this data is considered
Missing At Random. Imputing the data is, therefore, appropriate (Little et al.,
2014).
There is disagreement whether Stochastic Regression Imputation (SRI) is sufficient
in such cases (cf. Little et al., 2014; Masconi et al., 2015). After imputing twenty
datasets, the differences between datasets were minor and the results showed
problem rating scales had little effect, thus giving little reason to suspect the
introduction of bias in the results. Therefore, and due to time limitations, it was
decided to perform SRI instead of Multiple Imputation.
Secondly, the dataset contained instances where clients went out of care and
got back right after. Care period gaps smaller than three months are merged, to
prevent a substantial number of gaps to fall in the data. If care period gaps are
larger than three months, the periods are treated as independent care periods.
Thirdly, instances that spanned over 24 January 2022 are discarded because
this is the last date where GPC data is available and because extrapolating while
controlling for the possibility that clients may contact the GPC after this date is not
plausible, since GPC contacts are not independently occurring events. The impact
of these cleaning steps is incorporated in the reported dataset sizes in Section 4.1.
The merging and cleaning pipeline can be found in Appendix D.

4.3 Exploratory Data Analysis
This section can be regarded as an overview of the features that show the most
significant bivariate association, in order to gain a first understanding of the
features. All plots concern the association between a feature and GPC contact per
urgency bin. A correlation table can be found in Appendix F.
0 1 2 3 4 5
Average number of contacts per year
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
Density
U0-2
Contact with U0-2
No contact with U0-2
0 1 2 3 4 5
U3
Contact with U3
No contact with U3
0 1 2 3 4 5
U4-5
Contact with U4-5
No contact with U4-5
Average number of GPC contacts per year
Figure 3: Density plots on the association of GPC contact per urgency bin and the average number
of GPC contacts in preceding years
The first observation is that GPC contact in preceding years is highly associated
with current contact, as can be seen in Figure 3. This observation is also revealed
in Figure 4, which illustrates the association with the average number of contacts
in the preceding two years.
Contact with U0-2 No contact with U0-2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Average
number
of
contacts
U0-2
Contact with U3 No contact with U3
U3
U4-5
Average number of GPC contacts in preceding two years
Figure 4: Bar plots on the association of GPC contact per urgency bin and the average number of
GPC contacts in the preceding two years
The data, moreover, reveals associations with EHR features. Figure 5 shows an
association with the number of problem domains in the most recent EHR.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Number
of
problem
domains
U0-2
U3
U4-5
Number of problem domains in most recent care plan
Figure 5: Bar plots on the association of GPC contact per urgency bin and the number of problem
domains in the most recent EHR
One can also see associations by looking at particular problem domains. These
associations exist on problem domains regarding bowel function, medication
regimen, respiration and urinary function, as shown in Figure 6. Multiple other
associations exist and can be found in Appendix E.

0.00
0.02
0.04
0.06
0.08
0.10
Proportion
mentioned
Bowel function problems: U0-2
Bowel function problems: U3
Bowel function problems: U4-5
0.0
0.1
0.2
0.3
0.4
Proportion
mentioned
Medication regimen problems: U0-2
Medication regimen problems: U3
Medication regimen problems: U4-5
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Proportion
mentioned
Respiration problems: U0-2
Respiration problems: U3
Respiration problems: U4-5
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Proportion
mentioned
Urinary function problems: U0-2
Urinary function problems: U3
Urinary function problems: U4-5
Proportion of problem domains mentioned in most recent care plans
Figure 6: Bar plots on the association of GPC contact per urgency bin and the proportion a problem
domain is mentioned in the most recent EHR
Associations with gender and living status are either insignificant or have weak
effect sizes. The same is true for the association with problem rating scales. Yet, as
Figure 7 reveals, the age of a client is associated with GPC demand.
75
76
77
78
79
80
81
Average
age
U0-2
U3
age_startperiod
U4-5
Average age at the start of the combination period
Figure 7: Bar plots on the association of GPC contact per urgency bin and the age of a client
4.4 Experimental Procedure
This section describes the experimental pipeline, as visualised in Figure 8. Merging,
cleaning and exploratory data analysis procedures already have been discussed in
Section 4.1, 4.2 and 4.3, respectively. To answer the subquestions, two experiments
are conducted. The first experiment concerns feature selection and the second

experiment concerns model evaluation. The experiments require data splitting
and model tuning throughout the pipeline. All methods are implemented using
Python. The packages used can be found in Appendix G.
Merging & formatting Cleaning Exploratory data analysis
Stratified train-test split
Training data
Feature selection (RFECV)
Model tuning on RFECV dataset
Using cross-validation
Model evaluation
Including error analysis
Model tuning on full dataset
Using cross-validation
Test data
Experiment 1
Experiment 2
SMOTE+Tomek
on train folds
SMOTE+Tomek
on train folds
Repeat for each urgency bin
Figure 8: Experimental pipeline
4.4.1 Data Splitting Because of the class imbalance, stratified sampling is used
for each urgency bin to set apart 20% of the dataset as test data, creating three
independent splits to build models on. It is not deemed plausible to stratify on
all three urgency bins simultaneously — creating one dataset that is stratified
on all three dependent variables — since they are correlated with each other,
making the split essentially non-random. The test data remains untouched dur-
ing tuning and is only used to assess an objective generalisable model performance.
4.4.2 Model Tuning Models are tuned using Bayesian Optimization (BO) with
5-fold cross-validation. BO is an informed optimisation strategy. The algorithm
optimises sequentially and learns from previous iterations. BO does this by up-
dating a Gaussian Process Regression function (GPR) on the objective function
(Rasmussen & Williams, 2006, pp. 126-129).
Consequentially, an acquisition function picks the hyperparameters. The acquisi-
tion function makes a trade-off between the exploitation of promising areas and
exploring new areas. For each iteration, the GPR is updated and new hyper-
parameters are sampled (Frazier, 2018; Yang & Shami, 2020). Thereby, parts of
the hyperparameter space yielding the most promising results receive the most
attention, leading to fewer iterations needed. Table 2 gives an overview of the
hyperparameters tuned per model and the best-performing hyperparameters can
be found in Appendix H.
The AUC-PR metric is used as the objective function and the algorithm is allowed
50 iterations. With each model, convergence occurred or small effect sizes are
observed. During model tuning — except with RUSB — SMOTE+Tomek with
k=5 is applied on the training folds exclusively to balance the classes. The model
tuning is done both before RFECV with all features included (full model) and after
RFECV with only the selected features included (selected model).

Table 2: Hyperparameter search spaces per model
Model Hyperparameter Search space
LR Penalty [l1, l2, elasticnet]
C {0.1, 100}, distribution= log10-uniform
Class weight [balanced, None]
L1 ratio {0, 1}, distribution=uniform
Solver [saga]
RF N estimators [1300]
Max features [sqrt, log2]
Max depth {5, 20}, distribution=count
Min samples-split {2, 15}, distribution=count
Min samples-leaf {1, 8}, distribution=count
Class weight [balanced, None]
XGB N estimators {200, 700}, distribution=count
Max depth {3, 80}, distribution=count
Min child weight {1, 10}, distribution-count
Gamma {0, 5}, distribution=uniform
Learning rate {0.001, 0.3}, distribution=log10-uniform
Subsample {0.5, 0.9}, distribution=uniform
Colsample bytree {0.5, 0.9}, distribution=uniform
ADA & N estimators {200, 700}, distribution=count
RUSB Learning rate {0.001, 0.3}, distribution=log10-uniform
4.4.3 Experiment 1: Feature Selection RFECV is used to select the best predic-
tive features for each prediction model. Due to time limitations, the two weakest
features are eliminated in each iteration.
5-fold cross-validation is implemented to split the training data into training and
validation folds. Furthermore, AUC-PR is used for evaluating the intermediate
models.
This study’s RFECV implementation looks at the ranking of the Mean Gini De-
crease, which is the average reduction of the Gini impurity brought by that feature
(Louppe, 2014, pp. 135-156). The Mean Gini Decrease was used because it is
available and consistent over all four conventional models.
The features included in the selected model are the features included in the RFECV
iteration of optimal performance.
4.4.4 Experiment 2: Model Evaluation When dealing with class imbalance,
accuracy becomes misleading because it overprioritises the majority class. How-
ever, in most cases with imbalanced data, the minority class is most interesting
(Ali et al., 2013; Thabtah et al., 2020). Therefore, one needs to consider metrics that
are robust to class imbalance. Two metrics are selected to appropriately handle
imbalanced data, while also preventing poor trade-offs.
Although the Area Under Receiver Operating Characteristic Curve (AUC-ROC)
is most prevalent in literature, the Area Under Precision-Recall Curve (AUC-PR)

5 results 18
reveals inferior performance better when dealing with imbalanced data (Davis &
Goadrich, 2006; Saito & Rehmsmeier, 2015). AUC-PR summarises the recall and
precision scores and, thereby, ignores the true negatives in its evaluation. This
means this metric is not biased by the overrepresentation of the majority class. For
that reason, AUC-PR is regarded as the primary metric.
However, the AUC-PR is only a summary of positive class metrics, which may
result in poor trade-offs. To identify and prevent this situation, the confusion
matrix is considered. In the trade-off recall is favoured, assuming interventions
based on model prediction concern positive actions that are cheap to execute and,
furthermore, that absence of an intervention does not impose any harm on the
subject.
In addition, interpretability is required (see Section 1.4). Therefore, the most
interpretable model is preferred in the case of alike performance. Interpretable
models are those using fewer features or with an interpretative architecture such
as LR.
A group-level error analysis is conducted on the best-performing models with the
test data to identify differences in model performance between groups. Groups
were drawn from all 115 features, through splitting univariately. Categorical
features were split by level and numerical features were automatically binned
into a maximum of five groups such that each group has an equal amount of
samples. Groups were included in the case of under- or overperformance with
more than two standard deviations from the average AUC-PR. Groups smaller
than 385 instances were excluded from the analysis.
5 results
1 11 21 31 41 51 61 71 81 91 101 111
Number of features in model
U0-2
LR
RUSB
1 11 21 31 41 51 61 71 81 91 101 111
U3
LR
RUSB
1 11 21 31 41 51 61 71 81 91 101 111
U4-5
LR
RUSB
1 11 21 31 41 51 61 71 81 91 101 111
RF
XGB
ADA
1 11 21 31 41 51 61 71 81 91 101 111
RF
XGB
ADA
1 11 21 31 41 51 61 71 81 91 101 111
RF
XGB
ADA
Model performance relative to the number of features included
Figure 9: Model performance (AUC-PR) with respect to the number of features included. Note: The
vertical dashed line denotes the chosen number of features.

5 results 19
5.1 Experiment 1: Feature Selection
As described in Section 3.2, RFECV offers both insights into the optimal number
of features and the specific features to include in the selected dataset.
In Figure 9 one can see how the performance of the prediction models changes
with respect to the number of features included.
The major difference is that LR and RF model performance decreases after
some point while boosting models handle higher dimensionality better. Although
model performance for boosting models does not decrease, the results show an
elbow in the model performances. The elbow is the point where model performance
stabilises. Even if it does not improve model performance, including fewer features
does improve the interpretability. Therefore, the first peak in the elbow is used as
the optimal point. Table 3 displays the number of included features per model.
Table 3: Number of features included in selected models
U0-2 U3 U4-5
LR 15 11 7
RF 9 5 9
XGB 11 5 25
ADA 1 1 3
RUSB 25 9 9
Features are selected per model and this selection can be found in Appendix I.
Noteworthy is that the XGB model on U4-5 contacts profits from considerably more
features and that the features regarding preceding GPC contact are less important
Table 4: Averaged top-five ranked most important features per urgency bin
Rank U0-2 U3 U4-5
1 Average number of GPC
contacts
Average number of GPC
contacts
Number of specific ur-
gency bin GPC contacts
in preceding two years
(U2-U5) a
2 Number of specific ur-
(U1-5)
Number of specific ur-
(U2-U5)
Average number of GPC
contacts
3 Problem domain: Respi-
ration
Start year Days in care
4 Number of EHRs Problem domain: Uri-
nary function
Problem domain: Medi-
cation regimen
5 Average length of home
care visits
Average EHR interval Number of problem do-
mains
a Note: this is a collection of features. Models often use preceding GPC contact with multiple urgency
estimation levels as features, but they are grouped in this overview to increase the readability.

5 results 20
than they are in other models. The RUSB model on U0-2 contacts also prof-
its from more features, although its model performance curve did not exhibit a
clear elbow characteristic. The ADA model shows constant model performance
over iterations when trained on U0-2 or U3 contacts. This indicates this model
only uses one feature, regardless of whether it has access to more.
The highest ranked features, averaged over all conventional models, are shown
in Table 4 to provide an overview of the most important features. Features
regarding preceding GPC contact are by far the most important. The RF model on
U3 contacts and all ADA models even use these features alone. Other models
also use features originating from EHRs. Features regarding home care intensity
(e.g. average length of home care visits) or features regarding a client’s overall frailty
status (e.g. number of problem domains) contribute to model performance in multiple
models. Interestingly, problem domains are also included in the selected features,
where specifically respiration, medication regimen and urinary function are useful
predictors.
5.2 Experiment 2: Model Evaluation
In the first phase of this experiment, the full models are tuned. The results after
tuning are displayed in Table 5, except for the confusion matrices, which can be
found in Appendix J.
Table 5: Full model performance results
Urgency Model AUC-PR Precision Recall AUC-
ROC
Accuracy
U0-2 Random 0.137 0.500
Very-urgent LR 0.224 0.183 0.543 0.622 0.604
RF 0.255 0.283 0.317 0.701 0.796
XGB 0.243 0.225 0.486 0.686 0.700
ADA 0.250 0.224 0.601 0.694 0.660
RUSB 0.278 0.218 0.690 0.708 0.618
U3 Random 0.166 0.500
Urgent LR 0.297 0.238 0.595 0.650 0.617
RF 0.312 0.335 0.364 0.700 0.775
XGB 0.339 0.563 0.086 0.711 0.838
ADA 0.307 0.267 0.577 0.690 0.668
RUSB 0.356 0.262 0.721 0.719 0.618
U4-5 Random 0.208 0.500
Non-urgent LR 0.345 0.304 0.548 0.659 0.644
RF 0.409 0.416 0.435 0.721 0.743
XGB 0.403 0.509 0.173 0.714 0.793
ADA 0.393 0.336 0.598 0.712 0.700
RUSB 0.430 0.328 0.725 0.732 0.633

5 results 21
From these results, indubitably model performance becomes higher for less urgent
contacts, indicating the predictability is also higher. On top of that, the models
achieve consistent model performance relative to each other. All conventional mod-
els outperform LR for each urgency bin and, apart from that, they have comparable
performance. Consequentially, the main differences are found in the trade-offs
between recall and precision. XGB prioritises precision over recall for U3 and U4-5
contacts, leading to few positive predictions, while RF does not appear to prioritise
one metric. LR, ADA and RUSB models favour recall over precision.
For the second phase, the datasets only include the selected features (see Sec-
tion 5.1). The results are given in Table 6.
Table 6: Selected model performance results
Urgency Model AUC-PR Precision Recall AUC-
ROC
Accuracy
U0-2 Random 0.137 0.500
Very-urgent LR 0.265 0.234 0.526 0.667 0.699
RF 0.253 0.235 0.492 0.685 0.711
XGB 0.249 0.225 0.550 0.689 0.679
ADA 0.219 0.184 0.884 0.679 0.446
RUSB 0.281 0.216 0.707 0.708 0.609
U3 Random 0.166 0.500
Urgent LR 0.313 0.309 0.424 0.667 0.748
RF 0.333 0.241 0.759 0.697 0.565
XGB 0.324 0.267 0.594 0.699 0.663
ADA 0.265 0.220 0.836 0.684 0.482
RUSB 0.340 0.263 0.725 0.713 0.617
U4-5 Random 0.208 0.500
Non-urgent LR 0.404 0.338 0.592 0.686 0.673
RF 0.419 0.364 0.584 0.728 0.701
XGB 0.395 0.333 0.585 0.715 0.669
ADA 0.405 0.298 0.786 0.712 0.570
RUSB 0.431 0.322 0.735 0.726 0.622
The selected model performance is comparable to the full model performance. This
means that alike performance can be achieved with considerably fewer features,
namely 1 to 25 instead of 115 features. Again, model performance increases for
less urgent contacts.
However, LR models are no longer outperformed, indicating these models profit
most from feature selection. ADA models for U0-2 and U3 contacts tend to un-
derperform, possibly because they only use one feature. However, other models
solely using preceding GPC contact features (i.e. RF on U3 contacts and ADA on
U4-5 contacts) achieve model performance comparable to models that also use
EHR features. Beyond that, the model performance is mostly comparable.
AUC-PR performance is highest for RUSB models, but they consistently favour
recall over precision. The same preference is shown by all ADA models, the RF

5 results 22
model on U3 contacts and — although to a lesser extent — generally in the selected
models.
From the selected confusion matrices in Figure 10, evidently the RF model on U3
contacts, and all RUSB and ADA models tend to predict fewer negatives and more
positives. This explains the lower specificity among these models.
Predicted: False Predicted: True
Truth:
False
Truth:
True
3549 1340
368 409
LR (selected): U0-2
Truth:
False
Truth:
True
3644 1245
395 382
RF (selected): U0-2
Truth:
False
Truth:
True
3418 1471
350 427
XGB (selected): U0-2
Truth:
False
Truth:
True
1842 3047
90 687
ADA (selected): U0-2
Truth:
False
Truth:
True
2900 1989
228 549
RUSB (selected): U0-2
Truth:
False
Truth:
True
3839 889
540 398
LR (selected): U3
Truth:
False
Truth:
True
2488 2240
226 712
RF (selected): U3
Truth:
False
Truth:
True
3198 1530
381 557
XGB (selected): U3
Truth:
False
Truth:
True
1948 2780
154 784
ADA (selected): U3
Truth:
False
Truth:
True
2818 1910
258 680
RUSB (selected): U3
Truth:
False
Truth:
True
3116 1369
482 699
LR (selected): U4-5
Truth:
False
Truth:
True
3281 1204
491 690
RF (selected): U4-5
Truth:
False
Truth:
True
3099 1386
490 691
XGB (selected): U4-5
Truth:
False
Truth:
True
2302 2183
253 928
ADA (selected): U4-5
Truth:
False
Truth:
True
2657 1828
313 868
RUSB (selected): U4-5
Figure 10: Confusion matrices selected models
5.2.1 Error Analysis An error analysis was conducted on the best-performing
models (see Table 7). The analysis yielded two noteworthy findings. Firstly, as
plotted in Figure 11, the best-performing models show a strong dependency on
preceding GPC contact to make accurate predictions.
0.00-0.18 0.18-0.45 0.45-0.94 >0.94
Average number of GPC contacts (per year)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
AUC-PR
performance
U0-2
0.00-0.20 0.20-0.47 0.47-0.96 >0.96
0.0
0.1
0.2
0.3
0.4
0.5
AUC-PR
performance
U3
0.00-0.19 0.19-0.45 0.45-0.92 >0.92
0.0
0.1
0.2
0.3
0.4
0.5
0.6
AUC-PR
performance
U4-5
Figure 11: Average model performance of best-performing models with respect to the average
number of GPC contacts. Note: the horizontal dashed line represents the average model
performance of the two best-performing models in that urgency bin and the error bars denote the
difference in model performance between those models.

6 discussion 23
Secondly, as shown in Figure 12, the best-performing models on U4-5 contacts
exhibit decreasing predictability for more recent years. Other urgency bins also
display fluctuations but without a clear linear trend.
2015-2016 2016-2018 2018-2019 2019-2020 2020-2021
Start year
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
AUC-PR
performance
U0-2
2015-2016 2016-2018 2018-2019 2019-2020 2020-2021
Start year
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
AUC-PR
performance
U3
2015-2016 2016-2018 2018-2019 2019-2020 2020-2021
Start year
0.0
0.1
0.2
0.3
0.4
0.5
AUC-PR
performance
U4-5
Figure 12: Average model performance of best-performing models with respect to the start year.
Note: the horizontal dashed line represents the average model performance of the two
best-performing models in that urgency bin and the error bars denote the difference in model
performance between those models.
Other error analysis results did not impose noteworthy differences in model
performance over other groups. A comprehensive error analysis can be found in
Appendix K.
6 discussion
This section starts with a discussion of the results, after which the study’s rele-
vance, limitations, recommendations for future work and ethical considerations
are discussed.
The first experiment answers the first subquestion: Which features can serve as
accurate predictors? Feature selection shows the model performance to decrease or
stabilise after 1 to 25 features and names the most important features. Primarily,
the features regarding preceding GPC contact are important. Furthermore, feature
selection reveals that features regarding home care intensity, client frailty and the
problem domains respiration, medication regimen and urinary function contribute to
model performance in varying urgency bins.
Feature selection gave three additional insights. As a first point, the results suggest
features which weren’t deemed interesting during Exploratory Data Analysis —
such as the average EHR interval. This proves the relevance of feature selection as
a mechanism to provide more interpretability of associated features since these
associations may not be obvious when they are regarded bivariately.
As a second point, feature selection reduces the dimensions in the final models
effectively, thus making them more interpretable. Even without instance-specific
explanations (see Section 6.4), the decision to only include a restricted feature set
does achieve this.
As a third point, this study contributes to the existing literature by exposing the
predictive power of preceding GPC contact. Linking back to Andersen’s mecha-
nisation, one can see the best predictors are related to health service usage, but

6 discussion 24
this is barely studied in the existing literature. Apart from that, the features are
consistent with existing literature, where for example respiratory diagnoses or a
higher frailty status are associated with higher or more urgent GPC contacts.
The second experiment answers the second subquestion: What is the best conven-
tional machine learning model performance? Model performance was higher for less
urgent contacts, possibly because prevalence of positive instances grew with less
urgent contacts. This indicates that predictability elevates when predicted events
are more common.
Since model performance within urgency bins is comparable, the discussion about
best model performance switches towards which model is most interpretable and
which trade-off is best.
According to that logic, selected models’ advantages over full models are evi-
dent. Furthermore, selected models favouring recall are considered to an extent
appropriate. Considering these factors, the best-performing prediction models are
shown in Table 7.
Table 7: Best selected model performances per urgency bin
Urgency Model AUC-
PR
Precision Recall AUC-
ROC
Accuracy Features
U0-2 LR 0.265 0.234 0.626 0.667 0.699 15
Very-urgent RUSB 0.281 0.216 0.707 0.708 0.609 25
U3 XGB 0.324 0.267 0.594 0.699 0.663 5
Urgent RUSB 0.340 0.263 0.725 0.713 0.617 9
U4-5 RF 0.419 0.364 0.584 0.728 0.701 9
Non-urgent RUSB 0.431 0.322 0.735 0.726 0.622 9
For U0-2 contacts, the RUSB model achieves a higher AUC-PR than the LR model,
but the RUSB model uses ten more features and has a less interpretable architec-
ture. For U3 and U4-5 contacts, the RUSB model also achieves the highest AUC-PR.
However, one can argue that the XGB model on U3 contacts and the RF model
on U4-5 contacts perform better as they predict significantly fewer false positives.
This indicates that there is no such thing as the objectively best model performance.
The error analysis findings emphasise the importance of preceding GPC contact in
making predictions. One could even call this a dependency on this feature since
predicting GPC contact on clients that did not contact the GPC before appears to be
much harder. At the same time, it could be possible this dependency has decreased
for U4-5 contacts in recent years since the introduction of internal alarm services
by this home care organisation in 2019, in which case this may be an intervention
worth considering. Other reasons for decreasing model performance on U4-5
contacts in more recent years are the COVID pandemic or structural changes in
underlying mechanisms caused by other factors. In the case of structural changes,
this study should be repeated with more recent data to provide better insights

6 discussion 25
into current mechanisms. Because the error analysis did not result in any other
noteworthy differences, there is little reason to suspect disparate impact of these
models for different groups.
Finally, the main question, To what extent is it possible to predict whether a home care
client will contact a GPC per urgency bin in the next twelve months?, can be answered.
It is possible to make predictions which are roughly twice as good as random
guessing and the best-performing prediction models from Table 7 indicate the
extent of predictability. A crucial follow-up question lies in the sufficiency of this
extent. Model performance — specifically for U4-5 contacts — could be sufficient
reason to implement a prediction model in this context. However, this question is
left to answer by future work.
6.1 Scientific and Societal Relevance
This study has revealed the possibilities of early identification of home care clients
contacting the GPC. Thereby, it contributes to scientific progress because there are
no studies on this topic yet. Early identification can have a societal impact because
it allows for early intervention and, consequently, helps to reduce the pressure on
GPCs.
On top of that, this study has shown the possibilities for a combining and collabo-
ration approach. This is done by merging GPC data with home care data which
is already available, mergeable and would otherwise remain unseen by GPCs.
This study, thus, shows new data does not necessarily need to be newly collected
but may be readily available. The major benefit in this context is that home care
organisations collect a large amount of information about each client. This data is
hard to collect by GPCs as they do not have regular contact intervals with a patient
and — even more important — do not have data regarding the home care clients
that do not impose a demand on GPC services.
6.2 Limitations
Although the advantages of this study are evident, some limitations should be
taken into account. First, the data originates from one geographical area within the
Netherlands. Since it is unknown to what extent the Omaha system is uniformly
applied over time and over organisations, the generalisability could be limited.
Beyond that, EHR data is collected by nurses, making the results susceptible to
human error.
Secondly, the literature describes the association between GPC contact and
sociodemographic neighbourhood characteristics, migrant status, education and
intellectual disabilities (see Section 2.1). This information is not structurally col-
lected in home care and, therefore, could not be included in the final dataset.

6 discussion 26
Thirdly, more efficient implementations of missing data imputation and feature
selection were used because of time limitations. This may have led to different
results, compared to more extensive methods, although the features containing
missing data are of little importance and the feature selection is still done appro-
priately but with coarser steps.
Fourthly, this study uses historical data to generate predictions. Since the data
originates from the period 2015 to 2021, it also concerns data collected during the
COVID pandemic. During this period GPCs saw differences — for example, a
decline in patients with respiratory and digestive issues (Morreel et al., 2020). This
history effect may impose problems on the generalisability in the coming years.
Lastly, one needs to consider the meaning of the dependent variable. A client
has multiple alternatives — such as ERs, internal alarm services or refrain from
seeking help — and the definition of a health problem worth contacting the GPC
for is personal. This means the dependent variable does not measure a need for
support directly, but rather a demand for support. The study’s results, hence, do
not extend toward claims about a client’s needs. Furthermore, the urgency bins
are derived from urgency estimations which can be wrong (see Section 2.3). This
imposes a construct validity problem since urgency estimations instead of the
actual urgencies are measured. Because of these limitations, careful interpretation
of the results is required.
6.3 Future Work
By answering the research questions, future studies can build on top of this study’s
conclusions. Foremost, since the most prominent predictors relate to preceding GPC
contact and barely any studies have been performed on these factors, this study
provides space to study these factors more closely.
This study can be reproduced on other datasets to quantify the representative-
ness of the current dataset and generalisability of the results. Future work can
also use novel methods or more data, in which case this study has set the baseline.
Besides that, future work may want to reconsider how to structure the dataset,
for example reconsidering the timeframe of twelve months, or study the reasons
behind group differences in model performance more closely.
For practical implementation, two questions require further consideration.
First, no structural prediction is currently performed in practice, which leads to
no interventions to decrease GPC contacts among home care clients. Therefore,
this study’s results need to give sufficient reason to implement predictions and
interventions on a structural basis. Inherently, this means the discussion about
model performance is broader than beating the baseline, namely moving towards
whether the predictive power and interpretability of the final model are sufficient
to build interventions upon.

6 discussion 27
Secondly, the effect of interventions on decreasing GPC demand and increasing
client health is a topic for future work. These interventions should be focused on
early support or increasing the health literacy among individuals that are predicted
positively. Even if model performance is insufficient, the insights from selected
features may give reason to study this topic more closely.
Additionally, this study reveals the possibilities of a combining and collab-
oration approach between GPCs and home care organisations. Because data is
prevalent in various institutions, there are reasons to study combining and collabo-
ration approaches between other healthcare actors — such as residential homes,
hospitals or ERs. This can be extended to studying associations through combining
datasets between three or more actors, hence also controlling for the alternatives a
client has in case they need support.
Finally, some models discussed in this study solely rely on preceding GPC contact
features. This indicates it might be possible to make such predictions on the
general population without requiring additional data.
6.4 Ethical Considerations
This study allows for ethical use but does not prevent unethical use. Firstly, repre-
sentativity is not guaranteed with just these results and does require replication of
this study on other datasets.
Secondly, for minimisation of harm to subjects, the interventions should be solely
focussed on increasing the level of care, instead of decreasing it for those predicted
as negative.
This study takes epistemic responsibility by giving insight into the most pre-
dictive factors and its final models allow for instance-specific explanations, for
example through implementing Local Interpretable Model-Agnostic Explanations
(e.g. Figure 13). As long as it is interpretable, explainable and implemented as a
supportive tool for home care nurses that can make autonomous decisions, it will
comply with this design principle (Robert, 2019).
0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12
0.32 < Average number of GPC contacts (per year) <= 0.8
Number of U5 GPC contacts in preceding two years > 0
Days in care > 365
Number of U2 GPC contacts in preceding two years == 0
Number of problem domains > 4
Average number of home care visits (per year) <= 80.9
72 < Age <= 81
LIME explanation on test case 4004
Feature Value
Average number of GPC contacts per year 0.488
Number of U5 GPC contacts in preceding two years 1.0
Number of problem domains 7.0
Average number of home care visits (per year) 47.832
Age 78.0
Days in care 730.0
Feature values of test case 4004
Final predicted probability of calling the GPC with U4-5 in the upcoming twelve months: 0.67
Figure 13: Local Interpretable Model-Agnostic Explanation of a test data prediction using the final
RF model on U4-5 contacts

7 conclusion 28
By naming the considerations ahead and allowing for interpretation and ex-
plainability, this study follows the ethical design principles discussed in Section
1.4. Furthermore, the results can help identify GPC contacts in advance and may,
thereby, prevent harm.
7 conclusion
This study has explored the possibilities for predicting GPC contact up to twelve
months in advance among home care clients. This study did so by combining data
from a home care organisation and a GPC and, subsequently, applying different
machine learning techniques for both feature selection and model tuning.
The most predictive factors relate to preceding GPC contact, frailty status, home
care intensity and the problem domains respiration, medication regimen and urinary
function. The results show that the most predictive factors relate to health service
usage. Interestingly, these factors are barely studied before.
Since no comparable studies exist, this study has set the baseline for predicting
GPC contacts. Prediction models can achieve AUC-PR scores of 0.281, 0.340 and
0.431 for U0-2, U3 and U4-5 contacts, respectively. This study, thus, proves early
identification is possible — primarily for U4-5 contacts — which, consequently,
can prevent GPC contacts when home care nurses are able to intervene on time.
This study provides interpretable results, contributes to solving the main
challenge at GPCs and provides novel scientific insights about the topic. In doing
so, this study has revealed the advantages of a combining and collaboration
approach between GPCs and home care organisations with data that would
otherwise remain unseen.

REFERENCES 29
references
Abdi, H., & Williams, L. J. (2010, 7). Principal component analysis. Wiley
Interdisciplinary Reviews: Computational Statistics, 2, 433-459. doi: 10.1002/
wics.101
Ali, A., Shamsuddin, S. M., & Ralescu, A. (2013). Classification with class imbalance
problem: A review. International Journal of Advance Soft Computing and its
Applications, 5.
Andersen, R. M. (1995, 3). Revisiting the behavioral model and access to medical
care: Does it matter? Journal of Health and Social Behavior, 36, 1-10. doi:
https://doi.org/10.2307/2137284
Arora, M., Kalyani, Y., & Shanker, S. (2021). A comparative study on in-
flated and dispersed count data. In (p. 29-38). SciTePress. doi: 10.5220/
0010547700290038
Babitsch, B., Gohl, D., & Von Lengerke, T. (2012). Re-revisiting andersen’s
behavioral model of health services use: a systematic review of studies from
1998-2011. GMS Psycho-Social-Medicine, 9. doi: 10.3205/psm000089
Batista, G. E. A. P. A., Bazzan, A. L. C., & Monard, M. C. (2003). Balancing training
data for automated annotation of keywords: a case study. WOB, 10-18.
Biau, G., & Scornet, E. (2016, 11). A random forest guided tour. TEST, 25, 197-227.
doi: https://doi.org/10.1007/s11749-016-0481-7
Bick, I., & Dowding, D. (2019, 7). Hospitalization risk factors of older cohorts of
home health care patients: A systematic review. Home Health Care Services
Quarterly, 38, 111-152. doi: 10.1080/01621424.2019.1616026
Bloemhoff, A., Schoon, Y., Smulders, K., Akkermans, R., Vloet, L. C., Van Den Berg,
K., & Berben, S. A. (2020, 8). Older persons are frailer after an emergency care
visit to the out-of-hours general practitioner cooperative in the netherlands:
A cross-sectional descriptive topics-mds study. BMC Family Practice, 21. doi:
10.1186/s12875-020-01220-y
Boone, H. N., & Boone, D. A. (2012). Analyzing likert data. Journal of Extension, 50,
1-5.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. doi: https://
doi.org/10.1023/A:1010933404324
Buja, A., Toffanin, R., Rigon, S., Lion, C., Sandonà, P., Carraro, D., . . . Baldo, V.
(2015, 8). What determines frequent attendance at out-of-hours primary care
services? European Journal of Public Health, 25, 563-568. doi: 10.1093/eurpub/
cku235
CBS. (2022). Bevolkingspiramide [demographic pyramid]. Retrieved from
https://www.cbs.nl/nl-nl/visualisaties/dashboard-bevolking/
bevolkingspiramide
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). Smote:
Synthetic minority over-sampling technique. Journal of Artificial Intelligence
Research, 16, 321-357. doi: https://doi.org/10.1613/jair.953
Chen, T., & Guestrin, C. (2016, 8). Xgboost: A scalable tree boosting system. In
(Vol. 13-17-August-2016, p. 785-794). Association for Computing Machinery.

REFERENCES 30
doi: 10.1145/2939672.2939785
Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., & Van
Calster, B. (2019, 6). A systematic review shows no performance benefit
of machine learning over logistic regression for clinical prediction models.
Journal of Clinical Epidemiology, 110, 12-22. doi: 10.1016/j.jclinepi.2019.02.004
Crawford, J., Cooper, S., Cant, R., & DeSouza, R. (2017, 9). The impact of walk-in
centres and gp co-operatives on emergency department presentations: A
systematic review of the literature. International Emergency Nursing, 34, 36-42.
doi: 10.1016/j.ienj.2017.04.002
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and roc
curves. In (p. 233-240). doi: 10.1145/1143844.1143874
Deb, P., & Trivedi, P. K. (1997). Demand for medical care by the elderly: A finite
mixture approach. Journal of Applied Econometrics, 12, 313-336. doi: 10.1002/
(SICI)1099-1255(199705)12:3%3C313::AID-JAE440%3E3.0.CO;2-G
European Union Agency for Fundamental Rights, Council of Europe, European
Court of Human Rights, & European Data Protection Supervisor. (2018, 4).
Handbook on european data protection law. doi: 10.2811/58814
Fortinsky, R. H., Madigan, E. A., Sheehan, T. J., Tullai-McGuinness, S., & Klep-
pinger, A. (2014). Risk factors for hospitalization in a national sample of
medicare home health care patients. Journal of Applied Gerontology, 33, 474-493.
doi: 10.1177/0733464812454007
Frazier, P. I. (2018, 7). A tutorial on bayesian optimization.
doi: https://doi.org/10.48550/arXiv.1807.02811
Freund, Y., & Schapire, R. E. (1997). Journal of computer and system sciences s ss1504
journal of computer and system sciences (Vol. 55).
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting
machine. The Annals of Statistics, 29, 1189-1232. Retrieved from https://
www.jstor.org/stable/2699986
Gamst-Jensen, H., Frischknecht Christensen, E., Lippert, F., Folke, F., Egerod, I.,
Huibers, L., . . . Caspar Thygesen, L. (2020, 6). Self-rated worry is associated
with hospital admission in out-of-hours telephone triage - a prospective
cohort study. Scandinavian Journal of Trauma, Resuscitation and Emergency
Medicine, 28. doi: 10.1186/s13049-020-00743-8
Gerke, S., Minssen, T., & Cohen, G. (2020). Ethical and legal challenges of artificial
intelligence-driven healthcare. In Artificial intelligence in healthcare (p. 295-336).
Elsevier. doi: 10.1016/b978-0-12-818438-7.00012-5
Giesen, P., Ferwerda, R., Tijssen, R., Mokkink, H., Drijver, R., Van Den Bosch, W., &
Grol, R. (2007). Safety of telephone triage in general practitioner cooperatives:
do triage nurses correctly estimate urgency? Quality and Safety in Health Care,
16, 181-184. doi: 10.1136/qshc.2006.018846
Gijsen, R., Kommers, G., Deuning, C., & RIVM. (2019). Acute zorg, gebruik hap [acute
care, gpc use]. Retrieved from https://vzinfo.nl/acute-zorg/gebruik/hap
Guyon, I., Weston, J., & Barnhill, S. (2002). Gene selection for cancer classification
using support vector machines. Machine Learning, 46, 389-422. doi: 10.1023/
A:1012487302797

REFERENCES 31
Haraldseide, L. M., Sortland, L. S., Hunskaar, S., & Morken, T. (2020, 4). Contact
characteristics and factors associated with the degree of urgency among older
people in emergency primary health care: A cross-sectional study. BMC
Health Services Research, 20. doi: 10.1186/s12913-020-05219-0
Heutmekers, M., Naaldenberg, J., Verheggen, S. A., Assendelft, W. J., Van Schrojen-
stein Lantman-De Valk, H. M., Tobi, H., & Leusink, G. L. (2017, 11). Does risk
and urgency of requested out-of-hours general practitioners care differ for
people with intellectual disabilities in residential settings compared with the
general population in the netherlands? a cross-sectional routine data-based
study. BMJ Open, 7. doi: 10.1136/bmjopen-2017-019222
Hinton, G. E. (1990). Connectionist learning procedures. In Machine learning (pp.
555–610). Elsevier.
Huang, E.-H., Hu, H.-W., Jheng, W.-L., Chen, K.-Y., Liu, C.-H., Chi, H.-Y., . . .
Wang, J.-F. (2022, 1). Feature selection for intradialytic blood pressure
value prediction using gru-based method under rfecv algorithm. In (p. 1-
4). Institute of Electrical and Electronics Engineers (IEEE). doi: 10.1109/
icot54518.2021.9680645
Huibers, L., Keizer, E., Carlsen, A. H., Moth, G., Smits, M., Senn, O., & Christensen,
M. B. (2018, 10). Help-seeking behaviour outside office hours in denmark,
the netherlands and switzerland: A questionnaire study exploring responses
to hypothetical cases. BMJ Open, 8. doi: 10.1136/bmjopen-2017-019295
Ienca, M., Ferretti, A., Hurst, S., Puhan, M., Lovis, C., & Vayena, E. (2018, 10).
Considerations for ethics review of big data health research: A scoping
review. PLoS ONE, 13. doi: 10.1371/journal.pone.0204937
InEen. (2021, 12). Benchmark huisartsenposten 2020 [benchmark gpcs 2020]. Retrieved
from https://benchmarkhap.ineen.nl/
Jansen, T., Hek, K., Schellevis, F. G., Kunst, A. E., & Verheij, R. A. (2020, 12).
Socioeconomic inequalities in out-of-hours primary care use: An electronic
health records linkage study. European Journal of Public Health, 30, 1049-1055.
doi: 10.1093/eurpub/ckaa116
Jansen, T., Hek, K., Schellevis, F. G., Kunst, A. E., & Verheij, R. A. (2021, 6).
Income-related differences in out-of-hours primary care telephone triage
using national registration data. Emergency Medicine Journal, 38, 460-466. doi:
10.1136/emermed-2020-209649
Jansen, T., Rademakers, J., Waverijn, G., Verheij, R. A., Osborne, R., & Heijmans, M.
(2018, 5). The role of health literacy in explaining the association between
educational attainment and the use of out-of-hours primary care services in
chronically ill people: A survey study. BMC Health Services Research, 18. doi:
10.1186/s12913-018-3197-4
Jansen, T., Verheij, R. A., Schellevis, F. G., & Kunst, A. E. (2019, 3). Use of out-of-
hours primary care in affluent and deprived neighbourhoods during reforms
in long-term care: An observational study from 2013 to 2016. BMJ Open, 9.
doi: 10.1136/bmjopen-2018-026426
Jansen, T., Zwaanswijk, M., Hek, K., & De Bakker, D. (2015, 5). To what extent
does sociodemographic composition of the neighbourhood explain regional

REFERENCES 32
differences in demand of primary out-of-hours care: A multilevel study. BMC
Family Practice, 16. doi: 10.1186/s12875-015-0275-0
Jones, A., Costa, A. P., Pesevski, A., & McNicholas, P. D. (2018, 11). Predicting
hospital and emergency department utilization among community-dwelling
older adults: Statistical and machine learning approaches. PLoS ONE, 13.
doi: 10.1371/journal.pone.0206662
Jović, A., Brkić, K., & Bogunović, N. (2015). A review of feature selection methods
with applications. In (p. 1200-1205). doi: 10.1109/MIPRO.2015.7160458
Keizer, E., Bakker, P., Giesen, P., Wensing, M., Atsma, F., Smits, M., & Van Den
Muijsenbergh, M. (2017, 11). Migrants’ motives and expectations for contact-
ing out-of-hours primary care: A survey study. BMC Family Practice, 18. doi:
10.1186/s12875-017-0664-7
Keizer, E., Maassen, I., Smits, M., Wensing, M., & Giesen, P. (2016, 7). Reducing the
use of out-of-hours primary care services: A survey among dutch general
practitioners. European Journal of General Practice, 22, 189-195. doi: 10.1080/
13814788.2016.1178718
Keizer, E., Senn, O., Christensen, M. B., & Huibers, L. (2021, 12). Use of acute
care services by adults with a migrant background: a secondary analysis of a
euroohnet survey. BMC Family Practice, 22. doi: 10.1186/s12875-021-01460-6
Keymolen, E., & Taylor, L. (2021). Data ethics and data science: an uneasy marriage?
Kingma, D. P., & Ba, J. (2014, 12). Adam: A method for stochastic optimization..
Retrieved from http://arxiv.org/abs/1412.6980
Koster, N., & Harmsen, J. (2016). Het omaha system: Gids voor gebruik [the omaha
system: User guide] (Vol. 1). Perquery.
Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual infor-
mation. Physical Review E - Statistical Physics, Plasmas, Fluids, and Related
Interdisciplinary Topics, 69, 16. doi: 10.1103/PhysRevE.69.066138
Lambert, D. (1992, 2). Zero-inflated poisson regression, with an application to
defects in manufacturing. Technometrics, 34, 1-14. doi: 10.2307/1269547
Little, T. D., Jorgensen, T. D., Lang, K. M., & Moore, E. W. G. (2014, 3). On
the joys of missing data. Journal of Pediatric Psychology, 39, 151-162. doi:
10.1093/jpepsy/jst048
Lo, Y., Lynch, S. F., Urbanowicz, R. J., Olson, R. S., Ritter, A. Z., Whitehouse, C. R.,
. . . Bowlesc, K. H. (2019, 8). Using machine learning on home health care
assessments to predict fall risk. Studies in Health Technology and Informatics,
264, 684-688. doi: 10.3233/SHTI190310
Loeys, T., Moerkerke, B., de Smet, O., & Buysse, A. (2012, 2). The analysis of
zero-inflated count data: Beyond zero-inflated poisson regression. British
Journal of Mathematical and Statistical Psychology, 65, 163-180. doi: 10.1111/
j.2044-8317.2011.02031.x
Louppe, G. (2014). Understanding random forests . Retrieved from https://
arxiv.org/pdf/1407.7502.pdf
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve
neural network acoustic models..

REFERENCES 33
Masconi, K. L., Matsha, T. E., Echouffo-Tcheugui, J. B., Erasmus, R. T., & Kengne,
A. P. (2015, 3). Reporting and handling of missing data in predictive research
for prevalent undiagnosed type 2 diabetes mellitus: A systematic review.
EPMA Journal, 6. doi: 10.1186/s13167-015-0028-0
Melillo, P., Orrico, A., Chirico, F., Pecchia, L., Rossi, S., Testa, F., & Simonelli, F.
(2017, 3). Identifying fallers among ophthalmic patients using classification
tree methodology. PLoS ONE, 12. doi: 10.1371/journal.pone.0174083
Misra, P., & Yadav, A. S. (2020). Improving the classification accuracy using
recursive feature elimination with cross-validation. International Journal on
Emerging Technologies, 11, 659-665.
Mittelstadt, B. D., & Floridi, L. (2016, 4). The ethics of big data: Current and
foreseeable issues in biomedical contexts. Science and Engineering Ethics, 22,
303-341. doi: 10.1007/s11948-015-9652-2
Montalto, M., Dunt, D. R., Day, S. E., & Kelaher, M. A. (2010, 5). Testing the
safety of after-hours telephone triage: Patient simulations with validated
scenarios. Australasian Emergency Nursing Journal, 13, 7-16. doi: 10.1016/
j.aenj.2009.11.003
Morreel, S., Philips, H., & Verhoeven, V. (2020, 8). Organisation and characteris-
tics of out-of-hours primary care during a covid-19 outbreak: A real-time
observational study. PLoS ONE, 15. doi: 10.1371/journal.pone.0237629
Moth, G., Christensen, M. B., Christensen, H. C., Carlsen, A. H., Riddervold,
I. S., & Huibers, L. (2020). Age-related differences in motives for contacting
out-of-hours primary care: a cross-sectional questionnaire study in denmark.
Scandinavian Journal of Primary Health Care, 38, 272-280. doi: 10.1080/02813432
.2020.1794160
Mujumdar, A., & Vaidehi, V. (2019). Diabetes prediction using machine learning
algorithms. Procedia Computer Science, 165, 292-299. doi: 10.1016/j.procs.2020
.01.047
Nijmeijer, D. (2021). Preserving quality of life: Predicting hospitalization among
community care clients . Retrieved from http://arno.uvt.nl/show.cgi?fid=
157842
Nørøxe, K. B., Huibers, L., Moth, G., & Vedsted, P. (2017, 3). Medical appropriate-
ness of adult calls to danish out-of-hours primary care: A questionnaire-based
survey. BMC Family Practice, 18. doi: 10.1186/s12875-017-0617-1
Parslow, R., Jorm, A., Christensen, H., & Jacomb, P. (2002). Factors associated
with young adults’ obtaining general practitioner services. Australian Health
Review, 25, 109-118. doi: 10.1071/AH020109a
Post, H. (2021). Reducing waiting times at out-of-hours general practitioner departments
a data-driven simulation modelling and optimization study . Retrieved from
http://repository.tudelft.nl/
Ramerman, L., Rijpkema, C., Verheij, R., & Nederlands instituut voor onderzoek
van de gezondheidszorg. (2020). Zorg op de huisartsenpost : Nivel zorgregistraties
eerste lijn: jaarcijfers 2019 en trendcijfers 2015-2019 [care on the gpc: Nivel primary
care registration: annual numbers 2019 and trends 2015-2019]. NIVEL Nederlands
Inst. voor Onderzoek van de Gezondheidszorg.

REFERENCES 34
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning.
MIT Press. Retrieved from http://gaussianprocess.org/gpml/
Remeseiro, B., & Bolon-Canedo, V. (2019, 9). A review of feature selection
methods in medical applications. Computers in Biology and Medicine, 112. doi:
10.1016/j.compbiomed.2019.103375
Robert, N. (2019, 9). How artificial intelligence is changing nursing. Nursing
Management, 50, 30-39. doi: 10.1097/01.NUMA.0000578988.56622.21
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581-592. doi:
10.2307/2335739
Saito, T., & Rehmsmeier, M. (2015, 3). The precision-recall plot is more informative
than the roc plot when evaluating binary classifiers on imbalanced datasets.
PLoS ONE, 10. doi: 10.1371/journal.pone.0118432
Scapinello, M. P., Posocco, A., De Ronch, I., Castrogiovanni, F., Lollo, G., Sergi,
G., . . . Veronese, N. (2016, 9). Predictors of emergency department referral
in patients using out-of-hours primary care services. Health Policy, 120,
1001-1007. doi: 10.1016/j.healthpol.2016.07.018
Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010, 1). Rusboost:
A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems,
Man, and Cybernetics, 40, 185-197. doi: 10.1109/TSMCA.2009.2029559
Smits, M., Peters, Y., Broers, S., Keizer, E., Wensing, M., & Giesen, P. (2015, 5).
Association between general practice characteristics and use of out-of-hours
gp cooperatives. BMC Family Practice, 16. doi: 10.1186/s12875-015-0266-1
Smits, M., Rutten, M., Keizer, E., Wensing, M., Westert, G., & Giesen, P. (2017,
5). The development and performance of after-hours primary care in the
netherlands a narrative review. Annals of Internal Medicine, 166, 737-742. doi:
10.7326/M16-2776
Song, J., Woo, K., Shang, J., Ojo, M., & Topaz, M. (2021, 8). Predictive risk models
for wound infection-related hospitalization or ed visits in home health care
using machine-learning algorithms. Advances in Skin and Wound Care, 34,
1-12. doi: 10.1097/01.ASW.0000755928.30524.22
Srivastava, N., Hinton, G., Krizhevsky, A., & Salakhutdinov, R. (2014). Dropout: A
simple way to prevent neural networks from overfitting. Journal of Machine
Learning Research, 15, 1929-1958.
Teotia, R., Freeman, S., & Jackson, P. (2021, 12). Predicting home care use
after assessment using multiple machine learning methods. In (p. 662-
666). Institute of Electrical and Electronics Engineers (IEEE). doi: 10.1109/
iemcon53756.2021.9623216
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020, 3). Data imbalance
in classification: Experimental evaluation. Information Sciences, 513, 429-441.
doi: 10.1016/j.ins.2019.11.004
Tinker, A. (2002). The social implications of an ageing population. Mechanisms of
Ageing and Development, 123, 729-735. doi: 10.1016/S0047-6374(01)00418-3
Van Den Broek, T., Dykstra, P. A., & Van Der Veen, R. J. (2019, 1). Adult children
stepping in? long-term care reforms and trends in children’s provision of
household support to impaired parents in the netherlands. Ageing and Society,

REFERENCES 35
39, 112-137. doi: 10.1017/S0144686X17000836
Vedsted, P., Fink, P., Sørensen, H. T., & Olesen, F. (2004, 8). Physical, mental
and social factors associated with frequent attendance in danish general
practice. a population-based cross-sectional study. Social Science and Medicine,
59, 813-823. doi: 10.1016/j.socscimed.2003.11.027
Vektis. (2020, 3). Feiten en cijfers over wijkverpleging [facts and numbers on
home care nursing]. Retrieved from https://www.vektis.nl/intelligence/
publicaties/factsheet-wijkverpleging
Veyron, J. H., Friocourt, P., Jeanjean, O., Luquel, L., Bonifas, N., Denis, F., &
Belmin, J. (2019, 8). Home care aides’ observations and machine learning
algorithms for the prediction of visits to emergency departments by older
community-dwelling individuals receiving home care assistance: A proof of
concept study. PLoS ONE, 14. doi: 10.1371/journal.pone.0220002
Wang, C., Xiao, Z., & Wu, J. (2019, 9). Functional connectivity-based classification
of autism and control using svm-rfecv on rs-fmri data. Physica Medica, 65,
99-105. doi: 10.1016/j.ejmp.2019.08.010
Witt, U. F., Nibe, S. M., Ole, H., & Lebech, C. S. (2022, 4). A novel approach for
predicting acute hospitalizations among elderly recipients of home care? a
model development study. International Journal of Medical Informatics, 160.
doi: 10.1016/j.ijmedinf.2022.104715
Wouters, L. T., Zwart, D. L., Erkelens, D. C., Cheung, N. S., De Groot, E., Damoi-
seaux, R. A., . . . Rutten, F. H. (2020). Chest discomfort at night and risk of
acute coronary syndrome: Cross-sectional study of telephone conversations.
Family Practice, 37, 473-478. doi: 10.1093/FAMPRA/CMAA005
Yang, L., & Shami, A. (2020, 7). On hyperparameter optimization of machine
learning algorithms: Theory and practice. Neurocomputing, 415, 295-316. doi:
10.1016/j.neucom.2020.07.061
Zeng, M., Zou, B., Wei, F., Liu, X., & Wang, L. (2016, 9). Effective prediction of
three common diseases by combining smote with tomek links technique for
imbalanced medical data. In (p. 225-228). IEEE. doi: 10.1109/ICOACS.2016
.7563084
Zwaanswijk, M., Nielen, M. M., Hek, K., & Verheij, R. A. (2015). Factors associated
with variation in urgency of primary out-of-hours contacts in the netherlands:
a cross-sectional study. BMJ Open, 5. doi: 10.1136/bmjopen-2015-008421

36
Appendices
a experiments with zero-inflated count models
This section is provided for interested readers to elaborate on the ineffectiveness of
zero-inflated count models for this task and accordingly the rationale to transform
the task into binary classification. Because this is the sole purpose of the section,
the granularity provided is significantly coarser compared to the methodology and
experimental procedure of the classification task.
Zero-Inflated Count Models Count problems are often modelled using the
Poisson distribution. The Poisson distribution is well suited for modelling discrete
and non-negative count data. However, it does have a significant drawback. As
first proposed by Lambert (1992), some real-world problems show inflation of zeros
in the distribution. Consequentially, this concept was introduced in healthcare
problems (Deb & Trivedi, 1997).
Zero-inflation occurs in tasks such as manufacturing defect prediction, but also
the number of GPC contacts per twelve months. This phenomenon means that
most of the combinations are zero. However, the original Poisson distribution
cannot work with this zero-inflation in distributions. Therefore, literature advises
the use of zero-inflated Poisson (ZIP) models to model such data (Loeys et al., 2012).
Experimental Pipeline Feature selection is performed using the estimation
of Mutual Information, which indicates how much information a feature reveals
with respect to the dependent variable. This method is proposed by Kraskov et
al. (2004). As it is based on information theory, it is among the most informative
of filter methods (Jović et al., 2015). For each model the best 20 features were
included.
Conventional machine learning approaches — such as eXtreme Gradient Boosting
(XGB) and Random Forest (RF) — have shown to do well on zero-inflated count
tasks (Arora et al., 2021). This is why these are used in this analysis. Optimisation
was done using the Bayesian Optimization procedure described in the thesis (see
thesis, Section 4.4).
For evaluation, the R-square and mean squared error are reported, as they ob-
jectively measure the difference between the predicted and true values of the
dependent variable.
Experimental Results The table below shows a brief overview of the achieved
model performances.

A experiments with zero-inflated count models 37
Table A.1: Regression model results
Urgency Model MSE R2
U0-2 ZIP 0.326 0.144
Very urgent RF 0.342 0.103
XGB 0.329 0.137
U03 ZIP 0.379 0.124
Urgent RF 0.381 0.119
XGB 0.375 0.132
U4-5 ZIP 1.044 0.204
Non-urgent RF 0.998 0.209
XGB 1.037 0.239
As one can see, the metrics of the regression models are critically low, disallowing
for relevant conclusions. More importantly, models show an inability to accurately
predict sufficient contacts larger than one, probably because of an overinflation of
zeros in combination with a severe right skewness of the distributions.
0 2 4 6 8
True number of contacts
0
2
4
6
8
Predicted
number
of
contacts
True vs predicted values
0 1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
80
Mean
squared
error
per
count
Mean squared error per count
(a) ZIP U0-2
0 2 4 6 8
0
2
4
6
8
Predicted
number
of
contacts
0 1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
80
Mean
squared
error
per
count
(b) RF U0-2
0 2 4 6 8
0
2
4
6
8
Predicted
number
of
contacts
0 1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
80
Mean
squared
error
per
count
(c) XGB U0-2
0 2 4 6 8
0
2
4
6
8
Predicted
number
of
contacts
0 1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
80
Mean
squared
error
per
count
(d) ZIP U3
0 2 4 6 8
0
2
4
6
8
Predicted
number
of
contacts
0 1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
Mean
squared
error
per
count
(e) RF U3
0 2 4 6 8
0
2
4
6
8
Predicted
number
of
contacts
0 1 2 3 4 5 6 7 8 9
0
10
20
30
40
50
60
70
Mean
squared
error
per
count
(f) XGB U3
0 2 4 6 8 10 12
0
2
4
6
8
10
12
Predicted
number
of
contacts
0 1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
120
Mean
squared
error
per
count
(g) ZIP U4-5
0 2 4 6 8 10 12
0
2
4
6
8
10
12
Predicted
number
of
contacts
0 1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
120
Mean
squared
error
per
count
(h) RF U4-5
0 2 4 6 8 10 12
0
2
4
6
8
10
12
Predicted
number
of
contacts
0 1 2 3 4 5 6 7 8 9 10 11 12
0
20
40
60
80
100
120
Mean
squared
error
per
count
(i) XGB U4-5
Figure A.1: Visualisations of model error
The figures above show that the models fall short on sufficiently predicting
numbers larger than one. Since combinations larger than one occur sporadically in
the dataset, the estimated practical and scientific advantages for the prediction of
GPC contact count are not deemed worth the unsatisfactory model performance.

Early Identification of GPC Contacts Among Home Care Clients Through Urgency Specific Predictions

Early Identification of GPC Contacts Among Home Care Clients Through Urgency Specific Predictions

Recommended

Recommended

More Related Content

Similar to Early Identification of GPC Contacts Among Home Care Clients Through Urgency Specific Predictions

Similar to Early Identification of GPC Contacts Among Home Care Clients Through Urgency Specific Predictions (20)

Recently uploaded

Recently uploaded (20)

Early Identification of GPC Contacts Among Home Care Clients Through Urgency Specific Predictions