The study focused on developing and validating a multi-state model to predict multimorbidity of cardiovascular disease, type 2 diabetes, and chronic kidney diseases. The presentation is a walk through the complete process starting from acquiring, filtering, splitting the data, developing the prediction model of the training data, validating the generated model on the testing data, and comparing its accuracy.
The following GitHub repository contains the R scripts required to complete this investigation.
https://github.com/jmanali1996/Multimorbidity-Multistate-Model.git
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Multimorbidity Multistate Model
1. Develop a multi-state model
to predict multimorbidity of
Cardiovascular disease (CVD),
Type 2 diabetes (T2D), and
Chronic kidney disease (CKD)
Research aim
By:
Manali Ajay Jain
MSc Health Data Science (2022-23)
2. Develop a multi-state model
to predict multimorbidity of
Cardiovascular disease (CVD),
Type 2 diabetes (T2D), and
Chronic kidney disease (CKD)
Research aim
Many thanks to
Dr Glen Martin
for guiding me
throughout the research
3. Key elements
to understand
Prediction models
Informs people about their ailment prognosis
01
Multistate models
A stochastic process with multiple discrete states that it can occupy at
any given time
02
Multimorbidity
03
co-existence of two or more chronic illnesses, where one is not necessarily
more significant than the others time
Risk predictions of Multimorbidity
04
newly emerging area of research
4. ! The missing pieces !
Multimorbidity studies were
mainly performed on groups of
individual
The validation sample size of the
risk predictions of multimorbidity
tend to be smaller
5. An attempt to fill the gap
This multistate model
analysed every single
patient
Sample size for
validation was
relatively larger
Validation approach
were generating
stacked transition
possibilities graphs
6. Data Source
Clinical Practice Research Datalink (CPRD)
*Advisory note:
Access to the CPRD dataset was gained only after the
completion of the CPRD resource module training test
The UoM also provided a Data Protection and Cyber
Security course
7. Foundation of
the dataset
Data was obtained
from Clinical Practice
Research Datalink
(CPRD)
Total data was
of 2,714,535
observations with 91
variables
Considered only
healthy population
and required variables
to start the analysis
This reduced the data to
2,001,735 observations and
74 variables, out of which 30
were various comorbidities
and 8 were demographic
variables
This data was then split in
70:30 ratio for training and
testing
purposes respectively
70:30
8. Data in a
glance
Demographic variables
Age, BMI, Cholesterol ration, and SBP were
continuous variables
Gender: female, male
Ethnicity: white, mixed ethnic group, asian/asian
british, black/african/Caribbean/black british,
other ethnic group
Smoking: never, ex, current
Index of Multiple Deprivation: 5 stages, most to
least
11. XQuartz 2.8.5
R 4.1.3-gcc830
RStudio 1.3.1073
R libraries
dplyr (Hadley Wickham et al. 2014)
ggplot (Hadley Wickham et al. 2007)
mstate (de Wreede et al. 2011)
calibmsm (Alexander Pate et al. unreleased)
Technical
Requirements
12. No need of standardization
No need of normalization
Missing values were treated
by single stochastic imputation
Splitting was done randomly,
without any criteria
Before we
proceed
16. Let’s
begin!
• Training dataset was used for model construction
• Population distribution was studied across transitions
• Dataset was converted into a dataset of class msdata (long format)
• id
• from
• to
• trans
• Tstart
• Tstop
• time
• status
• covariates (the 8 demographic variables)
17. Let’s
begin!
• Training dataset was used for model construction
• Population distribution was studied across transitions
• Dataset was converted into a dataset of class msdata (long format)
• Covariates were expanded – dummy variables were generated based on
the no. of transitions
18. Let’s
begin!
• Training dataset was used for model construction
• Population distribution was studied across transitions
• Dataset was converted into a dataset of class msdata (long format)
• Covariates were expanded – dummy variables were generated based on
the no. of transitions
• Time was converted from days to years
19. Let’s
begin!
• Training dataset was used for model construction
• Population distribution was studied across transitions
• Dataset was converted into a dataset of class msdata (long format)
• Covariates were expanded – dummy variables were generated based on
the no. of transitions
• Time was converted from days to years
• Cox model was fitted which had separate baseline hazards for each of the
transitions and no covariates
20. Let’s
begin!
• Training dataset was used for model construction
• Population distribution was studied across transitions
• Dataset was converted into a dataset of class msdata (long format)
• Covariates were expanded – dummy variables were generated based on
the no. of transitions
• Time was converted from days to years
• Cox model was fitted which had separate baseline hazards for each of the
transitions and no covariates
• Transition hazards estimates and their associated covariances from each
stage were calculated
30. Diving
deeper
Transition possibilities estimates were also produced
These findings demonstrate how a patient's prognosis is influenced by both their initial
condition and by the moment used as the beginning point for prediction
32. Diving
deeper
Transition possibilities estimates were also produced
These findings demonstrate how a patient's prognosis is influenced by both their initial
condition and by the moment used as the beginning point for prediction
To create a lower dimensional representation of the regression coefficients of the whole
model, the reduced rank (RR) model was helpful
33. Diving
deeper
Transition possibilities estimates were also produced
These findings demonstrate how a patient's prognosis is influenced by both their initial
condition and by the moment used as the beginning point for prediction
To create a lower dimensional representation of the regression coefficients of the whole
model, the reduced rank (RR) model was helpful
RR model generates three items
• Alpha
• Gamma
• Beta
34. Alpha output of RR model
Misc ethnic group
Ex-smokers
More deprived
35. Gamma output of RR model
• All the values are mostly negative
• The coefficients for this risk score
in Gamma are negative and of
substantial size for all transitions
into death, same for transitions
starting from healthy
36. Beta output of RR model
• Combined analysis of alpha and
gamma
• Lower values of Alpha (for instance
mixed ethnicity, ex smoker, more
deprived) correspond to higher
death rates
38. • Built-up model is now fitted into the test
dataset
• Two follow-up durations were used to
evaluate the transition probabilities
• 5 years
• 10 years
• They could then be compared to the train
dataset stacked transition probabilities plot
It’s
TESTING
TIME!
39. Stacked transition possibilities
after 5 years
Testing
time
The distance between two
adjacent curves represents the
probability of being in the
corresponding state
40. Stacked transition possibilities
after 10 years
Testing
time
The distance between two
adjacent curves represents the
probability of being in the
corresponding state
Majorly transitions resulting to
death is less over the years
41. Was it
worth it??
Couldn’t make calibration plots, but somehow
managed to get future prediction probabilities
01
01
Resulted in meaningful analysis which can
potentially be beneficial for the health sector
02