Ron Kremer, Syed Mohib Raza, Paolo Missier et al.
Newcastle University, UK
School of Computing and Faculty of Medical Sciences
IEEE Big Data
Big Data Analytics for Health and Medicine (BDA4HM 2022)
Dec. 2022, Osaka Japan
Tracking trajectories of multiple long-term
conditions using dynamic patient-cluster associations
2
IEEE
BigData
2022
Motivation
Multiple Long-Term Conditions, defined as [1,2]:
• Four or more long-term (chronic) conditions
A long term condition (LTC) is a condition that cannot, at present, be cured
but is controlled by medication and/or other treatment/therapies (*)
(*) NHS and UK Dept. of Health, Long Term Conditions Compendium of Information Third Edition,
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/216528/dh_134486.pdf
[1] M. C. Johnston, M. Crilly, C. Black, G. J. Prescott, and S. W. Mercer, “Defining and measuring multimorbidity: a systematic review of systematic reviews,”
European journal of public health, vol. 29, no. 1, pp. 182–189, 2019.
[2] B. P. Nunes, T. R. Flores, G. I. Mielke, E. Thum ́e, and L. A. Facchini, “Multimorbidity and mortality in older adults: A systematic review and meta-analysis.”
Archives of gerontology and geriatrics, vol. 67, pp. 130–138, Dec. 2016, place: Netherlands.
Significant research investment by NIHR, the core
translational medicine funder in the UK
The number of people with multiple LTCs in the UK is set to
rise to 2.9 million in 2018 from 1.9 million in 2008.
3
IEEE
BigData
2022
Can we model the likelihood of next disease?
Experimental models exists for
- Modelling disease progression [3]
- Discovering clinical pathway patterns [4]
- Predicting next disease(s) [5]
However, not very robust or actually deployed in practice
[3] Wang, Xiang, David Sontag, and Fei Wang. ‘Unsupervised Learning of Disease Progression Models’. In Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 85–94. KDD ’14. New York, NY, USA, 2014 https://doi.org/10.1145/2623330.2623754.
[4] Huang, Zhengxing, Wei Dong, Lei Ji, Chenxi Gan, Xudong Lu, and Huilong Duan. ‘Discovery of Clinical Pathway Patterns from Event Logs Using Probabilistic Topic
Models’. Journal of Biomedical Informatics 47 (1 February 2014): 39–57. https://doi.org/10.1016/j.jbi.2013.09.003.
[5] Men, Lu, Noyan Ilk, Xinlin Tang, and Yuan Liu. ‘Multi-Disease Prediction Using LSTM Recurrent Neural Networks’. Expert Systems with Applications 177 (1
September 2021): 114905. https://doi.org/10.1016/j.eswa.2021.114905.
use Electronic Health Records (diagnoses event logs) to predict patients’
long-term associations to a specific disease cluster
Our goal:
4
IEEE
BigData
2022
Research hypothesis
It is possible to identify clusters of diseases that:
1. Are described using disease terms that are familiar to health domain experts
2. Are clinically significant  based on expert validation
3. Admit a quantitative association of individual patients with each of the clusters
(*) limited to LTCs
What we hope to find:
1. A significant majority of patients are stable relative to the clustering
2. Stability emerges early in their medical history(*)
5
<event
name>
Overview of the approach
1. Generate clusters of conditions based on disease co-occurrence in patients’ timelines
2. Associate patients with each of the clusters, at each stage of their disease progression
Can we quantify such likelihood?
We can more easily predict they will be
affected by diseases within that cluster
- Weakly associated with any given cluster
- Their association changes cluster over time
Stable patients: Unstable patients:
Exhibit increasingly strong associations
with a specific cluster over time
Can we identify the causes for instability?
Are these unanticipated traumatic events?
6
IEEE
BigData
2022
Contributions
• We use Topic Modelling as a form of semantic clustering
• Topics are defined by ranked lists of disease terms
• We define a cluster’s gravitational pull: patients are differently attracted by each
cluster at different points in time
• We propose a quantitative measure of stability with respect to clusters over time
• We study how stability increases as timelines progress
7
IEEE
BigData
2022
Dataset: UK Biobank Linked Electronic Health Records
- Irregularly spaced
- Healthy  fewer records!
- 143,000 MLTC individuals
hypertension  unspecified_rare_diabetes  type_2_diabetes  cerebrovascular_dz  asthma  diab_neuro 
obesity  CKD  ESRD  NAFLD_NASH  cholelithiasis  spinal_stenosis
allergic_rhinitis  hypertension  enthesopathy  thyroid
UK Biobank:
500,000 volunteer participants,
aged between 40 and 69 and living
in the UK
8
Baseline
assessment
GP events
prescriptions
HESIN diagnoses
N = 240,000
N = 500,000
Hospital events
Used to determine
admission/ re-admission
patterns
operations
57,698,505
123,644,445
Raw UK biobank datasets
eid
MLTC-M cohort: 143,000
Up to 20 years of records
9
IEEE
BigData
2022
Detailed approach
Patients timelines (143,000)
Topic
Modelling
- Optimal number of topics
- Correcting terms rankings
within topics
Dynamic
Patient-cluster
Association
- Association vectors
Patient stability
analysis
- Per-stage association changes
Bag of words
representation
hypertension
unspecified_rare_diabetes
type_2_diabetes
cerebrovascular_dz
Asthma
diab_neuro
Obesity
CKD
ESRD
NAFLD_NASH
Cholelithiasis
spinal_stenosis
allergic_rhinitis
Hypertension
Enthesopathy
thyroid
10
Implementation: LDA and Gensim
https://radimrehurek.com/gensim/models/ldamodel.html
https://pyldavis.readthedocs.io/en/latest/readme.html
Řehůřek, Radim, and Petr Sojka. ‘Software Framework for Topic Modelling with Large Corpora’. In Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA, 2010.
IEEE
BigData
2022
11
LDA: 4-clusters configuration
IEEE
BigData
2022
12
Patients association vectors
“raw” relevance of term t in cluster Cj:
Relative relevance of term t in cluster Cj:
Idf-weighted relative relevance of
term t in cluster Cj:
Inverse document frequency of term t in
corpus D:
Association of terms set T with cluster Cj
Association vector for terms set T
Patient: disease terms sequence T = [ t1, t2, …. tn]
IEEE
BigData
2022
13
Dynamic patient-cluster associations: examples
Patient 1 topic 1 topic 2 topic 3 topic 4
OA 0.20 0.21 0.07 0.04
skin ulcer 1.29 0.21 0.07 0.05
dermatitis 1.76 0.21 0.07 0.05
erectile dysfunction 1.76 0.68 0.07 0.51
Primary malignancy skin 2.14 1.46 0.07 0.51
Patient 2 topic 1 topic 2 topic 3 topic 4
spondylosis 0.31 0.37 0.26 0.00
obesity 0.45 0.50 0.34 0.50
urine incontinence 0.75 0.50 1.06 0.51
female genital prolapse 1.22 0.51 1.75 0.51
type 2 diabetes 1.22 0.51 1.75 1.43
unspecified rare diabetes 1.22 0.51 1.75 2.56
fracture of the hip 1.23 2.62 1.75 2.57
Patient 3 topic 1 topic 2 topic 3 topic 4
dermatitis 0.47 0.00 0.00 0.00
hypertension 0.55 0.16 0.00 0.16
atrial fibrilation 0.55 1.41 0.00 0.16
OA 0.75 1.62 0.07 0.20
tinnitus 1.24 2.02 0.33 0.20
Patient 4 topic 1 topic 2 topic 3 topic 4
PTSD 0.22 0.00 0.68 0.00
COPD 0.54 0.94 0.69 0.08
Neuromuscular dysfunction
of bladder 0.54 1.26 1.87 0.09
female genital prolapse 1.01 1.26 2.55 0.09
OA 1.21 1.47 2.63 0.12
Patient 5 topic 1 topic 2 topic 3 topic 4
Peripheral venous and
lymphatic disease 0.59 0.24 0.07 0.00
psoriasis 1.07 0.45 0.07 0.55
female genital prolapse 1.54 0.45 0.76 0.55
CHD 1.54 1.11 0.76 0.84
Alcohol dependence 1.74 1.59 0.76 1.25
obesity 1.88 1.72 0.84 1.75
hearing loss 2.21 2.12 0.85 1.76
urine incontinence 2.51 2.12 1.57 1.76
Patient 6 topic 1 topic 2 topic 3 topic 4
asthma 0.80 0.00 0.00 0.00
hypertension 0.88 0.16 0.00 0.16
hearing loss 1.21 0.56 0.01 0.17
Alcohol dependence 1.41 1.04 0.01 0.58
Patient stages: T1 = [ t1], T2 = [ t1, t2], … Tn = T = [ t1, t2, …. tn]
Patient: disease terms sequence T = [ t1, t2, …. tn]
IEEE
BigData
2022
14
<event
name>
Pipeline implementation
Prototyping Architecture:
8 cores / 16 GB
Python / Pandas, LDA gensim / PyLDAvis
Exec times:
- LDA clustering: 5’
- Generating all 143K patients “gravitational pull” vectors: 30’
Data preparation:
Raw EHRs from UKBiobank previously pre-processed
- Common Data Engineering tasks shared across projects
15
<event
name>
Summary and ongoing work
Initial investigation into an experimental study pipeline aimed at
- identifying disease clusters (topics) based on the medical timelines of an
entire population
- mapping those patients to the clusters based on their diagnoses
- Studying the dynamics of how such associations change over time
Questions and work in progress:
1. are there useful definitions of
- Patients’ attraction to a constellation of clusters at any given time
- Stability of attraction over time
2. What causes instability?
Data:
about 143, 000 MLTC
participants from UK
Biobank

Tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations

  • 1.
    Ron Kremer, SyedMohib Raza, Paolo Missier et al. Newcastle University, UK School of Computing and Faculty of Medical Sciences IEEE Big Data Big Data Analytics for Health and Medicine (BDA4HM 2022) Dec. 2022, Osaka Japan Tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations
  • 2.
    2 IEEE BigData 2022 Motivation Multiple Long-Term Conditions,defined as [1,2]: • Four or more long-term (chronic) conditions A long term condition (LTC) is a condition that cannot, at present, be cured but is controlled by medication and/or other treatment/therapies (*) (*) NHS and UK Dept. of Health, Long Term Conditions Compendium of Information Third Edition, https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/216528/dh_134486.pdf [1] M. C. Johnston, M. Crilly, C. Black, G. J. Prescott, and S. W. Mercer, “Defining and measuring multimorbidity: a systematic review of systematic reviews,” European journal of public health, vol. 29, no. 1, pp. 182–189, 2019. [2] B. P. Nunes, T. R. Flores, G. I. Mielke, E. Thum ́e, and L. A. Facchini, “Multimorbidity and mortality in older adults: A systematic review and meta-analysis.” Archives of gerontology and geriatrics, vol. 67, pp. 130–138, Dec. 2016, place: Netherlands. Significant research investment by NIHR, the core translational medicine funder in the UK The number of people with multiple LTCs in the UK is set to rise to 2.9 million in 2018 from 1.9 million in 2008.
  • 3.
    3 IEEE BigData 2022 Can we modelthe likelihood of next disease? Experimental models exists for - Modelling disease progression [3] - Discovering clinical pathway patterns [4] - Predicting next disease(s) [5] However, not very robust or actually deployed in practice [3] Wang, Xiang, David Sontag, and Fei Wang. ‘Unsupervised Learning of Disease Progression Models’. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 85–94. KDD ’14. New York, NY, USA, 2014 https://doi.org/10.1145/2623330.2623754. [4] Huang, Zhengxing, Wei Dong, Lei Ji, Chenxi Gan, Xudong Lu, and Huilong Duan. ‘Discovery of Clinical Pathway Patterns from Event Logs Using Probabilistic Topic Models’. Journal of Biomedical Informatics 47 (1 February 2014): 39–57. https://doi.org/10.1016/j.jbi.2013.09.003. [5] Men, Lu, Noyan Ilk, Xinlin Tang, and Yuan Liu. ‘Multi-Disease Prediction Using LSTM Recurrent Neural Networks’. Expert Systems with Applications 177 (1 September 2021): 114905. https://doi.org/10.1016/j.eswa.2021.114905. use Electronic Health Records (diagnoses event logs) to predict patients’ long-term associations to a specific disease cluster Our goal:
  • 4.
    4 IEEE BigData 2022 Research hypothesis It ispossible to identify clusters of diseases that: 1. Are described using disease terms that are familiar to health domain experts 2. Are clinically significant  based on expert validation 3. Admit a quantitative association of individual patients with each of the clusters (*) limited to LTCs What we hope to find: 1. A significant majority of patients are stable relative to the clustering 2. Stability emerges early in their medical history(*)
  • 5.
    5 <event name> Overview of theapproach 1. Generate clusters of conditions based on disease co-occurrence in patients’ timelines 2. Associate patients with each of the clusters, at each stage of their disease progression Can we quantify such likelihood? We can more easily predict they will be affected by diseases within that cluster - Weakly associated with any given cluster - Their association changes cluster over time Stable patients: Unstable patients: Exhibit increasingly strong associations with a specific cluster over time Can we identify the causes for instability? Are these unanticipated traumatic events?
  • 6.
    6 IEEE BigData 2022 Contributions • We useTopic Modelling as a form of semantic clustering • Topics are defined by ranked lists of disease terms • We define a cluster’s gravitational pull: patients are differently attracted by each cluster at different points in time • We propose a quantitative measure of stability with respect to clusters over time • We study how stability increases as timelines progress
  • 7.
    7 IEEE BigData 2022 Dataset: UK BiobankLinked Electronic Health Records - Irregularly spaced - Healthy  fewer records! - 143,000 MLTC individuals hypertension  unspecified_rare_diabetes  type_2_diabetes  cerebrovascular_dz  asthma  diab_neuro  obesity  CKD  ESRD  NAFLD_NASH  cholelithiasis  spinal_stenosis allergic_rhinitis  hypertension  enthesopathy  thyroid UK Biobank: 500,000 volunteer participants, aged between 40 and 69 and living in the UK
  • 8.
    8 Baseline assessment GP events prescriptions HESIN diagnoses N= 240,000 N = 500,000 Hospital events Used to determine admission/ re-admission patterns operations 57,698,505 123,644,445 Raw UK biobank datasets eid MLTC-M cohort: 143,000 Up to 20 years of records
  • 9.
    9 IEEE BigData 2022 Detailed approach Patients timelines(143,000) Topic Modelling - Optimal number of topics - Correcting terms rankings within topics Dynamic Patient-cluster Association - Association vectors Patient stability analysis - Per-stage association changes Bag of words representation hypertension unspecified_rare_diabetes type_2_diabetes cerebrovascular_dz Asthma diab_neuro Obesity CKD ESRD NAFLD_NASH Cholelithiasis spinal_stenosis allergic_rhinitis Hypertension Enthesopathy thyroid
  • 10.
    10 Implementation: LDA andGensim https://radimrehurek.com/gensim/models/ldamodel.html https://pyldavis.readthedocs.io/en/latest/readme.html Řehůřek, Radim, and Petr Sojka. ‘Software Framework for Topic Modelling with Large Corpora’. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA, 2010. IEEE BigData 2022
  • 11.
  • 12.
    12 Patients association vectors “raw”relevance of term t in cluster Cj: Relative relevance of term t in cluster Cj: Idf-weighted relative relevance of term t in cluster Cj: Inverse document frequency of term t in corpus D: Association of terms set T with cluster Cj Association vector for terms set T Patient: disease terms sequence T = [ t1, t2, …. tn] IEEE BigData 2022
  • 13.
    13 Dynamic patient-cluster associations:examples Patient 1 topic 1 topic 2 topic 3 topic 4 OA 0.20 0.21 0.07 0.04 skin ulcer 1.29 0.21 0.07 0.05 dermatitis 1.76 0.21 0.07 0.05 erectile dysfunction 1.76 0.68 0.07 0.51 Primary malignancy skin 2.14 1.46 0.07 0.51 Patient 2 topic 1 topic 2 topic 3 topic 4 spondylosis 0.31 0.37 0.26 0.00 obesity 0.45 0.50 0.34 0.50 urine incontinence 0.75 0.50 1.06 0.51 female genital prolapse 1.22 0.51 1.75 0.51 type 2 diabetes 1.22 0.51 1.75 1.43 unspecified rare diabetes 1.22 0.51 1.75 2.56 fracture of the hip 1.23 2.62 1.75 2.57 Patient 3 topic 1 topic 2 topic 3 topic 4 dermatitis 0.47 0.00 0.00 0.00 hypertension 0.55 0.16 0.00 0.16 atrial fibrilation 0.55 1.41 0.00 0.16 OA 0.75 1.62 0.07 0.20 tinnitus 1.24 2.02 0.33 0.20 Patient 4 topic 1 topic 2 topic 3 topic 4 PTSD 0.22 0.00 0.68 0.00 COPD 0.54 0.94 0.69 0.08 Neuromuscular dysfunction of bladder 0.54 1.26 1.87 0.09 female genital prolapse 1.01 1.26 2.55 0.09 OA 1.21 1.47 2.63 0.12 Patient 5 topic 1 topic 2 topic 3 topic 4 Peripheral venous and lymphatic disease 0.59 0.24 0.07 0.00 psoriasis 1.07 0.45 0.07 0.55 female genital prolapse 1.54 0.45 0.76 0.55 CHD 1.54 1.11 0.76 0.84 Alcohol dependence 1.74 1.59 0.76 1.25 obesity 1.88 1.72 0.84 1.75 hearing loss 2.21 2.12 0.85 1.76 urine incontinence 2.51 2.12 1.57 1.76 Patient 6 topic 1 topic 2 topic 3 topic 4 asthma 0.80 0.00 0.00 0.00 hypertension 0.88 0.16 0.00 0.16 hearing loss 1.21 0.56 0.01 0.17 Alcohol dependence 1.41 1.04 0.01 0.58 Patient stages: T1 = [ t1], T2 = [ t1, t2], … Tn = T = [ t1, t2, …. tn] Patient: disease terms sequence T = [ t1, t2, …. tn] IEEE BigData 2022
  • 14.
    14 <event name> Pipeline implementation Prototyping Architecture: 8cores / 16 GB Python / Pandas, LDA gensim / PyLDAvis Exec times: - LDA clustering: 5’ - Generating all 143K patients “gravitational pull” vectors: 30’ Data preparation: Raw EHRs from UKBiobank previously pre-processed - Common Data Engineering tasks shared across projects
  • 15.
    15 <event name> Summary and ongoingwork Initial investigation into an experimental study pipeline aimed at - identifying disease clusters (topics) based on the medical timelines of an entire population - mapping those patients to the clusters based on their diagnoses - Studying the dynamics of how such associations change over time Questions and work in progress: 1. are there useful definitions of - Patients’ attraction to a constellation of clusters at any given time - Stability of attraction over time 2. What causes instability? Data: about 143, 000 MLTC participants from UK Biobank