1. Trace Clustering on Very Large Event Data
in Healthcare
Using Frequent Sequence Patterns
Xixi Lu (x.lu@uu.nl)
Hajo A. Reijers
Seyed A. Tabatabaei
Mark Hoogendoorn
1
2. Background
• Increasing number of case studies of PM applied on healthcare data
• Emergency process
• Sepsis
• Oncology
• …
• Patient grouping/classification
• Patient’s pathway planning
• Resource planning, allocation, reallocation
2
5. An Example
Renal
insufficiency
Renal
insufficiency
< 30ml
Renal
insufficiency
Renal
insufficiency
< 30ml
Kidney
transplant
Kidney
transplant
Kidney
transplant
Kidney
transplant
Kidney
transplant
SurgeryRegistration Lab test
Doctor’s
appointment
Doctor’s
appointment
Registration Lab test
Doctor’s
appointment
SurgeryRegistration Lab test
Doctor’s
appointment
Diabetes Diabetes Diabetes Same cluster?
Same cluster?
5
6. However….
• Feature vector based
• Trace sequence based
• Model based
Related Work – Trace Clustering
Clustering
patients
…
• Greco, G., Guzzo, A., Pontieri, L., Sacca, D.: Discovering expressive process models by clustering log traces. IEEE Trans.
Knowl. Data Eng (2006)
• Song,M.,Gunther, C.W., van der Aalst, W.M.P.: Trace clustering in process mining. BPM workshop 2008.
• De Weerdt, J., vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery.
IEEE Trans. Knowl. Data Eng. 2013
• Bose, R.P.J.C., van der Aalst, W.M.P.: Trace clustering based on conserved patterns: towards achieving better process
models. BPM workshop 2010
• Bose, R.P.J.C., van der Aalst, W.M.P.: Context aware trace clustering: towards improving process mining results. In:
Proceedings of the SDM 2009
• Chatain, T., Carmona, J., van Dongen, B.: Alignment-based trace clustering. In: ER 2017. LNCS
• …
6
7. Challenge 1 - Complexity of Healthcare Data
For year 2017
• 130k patients
• 90k distinct traces
• 4m events
• 5k activities
• 1k diagnostic codes
7
8. Challenge 2 - Unknown Number of Clusters
Clustering
patients
Number of clusters?
4? 5? >1000?
…
8
9. Challenge 3 - Quality of Clusters
• Highly dependent on domain/medical knowledge
• Difficult to convince or be used by domain experts
9
We found this cluster
because we used
features X, Jaccard
distance, PCA, weights,
hierarchical clustering
with average linkage…
… this cannot be a
patient group because
*&^@%$%^#*)#(&@#
&^%@...
10. • Handle very complex event data
• Handle unknown number of clusters
• Incorporate and leverage domain knowledge
Research Question
Trace
Clustering
…
10
11. Clustering
…
Cluster of diabetes patients
Clustering
Clustering
Cluster of kidney patients
Sample set 1
Sample set 2
Sample Based Trace Clustering
Sample set k
11
12. … Using Frequent Sequence Pattern Mining
1. Mine FSP
Cluster C
2. Train
Parameters
3. Classify
/ Cluster
Event Log
Sample Set Frequent
sequence patterns
Support/
Thresholds
13
13. Frequent Sequence Patterns and Support
A C
A
EA C Y
A E
C E
Support = 3/3
Support = 3/3
Support = 2/3
C E Support = 2/3
14
XA C U E
ZA C E
A Support = 3/3
Patient 1
Patient 2
Patient 3
14. Mining Frequent Sequence Patterns
EA C Y Support threshold
= 0.5
15
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
1. Mine FSP
Patient 1
Patient 2
Patient 3
A E
A C
15. Training Parameter
EA C Y
16
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
Patient 1
Patient 2
Patient 3 Trace Sc1 Sc2 ScClo
Patient 1 3 3 3
Patient 2 3 2 2
Patient 3 3 3 3
Patient 4 2 1 1
Patient 5 3 3 3
Patient 6 … … …
Patient 7 … … …
… … … …
A E
A C
BA C Z
YA C E
𝜙1 = 3
𝜙2 = 3
𝜙 𝑐𝑙𝑜 = 3Patient 4
Patient 5
2. Train Parameters
16. EA C Y
17
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
Patient 1
Patient 2
Patient 3 Trace Sc1 Sc2 ScClo
Patient 1 3 3 3 ✔
Patient 2 3 2 2 ✘
Patient 3 3 3 3 ✔
Patient 4 2 1 1 ✘
Patient 5 3 3 3 ✔
Patient 6 … … … ✔
Patient 7 … … … ✘
… … … … ✘
A E
A C
BA C Z
YA C E
Patient 4
Patient 5
𝜙1 = 3
𝜙2 = 3
𝜙 𝑐𝑙𝑜 = 3
Applying Thresholds
3. Classify /Cluster
17. Evaluation
• Implemented in the ProM framework, TraceClusteringFSM package
• Evaluation using a real-life data set from VUMC
• 5 years of patient pathway records
• 3 patient groups, 15 clusters
18
25. Evaluation – Comparing F1-scores of
FSP to Frequent Item Set
26[1] Seyed Amin Tabatabaei, Xixi Lu, Mark Hoogendoorn, Hajo A. Reijers:
Identifying Patient Groups based on Frequent Patterns of Patient Samples. CoRRabs/1904.01863 (2019)
0.75 0.94 0.67
26. Process Maps of Frequent Sequence Patterns
27
Data scientist was able to validate that
“kalium” (potassium), “kreatinine”
(creatinine), “calcium” (calcium),
“fosfaat” (phosphate), “albumine”
(albumin), “natrium” (sodium),
“ureum bloed” (ureum blood), etc. are
important activities (e.g., lab activities)
in the clinical pathway of the kidney
groups.
27. Conclusion & Future Work
• Handle complex data and unknown number of clusters
• Clusters found relatively in-line with domain knowledge
• reasonably high accuracy
• Behavior criteria which can be used to communicate
• Evaluated on the healthcare data
• Future work
• Strategies to select patterns
• The effect of sample size
• Validation with medical experts
• Evaluation using other data sets
28
28. THANK YOU!
Dr. ir. Xixi Lu
Utrecht University
The Netherlands
x.lu@uu.nl
29
Editor's Notes
Resource planning, usages, and reallocation
One concrete scenario of patient classification we encountered is the merger of VUMC and AMC hospitals => Picture!!
We can use patient groups to improve resource planning, allocation, improve process …
……
diabetes is the main cause of chronic kidney failure, corresponding to approximately 40% of the patients undergoing renal replacement therapy and outgrowing the number of cases of nephritis and systemic arterial
Feature-vector-based trace clustering
# cases ^2 * number of features…
Trace Sequence based trace clustering
Avg. 40 events, Max > 3000 events for one patient.
Model-based trace clustering
Kidney 140 patient:
> 600 activities, > 200 dbc, > 4000 combinations
> 400 million comparisons between the feature vectors.
> 1000 DBC… do we have 1000 clusters? Are we trying to find them all???
Feature-vector-based trace clustering
# cases ^2 * number of features…
Trace Sequence based trace clustering
Avg. 40 events, Max > 3000 events for one patient.
Model-based trace clustering
Kidney 140 patient:
> 600 activities, > 200 dbc, > 4000 combinations
Feature-vector-based trace clustering
# cases ^2 * number of features…
Trace Sequence based trace clustering
Avg. 40 events, Max > 3000 events for one patient.
Model-based trace clustering
Kidney 140 patient:
> 600 activities, > 200 dbc, > 4000 combinations
“Difference to supervised classification???”
We would like to use behavioral features, not just static features…
Chronical “attribute” first high, then low => yes a kidney patient.
First low, then high => not a kidney patient
Currently, in Process Mining, only model based classification => conformance checking.
Advantages…
Independently
iteratively
“Difference to supervised classification???”
We would like to use behavioral features, not just static features…
Chronical “attribute” first high, then low => yes a kidney patient.
First low, then high => not a kidney patient
Currently, in Process Mining, only model based classification => conformance checking.
Advantages…
Independently
iteratively
Lucky for us, the clusters are defined.
Phi_s : Support threshold
K: Training sample size & Sample size…?
=> Create a set of FSP as our criteria
Phi1, phi2, phi _clo are thresholds.
=> automated approximation using sample…
In the paper you can read the full results.
Less overfitting
What is worth mentioning is that
Significant higher results than FIS.
FIS was not able to obtain / report any results on NHTumor group.
Phi_s : Support threshold
K: Training sample size & Sample size…?
=> Create a set of FSP as our criteria
Phi1, phi2, phi _clo are thresholds.
=> automated approximation using sample…
Do we need to explain the concept “Sequence Patterns” closed set SPclo??
Start with background information. … in 2016…
What is 8.0 trillion… Germany’s GDP in 2016 is 3,68 trillion…
The point is: improving healthcare processes could have a huge impact.
“Patient classification” is one specific topic to be improved. Improving patient classification/clustering can help ….
-----------------------------
Health spending globally reached $8·0 trillion (7·8–8·1) in 2016 (comprising 8·6% of the global economy) … Globally, health spending is projected to increase to $15·0 trillion (14·0–16·0) by 2050 (reaching 9·4% of the global economy)
– Lancet 2019