BPM19 - trace clustering on very large event data

Trace Clustering on Very Large Event Data
in Healthcare
Using Frequent Sequence Patterns
Xixi Lu (x.lu@uu.nl)
Hajo A. Reijers
Seyed A. Tabatabaei
Mark Hoogendoorn
1

Background
• Increasing number of case studies of PM applied on healthcare data
• Emergency process
• Sepsis
• Oncology
• …
• Patient grouping/classification
• Patient’s pathway planning
• Resource planning, allocation, reallocation
2

Research Problem
Finding meaningful patient groups based on their pathways
to obtain value insights into healthcare processes
4

An Example
Renal
insufficiency
Renal
insufficiency
< 30ml
Renal
insufficiency
Renal
insufficiency
< 30ml
Kidney
transplant
Kidney
transplant
Kidney
transplant
Kidney
transplant
Kidney
transplant
SurgeryRegistration Lab test
Doctor’s
appointment
Doctor’s
appointment
Registration Lab test
Doctor’s
appointment
SurgeryRegistration Lab test
Doctor’s
appointment
Diabetes Diabetes Diabetes Same cluster?
Same cluster?
5

However….
• Feature vector based
• Trace sequence based
• Model based
Related Work – Trace Clustering
Clustering
patients
…
• Greco, G., Guzzo, A., Pontieri, L., Sacca, D.: Discovering expressive process models by clustering log traces. IEEE Trans.
Knowl. Data Eng (2006)
• Song,M.,Gunther, C.W., van der Aalst, W.M.P.: Trace clustering in process mining. BPM workshop 2008.
• De Weerdt, J., vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery.
IEEE Trans. Knowl. Data Eng. 2013
• Bose, R.P.J.C., van der Aalst, W.M.P.: Trace clustering based on conserved patterns: towards achieving better process
models. BPM workshop 2010
• Bose, R.P.J.C., van der Aalst, W.M.P.: Context aware trace clustering: towards improving process mining results. In:
Proceedings of the SDM 2009
• Chatain, T., Carmona, J., van Dongen, B.: Alignment-based trace clustering. In: ER 2017. LNCS
• …
6

Challenge 1 - Complexity of Healthcare Data
For year 2017
• 130k patients
• 90k distinct traces
• 4m events
• 5k activities
• 1k diagnostic codes
7

Challenge 2 - Unknown Number of Clusters
Clustering
patients
Number of clusters?
4? 5? >1000?
…
8

Challenge 3 - Quality of Clusters
• Highly dependent on domain/medical knowledge
• Difficult to convince or be used by domain experts
9
We found this cluster
because we used
features X, Jaccard
distance, PCA, weights,
hierarchical clustering
with average linkage…
… this cannot be a
patient group because
*&^@%$%^#*)#(&@#
&^%@...

• Handle very complex event data
• Handle unknown number of clusters
• Incorporate and leverage domain knowledge
Research Question
Trace
Clustering
…
10

Clustering
…
Cluster of diabetes patients
Clustering
Clustering
Cluster of kidney patients
Sample set 1
Sample set 2
Sample Based Trace Clustering
Sample set k
11

… Using Frequent Sequence Pattern Mining
1. Mine FSP
Cluster C
2. Train
Parameters
3. Classify
/ Cluster
Event Log
Sample Set Frequent
sequence patterns
Support/
Thresholds
13

Frequent Sequence Patterns and Support
A C
A
EA C Y
A E
C E
Support = 3/3
Support = 3/3
Support = 2/3
C E Support = 2/3
14
XA C U E
ZA C E
A Support = 3/3
Patient 1
Patient 2
Patient 3

Mining Frequent Sequence Patterns
EA C Y Support threshold
= 0.5
15
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
1. Mine FSP
Patient 1
Patient 2
Patient 3
A E
A C

Training Parameter
EA C Y
16
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
Patient 1
Patient 2
Patient 3 Trace Sc1 Sc2 ScClo
Patient 1 3 3 3
Patient 2 3 2 2
Patient 3 3 3 3
Patient 4 2 1 1
Patient 5 3 3 3
Patient 6 … … …
Patient 7 … … …
… … … …
A E
A C
BA C Z
YA C E
𝜙1 = 3
𝜙2 = 3
𝜙 𝑐𝑙𝑜 = 3Patient 4
Patient 5
2. Train Parameters

EA C Y
17
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
Patient 1
Patient 2
Patient 3 Trace Sc1 Sc2 ScClo
Patient 1 3 3 3 ✔
Patient 2 3 2 2 ✘
Patient 3 3 3 3 ✔
Patient 4 2 1 1 ✘
Patient 5 3 3 3 ✔
Patient 6 … … … ✔
Patient 7 … … … ✘
… … … … ✘
A E
A C
BA C Z
YA C E
Patient 4
Patient 5
𝜙1 = 3
𝜙2 = 3
𝜙 𝑐𝑙𝑜 = 3
Applying Thresholds
3. Classify /Cluster

Evaluation
• Implemented in the ProM framework, TraceClusteringFSM package
• Evaluation using a real-life data set from VUMC
• 5 years of patient pathway records
• 3 patient groups, 15 clusters
18

Evaluation – Data – Number of cases
19
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
#cases #dpi #evts #avg. e/c #max. e/c #acts #dbcs
All_17 All_16 All_15 All_14 All_13
130k 99k 3.9m 30 2.6k 5.1k 1.9k

Evaluation – Data – Number of cases / Cluster
20
128,505
124,966
128,430
133,811
133,438
140
145
118
88
81
1,521
1,468
1,492
1,562
1,573
1,050
1,225
1,298
1,325
1,350
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000

Evaluation - Effect of “Configurations” on F1
Cluster CEvent Log
Ground truth
21
Clustering
Sample set (30)
Cluster C

Effect of Support Threshold?
23

Effect of Training Sample Size?
24

Evaluation – Comparing F1-scores of
FSP to Frequent Item Set
26[1] Seyed Amin Tabatabaei, Xixi Lu, Mark Hoogendoorn, Hajo A. Reijers:
Identifying Patient Groups based on Frequent Patterns of Patient Samples. CoRRabs/1904.01863 (2019)
0.75 0.94 0.67

Process Maps of Frequent Sequence Patterns
27
Data scientist was able to validate that
“kalium” (potassium), “kreatinine”
(creatinine), “calcium” (calcium),
“fosfaat” (phosphate), “albumine”
(albumin), “natrium” (sodium),
“ureum bloed” (ureum blood), etc. are
important activities (e.g., lab activities)
in the clinical pathway of the kidney
groups.

Conclusion & Future Work
• Handle complex data and unknown number of clusters
• Clusters found relatively in-line with domain knowledge
• reasonably high accuracy
• Behavior criteria which can be used to communicate
• Evaluated on the healthcare data
• Future work
• Strategies to select patterns
• The effect of sample size
• Validation with medical experts
• Evaluation using other data sets
28

THANK YOU!
Dr. ir. Xixi Lu
Utrecht University
The Netherlands
x.lu@uu.nl
29

BPM19 - trace clustering on very large event data

Recommended

Recommended

More Related Content

Similar to BPM19 - trace clustering on very large event data

Similar to BPM19 - trace clustering on very large event data (20)

Recently uploaded

Recently uploaded (20)

BPM19 - trace clustering on very large event data

Editor's Notes