SlideShare a Scribd company logo
1 of 28
Trace Clustering on Very Large Event Data
in Healthcare
Using Frequent Sequence Patterns
Xixi Lu (x.lu@uu.nl)
Hajo A. Reijers
Seyed A. Tabatabaei
Mark Hoogendoorn
1
Background
• Increasing number of case studies of PM applied on healthcare data
• Emergency process
• Sepsis
• Oncology
• …
• Patient grouping/classification
• Patient’s pathway planning
• Resource planning, allocation, reallocation
2
3
Research Problem
Finding meaningful patient groups based on their pathways
to obtain value insights into healthcare processes
4
An Example
Renal
insufficiency
Renal
insufficiency
< 30ml
Renal
insufficiency
Renal
insufficiency
< 30ml
Kidney
transplant
Kidney
transplant
Kidney
transplant
Kidney
transplant
Kidney
transplant
SurgeryRegistration Lab test
Doctor’s
appointment
Doctor’s
appointment
Registration Lab test
Doctor’s
appointment
SurgeryRegistration Lab test
Doctor’s
appointment
Diabetes Diabetes Diabetes Same cluster?
Same cluster?
5
However….
• Feature vector based
• Trace sequence based
• Model based
Related Work – Trace Clustering
Clustering
patients
…
• Greco, G., Guzzo, A., Pontieri, L., Sacca, D.: Discovering expressive process models by clustering log traces. IEEE Trans.
Knowl. Data Eng (2006)
• Song,M.,Gunther, C.W., van der Aalst, W.M.P.: Trace clustering in process mining. BPM workshop 2008.
• De Weerdt, J., vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery.
IEEE Trans. Knowl. Data Eng. 2013
• Bose, R.P.J.C., van der Aalst, W.M.P.: Trace clustering based on conserved patterns: towards achieving better process
models. BPM workshop 2010
• Bose, R.P.J.C., van der Aalst, W.M.P.: Context aware trace clustering: towards improving process mining results. In:
Proceedings of the SDM 2009
• Chatain, T., Carmona, J., van Dongen, B.: Alignment-based trace clustering. In: ER 2017. LNCS
• …
6
Challenge 1 - Complexity of Healthcare Data
For year 2017
• 130k patients
• 90k distinct traces
• 4m events
• 5k activities
• 1k diagnostic codes
7
Challenge 2 - Unknown Number of Clusters
Clustering
patients
Number of clusters?
4? 5? >1000?
…
8
Challenge 3 - Quality of Clusters
• Highly dependent on domain/medical knowledge
• Difficult to convince or be used by domain experts
9
We found this cluster
because we used
features X, Jaccard
distance, PCA, weights,
hierarchical clustering
with average linkage…
… this cannot be a
patient group because
*&^@%$%^#*)#(&@#
&^%@...
• Handle very complex event data
• Handle unknown number of clusters
• Incorporate and leverage domain knowledge
Research Question
Trace
Clustering
…
10
Clustering
…
Cluster of diabetes patients
Clustering
Clustering
Cluster of kidney patients
Sample set 1
Sample set 2
Sample Based Trace Clustering
Sample set k
11
… Using Frequent Sequence Pattern Mining
1. Mine FSP
Cluster C
2. Train
Parameters
3. Classify
/ Cluster
Event Log
Sample Set Frequent
sequence patterns
Support/
Thresholds
13
Frequent Sequence Patterns and Support
A C
A
EA C Y
A E
C E
Support = 3/3
Support = 3/3
Support = 2/3
C E Support = 2/3
14
XA C U E
ZA C E
A Support = 3/3
Patient 1
Patient 2
Patient 3
Mining Frequent Sequence Patterns
EA C Y Support threshold
= 0.5
15
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
1. Mine FSP
Patient 1
Patient 2
Patient 3
A E
A C
Training Parameter
EA C Y
16
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
Patient 1
Patient 2
Patient 3 Trace Sc1 Sc2 ScClo
Patient 1 3 3 3
Patient 2 3 2 2
Patient 3 3 3 3
Patient 4 2 1 1
Patient 5 3 3 3
Patient 6 … … …
Patient 7 … … …
… … … …
A E
A C
BA C Z
YA C E
𝜙1 = 3
𝜙2 = 3
𝜙 𝑐𝑙𝑜 = 3Patient 4
Patient 5
2. Train Parameters
EA C Y
17
XA C U E
ZA C E
A C
A
A E
C E C E
A
C
E
Patient 1
Patient 2
Patient 3 Trace Sc1 Sc2 ScClo
Patient 1 3 3 3 ✔
Patient 2 3 2 2 ✘
Patient 3 3 3 3 ✔
Patient 4 2 1 1 ✘
Patient 5 3 3 3 ✔
Patient 6 … … … ✔
Patient 7 … … … ✘
… … … … ✘
A E
A C
BA C Z
YA C E
Patient 4
Patient 5
𝜙1 = 3
𝜙2 = 3
𝜙 𝑐𝑙𝑜 = 3
Applying Thresholds
3. Classify /Cluster
Evaluation
• Implemented in the ProM framework, TraceClusteringFSM package
• Evaluation using a real-life data set from VUMC
• 5 years of patient pathway records
• 3 patient groups, 15 clusters
18
Evaluation – Data – Number of cases
19
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
#cases #dpi #evts #avg. e/c #max. e/c #acts #dbcs
All_17 All_16 All_15 All_14 All_13
130k 99k 3.9m 30 2.6k 5.1k 1.9k
Evaluation – Data – Number of cases / Cluster
20
128,505
124,966
128,430
133,811
133,438
140
145
118
88
81
1,521
1,468
1,492
1,562
1,573
1,050
1,225
1,298
1,325
1,350
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
Evaluation - Effect of “Configurations” on F1
Cluster CEvent Log
Ground truth
21
Clustering
Sample set (30)
Cluster C
22
Effect of Support Threshold?
23
Effect of Training Sample Size?
24
Using Automated Approach?
25
Evaluation – Comparing F1-scores of
FSP to Frequent Item Set
26[1] Seyed Amin Tabatabaei, Xixi Lu, Mark Hoogendoorn, Hajo A. Reijers:
Identifying Patient Groups based on Frequent Patterns of Patient Samples. CoRRabs/1904.01863 (2019)
0.75 0.94 0.67
Process Maps of Frequent Sequence Patterns
27
Data scientist was able to validate that
“kalium” (potassium), “kreatinine”
(creatinine), “calcium” (calcium),
“fosfaat” (phosphate), “albumine”
(albumin), “natrium” (sodium),
“ureum bloed” (ureum blood), etc. are
important activities (e.g., lab activities)
in the clinical pathway of the kidney
groups.
Conclusion & Future Work
• Handle complex data and unknown number of clusters
• Clusters found relatively in-line with domain knowledge
• reasonably high accuracy
• Behavior criteria which can be used to communicate
• Evaluated on the healthcare data
• Future work
• Strategies to select patterns
• The effect of sample size
• Validation with medical experts
• Evaluation using other data sets
28
THANK YOU!
Dr. ir. Xixi Lu
Utrecht University
The Netherlands
x.lu@uu.nl
29

More Related Content

Similar to BPM19 - trace clustering on very large event data

Statistics for DP Biology IA
Statistics for DP Biology IAStatistics for DP Biology IA
Statistics for DP Biology IAVeronika Garga
 
a novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool wekaa novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool wekaahmad abdelhafeez
 
An introduction to the stepped wedge cluster randomised trial
An introduction to the stepped wedge cluster randomised trial An introduction to the stepped wedge cluster randomised trial
An introduction to the stepped wedge cluster randomised trial Karla hemming
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansBrook White, PMP
 
MH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -cleanMH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -cleanMin-hyung Kim
 
100,000 Genomes Project.
100,000 Genomes Project.100,000 Genomes Project.
100,000 Genomes Project.David Montaner
 
Open Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesisOpen Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesisAntica Culina
 
7. stratified sampling.pptx
7. stratified sampling.pptx7. stratified sampling.pptx
7. stratified sampling.pptxABDULRAUF411
 
To infinity and beyond
To infinity and beyond To infinity and beyond
To infinity and beyond Stephen Senn
 
Pathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillancePathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillanceJoel Saltz
 
Sampling designs
Sampling designsSampling designs
Sampling designsMarni Bunda
 
Sampling techniques new
Sampling techniques newSampling techniques new
Sampling techniques newGeeta80373
 
Sampling techniques new
Sampling techniques newSampling techniques new
Sampling techniques newbabita jangra
 
Using quality registry data in research
Using quality registry data in researchUsing quality registry data in research
Using quality registry data in researchscanFOAM
 
SAMPLING methods d p singh .ppt
SAMPLING methods d p singh .pptSAMPLING methods d p singh .ppt
SAMPLING methods d p singh .pptVivekKasar5
 
Are laboratory tests always needed frequency and causes of laboratory overu...
Are laboratory tests always needed   frequency and causes of laboratory overu...Are laboratory tests always needed   frequency and causes of laboratory overu...
Are laboratory tests always needed frequency and causes of laboratory overu...Hossamaldin Alzawawi
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsElena Sügis
 

Similar to BPM19 - trace clustering on very large event data (20)

Statistics for DP Biology IA
Statistics for DP Biology IAStatistics for DP Biology IA
Statistics for DP Biology IA
 
a novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool wekaa novel approach for breast cancer detection using data mining tool weka
a novel approach for breast cancer detection using data mining tool weka
 
An introduction to the stepped wedge cluster randomised trial
An introduction to the stepped wedge cluster randomised trial An introduction to the stepped wedge cluster randomised trial
An introduction to the stepped wedge cluster randomised trial
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-Statisticians
 
MH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -cleanMH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -clean
 
100,000 Genomes Project.
100,000 Genomes Project.100,000 Genomes Project.
100,000 Genomes Project.
 
Open Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesisOpen Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesis
 
Final_Presentation.pptx
Final_Presentation.pptxFinal_Presentation.pptx
Final_Presentation.pptx
 
1.3 collecting sample data
1.3 collecting sample data1.3 collecting sample data
1.3 collecting sample data
 
7. stratified sampling.pptx
7. stratified sampling.pptx7. stratified sampling.pptx
7. stratified sampling.pptx
 
To infinity and beyond
To infinity and beyond To infinity and beyond
To infinity and beyond
 
Pathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer SurveillancePathomics, Clinical Studies, and Cancer Surveillance
Pathomics, Clinical Studies, and Cancer Surveillance
 
Methods.pdf
Methods.pdfMethods.pdf
Methods.pdf
 
Sampling designs
Sampling designsSampling designs
Sampling designs
 
Sampling techniques new
Sampling techniques newSampling techniques new
Sampling techniques new
 
Sampling techniques new
Sampling techniques newSampling techniques new
Sampling techniques new
 
Using quality registry data in research
Using quality registry data in researchUsing quality registry data in research
Using quality registry data in research
 
SAMPLING methods d p singh .ppt
SAMPLING methods d p singh .pptSAMPLING methods d p singh .ppt
SAMPLING methods d p singh .ppt
 
Are laboratory tests always needed frequency and causes of laboratory overu...
Are laboratory tests always needed   frequency and causes of laboratory overu...Are laboratory tests always needed   frequency and causes of laboratory overu...
Are laboratory tests always needed frequency and causes of laboratory overu...
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

BPM19 - trace clustering on very large event data

  • 1. Trace Clustering on Very Large Event Data in Healthcare Using Frequent Sequence Patterns Xixi Lu (x.lu@uu.nl) Hajo A. Reijers Seyed A. Tabatabaei Mark Hoogendoorn 1
  • 2. Background • Increasing number of case studies of PM applied on healthcare data • Emergency process • Sepsis • Oncology • … • Patient grouping/classification • Patient’s pathway planning • Resource planning, allocation, reallocation 2
  • 3. 3
  • 4. Research Problem Finding meaningful patient groups based on their pathways to obtain value insights into healthcare processes 4
  • 5. An Example Renal insufficiency Renal insufficiency < 30ml Renal insufficiency Renal insufficiency < 30ml Kidney transplant Kidney transplant Kidney transplant Kidney transplant Kidney transplant SurgeryRegistration Lab test Doctor’s appointment Doctor’s appointment Registration Lab test Doctor’s appointment SurgeryRegistration Lab test Doctor’s appointment Diabetes Diabetes Diabetes Same cluster? Same cluster? 5
  • 6. However…. • Feature vector based • Trace sequence based • Model based Related Work – Trace Clustering Clustering patients … • Greco, G., Guzzo, A., Pontieri, L., Sacca, D.: Discovering expressive process models by clustering log traces. IEEE Trans. Knowl. Data Eng (2006) • Song,M.,Gunther, C.W., van der Aalst, W.M.P.: Trace clustering in process mining. BPM workshop 2008. • De Weerdt, J., vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery. IEEE Trans. Knowl. Data Eng. 2013 • Bose, R.P.J.C., van der Aalst, W.M.P.: Trace clustering based on conserved patterns: towards achieving better process models. BPM workshop 2010 • Bose, R.P.J.C., van der Aalst, W.M.P.: Context aware trace clustering: towards improving process mining results. In: Proceedings of the SDM 2009 • Chatain, T., Carmona, J., van Dongen, B.: Alignment-based trace clustering. In: ER 2017. LNCS • … 6
  • 7. Challenge 1 - Complexity of Healthcare Data For year 2017 • 130k patients • 90k distinct traces • 4m events • 5k activities • 1k diagnostic codes 7
  • 8. Challenge 2 - Unknown Number of Clusters Clustering patients Number of clusters? 4? 5? >1000? … 8
  • 9. Challenge 3 - Quality of Clusters • Highly dependent on domain/medical knowledge • Difficult to convince or be used by domain experts 9 We found this cluster because we used features X, Jaccard distance, PCA, weights, hierarchical clustering with average linkage… … this cannot be a patient group because *&^@%$%^#*)#(&@# &^%@...
  • 10. • Handle very complex event data • Handle unknown number of clusters • Incorporate and leverage domain knowledge Research Question Trace Clustering … 10
  • 11. Clustering … Cluster of diabetes patients Clustering Clustering Cluster of kidney patients Sample set 1 Sample set 2 Sample Based Trace Clustering Sample set k 11
  • 12. … Using Frequent Sequence Pattern Mining 1. Mine FSP Cluster C 2. Train Parameters 3. Classify / Cluster Event Log Sample Set Frequent sequence patterns Support/ Thresholds 13
  • 13. Frequent Sequence Patterns and Support A C A EA C Y A E C E Support = 3/3 Support = 3/3 Support = 2/3 C E Support = 2/3 14 XA C U E ZA C E A Support = 3/3 Patient 1 Patient 2 Patient 3
  • 14. Mining Frequent Sequence Patterns EA C Y Support threshold = 0.5 15 XA C U E ZA C E A C A A E C E C E A C E 1. Mine FSP Patient 1 Patient 2 Patient 3 A E A C
  • 15. Training Parameter EA C Y 16 XA C U E ZA C E A C A A E C E C E A C E Patient 1 Patient 2 Patient 3 Trace Sc1 Sc2 ScClo Patient 1 3 3 3 Patient 2 3 2 2 Patient 3 3 3 3 Patient 4 2 1 1 Patient 5 3 3 3 Patient 6 … … … Patient 7 … … … … … … … A E A C BA C Z YA C E 𝜙1 = 3 𝜙2 = 3 𝜙 𝑐𝑙𝑜 = 3Patient 4 Patient 5 2. Train Parameters
  • 16. EA C Y 17 XA C U E ZA C E A C A A E C E C E A C E Patient 1 Patient 2 Patient 3 Trace Sc1 Sc2 ScClo Patient 1 3 3 3 ✔ Patient 2 3 2 2 ✘ Patient 3 3 3 3 ✔ Patient 4 2 1 1 ✘ Patient 5 3 3 3 ✔ Patient 6 … … … ✔ Patient 7 … … … ✘ … … … … ✘ A E A C BA C Z YA C E Patient 4 Patient 5 𝜙1 = 3 𝜙2 = 3 𝜙 𝑐𝑙𝑜 = 3 Applying Thresholds 3. Classify /Cluster
  • 17. Evaluation • Implemented in the ProM framework, TraceClusteringFSM package • Evaluation using a real-life data set from VUMC • 5 years of patient pathway records • 3 patient groups, 15 clusters 18
  • 18. Evaluation – Data – Number of cases 19 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 #cases #dpi #evts #avg. e/c #max. e/c #acts #dbcs All_17 All_16 All_15 All_14 All_13 130k 99k 3.9m 30 2.6k 5.1k 1.9k
  • 19. Evaluation – Data – Number of cases / Cluster 20 128,505 124,966 128,430 133,811 133,438 140 145 118 88 81 1,521 1,468 1,492 1,562 1,573 1,050 1,225 1,298 1,325 1,350 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000
  • 20. Evaluation - Effect of “Configurations” on F1 Cluster CEvent Log Ground truth 21 Clustering Sample set (30) Cluster C
  • 21. 22
  • 22. Effect of Support Threshold? 23
  • 23. Effect of Training Sample Size? 24
  • 25. Evaluation – Comparing F1-scores of FSP to Frequent Item Set 26[1] Seyed Amin Tabatabaei, Xixi Lu, Mark Hoogendoorn, Hajo A. Reijers: Identifying Patient Groups based on Frequent Patterns of Patient Samples. CoRRabs/1904.01863 (2019) 0.75 0.94 0.67
  • 26. Process Maps of Frequent Sequence Patterns 27 Data scientist was able to validate that “kalium” (potassium), “kreatinine” (creatinine), “calcium” (calcium), “fosfaat” (phosphate), “albumine” (albumin), “natrium” (sodium), “ureum bloed” (ureum blood), etc. are important activities (e.g., lab activities) in the clinical pathway of the kidney groups.
  • 27. Conclusion & Future Work • Handle complex data and unknown number of clusters • Clusters found relatively in-line with domain knowledge • reasonably high accuracy • Behavior criteria which can be used to communicate • Evaluated on the healthcare data • Future work • Strategies to select patterns • The effect of sample size • Validation with medical experts • Evaluation using other data sets 28
  • 28. THANK YOU! Dr. ir. Xixi Lu Utrecht University The Netherlands x.lu@uu.nl 29

Editor's Notes

  1. Resource planning, usages, and reallocation One concrete scenario of patient classification we encountered is the merger of VUMC and AMC hospitals => Picture!!
  2. We can use patient groups to improve resource planning, allocation, improve process … ……
  3. diabetes is the main cause of chronic kidney failure, corresponding to approximately 40% of the patients undergoing renal replacement therapy and outgrowing the number of cases of nephritis and systemic arterial 
  4. Feature-vector-based trace clustering # cases ^2 * number of features… Trace Sequence based trace clustering Avg. 40 events, Max > 3000 events for one patient. Model-based trace clustering Kidney 140 patient: > 600 activities, > 200 dbc, > 4000 combinations
  5. > 400 million comparisons between the feature vectors.
  6. > 1000 DBC… do we have 1000 clusters? Are we trying to find them all???
  7. Feature-vector-based trace clustering # cases ^2 * number of features… Trace Sequence based trace clustering Avg. 40 events, Max > 3000 events for one patient. Model-based trace clustering Kidney 140 patient: > 600 activities, > 200 dbc, > 4000 combinations Feature-vector-based trace clustering # cases ^2 * number of features… Trace Sequence based trace clustering Avg. 40 events, Max > 3000 events for one patient. Model-based trace clustering Kidney 140 patient: > 600 activities, > 200 dbc, > 4000 combinations
  8. “Difference to supervised classification???” We would like to use behavioral features, not just static features… Chronical “attribute” first high, then low => yes a kidney patient. First low, then high => not a kidney patient Currently, in Process Mining, only model based classification => conformance checking. Advantages… Independently iteratively
  9. “Difference to supervised classification???” We would like to use behavioral features, not just static features… Chronical “attribute” first high, then low => yes a kidney patient. First low, then high => not a kidney patient Currently, in Process Mining, only model based classification => conformance checking. Advantages… Independently iteratively
  10. Lucky for us, the clusters are defined.
  11. Phi_s : Support threshold K: Training sample size & Sample size…? => Create a set of FSP as our criteria Phi1, phi2, phi _clo are thresholds. => automated approximation using sample…
  12. In the paper you can read the full results.
  13. Less overfitting
  14. What is worth mentioning is that Significant higher results than FIS. FIS was not able to obtain / report any results on NHTumor group.
  15. Phi_s : Support threshold K: Training sample size & Sample size…? => Create a set of FSP as our criteria Phi1, phi2, phi _clo are thresholds. => automated approximation using sample…
  16. Do we need to explain the concept “Sequence Patterns” closed set SPclo??
  17. Start with background information. … in 2016… What is 8.0 trillion… Germany’s GDP in 2016 is 3,68 trillion… The point is: improving healthcare processes could have a huge impact. “Patient classification” is one specific topic to be improved. Improving patient classification/clustering can help …. ----------------------------- Health spending globally reached $8·0 trillion (7·8–8·1) in 2016 (comprising 8·6% of the global economy) … Globally, health spending is projected to increase to $15·0 trillion (14·0–16·0) by 2050 (reaching 9·4% of the global economy) – Lancet 2019