Investigating the use of novel data mining and machine learning methods in healthcare data sources of multiple natures.
1. Thesis Defense
Investigating the Use Of Novel Data Mining And Machine
Learning Methods in Healthcare Data Sources Of Multiple
Nature
Roberto Batista
2. Content
• Introduction
• Part I - Survey Data
• Overview
• Literature review
• Methods
• Exploratory Data Analysis
• Data Transformation
• Conclusions
• Part II - Electronic Health Record
• Overview
• Literature review
• Methods
• Exploratory Data Analysis
• Data Transformation
• Conclusions
2
4. 2008 2015
9.4% 84%
Introduction
• Non-federal hospitals with basic systems.
The Office of the National Coordinator for Health Information Technology (ONC) 4
gender ethnicity religion age social finance exams
5. Health Data Sources
National Library of Medicine (NIH)
EMR
$
$
$
$
$
$ SpO2
EHR
5
Survey Electronic Medical Record Claim Data Vital Signs Data Electronic Health Record
6. PART I PART II
Thesis Components
EHR
6
Survey Electronic Health Record
7. Part I
Survey Data
-How to identify personality traits groups in the
Health and Retirement Study survey data?
7
9. HRS Overview
3 surveys
6 aspects
5
aspects
5
aspects
22,000
>50 yo
9 aspects
4
Derived
Datasets
58.54%
Medical Ethics
Training
9
10. Literature Review
Gould et al., 2015:
Verifies the symptoms of anxiety and depression in
veterans and non-veterans using CES-D and BAI.
Seligman et Al., 2018:
Machine Learning improves the understanding of social
determinants of health.
Hülür et al., 2015:
Investigates association between subjective memory,
subjective age and personality traits.
Fehrman et al., 2015:
Personality correlation with the consumption of eight
psychoactive drugs and its consumption by individuals.
Aschwanden et al., 2019:
Personality traits associations with the probability of
having a preventive screening for cancer.
Five personality Traits
(OCEAN):
• Openness
• Conscientiousness
• Extraversion
• Agreeableness
• Neuroticism
10
Machine Learning Studies
16. Methods
Cloud of Individuals:
Stars represents
individuals
Cloud of Variables:
Points represents
variables
A B C
1 a1 b2 c1
⋮ ⋮ ⋮ ⋮
i a2 b2 c3
i’ a1 b1 c1
⋮ ⋮ ⋮ ⋮
N a4 b2 c2
• Unsupervised Machine Learning
• Multiple Correspondence Analysis (MCA)
• Clustering
16
17. sophist_A lot
sophist_Some
bminded_A lot
curious_A lot
intellig_A lot
imagina_A lot
creative_A lot sympath_A lot
softheart_A lot
caring_A lot
warm_A lot
helpful_A lot
talkactive_A lot
active_A lot
lively_A lot
friendly_A lot
outgoing_A lot
careless_A lot
careless_Not at all
thorough_A lot
hardworker_A lot responsible_A lot
organized_A lot
calm_A lot
nervous_Not at all
worry_Not at all
moody_Not at all
−0.5
0.0
0.5
1.0
1.5
−1.00 −0.75 −0.50 −0.25 0.00
Dim1 (8.1%)
Dim2(4.7%)
Region 1
●
● ●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
sophist_A little
bminded_Some
curious_Some
intellig_Some
imagina_A little
creative_A little
creative_Some
sympath_Some
softheart_Some
caring_Some
warm_Some
helpful_Some
talkactive_A little
talkactive_Some
active_Some
lively_Some
friendly_Some
outgoing_Some
careless_A little
careless_Some
thorough_Some
hardworker_Some
organized_Some
calm_Some
nervous_A little
nervous_Some
worry_A little
worry_Some
moody_A little
moody_Some
−0.50
−0.25
0.00
0.25
0.50
0.0 0.5
Dim1 (8.1%)
Dim2(4.7%) Region 2
sophist_A little
sophist_Not at all
bminded_A little
bminded_Not at all
curious_A little
intellig_A little
imagina_A little
imagina_Not at all
creative_Not at all
sympath_A little
softheart_A little
caring_A little
caring_Some
warm_A little
helpful_A little
talkactive_A little
talkactive_Not at all
active_A little
lively_A little
friendly_A little
friendly_Some
outgoing_A little
outgoing_Not at all
thorough_A little
hardworker_A little
responsible_A little
responsible_Some
organized_A little
organized_Not at all
calm_A little
calm_Some
nervous_A lot
worry_A lot
moody_A lot
−1
0
1
2
−0.5 0.0 0.5 1.0 1.5 2.0
Dim1 (8.1%)
Dim2(4.7%)
Region 3 bminded_Not at all
curious_Not at all
intellig_Not at all
imagina_Not at all
sympath_Not at all
softheart_Not at all
warm_Not at all
helpful_Not at all
active_Not at all
lively_Not at all
friendly_Not at all
thorough_Not at all
hardworker_Not at all
calm_Not at all
1
2
3
4
5
0.5 1.0 1.5 2.0
Dim1 (8.1%)
Dim2(4.7%)
Region 4
17
19. Conclusions
• The hierarchical clustering technique applied to the low
dimensional representation of participants, provided by the MCA
method, suggested a reasonable separation of the respondent
profile characterized by a personality scale.
• This can be applied to survey design and sampling procedures.
• This can support correlation studies with other physical and mental
health indicators.
19
20. Paper Presented and Published
18th IEEE International
Conference on
Machine Learning and
Applications - ICMLA
2019
December 16-19, Boca Raton,
Florida, USA
20
21. Part II
Electronic Health Record
- How to predict Intensive Care Unit (ICU) Length
of Stay (LOS) using Machine Learning models?
21
25. Literature Review
Azari et al., 2012:
Approached the LOS prediction identifying similar groups. Reached
accuracy of 74.3%.
Van Houdenhoven et al., 2007:
LOS prediction elective esophagectomy with reconstruction for
carcinoma, with presence of gastroesophageal reflux disease, and
respiratory minute volume transthoracic. R2 of 45%.
Clark & Ryan, 2002:
Tested with demographics younger than 55 years old reach the
highest accuracy of 69%, individuals in the range of 55 and 70 yo
reached 13%, and the group older than 70 years old 17%.
Gustafson, 1968:
Uses five different methodologies for predicting the LOS of inguinal
herniotomy patients.
Afrin et al., 2019:
Predict LOS using three classifications, focused on the age and
death outcome of the patients. Accuracy 54.8% (RF and LR).
Intensive Care Unit (ICU) Length of Stay (LOS)
25
Wait time for
ICU Admission
ICU Management Important predictor
for Death Rate
ICU Cost
26. Data Accessing
Data Specimens only Research Training - CITI Program:
1. Belmont Report and Its Principles (ID 1127)
2. History and Ethics of Human Subjects Research (ID 498)
3. Basic Institutional Review Board (IRB) Regulations and
4. Review Process (ID 2)
5. Records-Based Research (ID 5)
6. Genetic Research in Human Populations (ID 6)
7. Populations in Research Requiring Additional Considerations and/or Protections (ID16680)
8. Conflicts of Interest in Human Subjects Research (ID 17464)
26
30. Methods
Tidymodels framework:
• rsample (data sampling)
• recipes (data preprocess)
• parsnip (machine learning modeling)
• yardstick (performance evaluation)
• Algorithm families: Decision Trees,
Random Forest, Boosted Trees, SVM,
and Linear Regression
x
x
x
x
30
31. Methods
Predictors
• Ethnicity
• Respiratory diagnosis
Subset
• ICU: SICU
• Admission Type: Urgency
Linear Regression
• R2 Adj.: 63.75%
• RMSE: 9.56
Classifier
• Accuracy: 92.7%
31
x
x
x
x
32. Conclusions
• LOS prediction is a very specific prediction task, case oriented and is
unlikely that one model can generalize for any case.
• It was possible to create a specific prediction model for:
• Surgical Intensive Care Unit
• Admitted from Emergency
• Patients with respiratory disease diagnosed
• The use of the novel R library tidymodels enables the use of multiple
ML libraries, under a unifying collection of packages for modeling and
statistical analysis that share the underlying design philosophy, grammar,
and data structures of the modern data science tools in the tidyverse.
32
33. Next Steps
• Format a paper and submit to
machine learning conferences/
journals.
• ACM-BCB ’20: 8th ACM
International Conference on
Bioinformatics, Computational
Biology,and Health Informatics
• Apply unsupervised learning
technics used in the part I to the
MIMIC-III dataset.
• Create MIMIC-III subsets with
Lab Exams for further
investigation.
33