Investigating the use of novel data mining and machine learning methods in healthcare data sources of multiple natures.

Thesis Defense
Investigating the Use Of Novel Data Mining And Machine

Learning Methods in Healthcare Data Sources Of Multiple

Nature
Roberto Batista

Content
• Introduction
• Part I - Survey Data
• Overview

• Literature review

• Methods

• Exploratory Data Analysis

• Data Transformation

• Conclusions
• Part II - Electronic Health Record
• Overview

• Literature review

• Methods

• Exploratory Data Analysis

• Data Transformation

• Conclusions
2

2008 2015
9.4% 84%
Introduction
• Non-federal hospitals with basic systems.
The Oﬃce of the National Coordinator for Health Information Technology (ONC) 4
gender ethnicity religion age social ﬁnance exams

Health Data Sources
National Library of Medicine (NIH)
EMR
$
$
$
$
$
$ SpO2
EHR
5
Survey Electronic Medical Record Claim Data Vital Signs Data Electronic Health Record

PART I PART II
Thesis Components
EHR
6
Survey Electronic Health Record

Part I
Survey Data
-How to identify personality traits groups in the
Health and Retirement Study survey data?
7

Health and Retirement Study
(HRS)
8

HRS Overview
3 surveys
6 aspects
5
aspects
5
aspects
22,000
>50 yo
9 aspects
4
Derived
Datasets
58.54%
Medical Ethics
Training
9

Literature Review
Gould et al., 2015:
Veriﬁes the symptoms of anxiety and depression in
veterans and non-veterans using CES-D and BAI.
Seligman et Al., 2018:
Machine Learning improves the understanding of social
determinants of health.
Hülür et al., 2015:
Investigates association between subjective memory,
subjective age and personality traits.
Fehrman et al., 2015:
Personality correlation with the consumption of eight
psychoactive drugs and its consumption by individuals.
Aschwanden et al., 2019:
Personality traits associations with the probability of
having a preventive screening for cancer.
Five personality Traits
(OCEAN):

• Openness

• Conscientiousness

• Extraversion

• Agreeableness

• Neuroticism
10
Machine Learning Studies

HRS Datasets Overview
11
HRS - RANDHRS Core HRS Exit HRS Post-Exit
• Adult ADHD

• Financial

• Material Hardship

• Long-term Care

• Medication Non-
Adherence

• Religious
• Proxy informant

• Health

• Family

• Finance
• Proxy informant

• Unresolved
ﬁnancial
situations
1992
|
2016
1992
|
2016
1992
|
2016
1992
|
2016

HRS Datasets of Interest
12

HRS Datasets of Interest
2006, 2008,
2010, 2012
HRS - RAND
HRS Core - Section LB - Left-Behind
Subjective well-being, lifestyle and experience of stress, quality of
Social ties, personality traits, work-related beliefs, and self-
related beliefs.
HRS Core - Section D - Cognition
Immediate and delayed free recall, working memory and mental
processing, vocabulary, mental status, and self-rated memory.
13
2006, 2008,
2010, 2012

Data Transformation
HRS:
• RAND
• Core D
• Core LB
15

Methods
Cloud of Individuals:
Stars represents

individuals
Cloud of Variables:
Points represents

variables
A B C
1 a1 b2 c1
⋮ ⋮ ⋮ ⋮
i a2 b2 c3
i’ a1 b1 c1
⋮ ⋮ ⋮ ⋮
N a4 b2 c2
• Unsupervised Machine Learning
• Multiple Correspondence Analysis (MCA)

• Clustering
16

sophist_A lot
sophist_Some
bminded_A lot
curious_A lot
intellig_A lot
imagina_A lot
creative_A lot sympath_A lot
softheart_A lot
caring_A lot
warm_A lot
helpful_A lot
talkactive_A lot
active_A lot
lively_A lot
friendly_A lot
outgoing_A lot
careless_A lot
careless_Not at all
thorough_A lot
hardworker_A lot responsible_A lot
organized_A lot
calm_A lot
nervous_Not at all
worry_Not at all
moody_Not at all
−0.5
0.0
0.5
1.0
1.5
−1.00 −0.75 −0.50 −0.25 0.00
Dim1 (8.1%)
Dim2(4.7%)
Region 1
●
● ●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
sophist_A little
bminded_Some
curious_Some
intellig_Some
imagina_A little
creative_A little
creative_Some
sympath_Some
softheart_Some
caring_Some
warm_Some
helpful_Some
talkactive_A little
talkactive_Some
active_Some
lively_Some
friendly_Some
outgoing_Some
careless_A little
careless_Some
thorough_Some
hardworker_Some
organized_Some
calm_Some
nervous_A little
nervous_Some
worry_A little
worry_Some
moody_A little
moody_Some
−0.50
−0.25
0.00
0.25
0.50
0.0 0.5
Dim1 (8.1%)
Dim2(4.7%) Region 2
sophist_A little
sophist_Not at all
bminded_A little
bminded_Not at all
curious_A little
intellig_A little
imagina_A little
imagina_Not at all
creative_Not at all
sympath_A little
softheart_A little
caring_A little
caring_Some
warm_A little
helpful_A little
talkactive_A little
talkactive_Not at all
active_A little
lively_A little
friendly_A little
friendly_Some
outgoing_A little
outgoing_Not at all
thorough_A little
hardworker_A little
responsible_A little
responsible_Some
organized_A little
organized_Not at all
calm_A little
calm_Some
nervous_A lot
worry_A lot
moody_A lot
−1
0
1
2
−0.5 0.0 0.5 1.0 1.5 2.0
Dim1 (8.1%)
Dim2(4.7%)
Region 3 bminded_Not at all
curious_Not at all
intellig_Not at all
imagina_Not at all
sympath_Not at all
softheart_Not at all
warm_Not at all
helpful_Not at all
active_Not at all
lively_Not at all
friendly_Not at all
thorough_Not at all
hardworker_Not at all
calm_Not at all
1
2
3
4
5
0.5 1.0 1.5 2.0
Dim1 (8.1%)
Dim2(4.7%)
Region 4
17

Conclusions
• The hierarchical clustering technique applied to the low
dimensional representation of participants, provided by the MCA
method, suggested a reasonable separation of the respondent
proﬁle characterized by a personality scale.

• This can be applied to survey design and sampling procedures.

• This can support correlation studies with other physical and mental
health indicators.
19

Paper Presented and Published
18th IEEE International
Conference on
Machine Learning and
Applications - ICMLA
2019
December 16-19, Boca Raton,
Florida, USA
20

Part II
Electronic Health Record
- How to predict Intensive Care Unit (ICU) Length
of Stay (LOS) using Machine Learning models?
21

Medical Information Mart for
Intensive Care - III
(MIMIC-III)
22

MIMIC-III Overview
NB, 15 >
2.1 days
7.76%
380
meas.
11.5%
44.1%
53,423
adm
6.9
days
EHR
7,870
38,597
23

Beth Israel Deaconess Medical Center
CareVue DB
MetaVision DB
MIMIC-III
24

Literature Review
Azari et al., 2012:
Approached the LOS prediction identifying similar groups. Reached
accuracy of 74.3%.
Van Houdenhoven et al., 2007:
LOS prediction elective esophagectomy with reconstruction for
carcinoma, with presence of gastroesophageal reflux disease, and
respiratory minute volume transthoracic. R2 of 45%.
Clark & Ryan, 2002:
Tested with demographics younger than 55 years old reach the
highest accuracy of 69%, individuals in the range of 55 and 70 yo
reached 13%, and the group older than 70 years old 17%.
Gustafson, 1968:
Uses five different methodologies for predicting the LOS of inguinal
herniotomy patients.
Afrin et al., 2019:
Predict LOS using three classifications, focused on the age and
death outcome of the patients. Accuracy 54.8% (RF and LR).
Intensive Care Unit (ICU) Length of Stay (LOS)
25
Wait time for
ICU Admission
ICU Management Important predictor
for Death Rate
ICU Cost

Data Accessing
Data Specimens only Research Training - CITI Program:
1. Belmont Report and Its Principles (ID 1127)

2. History and Ethics of Human Subjects Research (ID 498)

3. Basic Institutional Review Board (IRB) Regulations and

4. Review Process (ID 2)

5. Records-Based Research (ID 5)

6. Genetic Research in Human Populations (ID 6)

7. Populations in Research Requiring Additional Considerations and/or Protections (ID16680)

8. Conﬂicts of Interest in Human Subjects Research (ID 17464)
26

Exploratory Data Analysis
26 CSV Files SQLite
CSV to SQLite
Conversion
27

Data Transformation
28
CSV
STAYS
CSV
PATIENTS
1. ROW_ID
2. SUBJECT_ID
3. GENDER
4. DOB
5. DOD
6. DOD_HOSP
7. DOD_SSN
8. EXPIRE_FLAG
1. ROW_ID
2. SUBJECT_ID
3. HADM_ID
4. ICUSTAY_ID
5. DBSOURCE
6. FIRST_CAREUNIT
7. LAST_CAREUNIT
8. FIRST_WARDID
9. LAST_WARDID
10.INTIME
11.OUTTIME
12.LOS
CSV
DIAGNOSIS
1. ROW_ID
2. SUBJECT_ID
3. HADM_ID
4. SEQ_NUM
5. ICD9_CODE
CSV
ADMISSIONS
1. ROW_ID
2. SUBJECT_ID
3. HADM_ID
4. ADMITTIME
5. DISCHTIME
6. DEATHTIME
7. ADMISSION_TYPE
8. ADMISSION_LOCATION
9. DISCHARGE_LOCATION
10.INSURANCE
11.LANGUAGE
12.RELIGION
13.MARITAL_STATUS
14.ETHNICITY
15.EDREGTIME
16.EDOUTTIME
17.DIAGNOSIS
18.HOSPITAL_EXPIRE_FLAG
19.HAS_CHARTEVENTS_DATA

Data Transformation
2
1
3
4
5
6
7
29

Methods
Tidymodels framework:

• rsample (data sampling)

• recipes (data preprocess)

• parsnip (machine learning modeling)

• yardstick (performance evaluation)

• Algorithm families: Decision Trees,
Random Forest, Boosted Trees, SVM,
and Linear Regression
x
x
x
x
30

Methods
Predictors
• Ethnicity

• Respiratory diagnosis

Subset

• ICU: SICU

• Admission Type: Urgency

Linear Regression
• R2 Adj.: 63.75%

• RMSE: 9.56

Classiﬁer
• Accuracy: 92.7%
31
x
x
x
x

Conclusions
• LOS prediction is a very speciﬁc prediction task, case oriented and is
unlikely that one model can generalize for any case.

• It was possible to create a speciﬁc prediction model for:

• Surgical Intensive Care Unit

• Admitted from Emergency

• Patients with respiratory disease diagnosed

• The use of the novel R library tidymodels enables the use of multiple
ML libraries, under a unifying collection of packages for modeling and
statistical analysis that share the underlying design philosophy, grammar,
and data structures of the modern data science tools in the tidyverse.
32

Next Steps
• Format a paper and submit to
machine learning conferences/
journals.

• ACM-BCB ’20: 8th ACM
International Conference on
Bioinformatics, Computational
Biology,and Health Informatics
• Apply unsupervised learning
technics used in the part I to the
MIMIC-III dataset.

• Create MIMIC-III subsets with
Lab Exams for further
investigation.
33

Thanks to
Icons: http://www.ﬂaticons.com 36
Friends at

Investigating the use of novel data mining and machine learning methods in healthcare data sources of multiple natures.

Recommended

Recommended

More Related Content

Similar to Investigating the use of novel data mining and machine learning methods in healthcare data sources of multiple natures.

Similar to Investigating the use of novel data mining and machine learning methods in healthcare data sources of multiple natures. (20)

More from Roberto Williams Batista

More from Roberto Williams Batista (8)

Recently uploaded

Recently uploaded (20)

Investigating the use of novel data mining and machine learning methods in healthcare data sources of multiple natures.