Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019

Scaling is caring
Building scalable feature engineering pipelines for
machine learning in healthcare
April 3 2019
Amsterdam 2019

Introductions
• Michele Tonutti !
•Data Scientist at Pacmed
•Intensive Care Team
•Background in Biomedical Engineering and Robotics

Introductions
•Developing machine-learning-driven decision
support tools to make healthcare more
personal, personalised and precise.
•Patients only get care that has the highest
probability of success for them.
•Focus on oncology, emergency care, chronic
diseases, and intensive care.

Pacmed focuses on four applications
Emergency care:  
What is the urgency level of a patient (how quick should someone see a doctor)?
Intensive Care:  
Predicting risk of ICU and post-ICU complications to support decision-making
Chronic diseases:  
What is the best treatment (combination) for patients with hypertension, diabetes and/or
chronic kidney failure?
Oncology:  
What are the optimal treatments for the individual patient with colon-, prostate- or breast-
cancer?

Intensive care is most promising and furthest developed
Emergency care:  
What is the urgency level of a patient (how quick should someone see a doctor)?
Intensive Care:  
Predicting risk of ICU and post-ICU complications to support decision-making
Chronic diseases:  
What is the best treatment (combination) for patients with hypertension, diabetes and/or
chronic kidney failure?
Oncology:  
What are the optimal treatments for the individual patient with colon-, prostate- or breast-
cancer?

Pacmed is currently working on four prediction problems on the
intensive care
t-3 t-2 t-1 Today t+7
Readmission/mortality
Vital signs
Re-intubationRespiratory  
parameters
t-3 t-2 t-1 Today t+1 t+2
Bed capacityPatient inﬂow 
& outﬂow
Creatinine Kidney function
Discharge decision
 
Predicting the readmission and
mortality risk of patients on
discharge
Extubation decision
Predicting the risk of re-intubation
of patients if they are extubated
Capacity management
Predicting the number of full/
available beds
Predicting complications
E.g. Predicting kidney function

Machine-learning based decision support software

Explainable prediction of eligibility for discharge from the ICU

Explainable prediction of eligibility for discharge from the ICU
Feature Value Interpretation of value
SATURATION

Max value of the admission
98% A max value of 98% is lower than 95% of all discharged patients
SERUM CREATININE

Trend in last 24 hours
Increase of 20 ml
From 100 to 120
The average patient had a stable serum creatinine during the last
24 hours. The increase of +20 is higher than 99% of discharged
patients
ALAT

Variation in values last 24 hours
Variation of 7 ml
Between 5 and 12
The average patient had a variation of ALAT of 2 in the last 24
hours. A variation of 7 is higher than 76% of all patients.
URINE OUTPUT

Average last 24 hours
240 ml
An average value of last 24 hours. The average discharged patient
has a urine output of 250.

A pipeline for ICUs that works for both development and production
Hospital 1
Hospital 2
Hospital 3

Development
Production
Hospital 1
Hospital 2
Hospital 3

Development
Production
Feature
Engineering
Hospital 1
Hospital 2
Hospital 3

Feature engineering for medical data is an iterative process
Medical knowledge
Feature engineering
Modelling
Validation

The issue of variety in medical data
1.High number of unique parameters
2.Differing feature structure for different problems
3.Different parameter distributions between populations
4.Variability of measurements over time

Patient and admission characteristics
Clinical observations
Vital signs & device data
Lab values
High number of parameters measured in the ICU
• Respiratory rate
• Mechanical Ventilation
• Tidal Volume
• Expiratory minute Volume
• Respiration modus
• PEEP
• Piek druk
• Supplemental O2
• Fraction of inspired O2
• Type of O2 administration
• Peripheral O2 saturation
• Blood pressure (diastolic
and systolic, arterial and
non-invasive)
• Pulmonary artery press.
(diastolic and systolic)
• CVP
• PCWP wedge
• Heart rate
• Cardiac output
• Tidal volume (inspiratory
and expiratory)
• Heart rhythm & ectopic
• Shock index
• Temperature peripheral
• CAM, DOS, RASS, NAS
• GCS
• Pupil size and reaction
Respiration Circulation
• Cough stimulant
• Urine output
• Number of bronchial toilets
• Age, sex
• Length and weight at
admission
• Department of origin
• Length of stay
• Number of prior
admissions
• Time in the hospital
before admission
• CPR code
• Base excess
• O2 content in blood
• Arterial O2 saturation
• pH
• Part. press. (O2 & CO2)
• Actual bicarbonate
Blood gas analysis Haematology
• Hb, Ht
• White blood cell count
• MCH, MCV
• Erythrocytes
• Thrombocytes
• Lymphocytes
• Leucocytes
• Baso, eo and neutro
• Reticulocytes
• PT, APTT
• CK-MB
• Troponin-T
Cardiac enzymes
• Natrium, Kalium
• Chloride
• Calcium, ion. Calcium
• Magnesium
• Fosfaat
• Creatinine
• CK
• EST and CRP
• Blood glucose
• Blood lactate
• Amylase
• Serum albumine
• BUN_creatinine
• NT-ProBNP
Chemistry
• ALAT and ASAT
• GGT, AF
• LDH
• Bilirubine
Liver tests
• Natrium, Kalium
• Ureum
Urinalysis
Medication categories
• Alimentary tract and metabolism
• Antibiotics
• Blood and blood-forming organs
• Cardiovascular
• Musculoskeletal system
• Nervous system
• General (sondevoeding)
Other
• CVVH
• Lines and drains

Measurements can vary widely between hospitals
Number of measurements Mean value
Hospital 1
Hospital 2
Activated partial thromboplastin time (aPTT)

Parameters are measured at different time scales, with highly varying
values and measurement frequencies

What do we need?
• A feature engineering pipeline that: 
1. is scalable
2. can be used efﬁciently for both development and production
3. can be used for multiple outcome measures
4. produces features that are interpretable and useful for both machine
learning models and doctors

Challenge: how to turn time series into information relevant for a
model (and doctors)?

๏ Recurrent Neural Networks 
e.g. (Phased) LSTMs
๏ Frequency domain transforms 
e.g. Fourier transform
๏ Embedded representations  
e.g. patient2vec

๏ Recurrent Neural Networks 
e.g. (Phased) LSTMs
๏ Frequency domain transforms 
e.g. Fourier transform
๏ Embedded representations  
e.g. patient2vec
• Scalable?
• Reusable across models?
• Interpretable?

Extracting interpretable aggregated values from vital parameters
last
ﬁrst
minimum
average
slope standard deviation
maximum
{…}counts
Heart rate (bpm)

{…}
{…}
1
2
3
First 48h
First 72h
First 24h
{…}
We use these aggregated features to capture short-term effects as well as
longer-term trends

We use these aggregated features to capture short-term effects as well as
longer-term trends
{…} {…}
{…}
1
2
3
Whole stay
Day averages
First and last day

Multiple patients, multiple parameters, continuous time scale

Split - apply - combine
1) Splitting the data into groups based on some criteria.
2) Applying a function to each group independently.
3) Combining the results into a data structure.

Creating features grouped in custom time windows

Why not stick to Pandas then?
• Interpretable, easy, reliable
• Works very well with datetime
formats
• Most simple aggregations available

Why not stick to Pandas then?
• Interpretable, easy, reliable
• Works very well with datetime
formats
• Most simple aggregations available
• No out-of-the-box parallelisation
• Everything in memory
• Custom aggregations can be
extremely computationally heavy

Heavy computational load for custom functions

Dask: scalable Pandas
• Abstraction over numpy, pandas and scikit-learn allowing you to run
operations on them in parallel, using multicore processing

Dask: scalable Pandas
• Manipulating large datasets, even when those datasets don’t ﬁt in memory
• Distributed computing on large datasets with standard Pandas operations
like groupby, join, and time series computations
• Scales up to multiple machines auto-magically. 
Scales down: low-memory and fast even on local machines.

Reminder: our goal of scalability
๏ Develop and test on any machine
๏ Re-use the same pipeline for production
๏ For both large and small datasets

Problems with Dask
• Not all pandas aggregations available 
(e.g. apply custom functions on expanding windows)
• Complex to optimise on each machine
• Need to select manually number of workers, partitions, etc.
• Performance highly dependent on settings
• Slower for small datasets and certain transformations

TSFRESH
• "Time Series Feature extraction based on scalable hypothesis tests”.

TSFRESH
• Same split-apply-combine concept, but feature calculations are done on
numpy arrays (vectorized), in parallel

Dealing with time-varying signals
pandas Series numpy array
Calculate aggregates
in parallel
pandas DataFrame
min() 
max()
std()
…

Huge list of aggregates available out of the box

Result: clean, interpretable dataframe ready for modelling

Scaling up and down
• (Local) multiprocessing
• Cluster with Dask

Dealing with time-varying signals
• Problem: using numpy arrays means losing the datetime dimension
• Solution: custom fork of TSFRESH
• The DatetimeIndex of the input pandas dataframe is used only when
calculating time-dependent aggregations
• Medication data can also be taken into account by exploiting multi-
indices (e.g. medications)

Dealing with medications
Aggregates:
- Total amount
- Time since last dose
- Time under treatment
- Time without treatment

Summary
• Creating features for medical data entails dealing with variety and
variability
• Quick processing and interpretable features are top priorities
• No single tool offer a unique solution

Summary
• Pandas works well for quick processing of relatively small datasets
• Split-apply-combine
• Parallelizing (e.g. through Dask) allows quick computation of aggregates
both locally and distributed
• Vectorizing the split-apply-combine approach (e.g. with TSFRESH) speeds
up computation both for small and large datasets.
• Native support for Dask and custom distributors enables scaling

Conclusions
• Approach not limited to Python or speciﬁc packages
• Can be extended to any application that involve time series
• Scaling horizontally: we adapted the ICU pipeline for various other
projects (e.g. treatment decision based on patients’ clinical history)
• No need to re-invent the wheel every time

Key takeaway
“FEATURE ENGINEERING”
PANDAS
DATA SCIENTIST

Questions or feedback?
Michele Tonutti
michele.tonutti@pacmed.nl

Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019

Recommended

Recommended

More Related Content

Similar to Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019

Similar to Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019 (20)

More from Codemotion

More from Codemotion (20)

Recently uploaded

Recently uploaded (20)

Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019