A key challenge we face at Pacmed is quickly calibrating and deploying our tools for clinical decision support in different hospitals, where data formats may vary greatly. Using Intensive Care Units as a case study, I’ll delve into our scalable Python pipeline, which leverages Pandas’ split-apply-combine approach to perform complex feature engineering and automatic quality checks on large time-varying data, e.g. vital signs. I’ll show how we use the resulting flexible and interpretable dataframes to quickly (re)train our models to predict mortality, discharge, and medical complications.
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
1. Scaling is caring
Building scalable feature engineering pipelines for
machine learning in healthcare
April 3 2019
Amsterdam 2019
2. Introductions
• Michele Tonutti !
•Data Scientist at Pacmed
•Intensive Care Team
•Background in Biomedical Engineering and Robotics
3. Introductions
•Developing machine-learning-driven decision
support tools to make healthcare more
personal, personalised and precise.
•Patients only get care that has the highest
probability of success for them.
•Focus on oncology, emergency care, chronic
diseases, and intensive care.
4. Pacmed focuses on four applications
Emergency care:
What is the urgency level of a patient (how quick should someone see a doctor)?
Intensive Care:
Predicting risk of ICU and post-ICU complications to support decision-making
Chronic diseases:
What is the best treatment (combination) for patients with hypertension, diabetes and/or
chronic kidney failure?
Oncology:
What are the optimal treatments for the individual patient with colon-, prostate- or breast-
cancer?
5. Intensive care is most promising and furthest developed
Emergency care:
What is the urgency level of a patient (how quick should someone see a doctor)?
Intensive Care:
Predicting risk of ICU and post-ICU complications to support decision-making
Chronic diseases:
What is the best treatment (combination) for patients with hypertension, diabetes and/or
chronic kidney failure?
Oncology:
What are the optimal treatments for the individual patient with colon-, prostate- or breast-
cancer?
7. Pacmed is currently working on four prediction problems on the
intensive care
t-3 t-2 t-1 Today t+7
Readmission/mortality
Vital signs
t-3 t-2 t-1 Today t+2
Re-intubationRespiratory
parameters
t-3 t-2 t-1 Today t+1 t+2
Bed capacityPatient inflow
& outflow
t-3 t-2 t-1 Today t+1
Creatinine Kidney function
Discharge decision
Predicting the readmission and
mortality risk of patients on
discharge
Extubation decision
Predicting the risk of re-intubation
of patients if they are extubated
Capacity management
Predicting the number of full/
available beds
Predicting complications
E.g. Predicting kidney function
10. Explainable prediction of eligibility for discharge from the ICU
Feature Value Interpretation of value
SATURATION
Max value of the admission
98% A max value of 98% is lower than 95% of all discharged patients
SERUM CREATININE
Trend in last 24 hours
Increase of 20 ml
From 100 to 120
The average patient had a stable serum creatinine during the last
24 hours. The increase of +20 is higher than 99% of discharged
patients
ALAT
Variation in values last 24 hours
Variation of 7 ml
Between 5 and 12
The average patient had a variation of ALAT of 2 in the last 24
hours. A variation of 7 is higher than 76% of all patients.
URINE OUTPUT
Average last 24 hours
240 ml
An average value of last 24 hours. The average discharged patient
has a urine output of 250.
11. A pipeline for ICUs that works for both development and production
Hospital 1
Hospital 2
Hospital 3
14. Feature engineering for medical data is an iterative process
Medical knowledge
Feature engineering
Modelling
Validation
15. Feature engineering for medical data is an iterative process
Medical knowledge
Feature engineering
Modelling
Validation
16. The issue of variety in medical data
1.High number of unique parameters
2.Differing feature structure for different problems
3.Different parameter distributions between populations
4.Variability of measurements over time
17. Patient and admission characteristics
Clinical observations
Vital signs & device data
Lab values
High number of parameters measured in the ICU
• Respiratory rate
• Mechanical Ventilation
• Tidal Volume
• Expiratory minute Volume
• Respiration modus
• PEEP
• Piek druk
• Supplemental O2
• Fraction of inspired O2
• Type of O2 administration
• Peripheral O2 saturation
• Blood pressure (diastolic
and systolic, arterial and
non-invasive)
• Pulmonary artery press.
(diastolic and systolic)
• CVP
• PCWP wedge
• Heart rate
• Cardiac output
• Tidal volume (inspiratory
and expiratory)
• Heart rhythm & ectopic
• Shock index
• Temperature peripheral
• CAM, DOS, RASS, NAS
• GCS
• Pupil size and reaction
Respiration Circulation
• Cough stimulant
• Urine output
• Number of bronchial toilets
• Age, sex
• Length and weight at
admission
• Department of origin
• Length of stay
• Number of prior
admissions
• Time in the hospital
before admission
• CPR code
• Base excess
• O2 content in blood
• Arterial O2 saturation
• pH
• Part. press. (O2 & CO2)
• Actual bicarbonate
Blood gas analysis Haematology
• Hb, Ht
• White blood cell count
• MCH, MCV
• Erythrocytes
• Thrombocytes
• Lymphocytes
• Leucocytes
• Baso, eo and neutro
• Reticulocytes
• PT, APTT
• CK-MB
• Troponin-T
Cardiac enzymes
• Natrium, Kalium
• Chloride
• Calcium, ion. Calcium
• Magnesium
• Fosfaat
• Creatinine
• CK
• EST and CRP
• Blood glucose
• Blood lactate
• Amylase
• Serum albumine
• BUN_creatinine
• NT-ProBNP
Chemistry
• ALAT and ASAT
• GGT, AF
• LDH
• Bilirubine
Liver tests
• Natrium, Kalium
• Ureum
Urinalysis
Medication categories
• Alimentary tract and metabolism
• Antibiotics
• Blood and blood-forming organs
• Cardiovascular
• Musculoskeletal system
• Nervous system
• General (sondevoeding)
Other
• CVVH
• Lines and drains
18. Measurements can vary widely between hospitals
Number of measurements Mean value
Hospital 1
Hospital 2
Activated partial thromboplastin time (aPTT)
19. Parameters are measured at different time scales, with highly varying
values and measurement frequencies
20. What do we need?
• A feature engineering pipeline that:
1. is scalable
2. can be used efficiently for both development and production
3. can be used for multiple outcome measures
4. produces features that are interpretable and useful for both machine
learning models and doctors
21. Challenge: how to turn time series into information relevant for a
model (and doctors)?
22. Challenge: how to turn time series into information relevant for a
model (and doctors)?
๏ Recurrent Neural Networks
e.g. (Phased) LSTMs
๏ Frequency domain transforms
e.g. Fourier transform
๏ Embedded representations
e.g. patient2vec
23. Challenge: how to turn time series into information relevant for a
model (and doctors)?
๏ Recurrent Neural Networks
e.g. (Phased) LSTMs
๏ Frequency domain transforms
e.g. Fourier transform
๏ Embedded representations
e.g. patient2vec
• Scalable?
• Reusable across models?
• Interpretable?
24. Challenge: how to turn time series into information relevant for a
model (and doctors)?
๏ Recurrent Neural Networks
e.g. (Phased) LSTMs
๏ Frequency domain transforms
e.g. Fourier transform
๏ Embedded representations
e.g. patient2vec
• Scalable?
• Reusable across models?
• Interpretable?
25. Extracting interpretable aggregated values from vital parameters
last
first
minimum
average
slope standard deviation
maximum
{…}counts
Heart rate (bpm)
27. We use these aggregated features to capture short-term effects as well as
longer-term trends
{…} {…}
{…}
1
2
3
Whole stay
Day averages
First and last day
30. Split - apply - combine
1) Splitting the data into groups based on some criteria.
2) Applying a function to each group independently.
3) Combining the results into a data structure.
34. Why not stick to Pandas then?
• Interpretable, easy, reliable
• Works very well with datetime
formats
• Most simple aggregations available
35. Why not stick to Pandas then?
• Interpretable, easy, reliable
• Works very well with datetime
formats
• Most simple aggregations available
• No out-of-the-box parallelisation
• Everything in memory
• Custom aggregations can be
extremely computationally heavy
37. Dask: scalable Pandas
• Abstraction over numpy, pandas and scikit-learn allowing you to run
operations on them in parallel, using multicore processing
40. Dask: scalable Pandas
• Manipulating large datasets, even when those datasets don’t fit in memory
• Distributed computing on large datasets with standard Pandas operations
like groupby, join, and time series computations
• Scales up to multiple machines auto-magically.
Scales down: low-memory and fast even on local machines.
41. Reminder: our goal of scalability
๏ Develop and test on any machine
๏ Re-use the same pipeline for production
๏ For both large and small datasets
42. Problems with Dask
• Not all pandas aggregations available
(e.g. apply custom functions on expanding windows)
• Complex to optimise on each machine
• Need to select manually number of workers, partitions, etc.
• Performance highly dependent on settings
• Slower for small datasets and certain transformations
50. Scaling up and down
• (Local) multiprocessing
• Cluster with Dask
51. Dealing with time-varying signals
• Problem: using numpy arrays means losing the datetime dimension
• Solution: custom fork of TSFRESH
• The DatetimeIndex of the input pandas dataframe is used only when
calculating time-dependent aggregations
• Medication data can also be taken into account by exploiting multi-
indices (e.g. medications)
53. Summary
• Creating features for medical data entails dealing with variety and
variability
• Quick processing and interpretable features are top priorities
• No single tool offer a unique solution
54. Summary
• Pandas works well for quick processing of relatively small datasets
• Split-apply-combine
• Parallelizing (e.g. through Dask) allows quick computation of aggregates
both locally and distributed
• Vectorizing the split-apply-combine approach (e.g. with TSFRESH) speeds
up computation both for small and large datasets.
• Native support for Dask and custom distributors enables scaling
55. Conclusions
• Approach not limited to Python or specific packages
• Can be extended to any application that involve time series
• Scaling horizontally: we adapted the ICU pipeline for various other
projects (e.g. treatment decision based on patients’ clinical history)
• No need to re-invent the wheel every time