Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning

Introduction Results Description Conclusions
HISA Big Data 2014 – April 3rd 2014 ( #BD14 )
Enhancing Diagnostics for Invasive Aspergillosis using
Machine Learning
Simone Romano
simone.romano@unimelb.edu.au
@ialuronico
James Bailey1
Lawrence Cavedon1,2,3
Orla Morrissey4,5
Monica slavin6,7
Karin Verspoor1,2
1The University of Melbourne, Dept. of Computing and Information Systems
2NICTA (National ICT Aust.) VRL
3School of Computer Science and IT, RMIT University
4Alfred Health 5Monash University
6Peter MacCallum Cancer Centre 7Melbourne Health
Simone Romano The University of Melbourne
Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning

Introduction
Invasive Aspergillosis
Challenging Big Data Task
Results
Diagnostic Model
Description
Machine Learning for Diagnosis
Diagnosis of Invasive Aspergillosis
Conclusions
Summary
Future Work

Invasive Aspergillosis (IA)
Serious fungal infection and major cause of
mortality in patients undergoing allogeneic
stem cell transplantation or chemotherapy
for acute leukaemia.
Figure : Pulmonary IA.
http://en.wikipedia.org/wiki/Aspergillosis
Facts
34–43% mortality rate;
culture methods low sensitivity, only 40–50% IA cases identiﬁed;
IA patient results in +7 days of hospital stay and +$30,957.

Diagnosis and Treatment
Cases are classiﬁed with ProvenIA/ProbableIA/PossibleIA.
Current criteria for diagnosing IA are:
1. microbiology, risk factors, and CT scan ﬁndings;
2. Improved biomarkers such as Aspergillus PCR and Galactomannan
(GM) tested twice a week.
positive biopsy OR (positive CT scan AND single positive PCR/GM)
⇒ ProvenIA
≥ 2 consecutive positive PCR/GM in 2 week time frame
⇒ ProbableIA
Problem
One single positive biomarker might be a False Positive
⇒ Unnecessary harmful treatment.

Big Data task
In a randomised controlled trial comparing the two diﬀerent strategies for
diagnosis IA, large amount of data was collected from 240 patients
between Sept. 2005 and Nov. 2009 at six Australian Centres.
Objective: Leverage such data to produce more accurate prediction of
IA with Machine Learning techniques.
Are we really dealing with Big Data?

Big Data task
In a randomised controlled trial comparing the two diﬀerent strategies for
diagnosis IA, large amount of data was collected from 240 patients
between Sept. 2005 and Nov. 2009 at six Australian Centres.
Objective: Leverage such data to produce more accurate prediction of
IA with Machine Learning techniques.
Are we really dealing with Big Data?
All patients tracked for 26 weeks providing rich longitudinal data on
daily and weekly tests for each patient.
240 × 26 × 7 = 45,680 records.
Bed-side interpretation is a challenging task!

Diagnostic Model
Introduction
Results
Diagnostic Model
Description
Conclusions
Summary
Future Work

Diagnostic Model
Model
Our training set is a collection of 358 single positive biomarker tests that
precede the earliest label of IA.
Transplant/Chemotherapy
begins
1st 2nd 3rd 4th 5th months
positive biomarkers infection
Just 29 of the positive biomarkers were associated with a Proven IA or
Probable IA label within a week (329 false positives)
Built a model to output a probability of infection within a week
value;
Validated by a patient-level cross-validation framework.

Diagnostic Model
1 − TNR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.63

Diagnostic Model
1 − TNR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.63 AUC not too good

Diagnostic Model
1 − TNR
TPR
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
AUC = 0.63
But good in
classifying negatives!

Diagnostic Model
Result
Setting a low threshold on the model output probability to achieve high
NPV (100%) we were able to identify 95 (26.5%) tests that do not
lead to an IA infection (TNR = 28.9%) within a week.
⇒ Doctors can avoid to start treatment in 26.5% cases!
avoid over-treatment;
reduce drug-toxicity;
reduce antifungal drug costs
(E.g. Amphotericin B $8,260 per patient per week).

Introduction
Results
Diagnostic Model
Description
Conclusions
Summary
Future Work

Classiﬁcation Models
Logistic regression;
Decision trees;
Random forest
Training set
Voting
resampling
random tree
resampling
random tree
resampling
random tree
resampling
random tree
resampling
random tree
Random forest because:
It has the capability to work with heterogeneous features
(categorical/continuous);
It can work with many features.

Features to use
Known at baseline: Gender, age, BMI, smoking attitude status,etc.
Daily tested: neutrophil count, body temperature, amount of
administered steroids, haemoglobin, platelets, white cell count, urea,
creatinine, ALT, AST, GGT, bilirubin, LDH, etc.

Features to use
Known at baseline: Gender, age, BMI, smoking attitude status,etc.
Daily tested: neutrophil count, body temperature, amount of
administered steroids, haemoglobin, platelets, white cell count, urea,
creatinine, ALT, AST, GGT, bilirubin, LDH, etc.
Very heterogeneous features!!!

Heterogeneous Features
Features constant along the treatment: Age, Gender, etc.
Features that varied over time: neutrophil count, temperature,
corticosteroid doses, etc.
When we have a positive biomarker test we can use the recent past
information to predict IA. We consider recent past the values in the 3
week window prior a single positive test result.
May Jun Jul
36.537.538.5
date
temperature
window

Features that varied over time
Duration Features we count the number of days the value each
parameter lay within a particular range. For example, we divide the
measured temperature measurements into the intervals [36,37],
(37,38], (38,39], (39, 40], and and greater than 40(>40) Celsius
degrees and counted the number of days temperature occurred in
each interval;
Trajectories We select two days in the 3 week window preceding a
positive test test and compute the mean value, the standard
deviation, and the relative diﬀerence between those values. We
do it for all possible intervals in the window.

Summary
Introduction
Results
Diagnostic Model
Description
Conclusions
Summary
Future Work

Summary
Summary
Target: Enhance Diagnostics for biomarkers for Invasive
Aspergillosis
Method: Random forest for heterogeneous features creating
duration features, and trajectories features;
Validation: patient-level cross-validation;
Results: Setting a low threshold on the output probability, NPV =
100%, TNR = 28.9%. Safe avoidance of antifungal
therapy for 26.5% cases. Savings around $8K per patient
per week.

Future Work
Future Work
make the model more accurate in predicting when a positive test is
associated with an immediate infection to trigger the antifungal
treatment earlier in time;
search for alternative diagnosis when the outcomes are equally
probable according to the model;
make the model output more interpretable to clinical practitioners,
e.g. by identifying the trajectories in the data which generate a low
or high probability of IA.

Future Work
Thank you.
Questions?

Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning

Recommended

Recommended

More Related Content

Similar to Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning

Similar to Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning (20)

More from Simone Romano

More from Simone Romano (6)

Recently uploaded

Recently uploaded (20)

Enhancing Diagnostics for Invasive Aspergillosis using Machine Learning