Improving Clinical Machine Learning Evaluation

Improving
Clinical
Machine
Learning
Evaluation
DecisionCurve Analysis
andrelatedapproaches
withconsiderationsfor
uncertaintycalibration
andhuman-in-the loop
supervision
Petteri Teikari, PhD
PhD in Neuroscience
M.Sc Electrical Engineering
https://www.linkedin.com/in/petteriteikari/
Version “September 15, 2019“

The Motivation
forthe Slideshow
Whyare westill using
ROCandrelated
metricsforclinical
evaluationof machine
learningmodels?

https://twitter.com/alan_karthi/status/1159733666320408576
http://doi.org/10.1001/jama.2019.10306
Machine learning can identify patterns in the expanding,
heterogeneous data sets to create models that accurately
classify a patient’s diagnosis or predict what a patient may
experience in the future. However, realizing the potential benefit
of machine learning for patients in the form of better care
requires rethinking how model performance during
machine learning is assessed. A framework for rigorously
evaluating the performance of a model in the context of the
subsequent actions it triggers is necessary to identify models
that areclinically useful.

i.e. Doesyourmodel generalize well when deployed toreal-worldpopulation
Did the authors just “tortured data enough” to get the results that please journal editor/reviewers? Is it really so that your
test set performance predicts well clinical usability of your model?
http://phdcomics.com/comics/archive.php?comicid=405
https://doi.org/10.1016/j.ophtha.2017.08.046
Researchers developing machine learning algorith1ms
generally split the available data into separate sets to be
used for training, testing, and validation. However, all
data may be drawn from a single site or a narrow range
of sites, and the results may not generalize to other
populations or to all imaging devices and protocols.
We argue that before clinical deployment is
considered, an independent study, or one with
independent oversight (like clinical drug trials), should
be required of sufficient size to be confident about
detecting clinically important but less common events.
Also, should learning algorithms be allowed to continue
to learn after independent approval when exposed to
real-life data, which may alter their performance
possibly for the better, but in an unchecked way?
Regulatory authorities will need to adapt to these new
challenges.
Note!Not like all your “old
school” clinical trials were done
with sub-populations presenting
the whole population

Whatisthe datapipeline tofeed all your“AImodels”in the end
Anddoesthatconnectto clinicalpracticeandsystemsseamlessly?
https://twitter.com/EricTopol/status/1162441150801670145
Published:16August2019The“inconvenienttruth”
aboutAIinhealthcare TrishanPanch,HeatherMattie&
LeoAnthonyCeli npjDigitalMedicinevolume2,Article
number:77(2019)|https://doi.org/10.1038/s41746-019-0155-4
The rapid explosion in AI has introduced the possibility of using aggregated healthcare data to produce powerful
models that can automate diagnosis and also enable an increasingly precision approach to medicine by
tailoring treatments and targeting resources with maximum effectiveness in a timely and dynamic manner.
However, “the inconvenient truth”is thatatpresent thealgorithms thatfeatureprominentlyin researchliterature
arein factnot,forthemostpart, executableat thefrontlinesof clinicalpractice.
A complex web of ingrained political and economic factors as well as the proximal influence of medical
practice norms and commercial interests determine the way healthcare is delivered. Simply adding AI
applicationstoa fragmentedsystem willnot createsustainablechange.
Healthcare, with its abundance of data, is in theory well-poised to benefit from growth in cloud computing. The
largest and arguably most valuable store of data in healthcare rests in EMR/EHRs. However, clinician
satisfaction with EMRs remains low, resulting in variable completeness and quality of data entry, and
interoperability between different providers remains elusive. The typical lament of a harried clinician is still “why
does my EMR still suck and why don’t all these systems just talk to each other?” Policy imperatives have
attempted to address these dilemmas, however progress has been minimal. In spite of the widely touted
benefits of “data liberation”,15 a sufficiently compelling use case has not been presented to overcome the
vested interests maintaining the status quo and justify the significant upfront investment necessary to build
datainfrastructure.
To realize this vision and to realize the potential of AI across health systems, more fundamental issues have to be
addressed: who owns health data, who is responsible for it, and who can use it? Cloud computing alone will
not answer these questions—public discourse and policy intervention will be needed. The specific path
forward will depend on the degree of a social compact around healthcare itself as a public good, the tolerance to
public private partnership, and crucially, the public’s trust in both governments and the private sector to treat their
healthcaredatawithduecareandattention in thefaceofbothcommercialandpoliticalperverseincentives.

Uncertainty inClinicalPractice veryimportant!
WhatCliniciansWant:ContextualizingExplainableMachineLearningforClinicalEndUse
SanaTonekaboni,Shalmali Joshi,Melissa D.McCradden,Anna Goldenberg https://arxiv.org/abs/1905.05134

Uncertaintyquantification OnecaneasilydedicateawholePhDonthis
There has been growing interest in making deep neural networks robust for real-world
applications. Challenges arise when models receive inputs drawn from outside the
training distribution (OOD). For example, a neural network tasked with classifying
handwritten digits may assign high confidence predictions to cat images. Anomalies are
frequently encountered when deploying ML models in the real world. Generalization to
unseen and worst-case inputs is also essential for robustness to distributional shift.
Well-calibrated predictive uncertainty estimates are indispensable for many
machine learning applications, such as self-driving cars and medical diagnosis
systems. In order to have ML models reliably predict in open environment, wemust deepen
technicalunderstandingin thefollowing areas:
1) Learningalgorithmsthat arerobusttochangesin inputdatadistribution(e.g.,detect
out-of-distribution examples),
2) Mechanismstoestimateand calibrateconfidence produced byneuralnetworks,
3) Methodstoimproverobustnesstoadversarialandnon-adversarialcorruptions,and
4) Key applicationsforuncertainty(e.g.,computervision,robotics, self-drivingcars,
medicalimaging)aswellasbroadermachinelearningtasks.
Through the workshop we hope to help identify fundamentally important directions on
robust and reliabledeep learning, and fosterfuturecollaborations. We invite the submission
ofpaperson topics including,butnot limitedto:
●
Out-of-distribution detection andanomalydetection
●
Robustnesstocorruptions,adversarialperturbations,and distribution shift
●
Calibration
●
Probabilistic (Bayesian andnon-Bayesian)neuralnetworks
●
Openworldrecognition andopen setlearning
●
Security
●
Quantifyingdifferent typesof uncertainty(known unknownsand unknown
unknowns)andtypesofrobustness
●
Applicationsofrobust and uncertainty-awaredeeplearning
ICML2019 Workshopon
Uncertainty &RobustnessinDeep Learning
https://sites.google.com/view/udlworkshop2019/

UncertaintyofyourUncertainty Estimate? Can youtrust it?
CanYouTrustYourModel's
Uncertainty?Evaluating
PredictiveUncertainty Under
DatasetShift
Yaniv Ovadia, EmilyFertig, Jie Ren, ZacharyNado, DSculley,
SebastianNowozin, JoshuaV. Dillon, Balaji
Lakshminarayanan, Jasper Snoek Google Research, DeepMind
(Submittedon 6 Jun 2019)
https://arxiv.org/abs/1906.02530
Using Distributional Shift to Evaluate
Predictive Uncertainty While previous work has
evaluated the quality of predictive uncertainty on OOD
inputs (Lakshminarayanan et al., 2017), there has not to
our knowledge been a comprehensive evaluation of
uncertainty estimates from different methods under
dataset shift. Indeed, we suggest that effective
evaluation of predictive uncertainty is most
meaningful under conditions of distributional shift.
One reason for this is that post-hoc calibration gives
good results in independent and identically distributed
(i.i.d.) regimes, but can fail under even a mild shift in the
input data. And in real world applications,
distributional shift is widely prevalent.
Understanding questions of risk, uncertainty, and trust
in a model’s output becomes increasingly critical as
shiftfromtheoriginaltrainingdatagrowslarger.
(SVI) Stochastic Variational Bayesian Inference e.g.
Wu et al. 2019
(Ensembles, M = 10) Ensembles of M networks trained
independently on the entire dataset using random
initialization (Lakshminarayanan et al. 2016 Cited by 245

i.e. Areyouruncertaintiesproperlycalibrated?
On CalibrationofModernNeuralNetworks
Chuan Guo, GeoffPleiss,YuSun,Kilian Q. Weinberger CornellUniversity
https://arxiv.org/abs/1706.04599 Cited by 194
https://github.com/gpleiss/temperature_scaling
→ https://arxiv.org/abs/1810.11586 (2018)
→ → https://arxiv.org/abs/1901.06852 (2019)
→ → https://arxiv.org/abs/1905.00174 (2019)
Confidence calibration – the problem of predicting
probability estimates representative of the true correctness
likelihood – is important for classification models in many
applications. In automated health care, control should be
passed on to human doctors when the confidence of a disease
diagnosis network islow (Jianget al., 2012 Cited by 47
).Wediscover that
modern neural networks, unlike those from a decade ago, are
poorlycalibrated.
Expected Calibration Error (ECE). While reliability diagrams are useful visual tools, it
is more convenient to have a scalar summary statistic of calibration. Since statistics
comparing two distributions cannot be comprehensive, previous works have proposed
variants, each with a unique emphasis. One notion of miscalibration is the
difference in expectation between confidence and accuracy. Expected
Calibration Error (Naeini et al., 2015)
– or ECE – approximates this by partitioning predictions into
M equally-spaced bins (similar to the reliability diagrams) and taking a weighted average
ofthebins’accuracy/confidence difference
Maximum Calibration Error (MCE). In high-risk applications where reliable
confidence measures are absolutely necessary, we may wish to minimize the worst-
case deviation between confidence and accuracy. The Maximum Calibration Error (
Naeini et al., 2015 )
– or MCE – estimatesthisdeviation.
Reliabilitydiagrams(e.g.Niculescu-MizilandCaruana2005)for CIFAR-100(110-layer
ResNet)before(far left)andafter calibrationwithdifferentmethods(middleleft,middleright,
far right).
Note, see the “Adaptive
Binning Strategy” by
Ding et al. (2019)
These diagrams plot expected sample accuracy
asafunction ofconfidence.Ifthemodelisperfectly
calibrated – then the diagram should plot the
identity function. Any deviation from a perfect
diagonal represents miscalibration. Note
that reliability diagrams do not display the
proportion of samples in a given bin, and thus
cannot be used to estimate how many samples
arecalibrated.

PracticalexamplewithClinicalDecisionSupportSystems
Calibrationofmedical diagnosticclassifierscores
totheprobabilityofdisease
WeijieChen,Berkman Sahiner,FrankSamuelson,AriaPezeshk,NicholasPetrick
Statistical Methodsin Medical Research(Firstpublished online August 8, 2016)
https://doi.org/10.1177%2F0962280216661371 Cited by 2
Scores produced by statistical classifiers in many clinical decision support
systems and other medical diagnostic devices are generally on an arbitrary
scale, sothe clinical meaning ofthesescores is unclear. Calibrationof
classifier scores to a meaningful scale such as the probability of
disease ispotentiallyusefulwhensuch scoresareusedbyaphysician.
In this work, we investigated three methods (parametric, semi-parametric, and
non-parametric) for calibrating classifier scores to the probability of disease
scale and developed uncertainty estimation techniques for these
methods. We showed that classifier scores on arbitrary scales can be
calibrated to the probability of disease scale without affecting their
discrimination performance. With a finite dataset to train the calibration
function, it is important to accompany the probability estimate with its
confidence interval. Our simulations indicate that, when a dataset used for
finding the transformation for calibration is also used for estimating the
performance of calibration, the resubstitution bias exists for a performance
metric involving the truth states in evaluating the calibration performance.
However, the bias is small for the parametric and semi-parametric methods
whenthesamplesizeismoderatetolarge(>100perclass).

Newtoolsemerging foruncertaintyquantification #1
DetectingOut-of-DistributionInputstoDeep
GenerativeModelsUsingaTest for Typicality
Eric Nalisnick, AkihiroMatsukawa, Yee Whye Teh, BalajiLakshminarayanan DeepMind
(Submitted on 7Jun 2019) https://arxiv.org/abs/1906.02994
Recent work has shown that deep generative models can assign higher likelihood to
out-of-distribution data setsthan to their training data. We posit that this phenomenon
is caused by a mismatch between the model's typical set and its areas of high
probability density. In-distribution inputs should reside in the former but not
necessarily in the latter, as previous work has presumed. To determine whether or not
inputs reside in the typical set, we propose a statistically principled, easy-to-
implement test using the empirical distribution of model likelihoods. The test is model
agnostic and widely applicable, only requiring that the likelihood can be computed or
closely approximated. We report experiments showing that our procedure can
successfully detect the out-of-distribution sets in several of the challenging cases
reported byNalisnick et al. (2019).
In the experiments we showed that the proposed test is especially well-suited to
deep generative models, identifying the OOD set for SVHN vs CIFAR-10 vs
ImageNet Nalisnick et al.(2019)
with high accuracy (while maintaining ≤ 1% type-I error). In this
work we used the null hypothesis H0
: X ∈ Ae
M
, which was necessary since we
assumed access to only one training data set. One avenue for future work is to use
auxiliary data sets Hendrycks et al. 2019
to construct a test statistic for the null H0
: X ∈ Ae
M
,
aswould be proper for safety-criticalapplications.
In our experiments we also noticed two cases—PixelCNN trained on FashionMNIST,
tested on NotMNIST and Glow trained on CelebA, tested on CIFAR—in which the
empirical distributions of in- and out-of-distribution likelihoods matched
near perfectly. Thus use of the likelihood distribution produced by deep generative
models has a fundamental limitation that is seemingly worse than what was reported
by Nalisnicket al. (2019). Aitchison et al. (2016) showed that power-law datacan give rise
to a dispersed distribution of likelihoods, and thus examining connections between
long-taileddataand typicality might explainthisphenomenon.

Newtoolsemerging foruncertaintyquantification #2
OpenSetRecognitionThroughDeepNeuralNetworkUncertainty:DoesOut-of-
DistributionDetectionRequireGenerativeClassifiers?
MartinMundt,IuliiaPliushch,Sagnik Majumder andVisvanathanRamesh
GoetheUniversity, Frankfurt,Germany
August26,2019https://arxiv.org/abs/1908.09625
We present an analysis of predictive uncertainty based out-of-distribution detection for different
approaches to estimate various models’ epistemic uncertainty and contrast it with extreme value
theory based open set recognition. While the former alone does not seem to be enough to
overcome this challenge, we demonstrate that uncertainty goes hand in hand with the latter method. This
seemstobeparticularly reflectedinagenerative modelapproach,whereweshowthatposterior
based open set recognition outperforms discriminative models and predictive uncertainty based outlier
rejection, raising the question of whether classifiersneed to be generative in order to know what they have
notseen.
We have provided an analysis of prediction uncertainty and EVT based out-of-distribution
detection approaches for different model types and ways to estimate a model’s epistemic uncertainty.
While further larger scale evaluation is necessary, our results allow for two observations. First, whereas
OOD detection is difficult based on prediction values even when epistemic uncertainty is captured, EVT
based open set recognition based on a latent model’s approximate posterior can offer a
solution to a large degree. Second, we might require generative models for open set detection in
classification, even if previous work has shown that generative approaches that only model the data
distributionseemtofailtodistinguishunseenfromseendata[17].
[17] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan. Do Deep Generative
Models Know What They Don’t Know? International Conference on Learning Representations (ICLR),
2019.
Classification confidence and entropy for deep neural network
classifiers with and without approximate variational inference.
Models have been trained on FashionMNIST and are evaluated
on out-of-distributiondatasets.

ImprovingBack-TranslationwithUncertainty-based
ConfidenceEstimation
ShuoWang,,YangLiu,ChaoWang,HuanboLuan,Maosong
Sunhttps://arxiv.org/abs/1909.00157 (31Aug2019)
While back-translation is simple and effective in exploiting
abundant monolingual corpora to improve low-resource neural
machine translation (NMT), the synthetic bilingual corpora
generatedbyNMTmodelstrainedonlimitedauthenticbilingual
dataare inevitably noisy.Inthiswork,weproposetoquantify
the confidence of NMT model predictions based on model
uncertainty. With word- and sentence-level confidence
measures based on uncertainty, it is possible for back-
translation to better cope with noise in synthetic bilingual
corpora. Experiments on Chinese-English and English-
German translation tasks show that uncertainty-based
confidence estimation significantly improves the performance
ofbacktranslation
The key idea is to use Monte Carlo Dropout to sample
translation probabilities to calculate model uncertainty,
without the need for manually labeled data. As our
approach is transparent to model architectures, we plan to
further verify the effectiveness of our approach on other
downstream applications of NMT such as post-editing and
interactiveMT inthefuture.
Uncertaintyforback-translation i.e.foryourmedicalCycleGANs

Uncertainty ofUncertaintyestimatesmatter
n-MeRCI:A new MetrictoEvaluate
theCorrelationBetweenPredictive
Uncertainty andTrueError
MichelMoukari, LoïcSimon,SylvainePicard, Fredéric Jurie
(Submittedon20Aug2019)
As deep learning applications are becoming more and more
pervasive in robotics, the question ofevaluating the reliability
of inferences becomes a central question in the robotics
community.Thisdomain,knownas predictive uncertainty,
has come under the scrutiny of research groups developing
Bayesian approaches adapted to deep learning such as
Monte Carlo Dropout. Unfortunately, for the time being, the
real goal of predictive uncertainty has been swept
under the rug. Indeed, these approaches are solely
evaluated in terms of raw performance of the network
prediction, while the quality of their estimated
uncertainty is not assessed. Evaluating such uncertainty
prediction quality is especially important in robotics, as actions
shalldependontheconfidenceinperceivedinformation.In this
context, the main contribution of this article is to propose a
novel metric that is adapted to the evaluation of relative
uncertainty assessment and directly applicable to regression
with deep neural networks. To experimentally validate this
metric, we evaluate it on a toy dataset and then apply it to the
taskofmonocular depthestimation.

RememberNotlike your clinicians areperfect and unbiased
Trytofactorthe“groundtruthuncertainty”intothe models
EricTopol@EricTopol Thenext chapter of documenting physician bias in heart attack
diagnosis
1.Age BehavingDiscretelyHeuristicThinkingin the EmergencyDepartment by
@stephencoussens
2. Raceand Sex (+ day ofthe week, site) ="Behavioral Hazard"
WhoisTested for Heart Attackand WhoShould Be:PredictingPatient Riskand Physician Error.
@nberpubs by@m_sendhil @oziadias and potential for #AItolessen
64292097
Dealingwithinter-expertvariability inretinopathyofprematurity:A
machinelearningapproachV.Bolón-Canedo,E.Ataer-Cansizoglu,D.Erdogmus,
J.Kalpathy-Cramer,O.Fontenla-Romero,A.Alonso-Betanzosa,M.F.Chiang
https://doi.org/10.1016/j.cmpb.2015.06.004 (2015)
AHierarchicalProbabilisticU-NetforModeling
Multi-ScaleAmbiguitiesSimonA.A.Kohl, BernardinoRomera-Paredes
, KlausH.Maier-Hein, DaniloJimenez Rezende, S. M.Ali Eslami, PushmeetKohli,
AndrewZisserman, OlafRonneberger (Submittedon30May2019)

Labelnoisefromhospitalpractices
Strategiestoreducediagnosticerrors:asystematicreview
JulieAbimanyi-Ochom,ShalikaBohingamuMudiyanselage,Max Catchpool,MarnieFiripis,
SitharaWanni ArachchigeDona&JenniferJ.Watts BMCMedicalInformaticsandDecision
Makingvolume19,Articlenumber: 174(2019)https://doi.org/10.1186/s12911-019-0901-1
Despite numerous studies on interventions
targeting diagnostic errors, our analyses
revealed limited evidence on interventions being
practically used in clinical settings and a bias of
studies originating from the US (n = 19, 73% of
included studies). There is some evidence that
trigger algorithms, including computer based and
alert systems, may reduce delayed diagnosis and
improve diagnostic accuracy. In trauma settings,
strategies such as additional patient review (e.g.
trauma teams) reduced missed diagnosis and in
radiology departments review strategies such as
team meetings and error documentation
mayreduce diagnosticerrorratesovertime.

Notonlyyoursubjectsarebiased butalsotheresearchstaff?
High-performingphysiciansaremorelikelytoparticipatein
aresearchstudy:findingsfromaqualityimprovement
study
SimoneDahrouge,CatherineDeri Armstrong, WilliamHogg,JatinderpreetSingh &ClareLiddy
BMCMedical Research Methodologyvolume19, Articlenumber: 171(2019)
https://doi.org/10.1186/s12874-019-0809-6
Participants in voluntary research present a different demographic
profile than those who choose not to participate, affecting the
generalizability of many studies. Efforts to evaluate these
differences have faced challenges, as little information is available
from non-participants. Leveraging data from a recent randomized
controlled trial that used health administrative databases in a
jurisdiction with universal medical coverage, we sought to
compare the quality of care provided by participating and non-
participating physicians prior to the program’s
implementation in order to assess whether participating
physiciansprovidedahigher baselinequalityofcare.
Our study demonstrated a participation bias for several quality
indicators. Physician characteristics can explain some of
these differences. Other underlying physician or practice
attributes also influence interest in participating in quality
improvement initiativesand existingqualitylevels. Thestandardfor
addressingparticipation bias by controllingfor basic physician and
practice level variables is inadequate for ensuring that results are
generalizabletoprimarycareprovidersandpractices.
The participants who voluntarily agreed to participate in the IDOCC study
differed in several ways from those who refused. The family physicians
who agree to participate have a better performance quality level at
baseline compared to those who do not. While characteristics of participating
physicians can explain some of these differences, other underlying physician
or practice attributes are also likely to be influencing both interest in QI
initiatives and existing quality levels. QI initiatives are not adequately reaching
their target population. Our results may help inform policymakers to
develop effective recruitment and engagement strategies for
similar programs, with the hope of enhancing their efficiency and cost
effectiveness.
It is important to understand the differences between the
participating and non-participating physicians in order to evaluate
the generalizability and relevance of the conclusions from studies that rely on
voluntary participation. Comparisons of the groups are typically very limited
because very little information is available from those who choose not to
participate. In this particular circumstance where we were able to compare
participants and non-participants much more extensively than usual, we
found important differences demonstrating that the physicians that need the
help most were the least likely to participate. Further research is warranted to
determine how widespread is the participation bias caused by the
voluntarynatureofresearch participation.

ExclusionCriteriaasMeasurements
ExclusionCriteriaasMeasurementsI:IdentifyingInvalidResponses
BarryDewitt,BaruchFischhoff,Alexander L.Davis,StephenB.Broomell,MarkS.
Roberts,JanelHanmer
FirstPublishedAugust28,2019
https://doi.org/10.1177%2F0272989X19856617
In a systematic review, Engel et al. found large variation in the exclusion
criteria used to remove responses held not to represent genuine
preferences in health state valuation studies. We offer an empirical
approach to characterizing the similarities and differences among such
criteria.
Results. We find that the effects of exclusion criteria do not always
match the reasons advanced for applying them. For example, excluding
very high and very low values has been justified as removing aberrant
responses. However, people who give very high and very low
values prove to be systematically different in ways suggesting
that suchresponsesmay reflect different processes.
Conclusions. Exclusion criteria intended to remove low-quality
responses from health state valuation studies may actually remove
deliberate but unusual ones. A companion article examines the
effectsof the exclusion criteria on societal utilityestimates.
ExclusionCriteriaasMeasurementsII:EffectsonUtilityFunctions
BarryDewitt,BaruchFischhoff,Alexander L.Davis,StephenB.Broomell,MarkS.
Roberts,JanelHanmer
FirstPublishedAugust28,2019
https://doi.org/10.1177%2F0272989X19862542
Researchers often justify excluding some responses in studies eliciting
valuations of health states as not representing respondents’ true preferences.
Here,weexamine the effectsofapplying 8common exclusioncriteriaon societal
utilityestimates
Results. Exclusion criteria have varied effects on the utility functions for the
different PROMIS health domains. As a result, applying those criteria would have
varied effects on the value of treatments (and side effects) that change health
status on those domains. Limitations. Although our method could be applied to
any health utility judgments, the present estimates reflect the features of the study
that produced them. Those features include the selected health domains,
standard gamble method, and an online format that excluded some groups (e.g.,
visually impaired and illiterate individuals). We also examined only a subset of all
possible exclusion criteria, selected to represent the space of possibilities, as
characterized in a companion article. Conclusions. Exclusion criteria can
affect estimates of the societal utility of health states. We use those
effects, in conjunction with the results of the companion article, to make
suggestionsforselectingexclusioncriteriainfuturestudies.

Continuous(self-)trackingandburdenforthepatient
“Youhavetoknowwhyyou'redoingthis”:amixedmethods
studyofthebenefitsandburdensofself-trackingin
Parkinson'sdisease
SaraRiggare,ThereseScottDuncan, HelenaHvitfeldt&MariaHägglund
BMCMedicalInformaticsandDecision Makingvolume19,Articlenumber:
175(2019) https://doi.org/10.1186/s12911-019-0896-7
This study explores opinions and experiences of people with
Parkinson’s disease (PwP) in Sweden of using self-tracking.
Parkinson’s disease (PD) is a neurodegenerative condition entailing
varied and changing symptoms and side effects that can be a
challenge to manage optimally. Patients’ self-tracking has
demonstrated potential in other diseases, but we know little
about PD self-tracking. The aim of this study was therefore to explore
the opinionsand experiencesofPwP in Sweden ofusing self-tracking
for PD.
The main identified benefits are that self-tracking gives PwP a deeper
understanding of their own specific manifestations of PD and
contributes to a more effective decision making regarding their own
selfcare. The process of self-tracking also enables PwP to be more
active in communicating with healthcare. Tracking takes a lot of work
and there is a need to find the right balance between burdens
andbenefits.
Futureresearch
Our study suggests that there is potentially a lot to be learned from PwP self-tracking on their own
initiative and that the tools needed at least partly have distinctly different characteristics from tools
used by and in healthcare. In this field, we have identified possible future work in the design and
implementation of tools for measuring the “right” thing as well as for storing, analysing,
visualising, and sharing data. We have also identified a number of other strategies that self-
tracking patients apply to reduce the burden of tracking, e.g. focusing on tracking positive aspects
rather than negative, or clearly limiting their tracking in both time and focus. It would be of interest to
further explore how widely spread these strategies are and how effective they are in reducing the
burden of self-tracking. We believe that the PDSA methodology could be a useful tool in exploring
theseissuesfurther.
Another topic for further research is looking into the group that do not track. What can we learn from
them?Whataretheirreasonsfornot tracking?
We have also identified a neglected area in education related to self-tracking, both for PwP and
healthcare professionals. With a better understanding of the needs for knowledge, both theoretical
and practical, the benefits of self-tracking can be realised in a better way. Future work in this area
include for example identifying appropriate methods and actors foreducation as wellas organisational
andfundingissues.
Data from self-tracking efforts by individuals can also potentially be used for systematically
improving healthcare and research, ultimately enabling personalised medicine. This
would lead to a clearer focus on secondary prevention, which has the potential of improving health.
This potential warrants further studies relating to, for example, how self-tracking could influence health
economicalaspects,bothin healthcareaswellaswithin thesociety.

Burden/adherencehowtooptimizeattendancetotreatment?
Applyingmachinelearningtopredictfuture
adherencetophysicalactivityprograms
MoZhou,YoshimiFukuoka,KenGoldberg,EricVittinghoff
&AnilAswani BMCMedicalInformaticsandDecision
Makingvolume19,Articlenumber:169(2019)
https://doi.org/10.1186/s12911-019-0890-0
Identifying individuals who are unlikely to adhere to a
physical exercise regime has potential to improve
physical activity interventions. The aim of this paper is to
develop and test adherence prediction models using
objectively measured physical activity data in the Mobile
Phone-Based Physical Activity Education
program(mPED)trial.Tothe bestofour knowledge,this
is the first to apply Machine Learning methods to predict
exercise relapse using accelerometer-recorded
physical activity data (triaxial accelerometer (HJA-350IT,
ActivestylePro,OmronHealthcareCo.,Ltd).
DiPS is capable of making accurate and robust predictions
for future weeks. The most predictive features are
steps and physical activity intensity. Furthermore,
the use of DiPS scores can be a promising approach to
determine when or if to provide just-in-time messages
and step goal adjustments to improve compliance. Further
studies on the use of DiPS in the design of physical
activitypromotionprogramsarewarranted.
DiPS is a machine learning-based score that uses
logisticregression or SVM on objectivelymeasuredstep
and goal data, and it was able to accurately predict
exercise relapse with a sensitivity of 85% and a
specificity of 67%. In addition, simulation results suggest
the potential benefit of DiPS as a score to allocate
resources in order to hopefully provide more cost-
effective interventions for increasing adherence.
However, DiPS will need to be validated in larger and
different populations, and its efficacy will need to be
examinedinafull-scaleRCTinthenear future.

‘BiasedEHRfilling’frompoorattendancefromspecific‘phenotypes’?
Predictingscheduledhospitalattendancewithartificial
intelligence
AmyNelson,DanielHerron,GeraintRees&ParashkevNachev npjDigital
Medicinevolume2,Articlenumber:26(2019)
https://doi.org/10.1038/s41746-019-0103-3
Failure to attend scheduled hospital appointments disrupts clinical
management and consumes resource estimated at £1 billion
annually in the United Kingdom National Health Service alone. Accurate
stratification of absence risk can maximize the yield of preventative
interventions. The wide multiplicity of potential causes, and the poor
performance of systems based on simple, linear, low-dimensional models,
suggests complex predictive models of attendance are needed. Here, we
quantify the effect of using complex, non-linear, high-dimensional models
enabledbymachinelearning.
Impact modelling
The value of a predictive system depends on the relative cost of a lost
appointment, and the cost and efficacy of the intervention. The mean
‘reference’ cost of an MRI in the UK National Health Service for the latest
available reporting period (2015–2016) is £147.25,22 rounded to £150. The
cost of reminding a patient by telephone—which often requires more than
one call—is conservatively estimated at £6 within our institution, in broad
agreement with commercial rates. The reported intervention efficacy ranges
from33to39%:11,12,13 hereweconservativelychoosethelowervalue.

DatasharingbythepatientsforclinicalbenefitinEMR/EHR
Patients’ willingness to share digital
healthandnon-health datafor
research: across-sectional study
EmilySeltzer, Jesse Goldshear, Sharath ChandraGuntuku,
Dave Grande, David A. Asch, ElissaV. Klinger & RainaM.
Merchant
BMCMedicalInformaticsandDecision
Makingvolume19,Articlenumber:157(2019)
https://doi.org/10.1186/s12911-019-0886-9
Patients generate large amounts of digital data
through devices, social media applications, and
other online activities. Little is known about
patients’ perception of the data they
generate online and its relatedness to health,
their willingness to share data for research,
andtheirpreferencesregardingdatause.
Patients in this study were willing to share a
considerable amount of personal digital
data with health researchers. They also
recognize that digital data from many sources
reveal information about their health. This study
opens up a discussion around reconsidering US
privacy protections for health information to
reflect current opinions and to include their
relatednesstohealth.
While providing data back to patients would be a first step, future work would also focus on the utility of this
data being provided to healthcare providers via anEMR. Lessdefined is howthisdatawould be interpreted,
or used, or if it would even be welcomed. Regular reports of patients’ steps walked, calories consumed, Facebook
status updates, and online footprints might create overwhelming expectations of regular surveillance of
questionable value and frustratingly limited opportunities to intervene even if strong signals of
abnormal patterns were detected [30]. This future work could assess healthcare providers use of digital data
incorporated in an EMR and focus on issues related to the accuracy, interpretability, meaning, and
actionabilityofthedata [31,32,33,34,35].
This study has limitations. The findings are exploratory and represent a small sample size from a non-
representative population. Response rate may have been influenced by patients being queried in a medical
environment and could vary if patients were asked in non-hospital settings. This study also has strengths. Because
we told patients that we would immediately access their data should they be willing to share it, their
willingness toshare more likely representstrue preferences,rather than merelythe expressed preferencesofa
typicalhypotheticalsetting.

Physicianswith“God-complex”everyonethinktheyareright?
Collectiveintelligenceinmedicaldecision-making:a
systematicscopingreviewKateRadcliffe,HelenaC.Lyson,Jill
Barr-Walker &UrmimalaSarkar BMCMedicalInformaticsand
DecisionMakingvolume19,Articlenumber: 158(2019)
https://doi.org/10.1186/s12911-019-0882-0
Collective intelligence, facilitated by information
technology or manual techniques, refers to the collective
insight of groups working on a task and has the potential to
generate more accurate information or decisions than
individuals can make alone. This concept is gaining traction in
healthcare and has potential in enhancing diagnostic accuracy.
We aim to characterize the current state of research with
respect to collective intelligence in medical decision-making
and describe a framework fordiverse studiesin thistopic.
Collective intelligence in medical decision-making is gaining
popularity to advance medical decision-making and holds
promise to improve patient outcomes. However,
heterogeneous methods and outcomes make it difficult
to assess the utility of collective intelligence approaches
across settings and studies. A better understanding of
collective intelligence and its applications to medicine may
improve medical decision-making.
This systematic scoping review is the first to our knowledge to characterize collective intelligence in medical decision-making. Our review describes collective
intelligence that is generated by medical experts and distinct from terms such as “crowdsourcing” that do not use experts to make medical judgments. All
included studies examine collective intelligence as it pertains to specific cases, rather than simply describing collaborative decision-making or other decision
aids. In this review we present a novel framework to describe investigations into collective intelligence. Studies examined two distinct forms ofthe initial decision
task in collective intelligence: individual processes that were subsequentlyaggregated, versus group synthesis in which the diagnostic thinking was initiated in a
group setting. The initial decision task is followed by aggregation or synthesis of opinions to generate the collective decision-making output. When a group
jointlydevelops their initial decision,synthesis occurs as part of the initial input, whereas inindividual processes, manual orIT methods arerequired to generatea
collective output from the individual inputs that experts contribute. The final collective output can then be routed back to the decision-makers to potentially
influence patient care. The impact of these approaches on patient outcomes remains unclear and merits further study. Similarly, further research is needed to
determine how tobest incorporate these approaches into clinical practice.

ActiveLearning
andDatasetDistillation

This iteratedself-improvement illustrated #1
Deeplearningforcellular
imageanalysis
Erick Moen, Dylan Bannon, Takamasa Kudo, WilliamGraf, Markus
Covert and David VanValen CaliforniaInstituteofTechnology /StanfordUniversity
Nature Methods(2019)
https://doi.org/10.1038/s41592-019-0403-1
Here we review the intersection between
deep learning and cellular image analysis
and provide an overview of both the
mathematical mechanics and the
programming frameworks of deep
learning that are pertinent to life scientists.
We survey the field’s progress in four key
applications: image classification, image
segmentation, object tracking, and
augmentedmicroscopy.
Our prior work has shown that it is important to match a
model’s receptive field size with the relevant feature size
in order to produce a well-performing model for biological
images. The Python package Talos is a convenient tool for
Keras users that helps to automate hyperparameter
optimization through grid searches.
Wehave found thatmodern sofwared evelopmentpractices
havesubstantially improved theprogramming experience, as
well as the stability of the underlying hardware. Our groups
routinely use Git and Docker to develop and deploy
deep learning models. Git is a version-control sofware, and
the associated web platform GitHub allows code to be jointly
developed by team members. Docker is a containerization
tool that enables the production of reproducible
programming environments.
Deeplearningisadata
science,andfewknow
databetterthan those
whoacquireit. Inour
experience,better
toolsand better
insightsarisewhen
benchscientistsand
computational
scientistsworkside
byside—even
exchangingtasks—to
drivediscovery

This iteratedself-improvement illustrated #2
ImprovingDataset
VolumesandModel
Accuracy withSemi-
SupervisedIterativeSelf-
Learning
Robert Dupre; JiriFajtl ;Vasileios Argyriou ;Paolo Remagnino
IEEE Transactions onImage Processing ( EarlyAccess, May2019 )
https://doi.org/10.1109/TIP.2019.2913986
Within this work a novel semi-
supervised learning technique is
introduced based on a simple iterative
learning cycle together with learned
thresholding techniques and an
ensemble decision support system.
State-of-the-art model performance
and increased training data volume are
demonstrated, through the use of
unlabelled data when training deeply
learned classification models. The
methods presented work independently
from the model architectures or loss
functions, making this approach
applicable to a wide range of machine
learningandclassificationtasks.
v

ActiveLearning for datareduction (or labelverification)
LessisMore:AnExplorationofData
RedundancywithActiveDatasetSubsampling
Kashyap Chitta JoseM.AlvarezElmar Haussmann ClementFarabet NVIDIA
(Submitted on 29May2019) https://arxiv.org/abs/1905.12737
Given the large size of such datasets, it is conceivable that they contain certain
samples that either do not contribute or negatively impact the DNN’s
performance. If there is a large number of such samples, subsampling the
training dataset in a way that removes them could provide an effective solution to
both improve performance and reduce training time. In this paper, we propose an
approach called Active Dataset Subsampling (ADS), to identify favorable
subsets within a dataset for training using ensemble based uncertainty
estimation. In our work, we present an ensemble approach that allows users to
draw alarge number of samples using the catastrophic forgetting property
in DNNs [42]. Specifically, we exploit the disagreement between different
checkpoints stored during successive training epochs to efficiently construct
large and diverse ensembles. We collect several training checkpoints over
multiple training runs with different random seeds. This allows us to maximize the
number of samples drawn, efficiently generating ensembles with up to hundreds
ofmembers.
When applied to three image classification benchmarks (CIFAR-10, CIFAR-100
and ImageNet) we find that there are low uncertainty subsets, which can
be as large as 50% of the full dataset, that negatively impact
performance. These subsets are identified and removed with ADS. We
demonstrate that datasets obtained using ADS with a lightweight ResNet18
ensemble remain effective when used to train deeper models like ResNet-101.
Our results provide strong empirical evidence that using all the available data
fortrainingcanhurtperformance onlargescalevisiontasks.

Integrateintodoctor’sworkflow clinicalUX
ANew FrameworktoReduce
Doctor'sWorkloadforMedical
ImageAnnotation
YangDeng etal.(2019) Tsinghua University
http://doi.org/10.1109/ACCESS.2019.2917932
In order to effectively reduce the
workload of doctors, we developed a
new framework for medical image
annotation. First, by combining active
learning and U-shape network, we
employed a suggestive annotation
strategy to select the most effective
annotation candidates. We then
exploited a fine annotation platform
to alleviate annotating efforts on each
candidate and utilized a new criterion to
quantitatively calculate the efforts from
doctors.

Remember that
ConfusionMatrix
interpretation depend
on thepathology and the
health systemcost structure

FalseNegativetendstobemoreexpensivethanFalsePositive
Well at least in termsof quality-of-life (QALY) of the patient. Insurer/public healthcare might disagree
https://aws.amazon.com/blogs/machine-learning/training-models-with-unequal-econo
mic-error-costs-using-amazon-sagemaker/
[3] Wu, Yirong,Craig K. Abbey, Xianqiao Chen,JieLiu, David C.
Page, OguzhanAlagoz, Peggy Peissig,AdedayoA. Onitilo, and
ElizabethS. Burnside. “DevelopingaUtility Decision
Frameworkto Evaluate Predictive Modelsin Breast
Cancer RiskEstimation.”Journal of Medical Imaging 2,no.
4(October 2015). https://doi.org/10.1117/1.JMI.2.4.041005.
[4] Abbey, Craig K., YirongWu, Elizabeth S. Burnside, Adam
Wunderlich,Frank W. Samuelson, and John M. Boone. “A
Utility/CostAnalysisofBreastCancer RiskPrediction
Algorithms.”Proceedingsof SPIE–theInternational Society
forOptical Engineering 9787 (February27,2016).
https://www.ncbi.nlm.nih.gov/pubmed/27335532

Problemnotuniqueto Healthcare “Keyaccountmodeling”
Anovelcost-sensitiveframeworkforcustomer
churnpredictivemodeling
Alejandro CorreaBahnsen,DjamilaAouada&Björn Ottersten
DecisionAnalyticsvolume2,Articlenumber: 5(2015)|
https://doi.org/10.1186/s40165-015-0014-6 - Citedby16
In this paper a new framework for a cost-sensitive churn
predictive modeling was presented. First we show the importance
of using the actual financial costs of the churn modeling process, since
there are significant differences in the results when evaluating a churn
campaign using a traditional such as the F1-Score, than when using a
measure that incorporates the actual financial costs such as the
savings. Moreover, we also show the importance of having a measure
that differentiates the costs within customers, since different customers
have quite different financial impact as measured by their
lifetime value. Also, this framework can be expanded by using an
additional classifier to predict the offer response probability by
customer.
Furthermore, our evaluations confirmed that including the costs of
each example and using an example-dependent cost-sensitive
methods leads to better results in the sense of higher savings. In
particular, by using the cost-sensitive decision tree algorithm, the
financial savings are increased by 153,237 Euros, as compared to the
savings of the cost-insensitive random forest algorithm which amount
to just 24,629 Euros.

Uncertainty and
Medicaldecision making

MCDropout combinewith task-specificUtilityFunction
Loss-CalibratedApproximateInferencein
BayesianNeural Networks
AdamD.Cobb,Stephen J. Roberts, Yarin Gal
(Submittedon10May 2018)
https://arxiv.org/abs/1805.03901 Cited by 5 - Related articles
https://github.com/AdamCobb/LCBNN
Current approaches in approximate inference for Bayesian neural networks
minimise the Kullback-Leibler divergence to approximate the true posterior
over the weights. However, this approximation is without knowledge of
the final application, and therefore cannot guarantee optimal
predictions for a given task. To make more suitable task-specific
approximations, we introduce a new loss-calibrated evidence lower
bound for Bayesian neural networks in the context of supervised learning,
informed by Bayesian decision theory. By introducing a lower bound that
depends on a utility function, we ensure that our approximation achieves
higher utility than traditional methods for applications that have asymmetric
utilityfunctions.
Calibrating the network to take into account the utility leads to a
smoother transition from diagnosing a patient as healthy to diagnosing
them as having moderate diabetes. In comparison, weighting the cross
entropy to avoid false negatives by making errors on the healthy class
pushes it to ‘moderate’ more often. This cautiousness, leads to an
undesirable transition as shown in Figure 4a. The weighted cross entropy
model only diagnoses a patient as definitely being disease-free for
extremelyobvioustestresults, which isnot adesirablecharacteristic.
Left: Standard NN model. Middle: Weighted
cross entropy model. Right: Loss-
calibrated model. Each confusion matrix
displays the resulting diagnosis when
averaging the utility function with respect to
the dropout samples of each network. We
highlight that our utility function captures
our preferences by avoiding false
negatives of the ‘Healthy’ class. In addition,
there is a clear performance gain from the
loss-calibrated model, despite the label
noise in the training. This compares to both
the standard and weighted cross entropy
models, where there isacommon failure mode
of predicting a patient as being ‘Moderate’
when they are ‘Healthy’.

BrierScorebetterthanROCAUCforclinicalutility? Yes,but...
…Stillsensitivetodiseaseprevalence “class imbalance problem ”
TheBrierscoredoesnotevaluatetheclinicalutilityof
diagnostictestsorpredictionmodels
Melissa Assel, Daniel D.Sjoberg and AndrewJ. Vickers
MemorialSloan KetteringCancerCenter,NewYork,USA
Diagnostic and PrognosticResearch 2017 1:19 https://doi.org/10.1186/s41512-017-0020-3
The Brier score is an improvement over other statistical performance measures, such as AUC,
because it is influenced by both discrimination and calibration simultaneously, with smaller
values indicating superior model performance. The Brier score also estimates a well-defined
parameter in the population, the mean squared distance between the observed and expected
outcomes. Thesquare root of theBrierscore is thusthe expecteddistance between the observed and
predictedvalueon theprobabilityscale.
However, the Brier score is prevalence dependent i.e. sensitive to class imbalance in machine learning jargon
in such a way
that the rank ordering of tests or models may inappropriately vary by prevalence [Wuand Lee2014].
For instance, if a disease was rare (low prevalence), but very serious and easily cured by an innocuous
treatment (strong benefit to detection), the Brier score may inappropriately favor a specific test
compared to one of greater sensitivity. Indeed, this is approximately what was seen in the Zika virus
paper[Bragaetal.2017]
We advocate, as an alternative, the use of decision-analytic measures such as net benefit. Net
benefit always gave a rank ordering that was consistent with any reasonable evaluation of the
preferable test or model in a given clinical situation. For instance, a sensitive test had a higher net
benefit than a specific test where sensitivity was clinically important. It is perhaps not
surprising that a decision-analytic technique gives results that are in accord with clinical judgment
because clinical judgment is “hardwired” into the decision-analytic statistic. That said, this measure is
not without its own limitations, in particular, the assumption that the benefit and harms of
treatmentdonot varyimportantlybetweenpatients independentlyofpreference.
Howshouldweevaluatepredictiontools?Comparisonof
threedifferenttools forpredictionofseminalvesicle
invasionat radicalprostatectomyas atestcase
Giovanni Lughezzani etal. Eur Urol. 2012 Oct; 62(4):590–596.
https://dx.doi.org/10.1016%2Fj.eururo.2012.04.022
Traditional (area-under-the-receiver-operating-characteristic-
curve (AUC), calibration plots, the Brier score, sensitivity and
specificity, positive and negative predictive value) and novel (risk
stratification tables, the net reclassification index, decision curve
analysis and predictiveness curves) statistical methods quantified the
predictiveabilitiesofthethreetestedmodels.
Traditional statistical methods (receiver operating characteristic
(ROC) plots and Brier scores), as well as two of the novel statistical
methods (risk stratification tables and the net reclassification index)
could not provide clear distinction between the SVI
prediction tools. For example, receiver operating characteristic
(ROC) plots and Brier scores seemed biased against the binary
decision tool (ESUO criteria) and gave discordant results for the
continuous predictions of the Partin tables and the Gallina nomogram.
The results of the calibration plots were discordant with those of the
ROC plots. Conversely, the decision curve clearly indicated that the
Partin tables (Zornetal.2009) represent the ideal strategy for
stratifyingtheriskofseminalvesicleinvasion (SVI).

ROC(AUROC) not very usefulin the
end for clinicaluse?

“ROCographers”obsessiongaboutsuboptimalmetrics?
DecisionCurve Analysis:ANovelMethod
forEvaluatingPredictionModels
AndrewJ.Vickers,ElenaB. Elkin
MedicalDecisionMaking(November1,2006)
https://doi.org/10.1177/0272989X06295361
“Hilden (2004) has written of the schism between what he describes as
“ROCographers,” those who are interested solely in accuracy, and
“VOIographers,” who are interested in the clinical value of information
(VOI). He notes that although the former ignore the fact that their methods
have no clinical interpretation, the latter have not agreed on an appropriate
mathematical approach. We feel that decision curve analysis may help
bridge this schism by combining the direct clinical applicability of decision-
analytic methods with the mathematical simplicity of accuracy metrics.”
Jørgen Hilden. Evaluation of diagnostic tests: the schism. Soc Med Decis Making Newsletter. 2004;16:5–6. http://biostat.ku.dk/~jh/the-schism(hilden).doc

“Simplifying Medicaldecision
making,andmakingbiostatistician
morepowerful”
Decision CurveAnalysis (DCA)

DecisionCurveAnalysis OriginalPaper#1
DecisionCurve Analysis:ANovelMethod
forEvaluatingPredictionModels
AndrewJ.Vickers,ElenaB. Elkin
https://doi.org/10.1177/0272989X06295361
BACKGROUND
“The AUC metric focuses solely on the predictive accuracy of a model. As such, it cannot tell us
whether the model is worth using at all or which of 2 or more models is preferable. This is
because metrics that concern accuracy do not incorporate information on consequences.
Take the case where a false-negative result is much more harmful than a false-positive result. A
model that had a much greater specificity but slightly lower sensitivity than another would have a
higher AUC but would be a poorer choiceforclinicaluse.
Decision-analytic methods incorporate consequences and, in theory, can tell us whether a model
is worth using at all or which of several alternative models should be used Huningetal.2001
. In a typical
decision analysis, possible consequences of a clinical decision are identified and the expected
outcomes of alternative clinical management strategies then simulated using estimates of the
probabilityand sequelae of events ina hypothetical cohortofpatients
There are 2 general problems associated with applying traditional decision-analytic methods to
prediction models. First, they require data, such as on costs or quality-adjusted life-years, not
found in the validation data set—that is, the result of the model and the true disease state or outcome.
This means that a prediction model cannot be evaluated in a decision analysis without further
information being obtained. Moreover, decision-analytic methods often require explicit valuation of
health states or risk-benefit ratios for a range of outcomes. Health state utilities, used in the
quality adjustment of expected survival, are prone to a variety of systematic biases and may be
burdensome to elicit from subjects. The secondgeneral problem is that decision analysis typically
requires that the test or prediction model being evaluated give a binary result so that the rate of true-
and false-positive and negative results can be estimated. Prediction models often provide a result in
continuous form, such as the probability of an event from 0% to 100%. To evaluate such a
model using decision-analytic methods, the analyst must dichotomize the continuous result at a given
threshold and potentiallyevaluate awiderange ofsuch thresholds.”
Nowwehavetheconfusion matrix fromyourusedclassifier
model, butwhatistheprobability p ?
Thisiseitherclinician’s “gutfeeling” ofpatient’srisk forSVIorsome
riskthresholdbasedon patient’slab results/ EHRhistoryin fancier
modeling
seminal vesicles. The presence of seminal vesicle invasion (SVI) can be observed prior
to or during surgery only in rare cases of widespread disease. SVI is therefore typically
diagnosed after surgery by pathologic examination of the surgical sample. It has
recently been suggested that the likelihood of SVI can be predicted on the basis of
information available before surgery, such as cancer stage, tumor grade, and prostate-
specific antigen (PSA). A lthough some surgeons will remove the seminal vesicles
regardless of the predicted probability of SVI, others have argued that patients with a
low predicted probability of SVI might be spared total removal of the seminal vesicles:
most of the seminal vesicles would be dissected, but the tip, which is in close proximity
to several important nerves and blood vessels, would be preserved. According to this
viewpoint, sparing the seminal vesicle tip might therefore reduce the risk of common
side effects of prostatectomy such as incontinence and impotence.
When the probability is high
(>50%) for SVI, no point doing the
surgery

https://doi.org/10.1177/0272989X06295361
If the prediction model required
obtaining data from medical tests
that were invasive or dangerous
or involved expenditure of time,
effort, and money, we can use a
slightly different formulation of net
benefit:
The harm from the test is a “holistic” estimate of the negative
consequence of having to take the test (cost, inconvenience, medical
harms, etc.) in the units of a true-positive result. For example, if a clinician or a
patient thought that missing a case of disease was 50 times worse than
having to undergo testing, the test harm would be rated as 0.02. Test
harm can also be thought of in terms of the number of patients a clinician
would subject to the test to find 1 case of disease if the test were perfectly
accurate.
If the test were harmful in any way, it is possible that the net benefit of testing
wouldbe veryclosetoorlessthanthe netbenefitofthe“treatall”strategy for
some pt
. In such cases, we would recommend that the clinician have a
careful discussion with the patient and perhaps, if appropriate, implement a
formal decision analysis. In this sense, interpretation of a decision
curve is comparable to interpretation of a clinical trial: if an
intervention is of clear benefit, it should be used; if it is clearly ineffective, it
should notbe used; if its benefitislikely sufficient forsomebutnotall
patients,acarefuldiscussionwithpatientsisindicated.

MedicalDecisionMaking(November1,2006) https://doi.org/10.1177/0272989X06295361
The benefit of using a prediction model can be quantified in simple,
clinically applicable terms. Table 2 gives the results of our analysis for
threshold probability pt
s between 1% and 10%. The net benefit of
0.062 at a pt
of 5% can be interpreted in terms that use of the model,
compared with assuming that all patients are negative, leads to the
equivalent of a net 6.2 true positive results per 100 patients without
anincreaseinthenumberoffalse-positiveresults.
In terms of our specific example, we can state that if we perform surgeries
based on the prediction model, compared to tip preservation in all
patients, the net consequence is equivalent to removing the tip of affected
seminal vesicles in 6.2 patients per 100 and treating no unaffected
patients. Moreover, at a pt
of 5%, the net benefit for the prediction model is
0.013 greater than assuming all patients are positive. We can use the net
benefit formula to calculate that this is the equivalent of a net 0.013 ×
100/(0.05/0.95) = 25 fewer false positive results per 100 patients. In other
words, use of the prediction model would lead to the equivalent
of 25% fewer tip surgeries in patients without SVI, with no increase
inthe numberofpatientswith anaffectedseminalvesicleleftuntreated.

MedicalDecisionMaking(November1,2006) https://doi.org/10.1177/0272989X06295361
A second advantage of decision curve analysis is that it can be used to
compare several different models. To illustrate this, we compare the basic
prediction model with an expanded model and with a simple clinical decision rule.
The expanded model includes all of the variables in the basic model as well as
some additional biomarkers. The clinical decision rule separates patients into
2 riskgroups basedon Gleasongrade and tumorstage: thosewith grade greater
than 6orstagegreaterthan1 areconsideredhigh risk.To calculateadecisioncurve
for this rule, we used the methodology outlined above except that the proportions
oftrue- andfalse-positive resultsremainedconstantforall levelsof pt
.
Figure 3 shows the decision curve for these 3 models in the key range of pt
from 1% to
10%. There are 3 important features to note. First, although the expanded prediction
model has a better AUC than the basic model (0.82 v. 0.80), this makes no practical
difference: the 2 curves are essentially overlapping. Second, the basic model has a
considerably larger AUC than the simple clinical rule, yet for pt
s above 2%, there is
essentially no difference between the 2 models. Third, at some low values of pt
, using
the simple clinical rule actually leads to a poorer outcome than simply treating
everyone, despite a reasonably high AUC (0.72). In addition to illustrating the use of
decision curves to compare multiple prediction models, Figure 3 also demonstrates that
the methodology can easily be applied to a test or model with an inherently binary
outcome, suchasthe simple clinical decision rule.

Addingconfidenceintervalsandcross-validationin2008
Extensionsto decisioncurve
analysis, anovel method for
evaluating diagnostictests,
predictionmodels and
molecularmarkers
AndrewJ. Vickers, Angel M Cronin, Elena BElkin &Mithat
Gonen.BMCMedicalInformaticsand DecisionMaking
(November2008)https://doi.org/10.1186/1472-6947-8-53
In this paper we present several
extensions to decision curve
analysis including correction for
overfit, confidence intervals,
application to censored data
(including competing risk) and
calculation of decision curves
directly from predicted
probabilities. All of these
extensions are based on
straightforward methods that have
previously been described in the
literature for application to
analogousstatisticaltechniques.
BACKGROUNDDecision-analyticmethodscanexplicitlyconsider the clinicalconsequencesofdecisions.
Theythereforeprovidedataabouttheclinicalvalueoftests,modelsandmarkers,andcanthusdeterminewhether or
nottheseshouldbeusedinpatientcare.Yettraditionaldecision-analyticmethodshave severalimportant
disadvantagesthathavelimitedtheir adoptionintheclinicalliterature.
First, the mathematical methods can be complex and difficult to explain to a clinical audience. Second, many
predictors in medicine are continuous, such as a probability from a prognostic model or a serum level of a molecular
marker, and such predictors can be difficult to incorporate into decision analysis. Third, and perhaps most critically, a
comprehensive decision analysis usually requires information not found in the data set of a validation
study, that is, the test outcomes, marker values or model predictions on a group of patients matched with their true
outcome.
In the principal example used in this paper, blood was taken immediately before a biopsy for prostate cancer and
various molecular markers measured. The data set for the study consisted of the levels of the various markers and an
indicator for whether the biopsy was positive or negative for cancer. A biostatistician could immediately analyze
these data and provide an investigator with sensitivities, specificities and AUCs; a decision analyst would have to
obtain additional data on the costs and harms of biopsy and the consequences of failing to undertake a biopsy in a
patient with prostate cancer. Perhaps as a result, the number of papers that evaluate models and tests in terms of
accuracydwarfsthosewithadecision-analyticorientation.

Decisioncurveanalysis(DCA)Analyzingthe“opt-outextension”
AssessingtheClinicalImpact ofRisk
ModelsforOptingOutofTreatment
KathleenF.Kerr, MarshallD.Brown,Tracey L. Marsh, HollyJanes
MedicalDecisionMaking(January16,2019)
https://doi.org/10.1177%2F0272989X18819479
Decision curves are a tool for evaluating the population
impact of using a risk model for deciding whether to
undergo some intervention, which might be a
treatment to help prevent an unwanted clinical event or
invasive diagnostic testing such as biopsy. The common
formulation of decision curves is based on an opt-in
framework. That is,arisk modelisevaluatedbasedon the
population impact of using the model to opt high-risk
patients into treatment in a setting where the standard of
care is not to treat. Opt-in decision curves display the
population net benefit of the risk model in comparison
to the reference policy of treating no patients. In some
contexts, however, the standard of care in the absence
of a risk model is to treat everyone, and the potential use
of the risk model would be to opt low-risk patients out
of treatment. Although opt-out settings were discussed in
the original decision curve paper, opt-out decision
curves are underused. We review the formulation of
opt-out decision curves and discuss their advantages
for interpretation and inference when treat-all is the
standard.
Opt-in decision curve analysis of the risk model
in our simulated data. For comparison, the net
benefit values reported in the original article are also
shown (see Table 6 in Slankamenacetal.2013). The
standardized net benefit of each treatment policy is
displayed compared to the treat-none policy, which
has net benefit 0. The 95% confidence intervals
shown in the plot are useful for comparing either the
risk model or treat-all with treat-none. However,
these confidence intervals cannot be used to
compare the risk model with treat-all. For this
context, where treat-all is current policy, it is more
appropriate to display the standardized net benefit
of risk-based treatment compared to this reference
(seebelow).
Opt-out decision curve analysis corresponding
to above. Treat-all is the reference policy, and a risk
model could be used to opt low-risk patients out of
treatment. The analysis shows that the risk model
offers an estimated 20% to 55% of the maximum
possible net benefit to the patient population for R
between 5% and 20% compared to perfect
prediction. For a prespecified risk threshold, this opt-
out decision curve allows an assessment of the
evidence in favor of using the risk model over treat-
all because the confidence intervals displayed are
for risk-based treatment relative to treat-all. This is
not possible using the opt-in decision curves in
above.

DCAandUncertainty
TheImportance of UncertaintyandOpt-Inv.Opt-
Out: Best PracticesforDecisionCurveAnalysis
KathleenF.KerrTraceyL.Marsh,Holly Janes DepartmentofBiostatistics,UniversityofWashington,Seattle,
Editorial,MedicalDecision Making (May20,2019)
https://doi.org/10.1177/0272989X19849436
Decision curve analysis (DCA) is a breakthrough methodology,
and we are happy to see its growing adoption. DCA evaluates a risk model
when its intended application is to choose or forego an intervention based
on estimatedriskofhaving anunwantedoutcomewithouttheintervention.
The risk model performance metric at the heart of DCA is net
benefit. Itiswellknown thatthere isan optimistic biaswhen arisk modelis
evaluated on the same data that were used to build and/or fit the model
unlessspecialmethods are employed.Thisissue pertainsto estimating net
benefit just as it pertains to estimating any other risk model performance
metric.
The common type of DCA evaluates risk models for identifying high-
risk patients whoshouldberecommendedthe intervention.These
“opt-in” decision curves assess risk-based assignment to the
intervention relative to the treat-none policy (Kerretal.2019)
. “Opt-out” decision
curves are better suited when current policy is treat-all, and the potential
use of a risk model is to opt low-risk individuals out of the
intervention. In our opinion, a list of DCA best practices should advocate
choosing the type of decision curve that is most appropriate for the
application—depending on whether currentpractice istreat-none or treat-
all. This proposed item is in the spirit of CapogrossoandVickers(2019)’s
overarchinggoalofthoughtfulandconscientiousapplicationofDCA.
First, investigators should summarize the uncertainty in DCA results. The standard statistical
tool for quantifying uncertainty is the confidence interval. Decision curves are estimated
using a sample of data from a target population (such as a patient population) to infer risk model
performance in that population. A large sample size enables reliable and precise inference to the
target population, reflected in narrow confidence intervals. With a small sample size, spurious results
can arise by chance. In this situation, wide confidence intervals communicate the larger degree of
uncertainty.
While there is largely consensus on the importance of quantifying uncertainty in quantitative
scientific biomedical research, there appears to be some disagreement on this point for DCA
Vickersetal.2008; Bakeretal.2009
. We suspect this disagreement relates to another issue. It has been proposed
that an individual can use DCA results together with his or her personal risk threshold
to decide whether to choose the intervention, forego the intervention, or use the risk modelto
decide VickersandElkin2006; vanCalster2018
. If one accepts this proposal, one can then argue that the individual
should choose the option that appears superior based on point estimates of net benefit, regardless
of statistical significance. Under this proposal, measures of uncertainty such as confidence intervals
do not affectan individual’s decision, arguablyrenderingthemirrelevant.
Our view is that DCA is not appropriately used for such individual decision
making. The components of net benefitare the fraction ofthe target population that will go
on to have the unwanted clinical event and the true- and false-positive rates for the risk
modelattherisk threshold.Allofthesequantitiesare populationquantities,andsotoois
net benefit. Our proposal that decision curves should be published with confidence
intervalscan be viewedasanalternative tothe proposalthat decisioncurvesshould be
smoothed Capogrossoand Vickers (2019)
. To our knowledge, the statistical properties of smoothed
estimates of net benefit have not been investigated, so we think it is premature to
recommend them. Confidence intervals around a bumpy decision curve prevent
overinterpretation of bumps in the curve. In contrast, smoothing a bumpy decision
curve might make results appear more definitive than they really are, which could invite,
ratherthanprevent,overinterpretationofDCAresults.

DecisionCurves vs.RelativeUtility(RU)Curves
DecisionCurvesand RelativeUtilityCurves
StuartG.Baker DivisionofCancerPrevention,NationalCancerInstitute, Bethesda,MD,USA
MedicalDecisionMaking,May20,2019
https://doi.org/10.1177/0272989X19850762
The challenge with using standard decision-analytic methods to evaluate risk prediction has
been specifying benefits and costs. Decision curve analysis circumvents this difficulty by
using a sensitivity analysis based on a range of risk thresholds. A useful alternative to decision
curves is relative utility curves, which uses the risk in the target population for the risk
threshold and has a rigorous theoretical basis. Relative utility is the maximum net benefit of risk
prediction (excluding the cost of data collection) at a given risk threshold divided by the maximum
net benefit of perfect prediction A relative utility curve plots relative utility versus risk threshold for
prediction versus treat-all, on the left side where treat-all is preferred to treat-none, and
predictionversustreat-none on the right side, where treat-none ispreferred to treat-all.
https://doi.org/10.1515/1557-4679.1395
A relative utility curve analysis begins with a risk prediction model obtained from training data. The
goal is to evaluate the risk prediction model in an independent test sample, which is ideally a random
sample from a target population, possibility stratified by event and no event. Unlike a decision curve
analysis, a relative utility analysis requires a concave receiver-operating characteristic (ROC) curve (or
the concave envelope of an ROC curve). A relative utility curve plots relative utility as a function
of the probability of the event in the target population and the test’s true-positive and false-
positive ratesversusa riskthreshold.
Importantly, the risk threshold in a relative utility curve corresponds to the risk in the target
population. In contrast, in decision curves, the risk threshold corresponds to the predicted risk in
the training sample. Consequently, calibration, adjusting the risk from the training sample to match the
risk in the test sample (and ideally to match the risk in the target population), is important for proper
application of decision curves—a criterion not listed in Capogrosso and Vickers. Poor
calibration can yield very misleading decision curves Van Calsterand Vickers 2014; Kerret al.2016
particularly
when the trainingsample isartificiallycreated byseparate samplingof casesand controls.
Making a prediction requires data collection, which often involves a cost or harm. The test
tradeoff is a statistic, developed for relative utility curves but computable with decision curves, yielding a
meaningful decision-analytic statement about the cost or harm of data collection without
explicitly specifying these costs or harms. The test tradeoff is the minimum number of persons
receiving a test that would be traded for a true positive so that the net benefit of risk prediction is positive.
Investigators need only decide if the test tradeoff is acceptable given the type of data
collection. For example, a test tradeoff of data collection in 3000 persons for every true prediction of
breast cancer risk may be acceptable if data collection involves a questionnaire or an inexpensive
genetic test but likely unacceptable if it involves an invasive test. Capogrosso and Vickers do not
mention test tradeoffs, but they are startingtobe used with decision curves van Calster 2018
.
In summary, Capogrosso and Vickers have made a valuable contribution by recognizing the
growing use of decision curve analyses satisfying their criteria for good practice. However, the
criteria should be expanded to include calibration of the predicted risk from the training sample
to the risk in the test sample and ideally the target population, so that the risk threshold is on the
appropriate scale. In addition, when using either decision curves or relative utility curves, it is worthwhile
to report the test tradeoff when there are nonnegligible data collection costs.

Decisioncurveanalysis(DCA)Reviewofusage
ASystematicReviewofthe
Literature DemonstratesSome
Errorsinthe Use ofDecisionCurve
AnalysisbutGenerallyCorrect
Interpretationof Findings
PaoloCapogrosso,AndrewJ.Vickers
MedicalDecisionMaking(February28, 2019)
https://doi.org/10.1177%2F0272989X19832881
We performed a literature review to
identify common errors in the
application of DCA and provide
practical suggestions for
appropriate use of DCA. Despite
some common errors in the
application of DCA, our finding that
almost all studies correctly
interpreted the DCA results
demonstrates that it is a clear and
intuitive method to assess
clinical utility.
A common task in medical research is to
assess the value of a diagnostic
test, molecular marker, or prediction
model. The statistical methods typically
used to do so include metrics such as
sensitivity, specificity, and area under
the curve (AUC Hanley and MacNeil 1982
)
However, it is difficult to translate
these metrics into clinical
practice: for instance,itis notatallclear
how high AUC needs to be to justify use
of a prediction model or whether, when
comparing 2 diagnostic tests, a given
increase in sensitivity is worth a given
decrease in specificity (Greenland 2008;
Vickers and Cronin 2010)
. It has been generally
argued that because traditional
statistical metrics do not incorporate
clinical consequences—for
instance, the AUC weights sensitivity
and specificity as equally important—
they cannot be used to guide
clinicaldecisions.
In brief, DCA is a plot of net benefit against
threshold probability. Net benefit is a
weighted sum of true and false positives, the
weighting accounting for differential
consequences of each. For instance, it is much
more valuable to find a cancer (true positive)
than it is harmful to conduct an unnecessary
biopsy (false negative) and so it is appropriate
togiveahigher weighttotrue positivesthan false
positives. Threshold probability is the minimum
risk at which a patient or doctor would
accept a treatment and is considered across
arangetoreflectvariationinpreferences.
In the case of a cancer biopsy, for example, we
might imagine that a patient would refuse a
biopsy for acancer risk of1%,acceptabiopsyfor
a risk of 99%, but somewhere in between, such
as a 10% risk, be unsure one way or the other.
The threshold probability is used to determine
positive (risk from the model under evaluation of
10% of more) v. negative (risk less than 10%) and
as the weighting factor in net benefit. Net
benefit for a model, test, or marker is
compared to 2 default strategies of ‘‘treat all’’
(assuming all patients are positive) and ‘‘treat
none’’(assumeallpatientsarenegative)

DCAInAction Whento doProstateMRI
Populationnetbenefitof prostate MRIwithhigh
spatiotemporalresolutioncontrast enhanced imaging:‐enhanced imaging:
Adecisioncurve analysis
VinayPrabhu,AndrewB.Rosenkrantz,RicardoOtazo, DanielK. Sodickson,StellaK.
Kang Department ofRadiology,NYUSchool ofMedicine,NewYork,NewYork,USADepartment ofPopulationHealth,NYUSchool ofMedicine,NewYork,NewYork,USA
JMRI Volume49,Issue5,May2019
https://doi.org/10.1002/jmri.26318
The value of dynamic contrast enhanced (DCE) sequences in prostate‐enhanced (DCE) sequences in prostate
MRI compared with noncontrast MRI is controversial. To evaluate the
population net benefit of risk stratification using DCE MRI for detection‐enhanced (DCE) sequences in prostate
of high grade prostate cancer (HGPCA), with or without high‐enhanced (DCE) sequences in prostate
spatiotemporal resolution DCE imaging.
Noncontrast MRI characterization is likely sufficient to inform the
decision for biopsy only at low personal risk thresholds for
detection of HGPCA (5–11%), where the risks of biopsy are given
minimal consideration. Thus, prospective study is warranted to
further assess the avoidance of unnecessary prostate biopsies and
oncologic outcomes of using GRASP DCE-MRI to guide prostate
biopsydecisions.
Net benefit (a) and standardized net benefit (b) of various biopsy strategies based on study data
using GRASP DCE-MRI, with average HGPCA prevalence. Over probability thresholds (≥11%),
GRASP DCE-MRI with biopsy for lesions scored PI-RADS ≥4 (blue line) was most beneficial. Biopsy
of no lesions did not provide highest net benefit within a clinically relevant range of risk thresholds

DCAIn Action ExampleformicroRNA utilityin bladdercancerprognosis
Developmentofa21-miRNASignature Associated Withthe Prognosisof
PatientsWithBladderCancer
Xiao-HongYinnetal.Center for Evidence-Basedand TranslationalMedicine/DepartmentofEvidence-Based Medicineand ClinicalEpidemiology,,Zhongnan
HospitalofWuhanUniversity
FrontOncol.2019;9:729.https://dx.doi.org/10.3389%2Ffonc.2019.00729
Meanwhile, the prognostication value of the nomogram was verified internally using 1,000 bootstrap
samples, R package “rms” was applied to draw the nomogram and to perform internal validation.
Subsequently, we performed decision curve analysis (DCA) to verify the clinical role of the
nomogramfor the21-miRNAsignature.
To translate the conclusions of the present into clinical applications, we built a nomogram
containing the 21-miRNA signature and other clinical features of BC patients. Users could detect the
expression levels of these miRNAs and calculate the risk score of each BC patients based on the
expression levels of these 21-miRNAs and their corresponding coefficients in the Cox proportional
hazards model (Figure 6), then BC patients could be stratified into high risk group and low risk
group based on the 21-miRNA signature and physicians could estimate the 3 and 5 year survival
probabilities of BC patients. Meanwhile, the result of DCA suggested that the 21-miRNA signature
containing nomogram showed better prediction ability across the threshold probabilities
rangingfrom31to82%(Figure7).
Nevertheless, the most critical limitation of the present study is that the conclusions of the present
study derived from retrospective analysis of public data and the lack of external validation of the
model using in vivo, in vitro and prospective studies. Thus, we should keep cautious when we
translate the 21-miRNA signature into clinical practice. Further large-scale and multi-center in
vivo, in vitro and prospective clinical trials are needed in the future to confirm our new findings. In
conclusion, we introduced a 21-miRNA signature associated with the prognosis of BC patients, and it
mightbeusedasaprognosticmarker inBC.

not like your metric spaceends here
Variousmetricsforclinical
usefulnessexistbeyondthe
overusedROC

NetReclassificationIndex andStandardizedNetBenefit nice?
Measuresforevaluationof prognostic
improvementundermultivariatenormality
fornestedandnonnested models
DanielleM.Enserro OlgaV.Demler MichaelJ.Pencina RalphB.
D'AgostinoSr.
StatisticsinMedicine(June,2019) https://doi.org/10.1002/sim.8204
When comparing performances of two risk prediction
models, several metrics exist to quantify prognostic
improvement, including the change in the area under
the Receiver Operating Characteristic (ROC) curve, the
Integrated Discrimination Improvement, the Net
Reclassification Index (NRI) at event rate, the change in
Standardized Net Benefit (SNB), the change in Brier
score, and the change in scaled Brier score. We explore
the behavior and inter-relationships between
these metrics under multivariate normality in nested
and nonnested model comparisons. We demonstrate
that, within the framework of linear discriminant analysis,
all six statistics are functions of squared
Mahalanobis distance, a robust metric that properly
measures discrimination by quantifying the
separation between the risk scores of events
and nonevents. These relationships are important for
overall interpretability and clinical usefulness.
By extending the theoretical formulas for ΔAUC, IDI, AUC, IDI,
and NRI(y) from nested to nonnested model
comparisons, we increased their usability in
practice. These formulas, combined with the
theoretical derivations for ΔAUC, IDI, SNB(t), ΔAUC, IDI, SBS, and ΔAUC, IDI, BS,
provide additional estimation methods for
investigators. Due to increased variability in the
empirical estimation of NRI(y) and ΔSNB(t)SNB(t),
the theoretical estimators for these particular
metrics may be the superior estimation
method. As seen in the practical example,
nonnormality in the predictor variables
does not drastically affect the estimation of
the prognostic improvement measures;
however, the investigator may consider using
transformations if it is a biologically natural step to
take in analysis.
The difference between nested and
nonnested models. See:
● Cox, D. R.: Tests of separate fam
ilies of hypotheses, Proceedings 4
th Berkeley Symposium in Mathe
matical Statistics and Probability,
1, 105–123 (1961). University of
California Press.
● Cox, D. R.: Further results on test
of separate families of hypothese
s. Journal of the Royal Statistical
Society B, 406–424 (1962).
https://stats.stackexchange.com/questions
/4717/what-is-the-difference-between-a-
nested-and-a-non-nested-model

ExampleofNRIUse
Dynamicriskpredictionfordiabetesusingbiomarkerchange
measurements
LaylaParast,MeganMathews&Mark W.Friedberg
BMCMedicalResearchMethodologyvolume19,Articlenumber:175(2019)
https://doi.org/10.1186/s12874-019-0812-y
Dynamic risk models, which incorporate disease-free survival and repeated
measurements over time, might yield more accurate predictions of future health
status compared to static models. The objective of this study was to develop and
apply a dynamic prediction model to estimate the risk of developing type 2 diabetes
mellitus. Dynamic prediction models based on longitudinal, repeated risk
factormeasurements have the potential to improve the accuracyoffuturehealth
statuspredictions.
Our study applying these methods to the DPP data has some limitations. First,
since these data are from a clinical trial that was specifically focused on high-
risk adults, these results may not be representative of individuals at lower risk for
diabetes. Second, our data lacked precise information on patient
characteristics (exact age and BMI, for example) and was limited to the biological
information available in the DPP data release. This may have contributed to our
observed overall moderate prediction accuracy even using the dynamic model in
the 0.6–0.7 range for the AUC. Future work examining the utility of dynamic
models is warranted within studies that have more patient characteristics available
for prediction. However, even with this limitation, this illustration showsthe potential
advantagesofsuchadynamicapproachoverastaticapproach.

MeanRiskStratification(MSR) andNBI earlywork
Quantifyingriskstratificationprovidedbydiagnostic
testsandriskpredictions:ComparisontoAUCand
decisioncurve analysis
HormuzdA. Katki USNationalCancerInstitute,NIH,DHHS,Division of CancerEpidemiologyandGenetics
StatisticsinMedicine(March2019) https://doi.org/10.1002/sim.8163
A property of diagnostic tests and risk models deserving more attention is risk stratification, defined as
the ability of a test or model to separate those at high absolute risk of disease from those at low absolute risk.
Risk stratification fills a gap between measures of classification (ie, area under the curve (AUC)) that do
not require absolute risks and decision analysis that requires not only absolute risks but also
subjectivespecificationofcostsandutilities.
We introduce mean risk stratification (MRS) as the average change in risk of disease (posttest pretest)‐enhanced (DCE) sequences in prostate
revealed by a diagnostic test or risk model dichotomized at a risk threshold. Mean risk stratification is
particularly valuable for rare conditions, where AUC can be high but MRS can be low, identifying
situations that temper overenthusiasm for screening with the new test/model. We apply MRS to the
controversy over who should get testing for mutations in BRCA1/2 that cause high risks of breast and
ovarian cancers. To reveal different properties of risk thresholds to refer women for BRCA1/2 testing, we
propose an eclectic approach considering MRS and other metrics. The value of MRS is to interpret
AUCin thecontextof BRCA1/2mutation prevalence, providing a rangeof risk thresholds atwhicha risk model is
“optimally informative,” and to provide insight into why net benefit arrives to its conclusion. Herein also, we
introduce a linked metric, the net benefit of information (NBI) (derived from DCA), ie, the increase in
expected utility fromusing the marker/model to select peoplefor intervention versusrandomly selecting people
forintervention. NBI quantifiesthe“informativeness”of amarker/model.
Work is needed on the effect of model miscalibration, small sample sizes, or correcting for
overoptimism in both the range and location of the sweetspot of risk thresholds with maximal MRS and net
benefit of information (NBI). Much work and empirical experience, is needed to make MRS and NBI usable
in practice. Because MRS is on the scale of the outcome, no single MRS could be considered as
usefully informative across outcomes. For example, the MRS = 1.7% for BRCAPRO represents a 1.7%
average change in risk of carrying a cancer-causing mutation. However, carrying a mutation is not as severe as
a 1.7% average change in yearly risk of developing cancer itself, much less a 1.7% average change in yearly risk
of death. Understanding how to define a clinically significant MRS is a key issue for future research.
In addition, more work needs to be done to understand how to use MRS to decide on how to use tests to rule-
inor rule-outpeopleforintervention

ClinicalReinforcement
Learning

DynamicTreatmentRecommendation withRL:Background
SupervisedReinforcementLearningwithRecurrentNeuralNetwork
forDynamic TreatmentRecommendation
LuWang,Wei Zhang,XiaofengHe,Hongyuan Zha
KDD'18 Proceedingsof the24thACMSIGKDDInternationalConferenceon KnowledgeDiscovery&Data
Mining https://doi.org/10.1145/3219819.3219961
The data-driven research on treatment recommendation involves two main branches: supervised
learning (SL) and reinforcement learning (RL) for prescription. SL based prescription tries to
minimize the difference between the recommended prescriptions and indicator signal which
denotes doctor prescriptions. Several pattern-based methods generate recommendations by utilizing the
similarity of patients [Huetal.2016, Sun etal.2016]
, but they are challenging to directly learn the relation between
patients and medications. Recently, some deep models achieve significant improvements by learning a
nonlinear mapping from multiple diseases to multiple drug categories [BajorandLasko2017, Wangetal.2018,
Wangetal.2017
. Unfortunately, a key concern for these SL based models still remains unresolved, i.e, the ground
truth of “good” treatment strategy being unclear in the medical literature [Marik 2015]. More
importantly, the original goal of clinical decision also considers the outcome of patients instead of only
matching theindicatorsignal.
The above issues can be addressed by reinforcement learning for dynamic treatment regime (DTR) [
Murphy2003, Robins1986]. DTR is a sequence of tailored treatments according to the dynamic states of patients, which
conforms to the clinical practice. As a real example shown in Figure 1, treatments for the patient vary dynamically
over time with the accruing observations. The optimal DTR is determined by maximizing the evaluation signal which
indicates the long-term outcome of patients, due to the delayed effect of the current treatment and the influence of future
treatment choices [Chakraborty and Moodie 2013]. With the desired properties of dealing with delayed reward and
inferring optimal policy based on non-optimal prescription behaviors, a set of reinforcement learning methods have
been adapted to generate optimal DTR for life-threatening diseases, such as schizophrenia, non-small cell lung cancer, and
sepsis [e.g. Nemati et al. 2016]. Recently, some studies employ deep RL to solve the DTR problem based on large scale
EHRs [Peng et al.2019, Raghu et al. 2017, Weng et al.2016
. Nevertheless, these methods may recommend treatments that are obviously
different from doctors’ prescriptions due to the lack of the supervision from doctors, which may cause high risk [
Shen et al. 2013] in clinical practice. In addition, the existing methods are challenging for analyzing multiple diseases and the
complexmedication space.
In fact, the evaluation signal and indicator signal play complementary roles, where the
indicator signal gives a basic effectiveness and the evaluation signal helps optimize policy.
Imitation learning (e.g. Finnetal. 2016) utilizes the indicator signal to estimate a reward
function for training robots by supposing the indicator signal is optimal, which is not in line
with the clinical reality. Supervised actor-critic (e.g. Zhuetal.2017) uses the indicator
signal to pre-train a “guardian” and then combines “actor” output and “guardian” output
to send low-risk actions for robots. However, the two types of signals are trained separately
and cannot learn from each other. Inspired by these studies, we propose a novel deep
architecture to generate recommendations for more general DTR involving multiple
diseases and medications, called Supervised Reinforcement Learning with
Recurrent Neural Network (SRL-RNN). The main novelty of SRL-RNN is to combine
the evaluation signal and indicator signal at the same time to learn an integrated policy.
More specifically, SRL-RNN consists of an off-policy actor-criticframework to learn complex
relations among medications, diseases, and individual characteristics. The “actor” in the
framework is not only influenced by the evaluation signal like traditional RL but also
adjusted by the doctors’ behaviors to ensure safe actions. RNN is further adopted to
capture the dependence of the longitudinal and temporal records of patients for the
POMDP problem. Note that treatment and prescription are used interchangeably in this
paper.
!

RewardPoliciesforHealthcare #1
DefiningAdmissibleRewardsforHighConfidencePolicy
EvaluationNiranjaniPrasad,BarbaraEEngelhardt,FinaleDoshi-Velez Princeton / Harvard University
(Submitted on 30 May 2019) https://arxiv.org/abs/1905.13167
One fundamental challenge of reinforcement learning (RL) in practice is specifying the agent’s reward.
Reward functions implicitly define policy, and misspecified rewards can introduce severe, unexpected behaviour.
However, it can be difficult for domain experts to distill multiple (and often implicit) requisites for desired
behaviour into a scalar feedback signal. Much work in reward design or inference using inverse reinforcement
learningfocuseson online, interactive settings in which the agent hasaccessto human feedback[Cristiano et al.2017,
Loftin et al. 2014]
or to a simulator with which to evaluate policies andcompare against human performance. Here, we focus
on reward design for batch RL: we assume access only to a set of past trajectories collected from sub-
optimal experts, with which to train our policies. This is common in many real-world scenarios where the
risks of deploying an agent are high but logging current practice is relatively easy, as in healthcare, education, or
finance.
Batch RL is distinguished by two key preconditions when performing reward design. First, as we assume that
data are expensive to acquire, we must ensure that policies found using the reward function can be evaluated
given existing data. Regardless of the true objectives of the designer, there exist fundamental limitations on reward
functions that can be optimized and that also provide guarantees on performance. There have been a number of
methods presented in the literature for safe, high-confidence policy improvement from batch data given some
reward function, treating behaviour seen in the data as a baseline. In this work, we turn this question around to ask:
Whatisthe classofreward functionsfor which high-confidence policyimprovementispossible?
Second, we typically assume that batch data are not random but produced by domain experts pursuing
biased but reasonable policies. Thus if an expert-specified reward function results in behaviour that deviates
substantially from past trajectories, we must ask whether that deviation was intentional or, as is more likely,
simply because the designer omitted an important constraint, causing the agent to learn unintentional
behaviour. This assumption can be formalized by treating the batch data as ε-optimal with respect to the true reward
function, and searching for rewards that are consistent with this assumption [Huangetal. 2018]. Here, we extend
these ideas to incorporate the uncertainty present when evaluating a policy in the batch setting, where
trajectoriesfrom theestimated policycannot be collected.
We can see that these two constraints are not
equivalent. The extent of overlap in reward functions
satisfying these criteria depends, for example, on the
homogeneity of behaviour in the batch data: if
consistency is measured with respect to average
behaviour in the data, and agents deviate substantially
from this average—e.g., across clinical care providers—
then the space of policies that can be evaluated given the
batch data may be larger than the policy space consistent
with theaverageexpert.
In this paper, we combine these two conditions to
construct tests for admissible functions in reward design
using available data. This yields a novel approach to
the challenge of high-confidence policy
evaluation given high variance importance
sampling-basedvalueestimatesoverextendeddecision
horizons, typical of batch RL problems, and encourages
safe, incremental policy improvement. We illustrate our
approach on several benchmark control tasks, and in
reward design for a health care domain, namely,
weaning apatientfromamechanical ventilator.

Improving Clinical Machine Learning Evaluation

Improving Clinical Machine Learning Evaluation

Recommended

Recommended

More Related Content

More from PetteriTeikariPhD

More from PetteriTeikariPhD (20)

Recently uploaded

Recently uploaded (20)

Improving Clinical Machine Learning Evaluation