Quantiative uncertainty in QSAR predictions - Bayesian predictive inference and the magic of bootstrap
1. Uncertainty in QSAR Predictions –
Bayesian Inference and the Magic of
Bootstrap
Ullrika Sahlin PhD
Centre for Environmental and
Climate Research (CEC)
3. Uncertainty in hazard assessment –
does it matter?
4.
Conservative
value of
toxicity
3.
Expected
toxicity
2.
Median
toxicity
1. QSAR
predictions
without
uncertainty
0. No HA
?: 386
Not toxic*:
281
265 262 153
+109
+3
+16
Very toxic:
105
Sahlin et al. 2013. Arguments for Considering Uncertainty in QSAR Predictions
in Hazard and Risk Assessments. ATLA
4. QSAR integrated hazard assessment
and the AD domain problem
-10 -8 -6 -4
0200400600800
Predicted No Effect Concentration of 386 Triazoles
log min{EC50}
Molecularweight
Relative toxicity potential
Low confidence in prediction
5. Modes of statistical inference
• Parametric inference
– Explain
– Hypothesis-driven
• Predictive inference
– Predict to support decision making
– Generate hypothesis
• Evidence synthesis
– Consider quality
Geisser. Introduction to predictive inference 1993. Sutton and Abrams 2001. Bayesian
methods in meta-analysis and evidence synthesis. Statistical Methods in Medical Research.
6. To predict…
is to make a statement
of something we have
not yet observed
is always made with
uncertainty
is made using at least
one model
7. How can I…
• Assess uncertainty in a prediction?
• Take my judgement of confidence in the
model into account?
• Validate the assessment?
Principle for
QSAR modelling
Principle to
judge
confidence in
predictions
Principle to
assess
uncertainty
8. Uncertainty in a prediction
Predictive error Predictive reliability
Our confidence in using a
model to predict what we
want to predict
0.0 0.1 0.2 0.3 0.4 0.5 0.6
-2-101
hat value
predictivemean
2 4 6 8 10 12 14
-2-101
nC
logEC50
Discrepancy between model
and reality
9. -5 0 5 10
-10-5051015
nC
predictedy
Different kinds of errors
11. Different measures of predictive
reliability
• Similarity to points in the training data set
• Distance from the centre of training data
• Density of training data around the item to be
predicted
• Sensitivity analysis e.g. standard deviation in
perturbed predictions
17. I. Bayesian modelling
Assessment of
predictive
distribution
Frequentist
framework
Frequentist
analytical
Sampling
"external data" Re-sampling
Jackknifing
"without
replacement"
Bootstrapping
"with
replacement"
Bayesian
framework
Bayesian
analytical
Bayesian
sampling
18. I. Bayesian modelling
• Model parameters are
uncertain
• Uncertainty is described by
probability
• Prior information is
subjective
• Data enters through
Bayesian updating
0 50 100 150 200
505560657075
MCMC sampling
parameter 1
parameter2
19. I. Bayesian modelling
Pros
• Uncertainty is measured by
probability
• Links to decision theory
• Motivated under small data
Cons
• Treatment of high-
dimensional descriptor
space?
• Limitation to specific
models?
• Re-modelling of QSARs
needed
20. Validation
Fathead Minnow QSARdata R-package
Park and Casella (2008) Journal of the American Statistical
Association, Gramacy and Pantaleo (2010) Bayesian Analysis.
-2 -1 0 1 2
-1012
training data
observed
predicted
R2_Blasso = 0.79
-3 -2 -1 0 1 2
-2-10123
test data
observed
predicted
R2_Blasso = 0.75
21. Validation
Empirical coverage
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
training data
confidence
hitrate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
test data
confidence
hitrate
23. 3. Assessment considering judgment in
predictive reliability
Inspired by Denham 1997 and Clark 2009
Type of distribution:
Gaussian
Mean: Point
prediction yq
Variance: Local Predictive Error Sum of
Squares divided by denominator
24. 3. Assessment considering judgment in
predictive reliability
Inspired by Denham 1997 and Clark 2009
Type of distribution:
Gaussian
Mean: Point
prediction yq
Variance: Local Predictive Error Sum of
Squares divided by denominator
Observed prediction errors Measure of predictive reliability
jj yy ˆ Sampling from distribution of
modified residuals
25. 3. Assessment considering judgment in
predictive reliability
n
j jq
n
j jjjq
q
w
yyw
PRESSW
1 ,
1
2
, )ˆ(
.
)(
2
,
)ˆ(.
jqwkNNj
jjq yyPRESSkNN
n
j jj yyPRESS 1
2
)ˆ(
Inspired by Denham 1997 and Clark 2009
Type of distribution:
Gaussian
Mean: Point
prediction Yq
Variance: Local Predictive Error Sum of
Squares divided by denominator
26. Validate the assessment
Evaluation on External data
log likelihood score
Assessmentofpredictiveerror
-100 -80 -60 -40 -20 0
equal
W euclidean
W leverage
W ADdens
kNN euclidean
kNN leverage
kNN ADdens
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Empirical coverage (External data)
confidence level
hitrate
1:1
equal
W euclidean
W leverage
W ADdens
kNN euclidean
kNN leverage
kNN ADdens
27. So – which approach is the best?
-2 -1 0 1 2
-2-1012
training data
observed
predicted
R2_pls = 0.77 R2_boot = 0.83 R2_Blasso = 0.79
-3 -2 -1 0 1 2
-2-10123
test data
observed
predicted
R2_pls = 0.77 R2_boot = 0.78 R2_Blasso = 0.75
28. So – which approach is the best?
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
training data
confidence
hitrate
1:1
Blasso
Bootstrap
kNN leverage
equal
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
test data
confidence
hitrate
1:1
Blasso
Bootstrap
W euclidean
equal
29. 0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
training data
confidence
hitrate
1:1
Blasso
Bootstrap
kNN leverage
equal
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
test data
confidence
hitrate
1:1
Blasso
Bootstrap
W euclidean
equal
So – which approach is the best?
Evaluation on training data
log likelihood score
Assessmentofpredictiveerror
-200 -150 -100 -50 0
Blasso
Bootstrap
kNN leverage
equal
30. Take home messages
• A predictions is complete when given with
uncertainty specified by probability
• Assessment of uncertainty need both be
theoretical motivated and proved honest in
empirical evaluation of performance measures
• Three useful approaches are to assess uncertainty
through modelling (Bayesian), sampling (e.g.
bootstrapping), or post modelling of predictive
error
• Use appropriate measures to validate the
assessment of uncertainty
31. Thank you for your attention
Drive safely in the statistical djungle!
Editor's Notes
The number of BTAZs compounds classified as very toxic or not (including potentially*) toxic under the different treatments of QSAR uncertainty both in input and in the output of the assessment. Uncertainty in QSAR predictions is considered in alternatives 2 to 4.
Here is an example from a case-study in the EU-project CADASTER where QSAR predictionswereusedtoinform input parameters to an environmental hazardassessment. Assessmentsof relative toxicityarehereshown for 386 triazoles. The largersizeof a red dot the moretoxic. The compoundsarehereplottedagainst the minimum toxicityvaluewhich is the alternative if not uncertainty in QSAR predictionswouldhavebeenconsidered, and againstmolecularweightwhich has a highinfluence on toxcity. If considerationofuncertainty in QSAR predictionswould not have an effect, the compoundsshouldfollow a staightline. They do not, and thismeansthatuncertainty has an (but small) influence on the outcomeof the assessment. Someofthesecompoudndsfelloutof the applicabilitydomainofone or severalof the QSAR modelsused in the assessment. Thesecompoundsareassessedwithlowerconfidence and aremarkedwith a bluetriangle. Whyuncertaintyanalysis: Usingpointestimatesofpredictionse.g best guess or expectations (the plug in approach) does not garanteethat the assessmentareproducing the best guess or expectedvalueof the output.
There is always atleastonemodelbehind a prediction. It can be a mental model. It can be a by mathematicswelldefinedmodelconsistingof a set ofequations. The modellingmayinvolve a statistical modelwhichoftenareassumedtohold under certainassumptions. It can be the processsofmodellingwhichmore or less transparentlydescribehow the model has beengenerated, parameters estimated and the modelsperformancevalidated. A reasontodwelluponwhichmodelsinclude is that it sets the possibilitiestoassessuncertainty in predictions. This year's plethora of prognosticators comes thanks to Paul the octopus, who correctly predicted the outcomes of all seven of Germany's World Cup matches in 2010 in addition to the final between Spain and Holland.
The titlehere is uncertainty in a prediction, I would like toemphasizethatuncertainty is different from predictiontoprediction.Ineedtospecifyuncertainty in an individualpredictionwhen I usepredictivemodelssuch as QSARstoinform decision analysis in someway or another. I havenoticedthat a qualitativejudgementofpredictivereliabilitymayleadto the modelprediction not beingused, butwhat is the alternative, or the modelpredictionbeingusedbutflaggingthat it may not be good. This has led to the ideatolet the judgmentofpredictivereliabilityinfluence the quantitaive part of the uncertainty in a prediction. Information requirementsFirstwe note thatuncertainty in a predictioncannot be reportedwith a model in the same way general measuresofpredictiveperformancearereported, it depends on whattopredict. Later I will show howtoassessmentofuncertainty in a predictionscan be usedtoevaluatewaystojudgeconfidence in predictions. Note thattheremay be different uncertaintyassociatedto different individualpredictions. Error is not equal for all compoundsuponwhich a model is applied. Thisseemsratherobvious, but in practiseareerroroftenspecified as equal for anyprediction, whilepredictivereliabilitycan be very different. Whilereliability is a qualitativeaspectofuncertaintyrelatedto the question is this a trust worthypieceof information, can I usethisprediction in my risk or decision model, (and the followupquestion: if I can,twhat is the alterantive). Error, being a quantitativecharacterizationofuncertaintycan be dealthwith in the risk or deciaionanalysis, it still provideuse an alternative. There is a needtojointlyconsidererror and predictivereliability. Here is a simple modelbased on onedescriptor. The modelpredict a line, predictiveerrorcan be assessed. Here I haveused a Bayesianmodeltoquantifyerror in predictions. Errorincrease the futheroutof the scatterof data pointswehave, alsowhatcanwesayaboutitemsfallingoutsideof the scatterpoints. Bayesianmodelling[Descriptionof a Bayesian regression][Exemplified by the Bayesian Lasso]Predictive distribution increasewith the distanceto the training data set (hat value)
Sopredictiveerror is characterized by a probabilty distribution – the predictive distribution. Note thattheremay be different uncertaintyassociatedto different individualpredictions. Error is not equal for all compoundsuponwhich a model is applied. Thisseemsratherobvious, but in practiseareerroroftenspecified as equal for anyprediction, whilepredictivereliabilitycan be very different. Whilereliability is a qualitativeaspectofuncertaintyrelatedto the question is this a trust worthypieceof information, can I usethisprediction in my risk or decision model, (and the followupquestion: if I can,twhat is the alterantive). Error, being a quantitativecharacterizationofuncertaintycan be dealthwith in the risk or deciaionanalysis, it still provideuse an alternative. There is a needtoconsidererror and reliabilityjointly. Here is a simple modelbased on onedescriptor. The modelpredict a line, predictiveerrorcan be assessed. Here I haveused a Bayesianmodeltoquantifyerror in predictions. The dashedlines mark the boundsofprediction intervals with 95% confidenceofcovering the actualvalue. Errorincrease the furtheroutof the scatterof data pointswehave, alsowhatcanwesayaboutitemsfallingoutsideof the scatterpoints. Predictive distribution increasewith the distanceto the training data set.
Here is anotherexampleofdistance from modelversuspointprediction. Thismodel has a highdimensionaldescriptor space and thereof the scatterof black dots (the training data) and red crosses (external predictions). Hereweclearilyseethatsomecompoundsbecomesevere extrapolations from the AD whenpredicted by thismodel. As an alternative todisregardingthesepredictionswecould ask, yesthesepredictionsare bad, buthow bad and does it matter for our decision?
Judging the reliability in using a modeltopredictaremadeuponseveralcritierias. Firstonecan look for general qualitativecriterias, whether the compoundfullfillcertaincharacteritcstahthe QSAR is modelling. When thepredefinedcriteriasaremet, different measuresof a modelsdomainofapplicabilitycan be usedtoevaluatepredictivereliablility. Bild 0. 2 dimensionel avståndBild 1. avståndBild 2. Täthet (3 dim)Bild 3. Visa på något sätt.
Predictive distributionUncertaintydescribed by a probability distributionDescribes the errorwith a probability distribution
Ifwebelieve the assessmentofuncertaintyto be true, wewouldexpect the truevalueto fall somewhere under the predictive distribution. Close to the center of the predictive distribution moreoften.
Here is an attemptto show an overview over approachestoassesspredictiveerrors (or the predictive distribution).This is not covering all approaches, but the most common and I am happy todiscussthismorewithsomeoneinterested.It has twomainbranches – frequentist (or classlical) statistical framwork and Bayesianframework. I willnow pick and demonstrateoneexample from eachofthesetwobranches.
The first is a Bayesianappraochtoassessuncertainty. Bayesianmodelling is from the beginning designed tomodeluncertainty in parameters usingprobabilities and aretherefore ideal toassespredictiveerrors.
Bayesianmodellingcanquickly be summarized as the activityofmodellingwhere parameters areassigneduncertaintyusingprobabiltiies. A modelconsiistofmodelstructurewhose parameters tobeginwith a assigneduncertainty distribution that express our prior (taht is beforelooking at data) understandingoftheirvalues and characteristicofuncertainty. Data entersthroughBayesianupdating – an this so calledlikelihood principle can be more or less strict .ABayesianmodel is usuallyfitted by Markov Chain Monte Carlo sampling, whichmeansthat an simulation algorithmssearches for optima under the distribtonsof the parameters when the information in data is considered. Priors telluswhereto look and the data telluswhat is a goodplaceto be. In the figurewesee a simulation whichtookusto a good spot for the values on two parameters. When the algorithmseemtostay at the same place – wesaythat it has converged. Wethenthroughawaythosevalues in the beginnnigof the simulation and usethose (here red dots) to generate predictions from the model. Alsosince the parameters areuncertain the predictionswillalso be uncertainty and – viola – wehave a predictive distribution. Bayesianmodelling is THE frameworktoquantifyuncertainty. I provides uncertaintywith a fairlyeasy interpretation – i.e. ouruncertainty in a valuestemming from our expert knowledge and justified by information in empirical observations. At least in theory it is …Gaussian process can deal withhighdimensionaldescriptorspaces, but the mechanisticunderstandingof the model is lost.
TheadvantageswithBayesianmodellingarethatIt result in uncertaintyto be assessed by a probability distributionItinterpretaionofuncertainty is a directlinkto decision theoreticframework – usefulwhenoptimisingtestingstrategies for experimental design or (as in the applications I haveworkingwith) when QSAR predictionsinform input to risk assessmentmodels for chemialregulation. Also, it has a theoretical motivation even under small data sizes (-> Bayesian meta-regression)A problem is that it does not alwayswork in practise. It works best for parametricmodels, sincespecifying priors can be difficultifwe do not whatare in needof priors.It is not clearhowtotreathighdimensionaldescriptor space – the selectionofdescriptors is puzzlingme, from whereshoulddescriptors be part of the model. It is limitedtoBayesianmodellingFinally, it requiresQSARsiftheyalreadyexistto be Re-modelled as Bayesianmodels. Should original set ofdescriptors be considered or the final selectionDifferent parameter values: from pointestimatewith est variance in a frequentistframeworktoposterior distribution depdend on choice of priors in a Bayesianframework. Is it the same QSAR?
Letus look at the overviewofmethodstoassesspredictive distributions. From the frequentisticsideof bransch of the tree I consider re-sampling. Re-sampling sinceweoftenhave a limitation of data. Re-sampling withreplacement, whichmeansthat the same data pointcan be drawnseveraltimes. Thiscancreateinbreeding, i.e. thatsomeresultsappearthat is an artefactof the particula data, and onehavetocautions under small samplesizes. A recipie for Bootstrappingcansimply be toSpecify a quantitywhichuncertaintyweareinterested in. It can be a test statistic, an estimated parameter value or a predictiveerror (i .e. the discrepancybetween a prediction and reality). Thenwespecifyhowtoderivethisquantifybased on observations thatwehave. Thenrepeatedlysample from the observations and let the quantity be derivedmanytimes. Thisresult in a distribution for the quantitywhich express itsuncertainty. Bootstrappingoccurswhenweallow observations to be sampleseveraltimes. A classicalapplication is to fit a modelto data, generate predictions and deriveresiduals, sample from the distribution ofresidualsto generate new data, fit a new model and save the estimated parametrs. Repeatthissevaraltimes. Whatwe get is somethingsimilarto the Bayesianmodel, withuncertainty in the parameters whichresult in uncertainty in the predcitions. The interpretaionofuncertainty is different though. The useofbootstrapsolvesomeof the problems with the Bayesianmodelling. I will not show anyresults from Bootstrappinghere. I willquicklyturnto my third approach toassessuncertainty in a prediction, and that is the approach which do not refit the underlying QSAR model, butusenotionofpredictivereliablity in the assessmentofpredictiveerrors.
Givenare observations ofpredictionerror, i.e. the differencebetween a modelpredictionofcompound not part of the training data set and the actualvalue. For each observation weknow the correposndingmeasureofpredictivereliability. Usingoneof the PRESSesdescribed in the previousslidewecanderive the Local PRESS for a certainquerycompound by comparingitspredictivereliabilitytothoseof the assessment data set. The general algorithm to assess predictive uncertainty samples from the distribution of so calledmodified residuals. A modified residual is found by dividing prediction residuals from an assessment data set, yj–ŷ-j, by each item’s specific standard error SDEPj. If the standard error is properly estimated, and if we assume observed and not yet observed compounds to be exchangeable, the sample of modified residuals provides input for the predictive distribution of individual predictions of new compounds. In this way we do not have to specify what to divide the PRESS value by for the PRESS to be a variance of the predictive distribution. Thisassessment goes quicktorun. Whattakestime is toderive the measuresofpredictivereliability and perhaps LOO predictionerrors for a training data set (if no external data set is used). It alsobecomenecessaryto ask whatmeasureofpredictivereliabilitytouse. In the beginning I mentionedfour different kinds: similarty in descriptor space, distanceto the centreof the AD, densityof the AD closeto the predictedcompound, and sensitivityanalysiswhichcan be the standard deviation in a predictionwhen a model is generatedseveraltimeswith different outcomeseverytime. The nicethingwithhaving a predictive distribution is thatwecanactuallyvalidatehowgoodboth the model and the uncertainty in itspredictionsare. It is very common tocomparemeasuresofpredictivereliabilitythrough the correlationbetweenobservederrors and the measure, butweknowthaterrorscan be both small and large at the same time, theyaredrawn from a distribution. Weknowthatuncertaity in predcitionmayvary from compoundtocompound, butsincewehaveassessedindividualuncertainty in predictions, wecaneasilyplaceeachprediction under itscorrespondingpredictive distribution.
Givenare observations ofpredictionerror, i.e. the differencebetween a modelpredictionofcompound not part of the training data set and the actualvalue. For each observation weknow the correposndingmeasureofpredictivereliability. Usingoneof the PRESSesdescribed in the previousslidewecanderive the Local PRESS for a certainquerycompound by comparingitspredictivereliabilitytothoseof the assessment data set. The general algorithm to assess predictive uncertainty samples from the distribution of so calledmodified residuals. A modified residual is found by dividing prediction residuals from an assessment data set, yj–ŷ-j, by each item’s specific standard error SDEPj. If the standard error is properly estimated, and if we assume observed and not yet observed compounds to be exchangeable, the sample of modified residuals provides input for the predictive distribution of individual predictions of new compounds. In this way we do not have to specify what to divide the PRESS value by for the PRESS to be a variance of the predictive distribution. Thisassessment goes quicktorun. Whattakestime is toderive the measuresofpredictivereliability and perhaps LOO predictionerrors for a training data set (if no external data set is used). It alsobecomenecessaryto ask whatmeasureofpredictivereliabilitytouse. In the beginning I mentionedfour different kinds: similarty in descriptor space, distanceto the centreof the AD, densityof the AD closeto the predictedcompound, and sensitivityanalysiswhichcan be the standard deviation in a predictionwhen a model is generatedseveraltimeswith different outcomeseverytime. The nicethingwithhaving a predictive distribution is thatwecanactuallyvalidatehowgoodboth the model and the uncertainty in itspredictionsare. It is very common tocomparemeasuresofpredictivereliabilitythrough the correlationbetweenobservederrors and the measure, butweknowthaterrorscan be both small and large at the same time, theyaredrawn from a distribution. Weknowthatuncertaity in predcitionmayvary from compoundtocompound, butsincewehaveassessedindividualuncertainty in predictions, wecaneasilyplaceeachprediction under itscorrespondingpredictive distribution.
This approach aimtomodel the errordirectlybased on the judgementofpredictivereliability. For this I need a model for the predictive distribution:Still tamperingwith regressions the predictive distribution is assignedto be Gaussian (bellshaped distribution and symmetricarounditsmean). The meanvalue is the pointprediction from the QSAR model. Information ofpredictiveerror is thencontained in the Varianceofthispredcitive distribution. I let the variance be assessed by a local PRESS divided by a denominator.A reason for this choice is that is should be easytoapply and at best todocumentwithmodels. The Gaussian distribution is a simplification. Other distribution types, even non-parametric, couldhavebeen chosen. Also, the useofbothBayesian and bootstrap as shownbeforerequiresrunning a code (which is possiblebutperhaps not alwaysappreciated).PRESS is a common reportedperfomrancemeasureofQSARs, so why not usethat as a nullmodel for the assessmentofpredictiveuncertaity. Thismeansthat the nullmodelstatesthat all predictiveerrorsareequal and can be derived from the PRESS value. Wehavetriedtwo variants ofLocal PRESS.A weighted PRESS whichweightsaccordingto a measureofsimilarity in predictivereliabilityof the querycompound and of the compound for which I haveobservedpredictionerrors. The weight is constructedsuchthatobservederrorswithrelativelymoresimilarpredictivereliabilityare given higherinfluence in the assessmentvariance. As a consequencevariance for a compoundthat lies in the centreof the AD aremostlybaseduponerrorsobserved for compounds in the centre, and vice versa. A moredirect variant ofthistheme is touselet the PRESS value be morelocal by summing over the k nearestneighbours, wherewhat is near is judgedbased on similarity in predictivereliability. A problem with sampling basedapproaches is that the error in the outscirtsof the AD is less reliablyassessedsince it by definition are less valuesthere, and we do not ass in the Bayesiancaseprovideanyother information. Thus, the locally assessed predictive error can be seen as a conditional predictive error, i.e. the expected error given a compound’s position in the domain of applicability or prior information on uncertainty.
Herearetwowaystovalidateassessmentsofuncertaintyusing an external data set (at best not part of the modelling leading to the assessments). Firstwehavesummed the loggedlikehoodvalues for eachpoint in the external data set. A high score means a better (wellbalanced) assessmentofuncertainty. It meansthatmostcompoundsfell inside the predictivedistriubution and fewwerevery far out. I havenoticedthat the likelihood score can be a bit trickysometimes. And I alwaysprefertoalso look at the graphical display ofempricialcoverages. Empiricalcoverageplotsaregenerated by for different confidencelevelscount the proportion ofcompounds in the data set thatfell inside theircorrespondingprediction intervals. A good and wellbalancedassessmentshould generate a straigthonetooneline. It is importanttokeep in mind thathteunderlying QSAR modelshould be properlyvalidatedbeforedoingthisexcersice.
Bayesian vs bootstrapLoglieklihoodcoverage - while the likelihoodiprovide relative comparison, the empricialcoverageprovide an evaluationthatstand for itself. This is becuaseweuse the uncertaity in predictions as a probabilisticformulatedhypothesisof the observedvalue in the external data set.
Bayesian vs bootstrapLoglieklihoodcoverage - while the likelihoodiprovide relative comparison, the empricialcoverageprovide an evaluationthatstand for itself. This is becuaseweuse the uncertaity in predictions as a probabilisticformulatedhypothesisof the observedvalue in the external data set.