Robust biomarker selection from RT-qPCR data using statistical consensus criteria

Robust Biomarker Selection from RT-qPCR Data
using Statistical Consensus Criteria
Jodi Lapidus OHSU Biostatistics & Design Program, Director
Jack Wiedrick OHSU Biostatistics & Design Program, Staff Biostatistician
NIH/NCATS ExRNA Seminar Series, 07-Dec-2017

BACKGROUND
• Are micro-RNAs (miRNAs) expressed in human
body fluids, and if so, can they be used to
distinguish healthy from diseased patients?
• Quantification of miRNA expression can be done
using off-the-shelf quantitative reverse transcription
polymerase chain reaction (RT-qPCR) arrays
• NIH funding divides biomarker experiments into
UH2 (discovery) and UH3 (validation) phases

BACKGROUND
• UH2
Discover miRNA-based biomarkers in human body fluids
Experimentation to identify plausible and robust
candidates from a large panel of potential markers
Need to select a small set of markers while retaining and
prioritizing candidates with promising clinical utility
• UH3
Validation using independent sample sets
 Targeted testing on the reduced list of candidates
 Elucidate associations with clinical characteristics

BACKGROUND
• UH2
• UH3
What do we mean by this?
Markers are robust if they’re
not sensitive to irrelevant
features in the samples

BACKGROUND
• UH2
• UH3
This is the real problem.
We need to make selection
decisions based on small
numbers of samples, where
irrelevant features will often
be very prominent

A MOTIVATING SCENARIO
• Two clinical populations: Alzheimer's Disease (AD)
patients versus age/sex-matched non-AD controls
 Are there biofluid markers that distinguish them?
 If so, how do we discover which ones?

• Two clinical populations: Alzheimer's Disease (AD)
patients versus age/sex-matched non-AD controls
 Are there biofluid markers that distinguish them?
 If so, how do we discover which ones?
Measure all samples using a
standard RT-qPCR panel
assay and look for miRNAs
showing group differences

PHASE 1: DISCOVERY PHASE 2: VERIFICATION/VALIDATION
X
Large set of candidate markers
(Which ones are promising?
How do they seem to perform?)
Small set of promising markers
(Do they predict as well as hoped?
Are they feasible for screening?)

1 sample x 377 probes/card x 2 cards
TaqMan® TLDA Cards for miRNA
3 internal standards per
card are used to align
the cards for a subject
x #{subjects}
CONSIDERATIONS
1.Only one measured value
per probe per sample
2.Some yield no value…
why? (No expression?
Really low expression?
Assay failure?)
3.Some are untrustworthy
(weak amplification, etc)
4.Many probes but most are
likely to be unimportant

CONSIDERATIONS
1.Only one measured value
per probe per sample
2.Some yield no value…
why? (No expression?
Really low expression?
Assay failure?)
3.Some are untrustworthy
(weak amplification, etc)
4.Many probes but most are
likely to be unimportant
A given experiment may vary on some or all
of these considerations, but…
the statistical pipeline we describe can
be applied to any RT-qPCR experiment
designed to select biomarkers.

THE PROBLEM
• We want a robust selection methodology

THE PROBLEM
• We want a robust selection methodology that:
 emphasizes the predictors that matter and discounts the
ones that don't

THE PROBLEM
ones that don't
 characterizes the associations within the larger context of
uncertainty about prediction model validity

THE PROBLEM
ones that don't
 generates realistic and testable expectations for how well
the predictors can actually predict

THE PROBLEM
ones that don't
 generates realistic and testable expectations for how well
the predictors can actually predict
Our selection pipeline uses statistical
consensus methodology to mitigate the
risk of false discoveries by focusing
attention on reliable and robust signaling

WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions

 "statistically significant" doesn't mean "predictive"
p < 0.01

 "statistically significant" doesn't mean "predictive"
p < 0.01
Are the means significantly different?
Sure.
If you take a single measurement,
can you guess whether it's a case?
No.

 a t-test assumes the values are correct, but some aren't
CENSORING

At some point we stop counting
cycles because noise dominates
the signal at high resolutions
CENSORING

If amplification hasn't
occurred yet, there may
still be expression, but
all we know about the
cycle time is that it has
to be longer than the
maximum number of
cycles we attempted
CENSORING

 one-at-a-time tests can miss important parts of the story
miRNA#2
INTERACTION
(miRNA#2 is strongly correlated
with the outcome, but also
correlated with miRNA#1)

miRNA#2
Ignoring miRNA#2 leads to
the conclusion that
miRNA#1 is uninformative
about the outcome
(or maybe just a hint of
negative association)
INTERACTION

But at any given level of miRNA#2,
increased miRNA#1 is linked to a
significant increase in the outcome
miRNA#2
INTERACTION

STEPS FOR ROBUST SELECTION
1. VISUALIZE RAW DATA — be on the lookout for batch artifacts
and process noise and filter appropriately
2. NORMALIZE & TRANSFORM — encode sources of technical
noise and model their effects before beginning selection
3. FILTER UNSUITABLE TARGETS — if they don't assay well on
the technology, we can't use them as biomarkers anyway
4. SELECT USING MULTIPLE STATISTICAL METHODS —
different looks give a robust assessment of biomarker validity
5. CROSS-VALIDATE AND RANK — get expectations for
independent validation and prioritize markers accordingly
6. VALIDATE! — verify that the markers behave as expected in an
independent sample set and look for covariate influences

UH2

UH3

1. VISUALIZE RAW DATA
• Find process heterogeneities and failures
Be on the lookout for batch artifacts and process noise

• Remove candidates with poor assay performance

Our UH2 study considered 754
candidate miRNAs, and 343 (45%)
of those targets could be excluded
on assay quality grounds alone

• Determine assay quality/detection limits
(e.g. cycle time censoring threshold)

2. NORMALIZE & TRANSFORM
• Negative controls should be uniform
Encode sources of technical noise and model their effects

• Negative controls should be uniform within:
 processing batch (e.g. reagent lot)

 measurement batch (e.g. assay plate)

 fixed instrument settings

 fixed instrument settings
• Model and remove these effects if they differ
U6 U6

• Summarize replicates by median (robust center)

• Transform cycle times to an expression scale:

 higher numbers mean more expression

 higher numbers mean more expression
 censored values become 0

Low cycle time values
on this axis…

…map to high expression
values on this axis
on this axis…

…map to high expression
values on this axis
on this axis…
censored
values here

3. FILTER UNSUITABLE TARGETS
• Targets should be reasonably attested
If they don't assay well on the technology, we can't use them

75% censoring with
1:1 case:control ratio
means specificity can
never exceed 50%

• Cycle time accuracy should be mostly high

Otherwise rankings
become unreliable
because the cycle
times are unreliable

• Censoring should be unrelated to accuracy

A correlation here would mean
that measurement error is
blurring the distinction between
'expressed' and 'not expressed'

In our UH2 study, out of 411
well-measured targets we
were able to filter 260 (63%)
as unlikely to be viable
biomarkers in the technology

4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Different looks give a robust assessment of validity

 Questions for individual markers

 Questions for individual markers:
o Are cycle time counts equal? (LOG-RANK TEST)
strong expression-disease association

o Are cycle time counts equal? (LOG-RANK TEST)
strong expression-disease association
Log-rank tests properly
account for censoring
in the cycle times

o Do cycle times cluster by group? (ROC ANALYSIS)
large group separation in expression

o Do cycle times cluster by group? (ROC ANALYSIS)
large group separation in expression
ROC analysis is designed
to compare entire
distributions of values

 Questions for groups of markers

 Questions for groups of markers:
o Do target signals overlap? (RANDOM FOREST)
robust classification across many random trees

o Do target signals overlap? (RANDOM FOREST)
robust classification across many random trees
Random forests can
capture complex cross-
marker interactions

o Do signals transcend models? (ALTERNATE CLASSIFIERS)

o Do signals transcend models? (ALTERNATE CLASSIFIERS)
More than one way to grow a random forest
A random forest is resampling-based aggregate of decision
trees that attempts to average over all the many possible
trees that could be formed from a set of predictor variables.
But the decision trees could use different rules, e.g.:
•CART (Classification And Regression Trees)
•CFOREST (Conditional inference tree FORESTs)
•CHAID (CHi-squared Automatic Interaction Detection)
•BOOST (BOOSTed classification trees)
All are kinds of random forests, but their component trees
decide differently.

• Consensus in selection suggests signal validity
selected combinations of markers classify well

• Consensus in selection suggests signal validity
selected combinations of markers classify well
In our UH2 study, we were
able to reduce a set of
hundreds of candidate
miRNAs down to just a few
dozen demonstrating good
classification performance.

5. CROSS-VALIDATE AND RANK
• Use multiple imputation to fill in missing data and
simulate population-plausible datasets
Get expectations for independent validation and prioritize accordingly

• Leave-one-out resampling gives estimates of
prediction ability in a new independent cohort

Take the average of
a bunch of informed
guesses…

Take the average of
a bunch of informed
guesses…
…and make an
informed guess about
future performance

• Assess multimarker classification performance in all
possible groupings of top candidates

 Bayesian model averaging accounts for uncertainty about
"the right model”

"the right model"
 targets ranked by frequency
of inclusion in models,
weighted by goodness of fit

possible groupings of top candidates:
"the right model"
 targets ranked by frequency
of inclusion in models,
weighted by goodness of fit
 compare to any existing
biomarkers and look for
independent signaling

6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Before starting the validation, compare the shape of
response distributions in the old and new cohorts

6. VALIDATE!
 Differences in skew (whether the distribution leans one
way or the other) and kurtosis (how centered vs diffuse
the distribution is) could indicate poorly matched cohorts

6. VALIDATE!
The marker means
approximately line
up in both cohorts

6. VALIDATE!
But some markers in
the old cohort fell on the
extreme edges of the
distribution in the new
cohort — evidence of
potentially large skew in
the discovery cohort

6. VALIDATE!
Central thinness in some of the
distributions is an indication of
low kurtosis — pointing to a
possible admixture of dissimilar
subjects in the population
sampled by the new cohort

6. VALIDATE!
• Verify that marker relevance assumptions hold

6. VALIDATE!
These were chosen
as nondiscriminating
miRNAs…are they?

6. VALIDATE!
These were chosen
as discriminating
miRNAs…are they?

6. VALIDATE!
• Look for discrepancies in classification performance
patterns between the cohorts

6. VALIDATE!
Based on UH2 patterns, we
expect the miRNA-only curve
to pass through this region

6. VALIDATE!
patterns between the cohortsHappily, the miRNA-only curve
behaves exactly as expected

6. VALIDATE!
Similarly, we expect the bump
in performance from adding a
genetic marker to only kick in
at relatively low specificities

6. VALIDATE!
Equally happily, the curve with the
genetic marker doesn't behave as
expected — it's even better

6. VALIDATE!
Why would the genetic marker behave
differently in the two cohorts?
One explanation is that the discovery cohort
was less healthy — the symptoms were already
so strong that the genetic factor no longer
added much new information. When the
disease is less severe and cases are more
similar to controls, the genetic information
boosts sensitivity for the borderline cases.
These kinds of nuances can be very valuable
for deciding not only who the biomarker
screening should be applied to, but also when.

6. VALIDATE!
• Reprioritize markers based on how well they held
up as predictive in the new cohort

6. VALIDATE!
= low rank numbers for stronger markers
= middling rank numbers for mediocre markers
= high rank numbers for weaker markers
Apply several different methods of ranking
(i.e. "judges") to the set of markers —
these are the same statistical tests used
for biomarker discovery, but now the goal
is to prioritize rather than exclude

6. VALIDATE!
Note that one of the judges is the
ranking created in the discovery phase,
prior to seeing the validation data

6. VALIDATE!
Each judge independently
ranks the candidate markers
in order (1=best, 26=worst)

6. VALIDATE!
Then ranks for each marker
are summed across judges

6. VALIDATE!
= high rank numbers for weaker markersWe colorcode the table to
visually assess the
consistency of rankings

6. VALIDATE!
The rank sums define an
ordering of markers — this is
our consensus opinion
across evaluation methods

6. VALIDATE!
This means that rank sums
within 56 of each other
could be randomly assigned
with high probability, but
gaps larger than that are
likely to be qualitative

6. VALIDATE!
56 ranks 56 ranks

6. VALIDATE!
We could think of the rank
sum distribution as a
roughly even mixture of two
kinds of markers: "hot"
markers and "cool" markers
(where in the middle it's
hard to tell which is which)

6. VALIDATE!
"Hot"
markers
"Lukewarm"
markers
"Cool"
markers

6. VALIDATE!
• Seek internal validation of the marker prioritization

6. VALIDATE!
 If the markers we think are important really are, then they
should contribute the most to multimarker models

6. VALIDATE!
Nearly all possible
parsimonious models,
colorcoded by number of
markers in the model

6. VALIDATE!
Best models are here —
high AUC (strong),
low AIC (parsimonious)

6. VALIDATE!
Which markers contribute
most to the best models?

6. VALIDATE!
The highest ranked ones!

6. VALIDATE!
• Assess the overall quality of group separation

6. VALIDATE!
We may never be able to
screen these kinds of
cases with our markers

6. VALIDATE!
But many of these could
be latent cases…!

6. VALIDATE!
• Evaluate performance ranges for model classes

6. VALIDATE!
These regions fare no
better than existing
clinical markers

6. VALIDATE!
High-performance
regions can only be
reached with a
sufficient number of
markers to allow
clear discrimination

6. VALIDATE!
• Look for differential performance of the classifiers
within clinically relevant covariate subgroups

6. VALIDATE!
 A clever way to do this is to cluster the subjects and then
examine covariates in the tightest clusters

6. VALIDATE!
 A clever way to do this is to cluster the subjects and then
examine covariates in the tightest clusters
The markers are more
sensitive in Cluster 2 than
in Cluster 1, and Cluster 2
also has more males

6. VALIDATE!
• Look for trending of discrimination performance
across covariate spectra

6. VALIDATE!
 The relationships may be marker-specific! Or nonlinear!

6. VALIDATE!
Higher marker ranks seem to
correlate with marker
associations to the covariate…

6. VALIDATE!
…but only for the strongest
markers. Weaker markers
don't show such a clear pattern

NEXT STEPS
Co-authors:
Theresa Lusardi
Jay Phillips
Jack Wiedrick
Chris Harrington
Babette Lind
Jodi Lapidus
Joe Quinn
Julie Saugstad
Vol.55, no.3, pp.1223-1233,
2017
“MicroRNAs in Human Cerebrospinal Fluid
as Biomarkers for Alzheimer’s Disease”
DOI: 10.3233/JAD-160835
• An abbreviated discussion
of this pipeline appeared in
our recent UH2 paper in
the Journal of Alzheimer's
Disease
• Publication of a follow-up
UH3 paper on validation
results is in progress
• Also currently working on a
standalone methods paper
– Robust Statistical Analysis
Pipeline for Selecting Promising
Biomarkers from RT-qPCR
Experiments
Wiedrick, Lusardi, Saugstad, Lapidus

Robust biomarker selection from RT-qPCR data using statistical consensus criteria

Recommended

Recommended

More Related Content

Similar to Robust biomarker selection from RT-qPCR data using statistical consensus criteria

Similar to Robust biomarker selection from RT-qPCR data using statistical consensus criteria (20)

Recently uploaded

Recently uploaded (20)

Robust biomarker selection from RT-qPCR data using statistical consensus criteria