These are the slides from a web seminar given by Jack Wiedrick from the Biostatistics and Design Program at Oregon Health & Science University. He discussed a method for finding candidate biomarkers of disease from RNA datasets. The web seminar was presented as part of the Extracellular RNA Communication Consortium (ERCC) seminar series on 07 December 2017.
Robust biomarker selection from RT-qPCR data using statistical consensus criteria
1. Robust Biomarker Selection from RT-qPCR Data
using Statistical Consensus Criteria
Jodi Lapidus OHSU Biostatistics & Design Program, Director
Jack Wiedrick OHSU Biostatistics & Design Program, Staff Biostatistician
NIH/NCATS ExRNA Seminar Series, 07-Dec-2017
2. BACKGROUND
• Are micro-RNAs (miRNAs) expressed in human
body fluids, and if so, can they be used to
distinguish healthy from diseased patients?
• Quantification of miRNA expression can be done
using off-the-shelf quantitative reverse transcription
polymerase chain reaction (RT-qPCR) arrays
• NIH funding divides biomarker experiments into
UH2 (discovery) and UH3 (validation) phases
3. BACKGROUND
• UH2
Discover miRNA-based biomarkers in human body fluids
Experimentation to identify plausible and robust
candidates from a large panel of potential markers
Need to select a small set of markers while retaining and
prioritizing candidates with promising clinical utility
• UH3
Validation using independent sample sets
Targeted testing on the reduced list of candidates
Elucidate associations with clinical characteristics
4. BACKGROUND
• UH2
Discover miRNA-based biomarkers in human body fluids
Experimentation to identify plausible and robust
candidates from a large panel of potential markers
Need to select a small set of markers while retaining and
prioritizing candidates with promising clinical utility
• UH3
Validation using independent sample sets
Targeted testing on the reduced list of candidates
Elucidate associations with clinical characteristics
What do we mean by this?
Markers are robust if they’re
not sensitive to irrelevant
features in the samples
5. BACKGROUND
• UH2
Discover miRNA-based biomarkers in human body fluids
Experimentation to identify plausible and robust
candidates from a large panel of potential markers
Need to select a small set of markers while retaining and
prioritizing candidates with promising clinical utility
• UH3
Validation using independent sample sets
Targeted testing on the reduced list of candidates
Elucidate associations with clinical characteristics
This is the real problem.
We need to make selection
decisions based on small
numbers of samples, where
irrelevant features will often
be very prominent
6. A MOTIVATING SCENARIO
• Two clinical populations: Alzheimer's Disease (AD)
patients versus age/sex-matched non-AD controls
Are there biofluid markers that distinguish them?
If so, how do we discover which ones?
7. A MOTIVATING SCENARIO
• Two clinical populations: Alzheimer's Disease (AD)
patients versus age/sex-matched non-AD controls
Are there biofluid markers that distinguish them?
If so, how do we discover which ones?
Measure all samples using a
standard RT-qPCR panel
assay and look for miRNAs
showing group differences
8. A MOTIVATING SCENARIO
PHASE 1: DISCOVERY PHASE 2: VERIFICATION/VALIDATION
X
Large set of candidate markers
(Which ones are promising?
How do they seem to perform?)
Small set of promising markers
(Do they predict as well as hoped?
Are they feasible for screening?)
9. A MOTIVATING SCENARIO
1 sample x 377 probes/card x 2 cards
TaqMan® TLDA Cards for miRNA
3 internal standards per
card are used to align
the cards for a subject
x #{subjects}
CONSIDERATIONS
1.Only one measured value
per probe per sample
2.Some yield no value…
why? (No expression?
Really low expression?
Assay failure?)
3.Some are untrustworthy
(weak amplification, etc)
4.Many probes but most are
likely to be unimportant
10. CONSIDERATIONS
1.Only one measured value
per probe per sample
2.Some yield no value…
why? (No expression?
Really low expression?
Assay failure?)
3.Some are untrustworthy
(weak amplification, etc)
4.Many probes but most are
likely to be unimportant
A MOTIVATING SCENARIO
A given experiment may vary on some or all
of these considerations, but…
the statistical pipeline we describe can
be applied to any RT-qPCR experiment
designed to select biomarkers.
12. THE PROBLEM
• We want a robust selection methodology that:
emphasizes the predictors that matter and discounts the
ones that don't
13. THE PROBLEM
• We want a robust selection methodology that:
emphasizes the predictors that matter and discounts the
ones that don't
characterizes the associations within the larger context of
uncertainty about prediction model validity
14. THE PROBLEM
• We want a robust selection methodology that:
emphasizes the predictors that matter and discounts the
ones that don't
characterizes the associations within the larger context of
uncertainty about prediction model validity
generates realistic and testable expectations for how well
the predictors can actually predict
15. THE PROBLEM
• We want a robust selection methodology that:
emphasizes the predictors that matter and discounts the
ones that don't
characterizes the associations within the larger context of
uncertainty about prediction model validity
generates realistic and testable expectations for how well
the predictors can actually predict
Our selection pipeline uses statistical
consensus methodology to mitigate the
risk of false discoveries by focusing
attention on reliable and robust signaling
16. WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
17. WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
"statistically significant" doesn't mean "predictive"
p < 0.01
18. WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
"statistically significant" doesn't mean "predictive"
p < 0.01
Are the means significantly different?
Sure.
If you take a single measurement,
can you guess whether it's a case?
No.
19. WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
a t-test assumes the values are correct, but some aren't
CENSORING
20. WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
a t-test assumes the values are correct, but some aren't
At some point we stop counting
cycles because noise dominates
the signal at high resolutions
CENSORING
21. WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
a t-test assumes the values are correct, but some aren't
If amplification hasn't
occurred yet, there may
still be expression, but
all we know about the
cycle time is that it has
to be longer than the
maximum number of
cycles we attempted
CENSORING
22. WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
one-at-a-time tests can miss important parts of the story
miRNA#2
INTERACTION
(miRNA#2 is strongly correlated
with the outcome, but also
correlated with miRNA#1)
23. miRNA#2
WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
one-at-a-time tests can miss important parts of the story
Ignoring miRNA#2 leads to
the conclusion that
miRNA#1 is uninformative
about the outcome
(or maybe just a hint of
negative association)
INTERACTION
24. WHY DO WE NEED A "ROBUST" PIPELINE?
• Standard methods (e.g. p-values from t-tests) give
good answers, but to the wrong questions
one-at-a-time tests can miss important parts of the story
But at any given level of miRNA#2,
increased miRNA#1 is linked to a
significant increase in the outcome
miRNA#2
INTERACTION
25. STEPS FOR ROBUST SELECTION
1. VISUALIZE RAW DATA — be on the lookout for batch artifacts
and process noise and filter appropriately
2. NORMALIZE & TRANSFORM — encode sources of technical
noise and model their effects before beginning selection
3. FILTER UNSUITABLE TARGETS — if they don't assay well on
the technology, we can't use them as biomarkers anyway
4. SELECT USING MULTIPLE STATISTICAL METHODS —
different looks give a robust assessment of biomarker validity
5. CROSS-VALIDATE AND RANK — get expectations for
independent validation and prioritize markers accordingly
6. VALIDATE! — verify that the markers behave as expected in an
independent sample set and look for covariate influences
26. STEPS FOR ROBUST SELECTION
1. VISUALIZE RAW DATA — be on the lookout for batch artifacts
and process noise and filter appropriately
2. NORMALIZE & TRANSFORM — encode sources of technical
noise and model their effects before beginning selection
3. FILTER UNSUITABLE TARGETS — if they don't assay well on
the technology, we can't use them as biomarkers anyway
4. SELECT USING MULTIPLE STATISTICAL METHODS —
different looks give a robust assessment of biomarker validity
5. CROSS-VALIDATE AND RANK — get expectations for
independent validation and prioritize markers accordingly
6. VALIDATE! — verify that the markers behave as expected in an
independent sample set and look for covariate influences
UH2
27. STEPS FOR ROBUST SELECTION
1. VISUALIZE RAW DATA — be on the lookout for batch artifacts
and process noise and filter appropriately
2. NORMALIZE & TRANSFORM — encode sources of technical
noise and model their effects before beginning selection
3. FILTER UNSUITABLE TARGETS — if they don't assay well on
the technology, we can't use them as biomarkers anyway
4. SELECT USING MULTIPLE STATISTICAL METHODS —
different looks give a robust assessment of biomarker validity
5. CROSS-VALIDATE AND RANK — get expectations for
independent validation and prioritize markers accordingly
6. VALIDATE! — verify that the markers behave as expected in an
independent sample set and look for covariate influences
UH3
28. 1. VISUALIZE RAW DATA
• Find process heterogeneities and failures
Be on the lookout for batch artifacts and process noise
29. 1. VISUALIZE RAW DATA
• Find process heterogeneities and failures
• Remove candidates with poor assay performance
Be on the lookout for batch artifacts and process noise
30. 1. VISUALIZE RAW DATA
• Find process heterogeneities and failures
• Remove candidates with poor assay performance
Be on the lookout for batch artifacts and process noise
Our UH2 study considered 754
candidate miRNAs, and 343 (45%)
of those targets could be excluded
on assay quality grounds alone
31. 1. VISUALIZE RAW DATA
• Find process heterogeneities and failures
• Remove candidates with poor assay performance
• Determine assay quality/detection limits
(e.g. cycle time censoring threshold)
Be on the lookout for batch artifacts and process noise
32. 2. NORMALIZE & TRANSFORM
• Negative controls should be uniform
Encode sources of technical noise and model their effects
33. 2. NORMALIZE & TRANSFORM
• Negative controls should be uniform within:
processing batch (e.g. reagent lot)
Encode sources of technical noise and model their effects
34. 2. NORMALIZE & TRANSFORM
• Negative controls should be uniform within:
processing batch (e.g. reagent lot)
measurement batch (e.g. assay plate)
Encode sources of technical noise and model their effects
35. 2. NORMALIZE & TRANSFORM
• Negative controls should be uniform within:
processing batch (e.g. reagent lot)
measurement batch (e.g. assay plate)
fixed instrument settings
Encode sources of technical noise and model their effects
36. 2. NORMALIZE & TRANSFORM
• Negative controls should be uniform within:
processing batch (e.g. reagent lot)
measurement batch (e.g. assay plate)
fixed instrument settings
• Model and remove these effects if they differ
Encode sources of technical noise and model their effects
U6 U6
37. 2. NORMALIZE & TRANSFORM
• Summarize replicates by median (robust center)
Encode sources of technical noise and model their effects
38. 2. NORMALIZE & TRANSFORM
• Summarize replicates by median (robust center)
• Transform cycle times to an expression scale:
Encode sources of technical noise and model their effects
39. 2. NORMALIZE & TRANSFORM
• Summarize replicates by median (robust center)
• Transform cycle times to an expression scale:
higher numbers mean more expression
Encode sources of technical noise and model their effects
40. 2. NORMALIZE & TRANSFORM
• Summarize replicates by median (robust center)
• Transform cycle times to an expression scale:
higher numbers mean more expression
censored values become 0
Encode sources of technical noise and model their effects
41. 2. NORMALIZE & TRANSFORM
• Summarize replicates by median (robust center)
• Transform cycle times to an expression scale:
Encode sources of technical noise and model their effects
Low cycle time values
on this axis…
42. 2. NORMALIZE & TRANSFORM
• Summarize replicates by median (robust center)
• Transform cycle times to an expression scale:
Encode sources of technical noise and model their effects
…map to high expression
values on this axis
Low cycle time values
on this axis…
43. 2. NORMALIZE & TRANSFORM
• Summarize replicates by median (robust center)
• Transform cycle times to an expression scale:
Encode sources of technical noise and model their effects
…map to high expression
values on this axis
Low cycle time values
on this axis…
censored
values here
44. 3. FILTER UNSUITABLE TARGETS
• Targets should be reasonably attested
If they don't assay well on the technology, we can't use them
45. 3. FILTER UNSUITABLE TARGETS
• Targets should be reasonably attested
If they don't assay well on the technology, we can't use them
75% censoring with
1:1 case:control ratio
means specificity can
never exceed 50%
46. 3. FILTER UNSUITABLE TARGETS
• Targets should be reasonably attested
• Cycle time accuracy should be mostly high
If they don't assay well on the technology, we can't use them
47. 3. FILTER UNSUITABLE TARGETS
• Targets should be reasonably attested
• Cycle time accuracy should be mostly high
If they don't assay well on the technology, we can't use them
Otherwise rankings
become unreliable
because the cycle
times are unreliable
48. 3. FILTER UNSUITABLE TARGETS
• Targets should be reasonably attested
• Cycle time accuracy should be mostly high
• Censoring should be unrelated to accuracy
If they don't assay well on the technology, we can't use them
49. 3. FILTER UNSUITABLE TARGETS
• Targets should be reasonably attested
• Cycle time accuracy should be mostly high
• Censoring should be unrelated to accuracy
If they don't assay well on the technology, we can't use them
A correlation here would mean
that measurement error is
blurring the distinction between
'expressed' and 'not expressed'
50. 3. FILTER UNSUITABLE TARGETS
• Targets should be reasonably attested
• Cycle time accuracy should be mostly high
• Censoring should be unrelated to accuracy
If they don't assay well on the technology, we can't use them
In our UH2 study, out of 411
well-measured targets we
were able to filter 260 (63%)
as unlikely to be viable
biomarkers in the technology
51. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Different looks give a robust assessment of validity
52. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for individual markers
Different looks give a robust assessment of validity
53. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for individual markers:
o Are cycle time counts equal? (LOG-RANK TEST)
Different looks give a robust assessment of validity
strong expression-disease association
54. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for individual markers:
o Are cycle time counts equal? (LOG-RANK TEST)
Different looks give a robust assessment of validity
strong expression-disease association
Log-rank tests properly
account for censoring
in the cycle times
55. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for individual markers:
o Do cycle times cluster by group? (ROC ANALYSIS)
Different looks give a robust assessment of validity
large group separation in expression
56. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for individual markers:
o Do cycle times cluster by group? (ROC ANALYSIS)
Different looks give a robust assessment of validity
large group separation in expression
ROC analysis is designed
to compare entire
distributions of values
57. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for groups of markers
Different looks give a robust assessment of validity
58. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for groups of markers:
o Do target signals overlap? (RANDOM FOREST)
Different looks give a robust assessment of validity
robust classification across many random trees
59. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for groups of markers:
o Do target signals overlap? (RANDOM FOREST)
Different looks give a robust assessment of validity
robust classification across many random trees
Random forests can
capture complex cross-
marker interactions
60. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for groups of markers:
o Do signals transcend models? (ALTERNATE CLASSIFIERS)
Different looks give a robust assessment of validity
61. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
Questions for groups of markers:
o Do signals transcend models? (ALTERNATE CLASSIFIERS)
Different looks give a robust assessment of validity
More than one way to grow a random forest
A random forest is resampling-based aggregate of decision
trees that attempts to average over all the many possible
trees that could be formed from a set of predictor variables.
But the decision trees could use different rules, e.g.:
•CART (Classification And Regression Trees)
•CFOREST (Conditional inference tree FORESTs)
•CHAID (CHi-squared Automatic Interaction Detection)
•BOOST (BOOSTed classification trees)
All are kinds of random forests, but their component trees
decide differently.
62. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
• Consensus in selection suggests signal validity
Different looks give a robust assessment of validity
selected combinations of markers classify well
63. 4. USE MULTIPLE STATISTICAL METHODS
• Different tests offer different views of classification
• Consensus in selection suggests signal validity
Different looks give a robust assessment of validity
selected combinations of markers classify well
In our UH2 study, we were
able to reduce a set of
hundreds of candidate
miRNAs down to just a few
dozen demonstrating good
classification performance.
64. 5. CROSS-VALIDATE AND RANK
• Use multiple imputation to fill in missing data and
simulate population-plausible datasets
Get expectations for independent validation and prioritize accordingly
65. 5. CROSS-VALIDATE AND RANK
• Use multiple imputation to fill in missing data and
simulate population-plausible datasets
• Leave-one-out resampling gives estimates of
prediction ability in a new independent cohort
Get expectations for independent validation and prioritize accordingly
66. 5. CROSS-VALIDATE AND RANK
• Use multiple imputation to fill in missing data and
simulate population-plausible datasets
• Leave-one-out resampling gives estimates of
prediction ability in a new independent cohort
Get expectations for independent validation and prioritize accordingly
Take the average of
a bunch of informed
guesses…
67. 5. CROSS-VALIDATE AND RANK
• Use multiple imputation to fill in missing data and
simulate population-plausible datasets
• Leave-one-out resampling gives estimates of
prediction ability in a new independent cohort
Get expectations for independent validation and prioritize accordingly
Take the average of
a bunch of informed
guesses…
…and make an
informed guess about
future performance
68. 5. CROSS-VALIDATE AND RANK
• Assess multimarker classification performance in all
possible groupings of top candidates
Get expectations for independent validation and prioritize accordingly
69. 5. CROSS-VALIDATE AND RANK
• Assess multimarker classification performance in all
possible groupings of top candidates
Bayesian model averaging accounts for uncertainty about
"the right model”
Get expectations for independent validation and prioritize accordingly
70. 5. CROSS-VALIDATE AND RANK
• Assess multimarker classification performance in all
possible groupings of top candidates
Bayesian model averaging accounts for uncertainty about
"the right model"
targets ranked by frequency
of inclusion in models,
weighted by goodness of fit
Get expectations for independent validation and prioritize accordingly
71. 5. CROSS-VALIDATE AND RANK
• Assess multimarker classification performance in all
possible groupings of top candidates:
Bayesian model averaging accounts for uncertainty about
"the right model"
targets ranked by frequency
of inclusion in models,
weighted by goodness of fit
compare to any existing
biomarkers and look for
independent signaling
Get expectations for independent validation and prioritize accordingly
72. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Before starting the validation, compare the shape of
response distributions in the old and new cohorts
73. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Before starting the validation, compare the shape of
response distributions in the old and new cohorts
Differences in skew (whether the distribution leans one
way or the other) and kurtosis (how centered vs diffuse
the distribution is) could indicate poorly matched cohorts
74. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Before starting the validation, compare the shape of
response distributions in the old and new cohorts
Differences in skew (whether the distribution leans one
way or the other) and kurtosis (how centered vs diffuse
the distribution is) could indicate poorly matched cohorts
The marker means
approximately line
up in both cohorts
75. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Before starting the validation, compare the shape of
response distributions in the old and new cohorts
Differences in skew (whether the distribution leans one
way or the other) and kurtosis (how centered vs diffuse
the distribution is) could indicate poorly matched cohorts
But some markers in
the old cohort fell on the
extreme edges of the
distribution in the new
cohort — evidence of
potentially large skew in
the discovery cohort
76. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Before starting the validation, compare the shape of
response distributions in the old and new cohorts
Differences in skew (whether the distribution leans one
way or the other) and kurtosis (how centered vs diffuse
the distribution is) could indicate poorly matched cohorts
Central thinness in some of the
distributions is an indication of
low kurtosis — pointing to a
possible admixture of dissimilar
subjects in the population
sampled by the new cohort
77. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Verify that marker relevance assumptions hold
78. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Verify that marker relevance assumptions hold
These were chosen
as nondiscriminating
miRNAs…are they?
79. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Verify that marker relevance assumptions hold
80. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Verify that marker relevance assumptions hold
These were chosen
as discriminating
miRNAs…are they?
81. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Verify that marker relevance assumptions hold
82. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
83. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
84. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
Based on UH2 patterns, we
expect the miRNA-only curve
to pass through this region
85. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
86. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
87. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohortsHappily, the miRNA-only curve
behaves exactly as expected
88. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
Similarly, we expect the bump
in performance from adding a
genetic marker to only kick in
at relatively low specificities
89. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
90. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
91. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
Equally happily, the curve with the
genetic marker doesn't behave as
expected — it's even better
92. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for discrepancies in classification performance
patterns between the cohorts
Why would the genetic marker behave
differently in the two cohorts?
One explanation is that the discovery cohort
was less healthy — the symptoms were already
so strong that the genetic factor no longer
added much new information. When the
disease is less severe and cases are more
similar to controls, the genetic information
boosts sensitivity for the borderline cases.
These kinds of nuances can be very valuable
for deciding not only who the biomarker
screening should be applied to, but also when.
93. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
94. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
= low rank numbers for stronger markers
= middling rank numbers for mediocre markers
= high rank numbers for weaker markers
Apply several different methods of ranking
(i.e. "judges") to the set of markers —
these are the same statistical tests used
for biomarker discovery, but now the goal
is to prioritize rather than exclude
95. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
= low rank numbers for stronger markers
= middling rank numbers for mediocre markers
= high rank numbers for weaker markers
Note that one of the judges is the
ranking created in the discovery phase,
prior to seeing the validation data
96. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
= low rank numbers for stronger markers
= middling rank numbers for mediocre markers
= high rank numbers for weaker markers
Each judge independently
ranks the candidate markers
in order (1=best, 26=worst)
97. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
= low rank numbers for stronger markers
= middling rank numbers for mediocre markers
= high rank numbers for weaker markers
Then ranks for each marker
are summed across judges
98. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
= low rank numbers for stronger markers
= middling rank numbers for mediocre markers
= high rank numbers for weaker markersWe colorcode the table to
visually assess the
consistency of rankings
99. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
The rank sums define an
ordering of markers — this is
our consensus opinion
across evaluation methods
100. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
This means that rank sums
within 56 of each other
could be randomly assigned
with high probability, but
gaps larger than that are
likely to be qualitative
101. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
56 ranks 56 ranks
102. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
We could think of the rank
sum distribution as a
roughly even mixture of two
kinds of markers: "hot"
markers and "cool" markers
(where in the middle it's
hard to tell which is which)
103. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Reprioritize markers based on how well they held
up as predictive in the new cohort
"Hot"
markers
"Lukewarm"
markers
"Cool"
markers
104. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Seek internal validation of the marker prioritization
105. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Seek internal validation of the marker prioritization
If the markers we think are important really are, then they
should contribute the most to multimarker models
106. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Seek internal validation of the marker prioritization
If the markers we think are important really are, then they
should contribute the most to multimarker models
Nearly all possible
parsimonious models,
colorcoded by number of
markers in the model
107. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Seek internal validation of the marker prioritization
If the markers we think are important really are, then they
should contribute the most to multimarker models
Best models are here —
high AUC (strong),
low AIC (parsimonious)
108. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Seek internal validation of the marker prioritization
If the markers we think are important really are, then they
should contribute the most to multimarker models
Which markers contribute
most to the best models?
109. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Seek internal validation of the marker prioritization
If the markers we think are important really are, then they
should contribute the most to multimarker models
The highest ranked ones!
110. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Assess the overall quality of group separation
111. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Assess the overall quality of group separation
We may never be able to
screen these kinds of
cases with our markers
112. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Assess the overall quality of group separation
But many of these could
be latent cases…!
113. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Evaluate performance ranges for model classes
114. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Evaluate performance ranges for model classes
These regions fare no
better than existing
clinical markers
115. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Evaluate performance ranges for model classes
High-performance
regions can only be
reached with a
sufficient number of
markers to allow
clear discrimination
116. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for differential performance of the classifiers
within clinically relevant covariate subgroups
117. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for differential performance of the classifiers
within clinically relevant covariate subgroups
A clever way to do this is to cluster the subjects and then
examine covariates in the tightest clusters
118. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for differential performance of the classifiers
within clinically relevant covariate subgroups
A clever way to do this is to cluster the subjects and then
examine covariates in the tightest clusters
The markers are more
sensitive in Cluster 2 than
in Cluster 1, and Cluster 2
also has more males
119. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for trending of discrimination performance
across covariate spectra
120. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for trending of discrimination performance
across covariate spectra
The relationships may be marker-specific! Or nonlinear!
121. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for trending of discrimination performance
across covariate spectra
The relationships may be marker-specific! Or nonlinear!
Higher marker ranks seem to
correlate with marker
associations to the covariate…
122. 6. VALIDATE!
Verify that the markers behave as expected in an independent sample set
• Look for trending of discrimination performance
across covariate spectra
The relationships may be marker-specific! Or nonlinear!
…but only for the strongest
markers. Weaker markers
don't show such a clear pattern
123. NEXT STEPS
Co-authors:
Theresa Lusardi
Jay Phillips
Jack Wiedrick
Chris Harrington
Babette Lind
Jodi Lapidus
Joe Quinn
Julie Saugstad
Vol.55, no.3, pp.1223-1233,
2017
“MicroRNAs in Human Cerebrospinal Fluid
as Biomarkers for Alzheimer’s Disease”
DOI: 10.3233/JAD-160835
• An abbreviated discussion
of this pipeline appeared in
our recent UH2 paper in
the Journal of Alzheimer's
Disease
• Publication of a follow-up
UH3 paper on validation
results is in progress
• Also currently working on a
standalone methods paper
– Robust Statistical Analysis
Pipeline for Selecting Promising
Biomarkers from RT-qPCR
Experiments
Wiedrick, Lusardi, Saugstad, Lapidus