Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease
Presentation given about the Generalized Additive Model Location, Scale and Shape (GAMLSS) methodology for the analysis of small RNA sequencing data and the potential of microRNAs as biomarkers for kidney and cardiometabolic diseases
Similar to Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease
Similar to Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease (20)
Call Girls Visakhapatnam Just Call 8250077686 Top Class Call Girl Service Ava...
Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease
1. Correcting bias and variation in
small RNA sequencing for optimal
(microRNA) biomarker discovery
and validation in cardio-metabolic
(and renal) disease
Christos Argyropoulos MD, PhD, FASN
Department of Internal Medicine
Division of Nephrology
University of New Mexico Health Sciences Center
2. Overview
• Models of sequence counts in short RNA-seq
experiments
• Estimating and controlling for bias in small RNA-seq
experiments
• Statistical approaches to analyzing differential
expression
• MicroRNA regulation – a control theory perspective
• MicroRNAs as biomarkers in diabetes, renal and
cardiometabolic disease
• Leveraging our approach for optimal biomarker
discovery
3. Signals in short RNA-seq
data
Building a model from first principles
4. Background
• Short RNA-seq data are becoming more and more
abundant
• There is poor reproducibility of findings between
and within research groups
• Systematic measurement bias confound findings
• Systematic variation relatively stable within protocols
• Systematic variation unpredictable between different
protocols and platforms
• Statistical methods may be used to explore and
address such biases
• Existing approaches are phenomelogical descriptions
• what do model parameters stand for?
• how can one best use these models?
5. Building a model from first
principles
• Establish testable predictions that may be verified
in existing datasets
• Establish correspondence between model
parameters and experimental steps
• Use this model to understand and correct
systematic and random bias in short RNA-seq
• Embed the model into more general frameworks
for applications:
• Epidemiological
• Biomarker discovery and validation
• Medical diagnostics
6. The short RNA-seq experiment
The vendor’s view The biochemist’s view
https://doi.org/10.1093/nar/gkt1021
http://www.genomics.hk/SamllRna.htm
http://www.geospiza.com/Products/SmallRNA.shtml
7. X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
Abundance in original
preparation
Abundance in
adapted(ligated)
sample
Abundance in PCR
amplified library
Abundance in
capture probes
Abundance of counts
in fastq files
(ligation efficiency) fi
(number of PCR cycles) N
(PCR efficiency) qi
Probability of capture si
Number of probes (K)
Library dilution factor (d)
Probability of signal
generation r
Probability of sequence
generation pi
L1
𝑁
, L2
𝑁
, … , Ln
𝑁
Conceptual model of the short RNA-seq
experiment (this is what we will talk about)
8. Modeling the qPCR amplification reaction
• Statistics of PCR amplification
• Branching (Galton-Watson) process
• GW distribution only available implicitly i.e.
through simulation
• Large scale simulations to derive
approximation to the GW process
• PCR literature, GW theory, martingale arguments
candidate distributions
• Information theory arguments used to compute
distance between GW samples and the
approximate distributions
• A (truncated) Normal distribution derived at the
end
X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
L1
𝑁
, L2
𝑁
, … , Ln
𝑁
9. Flattening the hierarchy through
marginalization
Integrate sources of variations out of the
model:
1. library sequence depth variation
2. PCR amplification
Final statistical model is about absolute
counts
• Direct modeling ≠ % of counts
• Limit of approximation encompasses all possible
sample compositions
• The is a truncated Normal Poisson mixture
distribution (approximated via a Negative
Binomial or Linear Quadratic Gaussian family)
Model implements a Linear-Quadratic (LQ)
mean-variance relationship
X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
L1
𝑁
, L2
𝑁
, … , Ln
𝑁
10. Distributional Regression for RNA-
seq data
LQ relationship between mean (𝜇) and variance (𝜎𝐿𝑄
2
)
𝜎𝐿𝑄
2
= 𝜇(1 + 𝜙𝜇)
• The variance and the mean
have to be modelled
concurrently
• Unless variance is modelled
inconsistent statistics small
(overoptimistic) p values
• Realm of distributional
regression models (GAMLSS –
Generalized Additive Models
for Location, Scale and Shape)
• One can re-use existing SW
frameworks to fit such models
11. Validating model(s) with synthetic
mixes of known composition
• Allow one to test the “backbone” of the model
without worrying about the adequacy of the
modeling of biology
• Sequencing of equimolar mixes:
• Explore and model systematic bias in the same protocol
• Sequencing of dilution series or non-equimolar
mixes:
• “Dose-response” curve of the bias
• Examination of “debiasing” approaches for the ability to
uncover the truth
• Model may also be used to analyze the
performance of differential expression algorithms
15. Enzymatic mechanism of RNA ligation
• The kinetics of RNA ligation were investigated thoroughly
in the 1970s and early 1980s
• The intermolecular reaction is relevant to RNA-seq
• The mechanism involves three, fully reversible, steps that
obey ping-pong ordered kinetics and are subject to
substrate inhibition
𝐸 + 𝐴𝑇𝑃
𝑘1
𝑘−1
𝐸 ∙ 𝐴𝑇𝑃
𝑘1𝑎
𝑘−1𝑎
𝐸 − 𝐴𝑀𝑃 + 𝑃𝑃𝑖
𝐸 − 𝐴𝑀𝑃 + 𝐷
𝑘2
𝑘−2
𝐸 ∙ 𝐴𝑝𝑝 − 𝐷
𝑘2𝑎
𝑘−2𝑎
𝐸 + 𝐴𝑝𝑝 − 𝐷
𝐸 ∙ 𝐴𝑝𝑝 − 𝐷 + 𝐴
𝑘3
𝑘−3
𝐸 ∙ 𝐴𝑝𝑝 − 𝐷 ∙ 𝐴
𝑘3𝑎
𝑘−3𝑎
𝐴𝑀𝑃 + 𝐸 + 𝐴𝐷
Bias in RNA-ligation was noted in these early investigations and the enzyme was
never used as tool in synthetic chemistry, as solid phase methods took off in the 80s
16. Kinetic analysis of ligase reaction
velocity in RNA-seq protocols
• Existing protocols include abundant cofactors (sharp
contrast to the experiments in 1970s)
Drive reaction to the right
Rate limiting single step reaction instead of tri-step one
Substrate preference (bias in reaction yields) is not eliminated
• Multi-substrate inhibition from all biosample sequences
available from ligation
Analytical series approximation for ratios of random variables
• Ligase operates at the 1st order domain of Michaelis-
Menten kinetics
𝑉𝑖 =
𝑉𝑖
𝑚𝑎𝑥
𝑋𝑖
𝐾 𝑀
𝑖
1 + 𝑖
𝑋𝑗
𝐾 𝑀
𝑗
≈
𝑉𝑖
𝑚𝑎𝑥
𝑋𝑖
𝐾 𝑀
𝑖
1 + 𝑛
𝐸 𝑋
𝐸 𝐾 𝑀
=
𝑉𝑖
𝑚𝑎𝑥
𝑋𝑖
𝐾 𝑀
𝑖
1 +
)𝐶 𝑇𝑜𝑡𝑎𝑙(0
𝐸 𝐾 𝑀
≈
𝑉𝑖
𝑚𝑎𝑥
𝑋𝑖
𝐾 𝑀
𝑖
17. Testable model predictions about
ligase bias in RNA-seq experiments
Mathematical expression
• 𝑋𝑖 1 − exp −
𝑉𝑖
𝑚𝑎𝑥
𝐾 𝑀
𝑖 𝑇𝑅 = 𝑋𝑖 𝑓𝑖
5ʹ
• Λ 𝑖 = 𝑋𝑖 𝑓𝑖
5ʹ
𝑓𝑖
3ʹ
= 𝑋𝑖 𝑓𝑖
Implications for ligase bias
• Concentration
independence
• Sample composition
independence
• Transferable within
experiments done with the
same protocol
• Protocol dependent
(reaction velocity
incorporates concentration
of cofactors and enzyme)
• Sequence equimolar mixes to
derive empirical correction
factors for ligase bias
• Apply those to biological
samples (“offsets” in
distributional regression) to
eliminate bias
19. Application of bias factors virtually
eliminates ligase bias
Monte Carlo Cross Validation in 3 equimoral datasets: randomly split the
dataset into learning and testing subsets, learn the correction factor and
apply it to correct the estimates of the learning dataset. Repeat N times
20. Empirical factors nearly eliminate bias
between equimolar datasets with 10x
different input (Galas Lab legacy datasets)
22. Design of Validation Experiments
What has been established?
• Moderate
concentration
independence
• Ability to nearly
eliminate bias over at
least two orders of
magnitude
• Legacy
platforms/experiments
What needs to be proven?
• Concentration
independence over >2
orders of magnitude
• Sample composition
independence
• Recovery of differential
expression measures
• Any value relative to
existing approaches?
23. Validation Experiments
Collaboration between PNRI (Galas Lab) and UNM (DoIM)
The largest, single protocol, technical series to date (GSE93399)
Experimental Group Dilution N
miRExplore (972 short
RNAs)
1:10 10
286 miRNAs 1:1 8
1:10 8
1:100 8
1:1000 8
Ratio Metric Series A
(descending)
Mix of
286 subpool A (1:1)
286 subpool B (1:10)
286 subpool C(1:100)
286 subpool D (1:1000)
8
Ratio Metric Series B
(ascending)
Mix of
286 subpool A (1:1000)
286 subpool B (1:100)
286 subpool C(1:10)
286 subpool D (1:1)
8
Total 7 groups (58 sequenced x 2 = 116)
24. Empirical bias correction over 3 orders
of magnitude in equimolar datasets
RMSE reduction: 77%-90% (input in calibration run differs by up to x10 from
target), 54%-67% otherwise
27. Bias Correction in Heterogeneous
Samples
• Correction factors remove
~55% of bias between
equimolar samples
• ~ 70% of RNAs have
expression within two fold
from the mean (from 23%)
• Bias reduction is ~40% in
ratiometric series
• ~63% of RNAs have
expression within x2 from
the mean (from 33%)
29. Our proposal for a model of
differential expression (DE) changes
Statistical formulation and
assumptions
log 𝜇𝑖,𝑗,𝑘 = 𝛼 + Δk + 𝑚𝑖,0 + 𝛿𝑖,𝑘
𝑚𝑖,0 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎𝜇0
2
)
𝛿𝑖,𝑘 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎 𝑘
2
)
(similar model for variance)
1. Expression in reference state is not
of prime scientific interest (can
omit correction for bias)
2. Technical sources of variation (PCR
efficiency, library sampling) of
much smaller magnitude than
biological variability
Parameter interpretation
and context of use
• Accommodates global
and sequence specific DE
changes
• Flexible modeling of
referent (global level and
variation around it)
• Still models counts
• No incorporation of
library specific factors
(model is un-normalized)
30. • Number of reads in sample j, assigned to species i (Ki,j)
• Assumed to follow a negative binomial distribution:
• 𝐾𝑖,𝑗~𝑁𝐵(𝜇𝑖,𝑗, 𝜎𝑖,𝑗
2
)
Existing Models for RNA-seq experiments
Standard
deviation
=𝜇𝑖,𝑗 + 𝑎𝜇𝑖,𝑗
2
(edgeR1)
1Biostatistics 2008, 9:321-32
2 Genome Biology 2010, 11:R106
=𝜇𝑖,𝑗 + 𝑠𝑗
2
𝑓(𝑚𝑖,𝑗) (DESeq2)
Mean = 𝑚𝑖,𝑗 × 𝑠𝑗 Common scale
(coverage of the library,
sequence depth)
Experimental Effects
iijim ,1,0, )log(
miRNA expression in the
control group
miRNA expression in the
experimental group
Model for
differential
expression
analysis
31. Comparison of proposed approach
against existing methods
“We” (gamlss)
• Uses the NB or the LQNO
• LQ relation between mean and
variance
• Variance and mean parameters
are estimated simultaneously
• Explicit count based modeling
• Un-normalized
• Shrinkage via random effects
modeling
• Derived from first principles (a
generative probability model)
“They” (edgeR/DESeq2 etc)
• NB or the linear model
• LQ or flexible relation between
mean & variance
• Two stage procedure to
estimate parameters
• Models counts as % of a given
library depth
• Normalized (% sum to one)
• Shrinkage via random effects
modeling
• Ad hoc, phenomenological
probability model
32. Scenarios of differential expression
to assess method performance
• Clustered, symmetric differential expression
1. fraction of overexpressed sequences is equal to that of the
underexpressed
2. no change in global expression
over and underexpressed RNAs are present in equal numbers
and exhibit same degree of DE
• Asymmetric, clustered differential expression
1. Fraction of overexpressed sequences ≠ underexpressed
Drives global expression change to one direction
• Global Change: all RNAs exhibit a variable but consistent
directional change of expression
• No change
All scenarios implemented through the validation datasets
33. The GAMLSS has smaller RMSE than 10
popular workflows for DE analysis
• Performance
benefit seen under
scenarios of
asymmetric,
clustered
differential
expression changes
• When DE are
(nearly) symmetric,
many other
methods have
similar
performance
36. GAMLSS demonstrates the optimal
balance between False Omission and False
Discovery Rates
ROC Curve Analysis FDR and FOR
37. What did we just find out about
algorithms for DE analysis?
• Proposed method (GAMLSS) is the top performer:
• Symmetric, clustered, DE changes
• Asymmetric clustered, DE changes
• Asymmetric global, DE changes
• No DE change
Optimal balance between FDR and FOR
• Existing methods introduce moderate – to – severe bias
• force the overall DE to sum to zero (what goes up must be
accompanied by something that goes down)
• Voom/limma somewhat more resilient, near identical
performance to GAMLSS under symmetric DE
These patterns have not seen before, because no-one to
date has generated datasets with known composition/DE
38. Why do existing methods fail to deliver?
• Existing models for RNA-seq analysis e.g
deSEQ, edgeR can be derived from 1st
principles as approximations
• RNA-seq counts as % of library depth
• Valid for dilute samples, not dominated by a
few RNA species
• Library size depth and modeling counts as %
(a relic of the SAGE era) may be a disastrous
distraction
• Parameterization constraints DE over all RNAs
included in the analysis to sum to zero
39. Practical implications for experimentalists
(not using GAMLSS)
• Any change to the population of RNAs modelled (e.g.
filtering)→ different DE values from the same dataset
• Both type M (degree of DE changes) and type S (label
an over-expressed sequence to be under-expressed &
vice-versa) errors
• Up to 25% of estimated DE changes may be of the wrong direction
• Up to 100% of estimated DE changes may be of the wrong
magnitude
• RNA-seq findings will fail to validate against qPCR
• Reputation of RNA-seq as a semiquantitative technique of
poor reproducibility is due to statistical methodology
42. Control In Biological Systems Is Many-
To-Many, Cooperative And Patterned
Feala JD, et al. PLoS ONE 7(1): e29374. (2012)
Riba A et al PLoS Comput Biol 10(2): e1003490.
(2014)
Bipartite Control Network Topologies miRNA – Transcription Factor circuits
Feed Forward Loop: master
control layout in many natural
and artificial control systems
43. How do we control things?
Predictably simple
(open loop)
Error Correcting
(feeback)
Model based
(feed forward)
44. Feed forward control
• Control element responds to a change in the
environment in a predefined manner
• Based on prediction of plant (“what is being
controlled”) behavior (requires model of the
system)
• Can react before error actually occurs (stabilizing
the system, e.g. cerebellum control of balance)
• Benefits: reduced hysteresis, increased accuracy,
cost-efficiency, lower “wear-tear”
45. Practical implications
• miRNAs function as master controllers in FFLs
• biology is intrinsically NOT model free
• miRNA profiling reveals the “plant” dynamics of
complex biological processes
• Emerging data suggest that sequence variation may underline
(dys-)regulation
• miRNA associations are by definition causal to some
aspects of a particular phenotype
• “a priori plausible” biomarkers
• direct therapeutic implications
• Examination of the “plant” (targets) may have
implications for microRNA research
• Context for the interpretation of microRNA changes
• “Stronger” biomarker signatures
46. microRNAs are rational candidates for
exploring paradigm shifts in biology
• Ubiquity-conservation
• Breadth & width of regulation (>60% of genes)
• Context-specificity (“meta-controller”)
• Master Controllers in Feed Forward Loops
These arguments are not disease area specific (e.g. apply
equal well to cancer or even psychiatric disease)
48. • 8-10% of the population suffer from diabetes
• 20-30% of patients with diabetes will develop evidence of
diabetic chronic kidney disease (DKD/CKD)
• DKD progresses in stages of increasing proteinuria
• 50% of patients with overt nephropathy will develop End
Stage Renal Disease (ESRD) within 10 years
• The end result: Diabetic nephropathy is the leading cause
of ESRD, requiring dialysis or kidney transplantation
accounting for 40% of cases
Facts, figures and the natural history of
cardiometabolic and renal disease in diabetes
49. • DKD is costly:
• 40-50% of the $44B Medicare expenditures for CKD
• 40-50% of the $50B total healthcare costs for ESRD
• DKD is lethal (>50% of these deaths are cardiac)
• Current therapies reduce risk by 30%
• Many of the things we tried to stabilize renal function AND
improve cardiovascular disease failed miserably in trials
• A paradigm change in our understanding of DKD is
warranted => We posit that miRNAs will trigger this shift
• This improvement likely spread to other areas given biology
of cardiovascular disease (“extreme phenotype”)
There is a significant unmet need for therapies that
stabilize progression and reduce death rates in
patients with diabetic kidney disease
1Afkarian et al J Am Soc Nephrol. 2013 Feb;24(2):302-8
US1
population
No Diabetes Diabetes
No CKD 7.7% 11.5%
CKD 17.2 31.1%
0 10 20 30 40 50 60
405060708090100
Dialysis Mortality
Time (months)
%Surviving
GN
DM
50. Why bother with microRNAs in DKD?
Heart & Vessels
• Angiogenesis
• Vascular inflammation
• Atherosclerosis
• LVH
• Vascular tone
• Endothelial dysfunction
Kidney
• Water homestasis
• Osmoregulation
• Calcium sensing
• Sodium, potassium,
acid base handling
• Renin production
• Renal development
• Renal senescence
• EMT
• Collagen production
Diabetes
• Insulin synthesis and
secretion
• Peripheral tissue
sensitivity
• Hepatic glucose
production
• Inflammatory gene
expression
51. microRNAs as Minimally Invasive
Biomarkers : a metrological argument
Advantages of microRNAs
Circulating microRNAs
•More stable in circulation than
mRNAs
•High expression level and low
complexity compared to mRNA
•Tissue specific expression
•Availability of analytical platforms
Keep getting cheaper over time
•Sequence conservation
Allows translation of clinical
associations to animal models
Allows translation of animal
models to clinical applications
Cortez et al Nat Rev Clin Oncol. Jun 7, 2011; 8(8): 467–477.
52.
53.
54. Targets of differentially expressed miRNAs in
early and late stages of DN map to overlapping
pathways MA v.s. NA Overt vs Normal
Pathway P-value Fraction P-value Fraction
Signal Transduction
Signaling by SCF-KIT 0.006 18/76 0.001 41/76
Signaling by Insulin receptor 0.009 23/109 <0.001 65/109
Signaling by NGF 0.016 38/212 <0.001 119/212
Signaling by Rho GTPases 0.024 24/125 <0.001 71/125
Signaling by ERBB4 0.027 16/76 <0.001 45/76
Signaling by ERBB2 0.035 19/97 <0.001 59/97
Signaling by PDGF 0.040 22/118 <0.001 67/118
Signaling by VEGF 0.041 4/11
Signaling by EGFR 0.044 20/106 <0.001 64/106
Dowstream signaling of activated FGFR 0.038 19/98 <0.001 61/98
Signaling by BMP 0.001 16/23
Signaling by TGFβ 0.004 11/15
DAG and IP3 signaling 0.010 20/31
PIP3 activates AKT signaling 0.020 15/26
RAF/MAP kinase cascade 0.031 7/10
Signaling by Notch 0.036 13/23
Interaction of integrin α5β3 with fibrillin 0.044 2/3
Interaction of integrin α5β3 with von Willbrand factor 0.044 2/3
Integrin cell surface interactions 0.024 40/85
Cell-Cell Communication 0.009 57/122
Cell Cycle
G0 and early G1 0.040 12/21
56. Goals of a microRNA research
program in cardiometabolic, renal and
diabetes diseases
• Use carefully designed case-control, before-
after, randomized controlled trials, and n-of-
1 trials for the following goals:
1. Personalized medicine applications
(diagnosis/prognosis/precision medicine)
2. Biomarker discovery (e.g. to aid trials)
3. Novel Therapeutics
58. Ingredients for success of a microRNA
regulation discovery program
Requires open-ended platforms (RNA-seq)
o Especially for kidney disease due to intrarenal RNA editing
Requires unbiased quantification between groups of
patients (differential expression analysis)
Requires unbiased and accurate quantification in
the absence of a controlled comparison (diagnostics
– bias correction)
Proposed approach: GAMLSS for RNA-seq satisfies
requirements better than all currently used methods
59. Measurement in clinical diagnostics
What we want to happen What actually happens
Patient 1
10,10
Measurement is reproducible
Measurement shows minimal inter-individual variation
Measurement shows minimal intra-individual variation
JANUARYJUNE
Condition A
JANUARYJUNE
Patient 2
10,10
Patient 3
15,15
Condition B
Patient 4
15,15
Patient 1
10,10
Condition A
Patient 2
10,10
Patient 3
15,15
Condition B
Patient 4
15,15
Patient 1
10,18
? Condition
Patient 2
13,10
Patient 3
15,10
Condition B
Patient 4
15,18
Patient 1
10,12
Patient 2
15,14
Patient 3
18,11
Patient 4
14,19
Condition A ? Condition
Condition A Condition B ? Condition
Condition BMeasurement is non-reproducible
Measurement shows high inter-individual variation
Measurement shows high intra-individual variation
60. • Understand and control for the sources of variation
• Use calibration sets as references
• A measurement is instrument specific
• Global reference standards (role for highly competent
labs that maintain the standards)
• Context of use:
• Detector (“out-of-limits” readings)
• Control (“track the course”)
Lessons from clinical chemistry labs
• Use GAMLSS as the prime analytical tool to analyze short
RNA-seq data as it correctly represents all sources of
variation and can use calibration (equimolar) runs
• Combine this with a protocol that experimentally controls
variation (e.g. 4N protocol of the Galas Lab)
61. Measurement in experimental samples
What we want to happen What actually happens
Condition A Condition B
10, 10, … , 10 15, 15, … , 15
B > A
Certain of the difference
Measurement is reproducible
Measurement shows no variation
RUN1RUN2
Condition A Condition B
10, 10, … , 10 15, 15, … , 15
B > A
Condition A Condition B
11, 7, … , 10 8, 19, … , 26
B > A
Uncertain of the difference
Measurement is non-reproducible
Measurement shows high variation
RUN1RUN2
Condition A Condition B
120, 90, … , 130 150, 60, … , 20
B < A
• Use GAMLSS as the prime analytical tool to analyze short RNA-seq data
as it optimizes discovery/omission rates & exhibits the least bias
• BUT what do these correctly/unbiasedly assessed DE changes mean?
62. Understanding the context for
differential expression changes
• A list of de-regulated targets will not by itself
support the microRNA discovery process
• Need some context to interpret changes and guide
further research
• This context is provided by analysis of microRNA
targets
• We have proposed and applied a formal target
analysis methodology in our early diabetic
nephropathy investigations
63. Formal Target Analysis: A Biochemical
Primer
1. Hill plot:
2. Fold change between two states:
3. Change in binding between the two states
4. Means and standard errors for the fold changes can be synthesized
using random effects meta-analysis
5. Integration of fold changes from different experiments
dKL loglog)logit(
1
log
FC
R
E
L
L 2log
2
2loglog)(log)logit()logit( 2 FCREORRE
• Use GAMLSS as the prime analytical tool to analyze differential expression in short
RNA-seq data as it achieves the smallest error among algorithms
http://www.pdg.cnb.uam.es/cursos/BioInfo2002/pages/F
armac/Comput_Lab/Guia_Glaxo/chap3b.html
64. The 1st grade approach to target analysis
Heuristic Argument: count the number of miRNAs with small p values
• Total Score (TS)= # of differentially expressed miRNAs predicted to
bind to a given target
• Regulation Score (RS)= # over-expressed- # under-expressed
miRNAs predicted to bind to a given target
TS Low High
RS
- -
0 0
+ +
Low Signal To Noise Ratio
Target likely disinhibited
Target likely neutrally modulated
Target likely inhibited
• Use GAMLSS as the prime analytical tool to analyze
putative targets of differentially expressed microRNAs as it
achieves the optimal balance between FDR/FOR
68. To boldly go where no one has gone before….
Methodological
• Extend the model to account
for abundance dependent
variations in PCR efficiency
• Incorporate target analysis
into count analysis
• Estimate ligase bias from the
sequence (computationally
derived correction factors)
microRNA biomarkers projects
• COMPASS: a community disease
detection program focusing on
diabetes and CKD in rural New
Mexico
• MIRROR-Transplant: metabolic
and immunological factors
contributing to kidney transplant
failure
• DIDIT: randomized controlled
trial to preserve urine
production in patients starting
dialysis
• Potential areas for collaboration
in the NIH biorepository?
69. Summary
• A generative, probability, model for the counts of
short RNA-seq measurements was developed
• This model may be used to estimate and substantially
correct for the presence of ligase bias
• It achieves superior performance (smaller error,
optimal balance of false discoveries and omissions)
than other competing methodologies
• Can be used to power “personalized” medicine
applications or experimental state comparisons
• Formal target analysis to guide further research
(“reverse-translation”)
70. Acknowledgements
• This work could not have been completed without
the collaboration of the Galas Lab at PNRI
David Galas: provided a friendly ear that had the
patience to listen, comment and risk time and
funds for the experiments
Alton Etheridge: pushed for extensive sequencing
and resequencing and carried out all the validation
experiments
Nikita Sakhanenko: had the patient to be our
software tester, validator and GEO submitter
• This work would not have started without John P
(Nick) Johnson (University of Pittsburgh) who
kicked me into the area about 8 years ago
https://bitbucket.org/chrisarg/rnaseqgamlss
73. Building the model from first principles
• Establish statistical distributions OR deterministic
relationships that “bind” together the quantities in
successive steps
• There is a “competitive qPCR” experiment beating inside
each RNA-seq dataset random
• Ligase bias is reproducible deterministic/systematic
• Apply marginalization (integration) operations to
“flatten” the hierarchy
• Derive the exact distributions (or the limits of
approximation) for a statistical model that directly
represents the quantity of interest
• Relate model parameters to quantities of interest
(absolute/relative quantification)
74. Facts about the distribution of
RNA-seq data
• Established relationships between distributions
that were first explored in the 1920-1930s
• Rare biomedical applications in the 1940s
• Theoretical work in the early 1960s
• Lead goes cold due to failure to conceptualize
practical applications after the 1960s
• Extremely involved expressions involving special
functions of mathematical physics (parabolic
cylinder functions) numerical complexities will
hinder attempts to use them as-is in applications
75. Rediscovering a Negative Binomial
parameterization and introducing a new
Gaussian Generalized Linear Normal Family
• Large scale numerical
simulations (>500,000) to
establish approximations for
the RNA-seq distribution
• Arbitrary precision libraries in
python in multicore machines
• Low precision – but
acceptable for statistical
computations
• Both approximations
implement a LQ relationship
between the mean and
variance
• Inferences are largely the
same (shown in synthetic
mixes)
76. Two equivalent views of measures of differential expression:
Fold Change and Probability of Over-Expression
• The GLM approach (limma,
DESeq/DESeq2, gamlss ) yield
measures of differential
expression for microarrays,
RNA-Seq or qPCR experiments
• These are estimates of fold
changes (noise) and their
associated standard errors
(signal)
• They can be converted to
probability estimates(= 𝒑)
about the signal being >0
(overexpressed) v.s. <0
• The standard error of 𝑝 is given
by 𝑝(1 − 𝑝)
-2 -1 0 1 2 3 4
0.00.10.20.30.4 Fold Change
Estimated
Fold Change
Fold Change = 1.0, SE=1.0, shaded area
(=1.0-pnorm(0,FC,SE) in R) yields
probability of overexpression
Computing probability of differential
expression (pDE) in R
77. Why do we need two views of the
same data?
The FC View
• Absolute, relative
quantification is possible
• Fold changes in one
miRNA are directly
comparable against each
other
• Fold changes are
comparable between and
within techniques
• Type I and II statistical
errors
The pDE View
• Only relative, relative
quantification is possible
• Platforms provide
evidence for directional
changes in expression
• Type M and S errors
• Provides input to Systems
Biology tools (e.g
Boolean Networks)
78. • Experimental work in late 19th century to discover the physiological
basis of coagulation (“prothrombin”)
• Development of different versions of the “Prothrombin Time”:
investigations in hemophilia, post-op bleeding & liver disease
(1930s-1950s): derived the normal range and ranges associated
with specific deficits
• Pre-analytical considerations throughout the 1950s (and even
today)
• In the 70s PT was used to monitor and dose warfarin in the clinic
• Classical studies in the 70-80s demonstrate high inter, intra and
analytic variability (despite > 30 years of standardization)
• WHO proposed to standardize the test in the mid 1980s through
the use of the INR (international normalized ratio)
Solid measurements for thinning one’s blood:
the history of the PT test
http://www.clinchem.org/content/51/3/553.full
http://circ.ahajournals.org/content/19/1/92.full.pdf
Thromb Haemost. 1985 Feb 18;53(1):155-6.
79. The cautious story of the INR
Normalization procedure
• 𝐼𝑁𝑅 =
𝑃𝑇 𝑝𝑎𝑡𝑖𝑒𝑛𝑡
𝑃𝑇 𝑛𝑜𝑟𝑚𝑎𝑙
𝐼𝑆𝐼
• PTnormal : Geometrical mean of 20
patients
• 𝐼𝑆𝐼 =
log(𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑜𝑟 𝐼𝑁𝑅)
log 𝑃𝑇 𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑜𝑟 /log(𝑃𝑇 𝑛𝑜𝑟𝑚𝑎𝑙)
Sources of variation
• Different methods to measure
the PT
• Different instruments that
implement each method
• Different calibrator sets for
each instrument!
http://www.who.int/bloodproducts/publications/WHO_TRS_889_A3.pdf
http://www.clinchem.org/content/56/10/1618.full
http://www.clinchem.org/content/51/3/553.full
81. Pathophysiology of the cardiorenal syndrome
http://www.kdigo.org/meetings_events/pdf/KDIGO%20CVD%20Controversy%20Rpt.pdf
Editor's Notes
FFL = Feed Forward Loop (=cerebellar circuits)
Common Statistical Properties in many gene regulatory pathways
“Understanding biology by reverse engineering the control”
Comment that among those with ESRD one may find an approximate equal proportion of patients with T1D and T2D, even though T2D is more frequent that T1D. The reason for this discrepancy is that overt nephropathy develops less frequently in patients T2D, many of whom will die from macrovascular complications before their kidney disease progresses
Ref for this slide: Diabetes Care January 2004 vol. 27 no. suppl 1 s79-s83 http://care.diabetesjournals.org/content/27/suppl_1/s79.full
To put things in context: five year survival stage for
- All cancers: 68%
Breast cancer: 72% (stage III), 22% (stage IV)
Colon Cancer: 53% (IIIC), 11% (IV)
Prostate Cancer: 28% (distant)
Analysis of enriched terms in REACTOME (Table) suggest that the predicted miRNA targets map to a distinct pathways involving growth factor signaling, apoptosis, immunity, substrate metabolism, transmembrane transport and certain non-kidney related terms. Furthermore, the identified pathways overlapped considerably between the comparisons of patients with overt nephropathy and normals , and follow-up v.s. baseline samples from MA patients. In the comparisons within baseline and follow-up MA samples we found only a few (<80) targets mapping to annotated REACTOME pathways, thus precluding a meaningful assessment with this structured vocabulary.
This slides shows a clinical diagnostics scenario in which we would like to use a microRNA as a biomarker for a clinical condition. In contrast to experimental samples, clinical diagnostic samples work in a “inverse” mode and are used in regulated environments. In particular clinical biomarkers are used to infer the presence of a clinical condition based (conditional) on the actual measurement obtained. Threshold for diagnosis: 14 to distinguish condition A from B
Show hypothetical repetitive measurements in two experimental or clinical conditions. In the left panel, there is no variation and the underlying pattern is visible to the naked eye without any “statistical” assistance
In the right panel, noise variation makes different to discern the patterns – the measurement is imperfect (in this case measurements from condition B are generated from distributions with higher means than A)
When the TS is low, the gene targeted receives input from a small number of DE miRNAs on the basis of the expression data so it would seem that such targets are not high priority for validation, or rather that the effects of miRNA on mRNA would be difficult to sort out (low signal to noise ratio).
On the other hand, if a target has a high TS this suggests that miRNAs potentially play an important role in modulating the expression of that target. In such a case the RS may allow a semi-quantitative assessment of the direction and the magnitude of the modulation:
Negative: target is likely “dis-inhibited” (expression may go up)
Positive: target is likely inhibited (expression may go down)
Zero: the modulation is likely neutral
Caveat1: In a biological fluid such as urine which integrates the microRNA signatures from diverse cellular populations in the kidney and the genitourinary tract, one may get a zero RS as a result of a positive RS in one cellular population and a negative one in another.
Caveat2: In the context of transcriptional/translational regulation, the anticipated (on the miRNA expression profile) mRNA response may be in the same or even in the opposite direction than the measured mRNA levels. In the former case, the miRNA pattern enhances, while in the latter case it opposes (counter-balances) the mRNA response.
This figure shows the meta-analysis for the target PDGF Beta. This factor has been shown to be expressed in biopsies from human patients with overt diabetic nephropathy, so it can be used as benchmark of the proposed methodology.
Out of the 68 predicted miRNAs targeting PDGFB, 37 were represented in our urinary profiles. It can be seen that the majority of these miRNAs do not exhibit a directional change (they are scattered around the vertical line of no effect OR=1), and only two stand out. When the evidence is synthesized together, using techniques and software for clinical trial analysis, the odds ratio is >1 suggesting that miRNAs are working towards inhibiting the gene. As the evidence from clinical biopsy material is that this particular factor is upregulated in diabetic kidneys, the miRNA pattern is probably providing a counter-balancing influence to a (?transcriptional) response.
Reference for the PDGF-beta: Langman et al. Over-expression of platelet-derived growth factor in human diabetic nephropathy. Nephrol Dial Transplant (2003) 18: 1392–1396 http://ndt.oxfordjournals.org/content/18/7/1392.full.pdf
The fold change view makes some strong assumptions about our ability to quantify relative changes in miRNA expression. This is a rather strong assumption e.g equivalent to assuming the same efficiency in a qPCR experiment. In particular it assumes that absolute quantification is indeed possible (e.g. platforms are like car speedometers so the acceleration (difference in velocity) in mph/sec from one are directly comparable to the same reading given by another speedometer). Hence microRNA changes are comparable within experiments (for different miRNAs) and between experiment (for the same miRNA). The statistical errors implicit in this view are of Type I (calling a signal different from zero when it is fact is not) and Type II (calling a signal =0 when it is not).
The pDE view only admits the possibility of a relative, relative quantification i.e. we can only infer directions of changes but not their absolute magnitude. When comparing data between and within platforms, only the direction of the change is important. This is a weaker assumption that the one made by the adoptees of the FC view. To make the difference explicit consider the Delta-Ct values in PCR; unless the probes have the same efficiency (so that we can convert Delta-Cts to FC), the only thing we can say is that a larger Delta-Ct corresponds to a larger change in expression than a smaller one. It follows that the type of errors in this view are type M (larger v.s. smaller) and S (calling an over-expressed miRNA under-expressed and vice versa). Though one may consider the pDE limiting, it is in fact
In this section we will examine a cautious story for a test widely used to monitor anti-coagulation. This test (called INR: International Normalized Ratio) came about as an attempt to standardize measurements of the Prothrombin Time, a test used to monitor the vitamin K dependent coagulation factors. The PT was a test developed initially as a way to study the coagulation pathways (“research tool”) and in particular post-operative thrombotic events and bleeding, hemophilia and liver disease
Normalization for the INR is based on subtracting (in logarithmic scale) the geometric mean of a “normal” sample and then scaling the result with a “fudge” factor that represents the technical variability/bias of the assay used. This factor (ISI) is determined by calibration against an adopted standard maintained by the World Health Organization. ISI derived from calibrator sets with known (certified) INRs and normal plasma
The formula and the process is directly analogous with various normalization approaches that have been applied to expression profiles so far
Despite the rigorous, international effort for the calibration assaying the same sample by different methods (lines in the graph) still yield different values. Considering the “hard” thresholds and tight ranges required in clinical practice, this performance is not adequate necessitating frequent measurements to ensure that the patient is maintained within range
There are many sources of variation which can be investigated with repeated measurements of the same individual. For the INR, a normalization procedure developed 30 years ago for a test widely used for the last 80 years!, used in CLIA-certified environments the analytical imprecision is still of the same magnitude as between and within individual sources of variability