Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease

Correcting bias and variation in
small RNA sequencing for optimal
(microRNA) biomarker discovery
and validation in cardio-metabolic
(and renal) disease
Christos Argyropoulos MD, PhD, FASN
Department of Internal Medicine
Division of Nephrology
University of New Mexico Health Sciences Center

Overview
• Models of sequence counts in short RNA-seq
experiments
• Estimating and controlling for bias in small RNA-seq
experiments
• Statistical approaches to analyzing differential
expression
• MicroRNA regulation – a control theory perspective
• MicroRNAs as biomarkers in diabetes, renal and
cardiometabolic disease
• Leveraging our approach for optimal biomarker
discovery

Signals in short RNA-seq
data
Building a model from first principles

Background
• Short RNA-seq data are becoming more and more
abundant
• There is poor reproducibility of findings between
and within research groups
• Systematic measurement bias confound findings
• Systematic variation  relatively stable within protocols
• Systematic variation  unpredictable between different
protocols and platforms
• Statistical methods may be used to explore and
address such biases
• Existing approaches are phenomelogical descriptions 
• what do model parameters stand for?
• how can one best use these models?

Building a model from first
principles
• Establish testable predictions that may be verified
in existing datasets
• Establish correspondence between model
parameters and experimental steps
• Use this model to understand and correct
systematic and random bias in short RNA-seq
• Embed the model into more general frameworks
for applications:
• Epidemiological
• Biomarker discovery and validation
• Medical diagnostics

The short RNA-seq experiment
The vendor’s view The biochemist’s view
https://doi.org/10.1093/nar/gkt1021
http://www.genomics.hk/SamllRna.htm
http://www.geospiza.com/Products/SmallRNA.shtml

X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
Abundance in original
preparation
Abundance in
adapted(ligated)
sample
Abundance in PCR
amplified library
Abundance in
capture probes
Abundance of counts
in fastq files
(ligation efficiency) fi
(number of PCR cycles) N
(PCR efficiency) qi
Probability of capture si
Number of probes (K)
Library dilution factor (d)
Probability of signal
generation r
Probability of sequence
generation pi
L1
𝑁
, L2
𝑁
, … , Ln
𝑁
Conceptual model of the short RNA-seq
experiment (this is what we will talk about)

Modeling the qPCR amplification reaction
• Statistics of PCR amplification
• Branching (Galton-Watson) process
• GW distribution only available implicitly i.e.
through simulation
• Large scale simulations to derive
approximation to the GW process
• PCR literature, GW theory, martingale arguments
 candidate distributions
• Information theory arguments used to compute
distance between GW samples and the
approximate distributions
• A (truncated) Normal distribution derived at the
end
X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
L1
𝑁
, L2
𝑁
, … , Ln
𝑁

Flattening the hierarchy through
marginalization
Integrate sources of variations out of the
model:
1. library sequence depth variation
2. PCR amplification
Final statistical model is about absolute
counts
• Direct modeling ≠ % of counts
• Limit of approximation encompasses all possible
sample compositions
• The is a truncated Normal Poisson mixture
distribution (approximated via a Negative
Binomial or Linear Quadratic Gaussian family)
Model implements a Linear-Quadratic (LQ)
mean-variance relationship
X1 , X2 , … , Xn
Λ1 , Λ2 , … , Λn
B1 , B2 , … , Bn
Y1 , Y2 , … , Yn
L1
𝑁
, L2
𝑁
, … , Ln
𝑁

Distributional Regression for RNA-
seq data
LQ relationship between mean (𝜇) and variance (𝜎𝐿𝑄
2
)
𝜎𝐿𝑄
2
= 𝜇(1 + 𝜙𝜇)
• The variance and the mean
have to be modelled
concurrently
• Unless variance is modelled 
inconsistent statistics  small
(overoptimistic) p values
• Realm of distributional
regression models (GAMLSS –
Generalized Additive Models
for Location, Scale and Shape)
• One can re-use existing SW
frameworks to fit such models

Validating model(s) with synthetic
mixes of known composition
• Allow one to test the “backbone” of the model
without worrying about the adequacy of the
modeling of biology
• Sequencing of equimolar mixes:
• Explore and model systematic bias in the same protocol
• Sequencing of dilution series or non-equimolar
mixes:
• “Dose-response” curve of the bias
• Examination of “debiasing” approaches for the ability to
uncover the truth
• Model may also be used to analyze the
performance of differential expression algorithms

Testable predictions: mean and variance linear
quadratic relationships in public RNA-seq data

Linear Quadratic Relationship in the
legacy datasets of the Galas group

Estimating and
Correcting for Ligase Bias
At the corner of Biochemistry and Mathematics

Enzymatic mechanism of RNA ligation
• The kinetics of RNA ligation were investigated thoroughly
in the 1970s and early 1980s
• The intermolecular reaction is relevant to RNA-seq
• The mechanism involves three, fully reversible, steps that
obey ping-pong ordered kinetics and are subject to
substrate inhibition
𝐸 + 𝐴𝑇𝑃
𝑘1
𝑘−1
𝐸 ∙ 𝐴𝑇𝑃
𝑘1𝑎
𝑘−1𝑎
𝐸 − 𝐴𝑀𝑃 + 𝑃𝑃𝑖
𝐸 − 𝐴𝑀𝑃 + 𝐷
𝑘2
𝑘−2
𝐸 ∙ 𝐴𝑝𝑝 − 𝐷
𝑘2𝑎
𝑘−2𝑎
𝐸 + 𝐴𝑝𝑝 − 𝐷
𝐸 ∙ 𝐴𝑝𝑝 − 𝐷 + 𝐴
𝑘3
𝑘−3
𝐸 ∙ 𝐴𝑝𝑝 − 𝐷 ∙ 𝐴
𝑘3𝑎
𝑘−3𝑎
𝐴𝑀𝑃 + 𝐸 + 𝐴𝐷
 Bias in RNA-ligation was noted in these early investigations and the enzyme was
never used as tool in synthetic chemistry, as solid phase methods took off in the 80s

Kinetic analysis of ligase reaction
velocity in RNA-seq protocols
• Existing protocols include abundant cofactors (sharp
contrast to the experiments in 1970s)
Drive reaction to the right
Rate limiting single step reaction instead of tri-step one
Substrate preference (bias in reaction yields) is not eliminated
• Multi-substrate inhibition from all biosample sequences
available from ligation
Analytical series approximation for ratios of random variables
• Ligase operates at the 1st order domain of Michaelis-
Menten kinetics
𝑉𝑖 =
𝑉𝑖
𝑚𝑎𝑥
𝑋𝑖
𝐾 𝑀
𝑖
1 + 𝑖
𝑋𝑗
𝐾 𝑀
𝑗
≈
𝑉𝑖
𝑚𝑎𝑥
𝑋𝑖
𝐾 𝑀
𝑖
1 + 𝑛
𝐸 𝑋
𝐸 𝐾 𝑀
=
𝑉𝑖
𝑚𝑎𝑥
𝑋𝑖
𝐾 𝑀
𝑖
1 +
)𝐶 𝑇𝑜𝑡𝑎𝑙(0
𝐸 𝐾 𝑀
≈
𝑉𝑖
𝑚𝑎𝑥
𝑋𝑖
𝐾 𝑀
𝑖

Testable model predictions about
ligase bias in RNA-seq experiments
Mathematical expression
• 𝑋𝑖 1 − exp −
𝑉𝑖
𝑚𝑎𝑥
𝐾 𝑀
𝑖 𝑇𝑅 = 𝑋𝑖 𝑓𝑖
5ʹ
• Λ 𝑖 = 𝑋𝑖 𝑓𝑖
5ʹ
𝑓𝑖
3ʹ
= 𝑋𝑖 𝑓𝑖
Implications for ligase bias
• Concentration
independence
• Sample composition
independence
• Transferable within
experiments done with the
same protocol
• Protocol dependent
(reaction velocity
incorporates concentration
of cofactors and enzyme)
• Sequence equimolar mixes to
derive empirical correction
factors for ligase bias
• Apply those to biological
samples (“offsets” in
distributional regression) to
eliminate bias

There is substantial
variation in raw
sequence counts
from equimolar
mixes

Application of bias factors virtually
eliminates ligase bias
Monte Carlo Cross Validation in 3 equimoral datasets: randomly split the
dataset into learning and testing subsets, learn the correction factor and
apply it to correct the estimates of the learning dataset. Repeat N times

Empirical factors nearly eliminate bias
between equimolar datasets with 10x
different input (Galas Lab legacy datasets)

Bias factors in public non-equimolar
short-RNA seq datasets

Design of Validation Experiments
What has been established?
• Moderate
concentration
independence
• Ability to nearly
eliminate bias over at
least two orders of
magnitude
• Legacy
platforms/experiments
What needs to be proven?
• Concentration
independence over >2
orders of magnitude
• Sample composition
independence
• Recovery of differential
expression measures
• Any value relative to
existing approaches?

Validation Experiments
Collaboration between PNRI (Galas Lab) and UNM (DoIM)
The largest, single protocol, technical series to date (GSE93399)
Experimental Group Dilution N
miRExplore (972 short
RNAs)
1:10 10
286 miRNAs 1:1 8
1:10 8
1:100 8
1:1000 8
Ratio Metric Series A
(descending)
Mix of
 286 subpool A (1:1)
 286 subpool B (1:10)
 286 subpool C(1:100)
 286 subpool D (1:1000)
8
Ratio Metric Series B
(ascending)
Mix of
 286 subpool A (1:1000)
 286 subpool B (1:100)
 286 subpool C(1:10)
 286 subpool D (1:1)
8
Total 7 groups (58 sequenced x 2 = 116)

Empirical bias correction over 3 orders
of magnitude in equimolar datasets
RMSE reduction: 77%-90% (input in calibration run differs by up to x10 from
target), 54%-67% otherwise

Empirical factors reduce bias by
nearly 60% in non-equimolar series

Bias correction recovers
expression profile patterns

Bias Correction in Heterogeneous
Samples
• Correction factors remove
~55% of bias between
equimolar samples
• ~ 70% of RNAs have
expression within two fold
from the mean (from 23%)
• Bias reduction is ~40% in
ratiometric series
• ~63% of RNAs have
expression within x2 from
the mean (from 33%)

Differential Expression
When more is less, and simplest is the best

Our proposal for a model of
differential expression (DE) changes
Statistical formulation and
assumptions
log 𝜇𝑖,𝑗,𝑘 = 𝛼 + Δk + 𝑚𝑖,0 + 𝛿𝑖,𝑘
𝑚𝑖,0 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎𝜇0
2
)
𝛿𝑖,𝑘 ~ 𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎 𝑘
2
)
(similar model for variance)
1. Expression in reference state is not
of prime scientific interest (can
omit correction for bias)
2. Technical sources of variation (PCR
efficiency, library sampling) of
much smaller magnitude than
biological variability
Parameter interpretation
and context of use
• Accommodates global
and sequence specific DE
changes
• Flexible modeling of
referent (global level and
variation around it)
• Still models counts
• No incorporation of
library specific factors
(model is un-normalized)

• Number of reads in sample j, assigned to species i (Ki,j)
• Assumed to follow a negative binomial distribution:
• 𝐾𝑖,𝑗~𝑁𝐵(𝜇𝑖,𝑗, 𝜎𝑖,𝑗
2
)
Existing Models for RNA-seq experiments
Standard
deviation
=𝜇𝑖,𝑗 + 𝑎𝜇𝑖,𝑗
2
(edgeR1)
1Biostatistics 2008, 9:321-32
2 Genome Biology 2010, 11:R106
=𝜇𝑖,𝑗 + 𝑠𝑗
2
𝑓(𝑚𝑖,𝑗) (DESeq2)
Mean = 𝑚𝑖,𝑗 × 𝑠𝑗 Common scale
(coverage of the library,
sequence depth)
Experimental Effects
iijim ,1,0, )log(  
miRNA expression in the
control group
miRNA expression in the
experimental group
Model for
differential
expression
analysis

Comparison of proposed approach
against existing methods
“We” (gamlss)
• Uses the NB or the LQNO
• LQ relation between mean and
variance
• Variance and mean parameters
are estimated simultaneously
• Explicit count based modeling
• Un-normalized
• Shrinkage via random effects
modeling
• Derived from first principles (a
generative probability model)
“They” (edgeR/DESeq2 etc)
• NB or the linear model
• LQ or flexible relation between
mean & variance
• Two stage procedure to
estimate parameters
• Models counts as % of a given
library depth
• Normalized (% sum to one)
• Shrinkage via random effects
modeling
• Ad hoc, phenomenological
probability model

Scenarios of differential expression
to assess method performance
• Clustered, symmetric differential expression
1. fraction of overexpressed sequences is equal to that of the
underexpressed
2. no change in global expression
over and underexpressed RNAs are present in equal numbers
and exhibit same degree of DE
• Asymmetric, clustered differential expression
1. Fraction of overexpressed sequences ≠ underexpressed
Drives global expression change to one direction
• Global Change: all RNAs exhibit a variable but consistent
directional change of expression
• No change
All scenarios implemented through the validation datasets

The GAMLSS has smaller RMSE than 10
popular workflows for DE analysis
• Performance
benefit seen under
scenarios of
asymmetric,
clustered
differential
expression changes
• When DE are
(nearly) symmetric,
many other
methods have
similar
performance

Existing methods cannot detect global,
directional differential expression

Algorithm
performance
in the
absence of
differential
expression

GAMLSS demonstrates the optimal
balance between False Omission and False
Discovery Rates
ROC Curve Analysis FDR and FOR

What did we just find out about
algorithms for DE analysis?
• Proposed method (GAMLSS) is the top performer:
• Symmetric, clustered, DE changes
• Asymmetric clustered, DE changes
• Asymmetric global, DE changes
• No DE change
Optimal balance between FDR and FOR
• Existing methods introduce moderate – to – severe bias
• force the overall DE to sum to zero (what goes up must be
accompanied by something that goes down)
• Voom/limma somewhat more resilient, near identical
performance to GAMLSS under symmetric DE
These patterns have not seen before, because no-one to
date has generated datasets with known composition/DE

Why do existing methods fail to deliver?
• Existing models for RNA-seq analysis e.g
deSEQ, edgeR can be derived from 1st
principles as approximations
• RNA-seq counts as % of library depth
• Valid for dilute samples, not dominated by a
few RNA species
• Library size depth and modeling counts as %
(a relic of the SAGE era) may be a disastrous
distraction
• Parameterization constraints DE over all RNAs
included in the analysis to sum to zero

Practical implications for experimentalists
(not using GAMLSS)
• Any change to the population of RNAs modelled (e.g.
filtering)→ different DE values from the same dataset
• Both type M (degree of DE changes) and type S (label
an over-expressed sequence to be under-expressed &
vice-versa) errors
• Up to 25% of estimated DE changes may be of the wrong direction
• Up to 100% of estimated DE changes may be of the wrong
magnitude
• RNA-seq findings will fail to validate against qPCR
• Reputation of RNA-seq as a semiquantitative technique of
poor reproducibility is due to statistical methodology

MicroRNA regulation
A control theory perspective

microRNA biology & therapeutic applications
http://www.nature.com/nature/journal/v469/n7330/fig_tab/nature09783_F1.html
http://www.nature.com/nature/journal/v469/n7330/full/nature09783.html

Control In Biological Systems Is Many-
To-Many, Cooperative And Patterned
Feala JD, et al. PLoS ONE 7(1): e29374. (2012)
Riba A et al PLoS Comput Biol 10(2): e1003490.
(2014)
Bipartite Control Network Topologies miRNA – Transcription Factor circuits
Feed Forward Loop: master
control layout in many natural
and artificial control systems

How do we control things?
Predictably simple
(open loop)
Error Correcting
(feeback)
Model based
(feed forward)

Feed forward control
• Control element responds to a change in the
environment in a predefined manner
• Based on prediction of plant (“what is being
controlled”) behavior (requires model of the
system)
• Can react before error actually occurs (stabilizing
the system, e.g. cerebellum control of balance)
• Benefits: reduced hysteresis, increased accuracy,
cost-efficiency, lower “wear-tear”

Practical implications
• miRNAs function as master controllers in FFLs
• biology is intrinsically NOT model free
• miRNA profiling reveals the “plant” dynamics of
complex biological processes
• Emerging data suggest that sequence variation may underline
(dys-)regulation
• miRNA associations are by definition causal to some
aspects of a particular phenotype
• “a priori plausible” biomarkers
• direct therapeutic implications
• Examination of the “plant” (targets) may have
implications for microRNA research
• Context for the interpretation of microRNA changes
• “Stronger” biomarker signatures

microRNAs are rational candidates for
exploring paradigm shifts in biology
• Ubiquity-conservation
• Breadth & width of regulation (>60% of genes)
• Context-specificity (“meta-controller”)
• Master Controllers in Feed Forward Loops
These arguments are not disease area specific (e.g. apply
equal well to cancer or even psychiatric disease)

MicroRNAs as
biomarkers
Renal, Diabetes and Cardiometabolic Disease

• 8-10% of the population suffer from diabetes
• 20-30% of patients with diabetes will develop evidence of
diabetic chronic kidney disease (DKD/CKD)
• DKD progresses in stages of increasing proteinuria
• 50% of patients with overt nephropathy will develop End
Stage Renal Disease (ESRD) within 10 years
• The end result: Diabetic nephropathy is the leading cause
of ESRD, requiring dialysis or kidney transplantation
accounting for 40% of cases
Facts, figures and the natural history of
cardiometabolic and renal disease in diabetes

• DKD is costly:
• 40-50% of the $44B Medicare expenditures for CKD
• 40-50% of the $50B total healthcare costs for ESRD
• DKD is lethal (>50% of these deaths are cardiac)
• Current therapies reduce risk by 30%
• Many of the things we tried to stabilize renal function AND
improve cardiovascular disease failed miserably in trials
• A paradigm change in our understanding of DKD is
warranted => We posit that miRNAs will trigger this shift
• This improvement likely spread to other areas given biology
of cardiovascular disease (“extreme phenotype”)
There is a significant unmet need for therapies that
stabilize progression and reduce death rates in
patients with diabetic kidney disease
1Afkarian et al J Am Soc Nephrol. 2013 Feb;24(2):302-8
US1
population
No Diabetes Diabetes
No CKD 7.7% 11.5%
CKD 17.2 31.1%
0 10 20 30 40 50 60
405060708090100
Dialysis Mortality
Time (months)
%Surviving
GN
DM

Why bother with microRNAs in DKD?
Heart & Vessels
• Angiogenesis
• Vascular inflammation
• Atherosclerosis
• LVH
• Vascular tone
• Endothelial dysfunction
Kidney
• Water homestasis
• Osmoregulation
• Calcium sensing
• Sodium, potassium,
acid base handling
• Renin production
• Renal development
• Renal senescence
• EMT
• Collagen production
Diabetes
• Insulin synthesis and
secretion
• Peripheral tissue
sensitivity
• Hepatic glucose
production
• Inflammatory gene
expression

microRNAs as Minimally Invasive
Biomarkers : a metrological argument
Advantages of microRNAs
Circulating microRNAs
•More stable in circulation than
mRNAs
•High expression level and low
complexity compared to mRNA
•Tissue specific expression
•Availability of analytical platforms
Keep getting cheaper over time
•Sequence conservation
Allows translation of clinical
associations to animal models
Allows translation of animal
models to clinical applications
Cortez et al Nat Rev Clin Oncol. Jun 7, 2011; 8(8): 467–477.

Targets of differentially expressed miRNAs in
early and late stages of DN map to overlapping
pathways MA v.s. NA Overt vs Normal
Pathway P-value Fraction P-value Fraction
Signal Transduction
Signaling by SCF-KIT 0.006 18/76 0.001 41/76
Signaling by Insulin receptor 0.009 23/109 <0.001 65/109
Signaling by NGF 0.016 38/212 <0.001 119/212
Signaling by Rho GTPases 0.024 24/125 <0.001 71/125
Signaling by ERBB4 0.027 16/76 <0.001 45/76
Signaling by ERBB2 0.035 19/97 <0.001 59/97
Signaling by PDGF 0.040 22/118 <0.001 67/118
Signaling by VEGF 0.041 4/11
Signaling by EGFR 0.044 20/106 <0.001 64/106
Dowstream signaling of activated FGFR 0.038 19/98 <0.001 61/98
Signaling by BMP 0.001 16/23
Signaling by TGFβ 0.004 11/15
DAG and IP3 signaling 0.010 20/31
PIP3 activates AKT signaling 0.020 15/26
RAF/MAP kinase cascade 0.031 7/10
Signaling by Notch 0.036 13/23
Interaction of integrin α5β3 with fibrillin 0.044 2/3
Interaction of integrin α5β3 with von Willbrand factor 0.044 2/3
Integrin cell surface interactions 0.024 40/85
Cell-Cell Communication 0.009 57/122
Cell Cycle
G0 and early G1 0.040 12/21

Leveraging the RNA-seq
analytical methodology
To boldly go where no one has gone before
(but many have tried)

Goals of a microRNA research
program in cardiometabolic, renal and
diabetes diseases
• Use carefully designed case-control, before-
after, randomized controlled trials, and n-of-
1 trials for the following goals:
1. Personalized medicine applications
(diagnosis/prognosis/precision medicine)
2. Biomarker discovery (e.g. to aid trials)
3. Novel Therapeutics

Animal
Models
Clinical
Associations
Clinical
Interventions
A microRNA driven discovery process
Biomarker
Discovery
Mechanistic
Insights
Therapeutics
Clinical Science, Bioinformatics, Systems
Biology Driven “Reverse Translation”
Translational
Science
Evidence
Based
Medicine
Basic Science

Ingredients for success of a microRNA
regulation discovery program
Requires open-ended platforms (RNA-seq)
o Especially for kidney disease due to intrarenal RNA editing
Requires unbiased quantification between groups of
patients (differential expression analysis)
Requires unbiased and accurate quantification in
the absence of a controlled comparison (diagnostics
– bias correction)
Proposed approach: GAMLSS for RNA-seq satisfies
requirements better than all currently used methods

Measurement in clinical diagnostics
What we want to happen What actually happens
Patient 1
10,10
Measurement is reproducible
Measurement shows minimal inter-individual variation
Measurement shows minimal intra-individual variation
JANUARYJUNE
Condition A
JANUARYJUNE
Patient 2
10,10
Patient 3
15,15
Condition B
Patient 4
15,15
Patient 1
10,10
Condition A
Patient 2
10,10
Patient 3
15,15
Condition B
Patient 4
15,15
Patient 1
10,18
? Condition
Patient 2
13,10
Patient 3
15,10
Condition B
Patient 4
15,18
Patient 1
10,12
Patient 2
15,14
Patient 3
18,11
Patient 4
14,19
Condition A ? Condition
Condition A Condition B ? Condition
Condition BMeasurement is non-reproducible
Measurement shows high inter-individual variation
Measurement shows high intra-individual variation

• Understand and control for the sources of variation
• Use calibration sets as references
• A measurement is instrument specific
• Global reference standards (role for highly competent
labs that maintain the standards)
• Context of use:
• Detector (“out-of-limits” readings)
• Control (“track the course”)
Lessons from clinical chemistry labs
• Use GAMLSS as the prime analytical tool to analyze short
RNA-seq data as it correctly represents all sources of
variation and can use calibration (equimolar) runs
• Combine this with a protocol that experimentally controls
variation (e.g. 4N protocol of the Galas Lab)

Measurement in experimental samples
What we want to happen What actually happens
Condition A Condition B
10, 10, … , 10 15, 15, … , 15
B > A
Certain of the difference
Measurement is reproducible
Measurement shows no variation
RUN1RUN2
10, 10, … , 10 15, 15, … , 15
B > A
11, 7, … , 10 8, 19, … , 26
B > A
Uncertain of the difference
Measurement is non-reproducible
Measurement shows high variation
RUN1RUN2
120, 90, … , 130 150, 60, … , 20
B < A
• Use GAMLSS as the prime analytical tool to analyze short RNA-seq data
as it optimizes discovery/omission rates & exhibits the least bias
• BUT what do these correctly/unbiasedly assessed DE changes mean?

Understanding the context for
differential expression changes
• A list of de-regulated targets will not by itself
support the microRNA discovery process
• Need some context to interpret changes and guide
further research
• This context is provided by analysis of microRNA
targets
• We have proposed and applied a formal target
analysis methodology in our early diabetic
nephropathy investigations

Formal Target Analysis: A Biochemical
Primer
1. Hill plot:
2. Fold change between two states:
3. Change in binding between the two states
4. Means and standard errors for the fold changes can be synthesized
using random effects meta-analysis
5. Integration of fold changes from different experiments
dKL loglog)logit(
1
log 









FC
R
E
L
L 2log
2

2loglog)(log)logit()logit( 2  FCREORRE 
• Use GAMLSS as the prime analytical tool to analyze differential expression in short
RNA-seq data as it achieves the smallest error among algorithms
http://www.pdg.cnb.uam.es/cursos/BioInfo2002/pages/F
armac/Comput_Lab/Guia_Glaxo/chap3b.html

The 1st grade approach to target analysis
Heuristic Argument: count the number of miRNAs with small p values
• Total Score (TS)= # of differentially expressed miRNAs predicted to
bind to a given target
• Regulation Score (RS)= # over-expressed- # under-expressed
miRNAs predicted to bind to a given target
TS Low High
RS
- -
0 0
+ +
Low Signal To Noise Ratio
Target likely disinhibited
Target likely neutrally modulated
Target likely inhibited
• Use GAMLSS as the prime analytical tool to analyze
putative targets of differentially expressed microRNAs as it
achieves the optimal balance between FDR/FOR

Target
Analysis
for PDGF-
Beta in
patients
with overt
diabetic
kidney
disease
(DKD)
Study
Fixed effect model
Random effects model
I-squared=0%, tau-squared=0, p=0.9656
hsa-let-7a-5p
hsa-let-7b-5p
hsa-let-7c
hsa-let-7d-5p
hsa-let-7e-5p
hsa-let-7f-5p
hsa-let-7g-5p
hsa-let-7i-5p
hsa-miR-106a-5p
hsa-miR-106b-5p
hsa-miR-122-5p
hsa-miR-1224-3p
hsa-miR-134
hsa-miR-140-3p
hsa-miR-17-5p
hsa-miR-1909-3p
hsa-miR-1913
hsa-miR-204-5p
hsa-miR-20a-5p
hsa-miR-20b-5p
hsa-miR-2110
hsa-miR-2113
hsa-miR-324-3p
hsa-miR-329
hsa-miR-335-5p
hsa-miR-342-3p
hsa-miR-361-3p
hsa-miR-450b-3p
hsa-miR-491-5p
hsa-miR-501-5p
hsa-miR-545-3p
hsa-miR-558
hsa-miR-603
hsa-miR-608
hsa-miR-663b
hsa-miR-765
hsa-miR-93-5p
TE
0.80
-0.46
-0.30
0.61
0.22
0.32
0.71
0.45
0.37
0.37
-0.06
1.52
0.44
0.08
0.51
0.32
0.83
0.43
-0.12
0.33
0.09
0.55
0.07
0.14
1.78
-0.10
0.05
0.74
0.60
-0.08
-0.01
0.27
-0.64
0.11
-0.41
0.72
-0.12
seTE
0.5893
0.5709
0.5681
0.6348
0.5636
0.5604
0.6051
0.6479
0.5721
0.6578
0.5752
0.7148
0.5414
0.6300
0.6882
0.5286
0.5430
0.5450
0.5736
0.7984
0.5451
0.7309
0.5503
0.5424
0.6324
0.5810
0.5991
0.6166
0.6992
0.7341
0.7830
0.5398
0.5310
0.7424
0.8823
0.5878
0.5416
0.2 1 2 5 15 50 150
Odds Ratio
Expression Ratio
OR
1.33
1.33
2.23
0.63
0.74
1.85
1.24
1.38
2.04
1.56
1.45
1.45
0.94
4.56
1.56
1.09
1.67
1.38
2.28
1.54
0.89
1.39
1.09
1.74
1.07
1.15
5.90
0.90
1.05
2.10
1.81
0.93
0.99
1.31
0.53
1.12
0.66
2.06
0.89
95%-CI
[1.09; 1.61]
[1.09; 1.61]
[0.70; 7.09]
[0.21; 1.93]
[0.24; 2.26]
[0.53; 6.42]
[0.41; 3.75]
[0.46; 4.14]
[0.62; 6.66]
[0.44; 5.56]
[0.47; 4.46]
[0.40; 5.26]
[0.31; 2.91]
[1.12; 18.50]
[0.54; 4.50]
[0.32; 3.73]
[0.43; 6.42]
[0.49; 3.88]
[0.79; 6.62]
[0.53; 4.48]
[0.29; 2.73]
[0.29; 6.63]
[0.38; 3.18]
[0.41; 7.28]
[0.36; 3.14]
[0.40; 3.34]
[1.71; 20.39]
[0.29; 2.81]
[0.32; 3.39]
[0.63; 7.03]
[0.46; 7.14]
[0.22; 3.90]
[0.21; 4.60]
[0.45; 3.76]
[0.19; 1.50]
[0.26; 4.79]
[0.12; 3.74]
[0.65; 6.50]
[0.31; 2.57]
W(fixed)
100%
--
2.8%
3.0%
3.1%
2.5%
3.1%
3.2%
2.7%
2.4%
3.0%
2.3%
3.0%
1.9%
3.4%
2.5%
2.1%
3.5%
3.4%
3.3%
3.0%
1.6%
3.3%
1.9%
3.3%
3.4%
2.5%
2.9%
2.8%
2.6%
2.0%
1.8%
1.6%
3.4%
3.5%
1.8%
1.3%
2.9%
3.4%
W(random)
--
100%
2.8%
3.0%
3.1%
2.5%
3.1%
3.2%
2.7%
2.4%
3.0%
2.3%
3.0%
1.9%
3.4%
2.5%
2.1%
3.5%
3.4%
3.3%
3.0%
1.6%
3.3%
1.9%
3.3%
3.4%
2.5%
2.9%
2.8%
2.6%
2.0%
1.8%
1.6%
3.4%
3.5%
1.8%
1.3%
2.9%
3.4%
Target Gene: PDGFB

Target
analysis
of the
NFE2L2/
Nrf2
pathway
in DKD

Target
analysis
of the
TGF-beta
pathway
in DKD

To boldly go where no one has gone before….
Methodological
• Extend the model to account
for abundance dependent
variations in PCR efficiency
• Incorporate target analysis
into count analysis
• Estimate ligase bias from the
sequence (computationally
derived correction factors)
microRNA biomarkers projects
• COMPASS: a community disease
detection program focusing on
diabetes and CKD in rural New
Mexico
• MIRROR-Transplant: metabolic
and immunological factors
contributing to kidney transplant
failure
• DIDIT: randomized controlled
trial to preserve urine
production in patients starting
dialysis
• Potential areas for collaboration
in the NIH biorepository?

Summary
• A generative, probability, model for the counts of
short RNA-seq measurements was developed
• This model may be used to estimate and substantially
correct for the presence of ligase bias
• It achieves superior performance (smaller error,
optimal balance of false discoveries and omissions)
than other competing methodologies
• Can be used to power “personalized” medicine
applications or experimental state comparisons
• Formal target analysis to guide further research
(“reverse-translation”)

Acknowledgements
• This work could not have been completed without
the collaboration of the Galas Lab at PNRI
David Galas: provided a friendly ear that had the
patience to listen, comment and risk time and
funds for the experiments
Alton Etheridge: pushed for extensive sequencing
and resequencing and carried out all the validation
experiments
Nikita Sakhanenko: had the patient to be our
software tester, validator and GEO submitter
• This work would not have started without John P
(Nick) Johnson (University of Pittsburgh) who
kicked me into the area about 8 years ago
https://bitbucket.org/chrisarg/rnaseqgamlss

Building the model from first principles
• Establish statistical distributions OR deterministic
relationships that “bind” together the quantities in
successive steps
• There is a “competitive qPCR” experiment beating inside
each RNA-seq dataset  random
• Ligase bias is reproducible  deterministic/systematic
• Apply marginalization (integration) operations to
“flatten” the hierarchy
• Derive the exact distributions (or the limits of
approximation) for a statistical model that directly
represents the quantity of interest
• Relate model parameters to quantities of interest
(absolute/relative quantification)

Facts about the distribution of
RNA-seq data
• Established relationships between distributions
that were first explored in the 1920-1930s
• Rare biomedical applications in the 1940s
• Theoretical work in the early 1960s
• Lead goes cold due to failure to conceptualize
practical applications after the 1960s
• Extremely involved expressions involving special
functions of mathematical physics (parabolic
cylinder functions)  numerical complexities will
hinder attempts to use them as-is in applications

Rediscovering a Negative Binomial
parameterization and introducing a new
Gaussian Generalized Linear Normal Family
• Large scale numerical
simulations (>500,000) to
establish approximations for
the RNA-seq distribution
• Arbitrary precision libraries in
python in multicore machines
• Low precision – but
acceptable for statistical
computations
• Both approximations
implement a LQ relationship
between the mean and
variance
• Inferences are largely the
same (shown in synthetic
mixes)

Two equivalent views of measures of differential expression:
Fold Change and Probability of Over-Expression
• The GLM approach (limma,
DESeq/DESeq2, gamlss ) yield
measures of differential
expression for microarrays,
RNA-Seq or qPCR experiments
• These are estimates of fold
changes (noise) and their
associated standard errors
(signal)
• They can be converted to
probability estimates(= 𝒑)
about the signal being >0
(overexpressed) v.s. <0
• The standard error of 𝑝 is given
by 𝑝(1 − 𝑝)
-2 -1 0 1 2 3 4
0.00.10.20.30.4 Fold Change
Estimated
Fold Change
Fold Change = 1.0, SE=1.0, shaded area
(=1.0-pnorm(0,FC,SE) in R) yields
probability of overexpression
Computing probability of differential
expression (pDE) in R

Why do we need two views of the
same data?
The FC View
• Absolute, relative
quantification is possible
• Fold changes in one
miRNA are directly
comparable against each
other
• Fold changes are
comparable between and
within techniques
• Type I and II statistical
errors
The pDE View
• Only relative, relative
quantification is possible
• Platforms provide
evidence for directional
changes in expression
• Type M and S errors
• Provides input to Systems
Biology tools (e.g
Boolean Networks)

• Experimental work in late 19th century to discover the physiological
basis of coagulation (“prothrombin”)
• Development of different versions of the “Prothrombin Time”:
investigations in hemophilia, post-op bleeding & liver disease
(1930s-1950s): derived the normal range and ranges associated
with specific deficits
• Pre-analytical considerations throughout the 1950s (and even
today)
• In the 70s PT was used to monitor and dose warfarin in the clinic
• Classical studies in the 70-80s demonstrate high inter, intra and
analytic variability (despite > 30 years of standardization)
• WHO proposed to standardize the test in the mid 1980s through
the use of the INR (international normalized ratio)
Solid measurements for thinning one’s blood:
the history of the PT test
http://www.clinchem.org/content/51/3/553.full
http://circ.ahajournals.org/content/19/1/92.full.pdf
Thromb Haemost. 1985 Feb 18;53(1):155-6.

The cautious story of the INR
Normalization procedure
• 𝐼𝑁𝑅 =
𝑃𝑇 𝑝𝑎𝑡𝑖𝑒𝑛𝑡
𝑃𝑇 𝑛𝑜𝑟𝑚𝑎𝑙
𝐼𝑆𝐼
• PTnormal : Geometrical mean of 20
patients
• 𝐼𝑆𝐼 =
log(𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑜𝑟 𝐼𝑁𝑅)
log 𝑃𝑇 𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑜𝑟 /log(𝑃𝑇 𝑛𝑜𝑟𝑚𝑎𝑙)
Sources of variation
• Different methods to measure
the PT
• Different instruments that
implement each method
• Different calibrator sets for
each instrument!
http://www.who.int/bloodproducts/publications/WHO_TRS_889_A3.pdf

Statistics Of Biological Regulatory Networks
Feala JD, et al. PLoS ONE 7(1): e29374. (2012)

Pathophysiology of the cardiorenal syndrome
http://www.kdigo.org/meetings_events/pdf/KDIGO%20CVD%20Controversy%20Rpt.pdf

Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease

Similar to Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease (20)

More from Christos Argyropoulos

More from Christos Argyropoulos (20)

Recently uploaded

Recently uploaded (20)

Correcting bias and variation in small RNA sequencing for optimal (microRNA) biomarker discovery and validation in cardio-metabolic (and renal) disease

Editor's Notes