Exploiting technical replicate variance analysis

Enrico Glaab
Luxembourg Centre for Systems Biomedicine
Exploiting technical
replicate variance in omics
data analysis (RepExplore)

2
Introduction: Variance and replicates
• Biomedical omics datasets may contain different types of variance:
Variance across samples from different individuals:
- Between-group variance
- Within-group variance
Variance across samples from the same individual:
- Intra-individual variance
- Technical variance
• Between- and within-group variance are covered via biological replicates,
intra-individual variance via time-series measurements, and technical
variance via technical replicates

3
Motivation: Loss of technical variance information
• Typically, combinations of biological and summarized technical replicates are
used for case-control analyses:
• Summarization of technical replicates into average measurements can
provide more robust input data, but information on the variance across
technical replicates is lost!
Mean/median
summarization
Mean/median
summarization
Statistical analysis

4
Example: Relevance of technical variance information
Abundance of the top differential metabolite (L-valine) in the Arabidopsis data by
Andersen et al. (2014) without (a) and with technical error (b, see whiskers)
without technical error with technical error
a) b)
mean-summarized
technical replicates

5
Probability of positive likelihood ratio (PPLR)
• We assume the omics data on logarithmic scale has an approximate normal
distribution
• Using mean (μ) and variance (s²) estimates derived from technical replicate
measurements, differential expression can be scored using the probability of
positive likelihood ratio (PPLR):
where P is the cumulative distribution function for the standard normal curve.
• For probe replicates on DNA microarray chips, Liu et al. propose a dedicated
parameter estimation approach to incorporate the probe-level measurement
error into variance estimates s² (Bioinformatics, 2006)

6
Comparison: Classical vs. PPLR top-ranked result
Whisker plots of the top differential metabolite in the Arabidopsis data by
Andersen et al. (2014): (a) classical approach (L-valine), (b) PPLR (L-proline)
a) b)

7
Application to Parkinson‘s disease transcriptome data
a) b)
a) Whisker plot of the top differential gene (NDUFB2) in the Parkinson’s disease
dataset by Simunovic et al. (2009); b) Heat map of the top 10 differentially
expressed genes according to the PPLR statistic.

8
Improved robustness of rankings across studies
• Evaluation: Compare the similarity of gene rankings across two Parkinson‘s
disease datasets (Simunovic et al., 2009, and Zheng et al., 2010)
• Two gene ranking methods are compared: PPLR and limma/eBayes (note:
duplicateCorrelation-function not generally applicable)
• Kendall‘s tau used as similarity measure for rankings
 Significantly higher similarity of rankings with PPLR statistic (p < 2.2E-16
for both up- and down-regulated genes)

9
Rankings improve with increasing numbers of replicates
• Apply method to simulated normal data with stddev. 1, 100 samples
(two groups, 50 per group) and 1000 features/biomolecules (900
uncorrelated, irrelevant and 100 differential, fixed effect size of D = 1)
• Add simulated technical replicates with measurement noise (R function jitter)
• Test enrichment of 100 differential features in the final PPLR ranking
 Enrichment score increases with increasing numbers of replicates

10
Using replicate variance information in PCA
• Principal Component Analysis (PCA) can be cast as a probabilistic model
(Tipping & Bishop, 1999) where d-dimensional data points yn can be
reconstructed from a q-dimensional latent point xn via a linear transformation
(W) and a noise vector εn:
• The corresponding data distribution is:
• If each dimension is allowed to have a different noise variance:
• While the maximum likelihood solution for the original model is equivalent to
PCA, Sanguinetti et al. presented an expectation maximization (EM) method
for the model with feature-specific noise variance (Bioinformatics, 2005)

11
Denoising property & results on Parkinson‘s disease data
• In the new probabilistic PCA, the noise estimate enables an evaluation of
the significance of principal components (PCs)  automated computation
of the maximum number of retainable PCs
• By accounting for measurement error in the model, noisy values are down-
weighted  pre-processing data via the modified PCA tends to provide
tighter sample clusters (confirmed on Parkinson‘s disease transcriptome
data)

12
Summary
• Technical replicate variance in omics data is usally not constant across
different biomolecules  only summarizing replicates via averaging will result
in loss of valuable information
• On Parkinson‘s disease transcriptomics data, accounting for technical
replicate variance provided improved robustness of differential expression
rankings across studies and gave tighter clusters in PCA visualizations
• All methods presented have been implemented on a public web-server
(RepExplore: www.repexplore.tk, Bioinformatics, 2015), providing automated
analyses with interactive visualizations and ranking tables

13
References
1. E. Glaab, R. Schneider, RepExplore: Addressing technical replicate variance in proteomics and metabolomics data analysis, Bioinformatics (2015),
31(13), pp. 2235
2. E. Glaab, Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification, Briefings in Bioinformatics
(2015), in press (doi: 10.1093/bib/bbv044)
3. E. Glaab, R. Schneider, Comparative pathway and network analysis of brain transcriptome changes during adult aging and in Parkinson's disease,
Neurobiology of Disease (2015), 74, 1-13
4. N. Vlassis, E. Glaab, GenePEN: analysis of network activity alterations in complex diseases via the pairwise elastic net, Statistical Applications in
Genetics and Molecular Biology (2015), 14(2), pp. 221
5. E. Glaab, Building a virtual ligand screening pipeline using free software: a survey, Briefings in Bioinformatics (2015), in press (doi:
10.1093/bib/bbv037)
6. E. Glaab, A. Baudot, N. Krasnogor, R. Schneider, A. Valencia. EnrichNet: network-based gene set enrichment analysis, Bioinformatics, 28(18):i451-
i457, 2012
7. E. Glaab, R. Schneider, PathVar: analysis of gene and protein expression variance in cellular pathways using microarray data, Bioinformatics,
28(3):446-447, 2012
8. E. Glaab, J. Bacardit, J. M. Garibaldi, N. Krasnogor, Using rule-based machine learning for candidate disease gene prioritization and sample
classification of cancer gene expression data, PLoS ONE, 7(7):e39932, 2012
9. E. Glaab, A. Baudot, N. Krasnogor, A. Valencia. TopoGSA: network topological gene set analysis,
Bioinformatics, 26(9):1271-1272, 2010
10. E. Glaab, A. Baudot, N. Krasnogor, A. Valencia. Extending pathways and processes using molecular interaction networks to analyse cancer genome
data, BMC Bioinformatics, 11(1):597, 2010
11. H. O. Habashy, D. G. Powe, E. Glaab, N. Krasnogor, J. M. Garibaldi, E. A. Rakha, G. Ball, A. R Green, C. Caldas, I. O. Ellis, RERG (Ras-related and
oestrogen-regulated growth-inhibitor) expression in breast cancer: A marker of ER-positive luminal-like subtype, Breast Cancer Research and
Treatment, 128(2):315-326, 2011
12. E. Glaab, J. M. Garibaldi and N. Krasnogor. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus
methods with cross-study normalization, BMC Bioinformatics,10:358, 2009
13. E. Glaab, J. M. Garibaldi, N. Krasnogor. Learning pathway-based decision rules to classify microarray cancer samples, German Conference on
Bioinformatics 2010, Lecture Notes in Informatics (LNI), 173, 123-134
14. E. Glaab, J. M. Garibaldi and N. Krasnogor. VRMLGen: An R-package for 3D Data Visualization on the Web, Journal of Statistical Software, 36(8),1-
18, 2010

Exploiting technical replicate variance analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Exploiting technical replicate variance analysis

Similar to Exploiting technical replicate variance analysis (20)

Recently uploaded

Recently uploaded (20)

Exploiting technical replicate variance analysis