High-throughput omics datasets often contain technical replicates included to account for technical sources of noise in the measurement process. Although summarizing these replicate measurements by using robust averages may help to reduce the influence of noise on downstream data analysis, the information on the variance across the replicate measurements is lost in the averaging process and therefore typically disregarded in subsequent statistical analyses.
We introduce RepExplore, a web-service dedicated to exploit the information captured in the technical replicate variance to provide more reliable and informative differential expression and abundance statistics for omics datasets. The software builds on previously published statistical methods, which have been applied successfully to biomedical omics data but are difficult to use without prior experience in programming or scripting. RepExplore facilitates the analysis by providing a fully automated data processing and interactive ranking tables, whisker plot, heat map and principal component analysis visualizations to interpret omics data and derived statistics.
Availability and implementation: Freely available at http://www.repexplore.tk
Journal publication: http://bioinformatics.oxfordjournals.org/content/31/13/2235.long (Glaab, E., & Schneider, R. (2015). RepExplore: Addressing technical replicate variance in proteomics and metabolomics data analysis. Bioinformatics, 31 (13): 2235-2237)
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Exploiting technical replicate variance analysis
1. Enrico Glaab
Luxembourg Centre for Systems Biomedicine
Exploiting technical
replicate variance in omics
data analysis (RepExplore)
2. 2
Introduction: Variance and replicates
• Biomedical omics datasets may contain different types of variance:
Variance across samples from different individuals:
- Between-group variance
- Within-group variance
Variance across samples from the same individual:
- Intra-individual variance
- Technical variance
• Between- and within-group variance are covered via biological replicates,
intra-individual variance via time-series measurements, and technical
variance via technical replicates
3. 3
Motivation: Loss of technical variance information
• Typically, combinations of biological and summarized technical replicates are
used for case-control analyses:
• Summarization of technical replicates into average measurements can
provide more robust input data, but information on the variance across
technical replicates is lost!
Mean/median
summarization
Mean/median
summarization
Statistical analysis
4. 4
Example: Relevance of technical variance information
Abundance of the top differential metabolite (L-valine) in the Arabidopsis data by
Andersen et al. (2014) without (a) and with technical error (b, see whiskers)
without technical error with technical error
a) b)
mean-summarized
technical replicates
5. 5
Probability of positive likelihood ratio (PPLR)
• We assume the omics data on logarithmic scale has an approximate normal
distribution
• Using mean (μ) and variance (s²) estimates derived from technical replicate
measurements, differential expression can be scored using the probability of
positive likelihood ratio (PPLR):
where P is the cumulative distribution function for the standard normal curve.
• For probe replicates on DNA microarray chips, Liu et al. propose a dedicated
parameter estimation approach to incorporate the probe-level measurement
error into variance estimates s² (Bioinformatics, 2006)
6. 6
Comparison: Classical vs. PPLR top-ranked result
Whisker plots of the top differential metabolite in the Arabidopsis data by
Andersen et al. (2014): (a) classical approach (L-valine), (b) PPLR (L-proline)
a) b)
7. 7
Application to Parkinson‘s disease transcriptome data
a) b)
a) Whisker plot of the top differential gene (NDUFB2) in the Parkinson’s disease
dataset by Simunovic et al. (2009); b) Heat map of the top 10 differentially
expressed genes according to the PPLR statistic.
8. 8
Improved robustness of rankings across studies
• Evaluation: Compare the similarity of gene rankings across two Parkinson‘s
disease datasets (Simunovic et al., 2009, and Zheng et al., 2010)
• Two gene ranking methods are compared: PPLR and limma/eBayes (note:
duplicateCorrelation-function not generally applicable)
• Kendall‘s tau used as similarity measure for rankings
Significantly higher similarity of rankings with PPLR statistic (p < 2.2E-16
for both up- and down-regulated genes)
9. 9
Rankings improve with increasing numbers of replicates
• Apply method to simulated normal data with stddev. 1, 100 samples
(two groups, 50 per group) and 1000 features/biomolecules (900
uncorrelated, irrelevant and 100 differential, fixed effect size of D = 1)
• Add simulated technical replicates with measurement noise (R function jitter)
• Test enrichment of 100 differential features in the final PPLR ranking
Enrichment score increases with increasing numbers of replicates
10. 10
Using replicate variance information in PCA
• Principal Component Analysis (PCA) can be cast as a probabilistic model
(Tipping & Bishop, 1999) where d-dimensional data points yn can be
reconstructed from a q-dimensional latent point xn via a linear transformation
(W) and a noise vector εn:
• The corresponding data distribution is:
• If each dimension is allowed to have a different noise variance:
• While the maximum likelihood solution for the original model is equivalent to
PCA, Sanguinetti et al. presented an expectation maximization (EM) method
for the model with feature-specific noise variance (Bioinformatics, 2005)
11. 11
Denoising property & results on Parkinson‘s disease data
• In the new probabilistic PCA, the noise estimate enables an evaluation of
the significance of principal components (PCs) automated computation
of the maximum number of retainable PCs
• By accounting for measurement error in the model, noisy values are down-
weighted pre-processing data via the modified PCA tends to provide
tighter sample clusters (confirmed on Parkinson‘s disease transcriptome
data)
12. 12
Summary
• Technical replicate variance in omics data is usally not constant across
different biomolecules only summarizing replicates via averaging will result
in loss of valuable information
• On Parkinson‘s disease transcriptomics data, accounting for technical
replicate variance provided improved robustness of differential expression
rankings across studies and gave tighter clusters in PCA visualizations
• All methods presented have been implemented on a public web-server
(RepExplore: www.repexplore.tk, Bioinformatics, 2015), providing automated
analyses with interactive visualizations and ranking tables
13. 13
References
1. E. Glaab, R. Schneider, RepExplore: Addressing technical replicate variance in proteomics and metabolomics data analysis, Bioinformatics (2015),
31(13), pp. 2235
2. E. Glaab, Using prior knowledge from cellular pathways and molecular networks for diagnostic specimen classification, Briefings in Bioinformatics
(2015), in press (doi: 10.1093/bib/bbv044)
3. E. Glaab, R. Schneider, Comparative pathway and network analysis of brain transcriptome changes during adult aging and in Parkinson's disease,
Neurobiology of Disease (2015), 74, 1-13
4. N. Vlassis, E. Glaab, GenePEN: analysis of network activity alterations in complex diseases via the pairwise elastic net, Statistical Applications in
Genetics and Molecular Biology (2015), 14(2), pp. 221
5. E. Glaab, Building a virtual ligand screening pipeline using free software: a survey, Briefings in Bioinformatics (2015), in press (doi:
10.1093/bib/bbv037)
6. E. Glaab, A. Baudot, N. Krasnogor, R. Schneider, A. Valencia. EnrichNet: network-based gene set enrichment analysis, Bioinformatics, 28(18):i451-
i457, 2012
7. E. Glaab, R. Schneider, PathVar: analysis of gene and protein expression variance in cellular pathways using microarray data, Bioinformatics,
28(3):446-447, 2012
8. E. Glaab, J. Bacardit, J. M. Garibaldi, N. Krasnogor, Using rule-based machine learning for candidate disease gene prioritization and sample
classification of cancer gene expression data, PLoS ONE, 7(7):e39932, 2012
9. E. Glaab, A. Baudot, N. Krasnogor, A. Valencia. TopoGSA: network topological gene set analysis,
Bioinformatics, 26(9):1271-1272, 2010
10. E. Glaab, A. Baudot, N. Krasnogor, A. Valencia. Extending pathways and processes using molecular interaction networks to analyse cancer genome
data, BMC Bioinformatics, 11(1):597, 2010
11. H. O. Habashy, D. G. Powe, E. Glaab, N. Krasnogor, J. M. Garibaldi, E. A. Rakha, G. Ball, A. R Green, C. Caldas, I. O. Ellis, RERG (Ras-related and
oestrogen-regulated growth-inhibitor) expression in breast cancer: A marker of ER-positive luminal-like subtype, Breast Cancer Research and
Treatment, 128(2):315-326, 2011
12. E. Glaab, J. M. Garibaldi and N. Krasnogor. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus
methods with cross-study normalization, BMC Bioinformatics,10:358, 2009
13. E. Glaab, J. M. Garibaldi, N. Krasnogor. Learning pathway-based decision rules to classify microarray cancer samples, German Conference on
Bioinformatics 2010, Lecture Notes in Informatics (LNI), 173, 123-134
14. E. Glaab, J. M. Garibaldi and N. Krasnogor. VRMLGen: An R-package for 3D Data Visualization on the Web, Journal of Statistical Software, 36(8),1-
18, 2010