1.
Metabolomics Data Analysis Johan A. Westerhuis Swammerdam Institute for Life Sciences, University of Amsterdam Business Mathematics and Information, North-West University, Potchefstroom, South Africa egraSeqAhead, Barcelona February 2013
2.
Metabolomics pipeline : Issues for biostatisticsBiological Data Statistical Biological Experimental Data Metabolitequestion Pre- Data inter- design acquisition identification processing analysis pretation Power analysis Normalisation Explorative Treatment Quantification Predictive design Hypothetical QC strategy biomarkers Measurement Spectral Network design matching inference, De NOVO MSEA, indentification Pathway analysis 3
3.
Data Analysisspecial issue Metabolomics • Data preprocessing methods (make samples more comparable) • How to treat non-detects • Variable importance in multivariate models • Metabolic network analysis • Data fusion methods • Individual responses • Between metabolite ratio’s Guest Editors Jeroen J. Jansen Johan A. Westerhuis
5.
Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment, – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
6.
Metabolomics Data preprocessing• Optimize biological content of data• Correct for incorrect sampling, sample workup issues, batch effects• What is the noise level in the data? Generalized log transform Variance stabilization.• High peaks more important than low peaks?• Multivariate methods love large values! 7
7.
Metabolic changes during E. coli culturegrowth using k-means clustering. time metabolites(A) Growth curve (optical density) of unperturbed E. coli culture. Numbers of respective sampling time points are marked in the curve. Time point 0 minutes marks the application of the respective stress condition.(B) Relative changes of metabolites pools normalized time point 1. Fold change is presented on log10 scale. To reveal main trends of metabolic changes 10 K means clusters are color coded. Szymanski, Jedrzej et al. PLoS ONE (2009), vol. 4 issue. 10
8.
Self Organising Map of Metabolites in serum 1H NMR spectra of 613 patients with type I diabetes and a diverse spread of complications Nonlinear mapping method for large number of samples. Relate position on the map to diagnostic responses. Can be made supervised1H NMR metabonomics approach to the disease continuum of diabetic complications and premature deathVP Mäkinen et al, Molecular Systems Biology 4:167, 2008
9.
Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised (Differentially expressed) – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment, Pathway analysis – Metabolomics Data – Metabolic network inference Fusion
10.
Supervised Metabolomics Data analysis Case – Control (PLSDA) Y 4 Men 3 0 Women 2 0 1 0 PC2 0 1 -1 1 -2 1 -3 -4 -2 0 2 4 6 PC1 0.04• Is there really a difference between the groups ? 0.02 Statistical validation issues 0 PLS b• Which are the most important -0.02 peaks for discrimination ? -0.04 Variable importance -0.06 4 3.5 3 2.5 2 1.5 1 0.5 0 Chemical shift (ppm)
11.
• Psyhogios example uitleggen met paper voorbeelden en metaboanalyst voorbeelden Proton NMR spectra of the urine samples were obtained on a 500MHz 1H NMR machine. 13
13.
Nonsupervised SupervisedUNIVERSITY OF 15AMSTERDAM
14.
Experimental Design ExampleExperiment:Rats are given Bromobenzene that affects the liverMeasurements: NMR spectroscopy of urine RatsExperimental Design: 6 hours 24 hours Time: 6, 24 and 48 hours 48 hours Groups: 3 doses of BB 3.0275 Vehicle group, Control group 2.055 5.38 3.285 3.0475 Animals: 3 rats per dose per time 3.675 3.7525 2.7175 2.075 2.93 point 10 8 6 4 2 0 chemical shift (ppm)
15.
Different contributions Experimental Design Time 4 3.5 0 0.2 0.4 time 0.6 0.8 1Metabolite concentration 3 2.5 Dose 2 1.5 1 0 0.2 0.4 0.6 0.8 1 0.5 time 0 -0.5 0 0.2 0.4 0.6 0.8 1 time Animal Trajectories 0 0.2 0.4 time 0.6 0.8 1
16.
ANOVA decomposition of each variable xhkihk k hk hkihk 43.5 32.5 21.5 10.5 0 0 0.2 0.4 0.6 0.8 1-0.5 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 MATRICES: X 1mT X α X αβ X αβγ
17.
ANOVA and PCA ASCAX 1m Xα Xαβ Xαβγ T Pα Pαβ Pαβγ X E Tα Tαβ Tαβγ Parts of the data not explained by the component X 1mT TαPα TαβPαβ TαβγPαβγ E T T T models
18.
Results 0.5 control vehicle 0.4 low Xαβγ mediumXα 0.3 high αβ -scores Xαβ Scores 0.2 0.1 40 % 0 -0.1 -0.2 6 24 48 Time (Hours)
19.
Results biomarkers 3.0475 5.38 3.7525 3.675 Unique to the α submodelα Differences 3.9675 2.735 2.055 between submodels 2.5425 2.5825 2.6975 2.055 Interesting for Biology 2.075 Interesting for Statistics / 2.91 Diagnosticsαβ 3.0275 2.93 3.9675 2.735 2.6975 2.5825 3.285 3.2625 2.075 2.93αβγ 3.0475 2.055 3.73 3.8875 2.735 3.0275 3.285 10 8 6 4 2 0 chemical shift (ppm)
20.
Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite – Method comparison ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data – Pathway analysis Fusion – Metabolic network inference
21.
NONTARGETED SELDI measurements of serum samples of 20 Gaucher patients and 20 healthy controls. Gaucher is a genetic disease in which a fatty substance (lipid) accumulates in cells and certain organs
22.
• human urine and porcine cerebrospinal fluid samples spiked with a range of peptides.• Variation in #samples, within and between group variation
24.
Feature selection methods RESULTS• Complex nontargeted Gaucher profiling data with highly variable background and varying difference between case and control: Multivariate methods perform best.• Spiked LCMS targeted data with less variation in effect size: univariate and semi-univariate methods are best in selecting biomarkers.
25.
Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment, – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
26.
Biomarkers:A: UnivariateB: MultivariateC: Change in group correlation
27.
BMR of green tea intervention study 186 human subjects with abdominal obesityValidation shows significant changes in BMR between placebo and green tea treatmenttogether with most important triacylglycerols TG28-29 and TG41-42.
28.
Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
32.
Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
33.
Special topic: Metabolic networks Biochemical Network vs Association Network Figure 7 Marginal correlation network for a set of metabolites in tomato. Volatiles in red, derivatized metabolites in yellow. Solid lines represent positive correlations, dashed lines negative ones. Thickness of line corresponds to magnitude of ...Margriet M.W.B. Hendriks , Data-processing strategies for metabolomics studies, Trends in Analytical Chemistry, 20212
34.
Metabolomics, 2005 Data from Potato tubers Metabolic neighbors Do not participate in common reactions High correlation due to e.g. chemical equilibrium, mass conservation,..“a systematic relationship between observed correlationnetworks and the underlying biochemical pathways.”Ralf Steuer: Observing and interpreting correlations in metabolomic networks, Bioinformatics, 2003
35.
Metabolic Network InferenceSearch for the link between metabolome data and underlying metabolicnetworks. F A E ?? F A E C B C B D D As an example: can we distinguish healthy from diseased networks: C Glucose A B C Glucose A B G G G G D DHEALTHY DISEASE F F E E F F
36.
From data to network NETWORK TOPOLOGYGoal: ? ? DIRECTIONSProblems: NOISE MISSING METABOLITES HUGE AMOUNT OF POSSIBLE NETWORK STRUCTURES 40
37.
Inference from static data1. DATA COLLECTION 2. SIMILARITY SCORE CALCULATION 2a. Relevance Networks 2b. Conditioned NetworksA. EnzymaticVariability ALL POSSIBLE Pearson Correlation (PC) Partial Pearson Correlation (PPC) PAIRWISE 0.6 INTERACTIONS (linear) (linear) 0.55 F 0.5 A E F A E 0.45 2 0.4 1.5 B 0.35 B 1 100 200 300 400 500 600 700 800 900 1000 0.5 5 C C 2 0 4B. Intrinsic Variability 1 1.5 1 0.2 0.4 3 0.6 0.8 D D 2 0.9 0.5 1 5 0.8 0 0 0 1 2 3 4 4 0.2 0.4 0.6 0.8 0.7 3 0.6 2 1 0.5 0 F A E 0 1 2 3 4 0 0.4 50 100 2 1.5 F 0 2 4 6 8 B A E 1 C 0.5 B 5 0 4 C DC. Environmental 0.2 0.4 0.6 0.8 3 2Variability 1 D 0 0 1 2 3 4 Mutual Information (MI) Conditional Mutual Information (non-linear) (CMI) (non-linear) 0 50 100 10 20 30 40 50
39.
Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
40.
Metabolomics data fusion• Account for between-block difference in quality of measurements to improve data fusion• For example, multi-platform data fusion, with differences in quantification, (non) targeted, error structure Amino acids Lipids Fused data• How to quantify the quality of measurements with many metabolites, and many samples?
41.
Error model for 1 metabolite QC sample -> RSDStandard Deviaton St.D • Error models: - RSD using 1 QC sample - 2-component using study samples M • Good error description - sufficient # samples A - large -range study samples I S Mean Intensity I
42.
Figure of merit for data from 1 platform Median: F-50 = 0.1St.D Var. 15 Var. 365 90th-percentile: F-90 = 0.35 Number of peaks Var. 118 F-50 F-90 Var. 213 I(Van Batenburg et al. Analytical Chemistry, 2011)
43.
Two-step data fusion j GC/MS LC/MS J1= 82 J2= 49 peaks peaks Ij M M • Step 1: Compute figures of merit for each platform
44.
Two-step data fusion: MB-MLPCA • Step 2 : Multi-block PCA with weighting by figures of merit Fused error covariance X1 X2 Amino acids Lipids js ˆ2 • Method needs good estimation of error variance by – Repeats – QC samples
45.
Realistic simulations using GCMS and LCMS data• Error variance estimated from duplicates• True error variance• Estimating variance from duplicates is problematic.• Use Mix of QC samples and repeats.
46.
Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.
Be the first to comment