Metabolomics Data Analysis                                         Johan A. Westerhuis                 Swammerdam Institut...
Metabolomics pipeline :                               Issues for biostatisticsBiological                                  ...
Data Analysisspecial issue Metabolomics      • Data preprocessing methods (make samples        more comparable)      • How...
Multivariate metabolomics data          NONTARGETED PROFILING                            TARGETED ANALYSIS                ...
Multivariate           Metabolomics Data analysis• Explorative   – Find groups, clusters structure /     outliers in metab...
Metabolomics Data preprocessing• Optimize biological content of data• Correct for incorrect sampling, sample  workup issue...
Metabolic changes during E. coli culturegrowth using k-means clustering.       time                                       ...
Self Organising Map of Metabolites in serum                                                                  1H NMR spectr...
Multivariate             Metabolomics Data analysis• Explorative    – Find groups, clusters structure /      outliers in m...
Supervised Metabolomics Data analysis              Case – Control (PLSDA)                                   Y           4 ...
• Psyhogios example uitleggen met paper  voorbeelden en metaboanalyst voorbeelden  Proton NMR spectra of the urine samples...
NMR spectra of urine samples             14
Nonsupervised                     SupervisedUNIVERSITY OF                15AMSTERDAM
Experimental Design ExampleExperiment:Rats        are given Bromobenzene        that affects the liverMeasurements: NMR sp...
Different contributions                                   Experimental Design                                             ...
ANOVA decomposition of each variable         xhkihk             k            hk              hkihk 43.5 32....
ANOVA and PCA  ASCAX  1m  Xα  Xαβ  Xαβγ          T                   Pα         Pαβ          Pαβγ  X                 ...
Results                             0.5                          control                                                  ...
Results  biomarkers                                          3.0475                        5.38                          ...
Multivariate           Metabolomics Data analysis• Explorative   – Find groups, clusters structure /     outliers in metab...
NONTARGETED              SELDI measurements of serum samples of              20 Gaucher patients and 20 healthy           ...
• human urine and porcine cerebrospinal fluid  samples spiked with a range of peptides.• Variation in #samples, within and...
Gaucher   Spiked
Feature selection methods RESULTS• Complex nontargeted Gaucher profiling data with  highly variable background and varying...
Multivariate           Metabolomics Data analysis• Explorative   – Find groups, clusters structure /     outliers in metab...
Biomarkers:A: UnivariateB: MultivariateC: Change in group correlation
BMR of green tea intervention study         186 human subjects with abdominal obesityValidation shows significant changes ...
Multivariate           Metabolomics Data analysis• Explorative   – Find groups, clusters structure /     outliers in metab...
Plasma
Differences in blood metabolites due to aging
Aging biomarker metabolites in liver
Multivariate           Metabolomics Data analysis• Explorative   – Find groups, clusters structure /     outliers in metab...
Special topic: Metabolic networks      Biochemical Network vs Association Network                                         ...
Metabolomics, 2005                                                                                        Data from       ...
Metabolic Network InferenceSearch for the link between metabolome data and underlying metabolicnetworks.                 ...
From data to network                                                            NETWORK                                   ...
Inference from static data1. DATA COLLECTION                                                                              ...
ESTIMATION OF CORRELATION NETWORKS        1. ASPP               2. ASA                3. HS                    4. HSP     ...
Multivariate           Metabolomics Data analysis• Explorative   – Find groups, clusters structure /     outliers in metab...
Metabolomics data fusion• Account for between-block difference in quality of  measurements to improve data fusion• For exa...
Error model for 1 metabolite                                QC sample ->   RSDStandard Deviaton St.D                      ...
Figure of merit for data from 1 platform                                                                            Median...
Two-step data fusion               j             GC/MS                     LC/MS          J1=          82                ...
Two-step data fusion: MB-MLPCA • Step 2 : Multi-block PCA with weighting by figures of merit                              ...
Realistic simulations    using GCMS and       LCMS data• Error variance estimated  from duplicates• True error variance• E...
Multivariate           Metabolomics Data analysis• Explorative   – Find groups, clusters structure /     outliers in metab...
Metabolomics Data Analysis
Metabolomics Data Analysis
Metabolomics Data Analysis
Metabolomics Data Analysis
Upcoming SlideShare
Loading in...5
×

Metabolomics Data Analysis

4,193

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,193
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
124
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Metabolomics Data Analysis

  1. 1. Metabolomics Data Analysis Johan A. Westerhuis Swammerdam Institute for Life Sciences, University of Amsterdam Business Mathematics and Information, North-West University, Potchefstroom, South Africa egraSeqAhead, Barcelona February 2013
  2. 2. Metabolomics pipeline : Issues for biostatisticsBiological Data Statistical Biological Experimental Data Metabolitequestion Pre- Data inter- design acquisition identification processing analysis pretation Power analysis Normalisation Explorative Treatment Quantification Predictive design Hypothetical QC strategy biomarkers Measurement Spectral Network design matching inference, De NOVO MSEA, indentification Pathway analysis 3
  3. 3. Data Analysisspecial issue Metabolomics • Data preprocessing methods (make samples more comparable) • How to treat non-detects • Variable importance in multivariate models • Metabolic network analysis • Data fusion methods • Individual responses • Between metabolite ratio’s Guest Editors Jeroen J. Jansen Johan A. Westerhuis
  4. 4. Multivariate metabolomics data NONTARGETED PROFILING TARGETED ANALYSIS hipp fum urea allant TMAO citrat1 67 45 6 3 31 10 44 32 10 3 1 8 7 13 43 24 12 4 33 23 0 0 99 76 5 2 12 6 15 2 Technical correlations Biological correlations Biological correlations
  5. 5. Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment, – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
  6. 6. Metabolomics Data preprocessing• Optimize biological content of data• Correct for incorrect sampling, sample workup issues, batch effects• What is the noise level in the data? Generalized log transform Variance stabilization.• High peaks more important than low peaks?• Multivariate methods love large values! 7
  7. 7. Metabolic changes during E. coli culturegrowth using k-means clustering. time metabolites(A) Growth curve (optical density) of unperturbed E. coli culture. Numbers of respective sampling time points are marked in the curve. Time point 0 minutes marks the application of the respective stress condition.(B) Relative changes of metabolites pools normalized time point 1. Fold change is presented on log10 scale. To reveal main trends of metabolic changes 10 K means clusters are color coded. Szymanski, Jedrzej et al. PLoS ONE (2009), vol. 4 issue. 10
  8. 8. Self Organising Map of Metabolites in serum 1H NMR spectra of 613 patients with type I diabetes and a diverse spread of complications Nonlinear mapping method for large number of samples. Relate position on the map to diagnostic responses. Can be made supervised1H NMR metabonomics approach to the disease continuum of diabetic complications and premature deathVP Mäkinen et al, Molecular Systems Biology 4:167, 2008
  9. 9. Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised (Differentially expressed) – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment, Pathway analysis – Metabolomics Data – Metabolic network inference Fusion
  10. 10. Supervised Metabolomics Data analysis Case – Control (PLSDA) Y 4 Men 3 0 Women 2 0 1 0 PC2 0 1 -1 1 -2 1 -3 -4 -2 0 2 4 6 PC1 0.04• Is there really a difference between the groups ? 0.02 Statistical validation issues 0 PLS b• Which are the most important -0.02 peaks for discrimination ? -0.04 Variable importance -0.06 4 3.5 3 2.5 2 1.5 1 0.5 0 Chemical shift (ppm)
  11. 11. • Psyhogios example uitleggen met paper voorbeelden en metaboanalyst voorbeelden Proton NMR spectra of the urine samples were obtained on a 500MHz 1H NMR machine. 13
  12. 12. NMR spectra of urine samples 14
  13. 13. Nonsupervised SupervisedUNIVERSITY OF 15AMSTERDAM
  14. 14. Experimental Design ExampleExperiment:Rats are given Bromobenzene that affects the liverMeasurements: NMR spectroscopy of urine RatsExperimental Design: 6 hours 24 hours Time: 6, 24 and 48 hours 48 hours Groups: 3 doses of BB 3.0275 Vehicle group, Control group 2.055 5.38 3.285 3.0475 Animals: 3 rats per dose per time 3.675 3.7525 2.7175 2.075 2.93 point 10 8 6 4 2 0 chemical shift (ppm)
  15. 15. Different contributions Experimental Design Time 4 3.5 0 0.2 0.4 time 0.6 0.8 1Metabolite concentration 3 2.5 Dose 2 1.5 1 0 0.2 0.4 0.6 0.8 1 0.5 time 0 -0.5 0 0.2 0.4 0.6 0.8 1 time Animal Trajectories 0 0.2 0.4 time 0.6 0.8 1
  16. 16. ANOVA decomposition of each variable xhkihk    k   hk   hkihk 43.5 32.5 21.5 10.5 0 0 0.2 0.4 0.6 0.8 1-0.5 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 MATRICES: X  1mT  X α  X αβ  X αβγ
  17. 17. ANOVA and PCA  ASCAX  1m  Xα  Xαβ  Xαβγ T Pα Pαβ Pαβγ X E Tα Tαβ Tαβγ Parts of the data not explained by the component X  1mT  TαPα  TαβPαβ  TαβγPαβγ  E T T T models
  18. 18. Results 0.5 control vehicle 0.4 low Xαβγ mediumXα 0.3 high αβ -scores Xαβ Scores 0.2 0.1 40 % 0 -0.1 -0.2 6 24 48 Time (Hours)
  19. 19. Results  biomarkers 3.0475 5.38 3.7525 3.675 Unique to the α submodelα Differences 3.9675 2.735 2.055 between submodels 2.5425 2.5825 2.6975 2.055 Interesting for Biology 2.075 Interesting for Statistics / 2.91 Diagnosticsαβ 3.0275 2.93 3.9675 2.735 2.6975 2.5825 3.285 3.2625 2.075 2.93αβγ 3.0475 2.055 3.73 3.8875 2.735 3.0275 3.285 10 8 6 4 2 0 chemical shift (ppm)
  20. 20. Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite – Method comparison ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data – Pathway analysis Fusion – Metabolic network inference
  21. 21. NONTARGETED SELDI measurements of serum samples of 20 Gaucher patients and 20 healthy controls. Gaucher is a genetic disease in which a fatty substance (lipid) accumulates in cells and certain organs
  22. 22. • human urine and porcine cerebrospinal fluid samples spiked with a range of peptides.• Variation in #samples, within and between group variation
  23. 23. Gaucher Spiked
  24. 24. Feature selection methods RESULTS• Complex nontargeted Gaucher profiling data with highly variable background and varying difference between case and control: Multivariate methods perform best.• Spiked LCMS targeted data with less variation in effect size: univariate and semi-univariate methods are best in selecting biomarkers.
  25. 25. Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment, – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
  26. 26. Biomarkers:A: UnivariateB: MultivariateC: Change in group correlation
  27. 27. BMR of green tea intervention study 186 human subjects with abdominal obesityValidation shows significant changes in BMR between placebo and green tea treatmenttogether with most important triacylglycerols TG28-29 and TG41-42.
  28. 28. Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
  29. 29. Plasma
  30. 30. Differences in blood metabolites due to aging
  31. 31. Aging biomarker metabolites in liver
  32. 32. Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
  33. 33. Special topic: Metabolic networks Biochemical Network vs Association Network Figure 7 Marginal correlation network for a set of metabolites in tomato. Volatiles in red, derivatized metabolites in yellow. Solid lines represent positive correlations, dashed lines negative ones. Thickness of line corresponds to magnitude of ...Margriet M.W.B. Hendriks , Data-processing strategies for metabolomics studies, Trends in Analytical Chemistry, 20212
  34. 34. Metabolomics, 2005 Data from Potato tubers Metabolic neighbors Do not participate in common reactions High correlation due to e.g. chemical equilibrium, mass conservation,..“a systematic relationship between observed correlationnetworks and the underlying biochemical pathways.”Ralf Steuer: Observing and interpreting correlations in metabolomic networks, Bioinformatics, 2003
  35. 35. Metabolic Network InferenceSearch for the link between metabolome data and underlying metabolicnetworks. F A E ?? F A E C B C B D D As an example: can we distinguish healthy from diseased networks: C Glucose A B C Glucose A B G G G G D DHEALTHY DISEASE F F E E F F
  36. 36. From data to network NETWORK TOPOLOGYGoal: ? ? DIRECTIONSProblems: NOISE MISSING METABOLITES HUGE AMOUNT OF POSSIBLE NETWORK STRUCTURES 40
  37. 37. Inference from static data1. DATA COLLECTION 2. SIMILARITY SCORE CALCULATION 2a. Relevance Networks 2b. Conditioned NetworksA. EnzymaticVariability ALL POSSIBLE Pearson Correlation (PC) Partial Pearson Correlation (PPC) PAIRWISE 0.6 INTERACTIONS (linear) (linear) 0.55 F 0.5 A E F A E 0.45 2 0.4 1.5 B 0.35 B 1 100 200 300 400 500 600 700 800 900 1000 0.5 5 C C 2 0 4B. Intrinsic Variability 1 1.5 1 0.2 0.4 3 0.6 0.8 D D 2 0.9 0.5 1 5 0.8 0 0 0 1 2 3 4 4 0.2 0.4 0.6 0.8 0.7 3 0.6 2 1 0.5 0 F A E 0 1 2 3 4 0 0.4 50 100 2 1.5 F 0 2 4 6 8 B A E 1 C 0.5 B 5 0 4 C DC. Environmental 0.2 0.4 0.6 0.8 3 2Variability 1 D 0 0 1 2 3 4 Mutual Information (MI) Conditional Mutual Information (non-linear) (CMI) (non-linear) 0 50 100 10 20 30 40 50
  38. 38. ESTIMATION OF CORRELATION NETWORKS 1. ASPP 2. ASA 3. HS 4. HSP Real Pathway Vmax Variability Intrinsic Variability Environmental Variability PC ASPP ASA HS HSP PC ASPP ASA HS HSP PC ASPP ASA HS HSP MI ASPP ASA HS HSP MI ASPP ASA HS HSP MI ASPP ASA HS HSP PPC1 ASPP PPC1 ASPP ASA HS HSPPPC1 ASPP ASA HS HSP ASA HS HSPCMI1 ASPP ASA HS HSP CMI1 ASPP ASA HS HSP CMI1 ASPP ASA HS HSP PPCn ASPP ASA HS HSPPPCn ASPP ASA HS HSP PPCn ASPP ASA HS HSP 100% PC: Pearson Correlation (linear measure) > 90% MI: Entropy-based Mutual Information (non-linear measure) 10% … 90% PPC: Partial Pearson Correlation (linear conditioning measure) < 10% CMI: Conditional Mutual Information (nonlinear conditioning measure) 42 Cakir, Metabolomics 2009
  39. 39. Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
  40. 40. Metabolomics data fusion• Account for between-block difference in quality of measurements to improve data fusion• For example, multi-platform data fusion, with differences in quantification, (non) targeted, error structure Amino acids Lipids Fused data• How to quantify the quality of measurements with many metabolites, and many samples?
  41. 41. Error model for 1 metabolite QC sample -> RSDStandard Deviaton St.D • Error models: - RSD using 1 QC sample - 2-component using study samples M • Good error description - sufficient # samples A  - large -range study samples I S Mean Intensity I
  42. 42. Figure of merit for data from 1 platform Median: F-50 = 0.1St.D Var. 15 Var. 365 90th-percentile: F-90 = 0.35 Number of peaks Var. 118 F-50 F-90 Var. 213 I(Van Batenburg et al. Analytical Chemistry, 2011)
  43. 43. Two-step data fusion j GC/MS LC/MS J1= 82 J2= 49 peaks peaks Ij M M  • Step 1: Compute figures of merit for each platform  
  44. 44. Two-step data fusion: MB-MLPCA • Step 2 : Multi-block PCA with weighting by figures of merit Fused error covariance X1 X2 Amino acids Lipids  js ˆ2   • Method needs good estimation of error variance by – Repeats – QC samples
  45. 45. Realistic simulations using GCMS and LCMS data• Error variance estimated from duplicates• True error variance• Estimating variance from duplicates is problematic.• Use Mix of QC samples and repeats.
  46. 46. Multivariate Metabolomics Data analysis• Explorative – Find groups, clusters structure / outliers in metabolites and in samples• Supervised – Discriminate two or more groups to make predictive model and to find • Special topics biomarkers. – Between metabolite ratios• Biological Interpretation – Metabolite set enrichment – Metabolomics Data Pathway analysis Fusion – Metabolic network inference
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×