Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Metabolomics: data acquisition, pre-processing and quality control

3,921 views

Published on

  • Be the first to comment

Metabolomics: data acquisition, pre-processing and quality control

  1. 1. 14-2-2013 Metabolomics: data acquisition, preprocessing & quality control Theo Reijmers, Analytical BioSciences, Leiden University Barcelona, 14-02-2013 Coenzymes (vitamines) Amino acidscarbohydrates hormones nucleotides Amino acidslipids 1
  2. 2. 14-2-2013 The metabolome • Metabolites chemical compounds with low molecular weight dynamic range 109concentration • Many chemical classes, with different chemical properties (different from proteomics) polarity log P –6 to 14 • Large differences in mass < 1500 Da abundance The metabolome global screen dynamic range 109 NMR concentration LC-MS custom polarity log P –6 to 14 targeted mass < 1500 Da 2
  3. 3. 14-2-2013 Analytical strategies: 1H NMR Advantages • Straightforward sample preparation • High sample throughput (robotic control) • Chemical shifts stable (if pH kept constant) • Quantification without standards • Highly repeatable and reproducible • Very valuable for identification of isolated metabolites Disadvantages • Limited sensitivity • Identification in complex mixtures rather difficultAnalytical strategies: LC-MS and GC-MS• Chromatography: separation of compounds in sample• Mass-spectrometry: detection of ions based on mass-to-charge ratio (m/z) 3
  4. 4. 14-2-2013 ChromatographySeparation of chemical compoundsbased on chemical properties chromatogramTypes of interaction: A B CA. Surface adsorptionB. Solvent partitioningC. Ion exchangeMass spectrometer separation of charged particles in the gas phase separation based on mass-to-charge ratio (m/z) mass massionisation detector analyser analyser 4
  5. 5. 14-2-2013 LC-MS vs GC-MSLiquid C-MS Gas C-MSAdvantages: Advantages:•Fast • Highly reproducible retention times•Efficient • Sensitive detection for all metabolites•Sensitive • Characteristic mass fingerprint (identification!)•Wide range of compoundsDisadvantages: Disadvantages:•Unstable* • Derivatization is needed to include•Sensitivity compound dependent polar analytes•Ion suppression gives rubbish data•Relative quantification (if no authenticstandard is available)*About as stable as a chocolate teapot in a heatwave. (Wilson 2009) Demonstration & Competence Lab • Applying technology developed in core in associate projects with industry, academia, clinics, knowledge institutes • Validation and implementation of metabolomics platforms • QA/QC system/error model per metabolite • Clinical & preclinical studies (projects with partners) • >15 000 samples/year • > 2000 metabolites • Identification pipeline • Training & hands-on-workshops 5
  6. 6. 14-2-2013 Platforms• Lipid analysis by LC-MS (ca. 300 individual compounds)• Amine analysis by LC-MS/MS (ca. 120 compounds)• Oxylipin analysis (ca. 140 compounds)• Global profiling by RP-LC-MS (ca. 450 compounds identified)• Global profiling by GC-MS (ca. 150 compounds)• Global profiling by CE-MS (ca. 300 compounds)• And more under development Large Metabolomics Measurement series DCL • IOP biomarkers for healthy aging – ±2500 samples, 28 batches – Measurement time ±28 weeks • Matching project LUMC and NCHA Netherlands centre for healthy Aging • Dutch Twin Register (NTR) – ±3000 samples, 31 batches – Measurement time ± 30 weeks • Dutch Twin Register (Nederlands Tweeling Register, NTR) • DiOGenes Diet, Obesity and Genes – ± 2000 samples, 27 batches – Measurement time ±14 weeks • NMC Associate project N & H cluster 6
  7. 7. 14-2-2013 Measurement Design • Randomization, replication & blocking of measurements • Inclusion of compounds & samples to monitor (& eventually correct for) quality – Internal Standards – Calibration samples – Quality Control (QC) samples – Replicate samples (technical & analytical) – Blanks – System suitability samples – Transfer samples Typical sample sequence list Orde r 1 2 3 Nam e Blank Blank Blank Id Blank Blank Blank Leve l Batch P repar atio n Injectio n isSamp le isSST isQC isd QC isBlan k isCal isOut lier isSuspe ct 0 0 0 5 5 5 1 1 1 1 1 1 Co mmen t 4 Blank Blank 0 5 1 1 5 dSST.C2 dSST.C2 2 5 1 1 6 SST.C2 SST.C2 2 5 1 1 7 dQ C dQ C 4 5 1 1Technical samples: system cleaning, testing and equilibrating. 8 QC QC 4 5 1 1 9 P5.C6.a C6 6 5 1 1 10 P5.C7.a C7 7 5 1 1 11 P5.C0.a C0 0 5 1 1 12 P5.C1.a C1 1 5 1 1 13 P5.C4.a C4 4 5 1 1 14 P5.C5.a C5 5 5 1 1 15 P5.C2.a C2 2 5 1 1 16 P5.C3.a C3 3 5 1 1 17 P5.C1 0543_090.3.01.0 4 5 1 1 18 P5.D1 0546_094.3.01.0 4 5 1 1 19 P5.E1 0550_076.3.01.0 4 5 1 1 20 QC QC 4 5 1 1 21 Blank Blank 0 5 1 1 22 dQ C QC 4 5 1 1 1 23 P5.F 1 0553_015.3.15.0 4 5 1 1 24 P5.G1 0555_097.3.01.0 4 5 1 1 25 P5.H1 0556_097.3.01.1 4 5 1 1 1 There might be somethi ng wrong here 26 P5.A2 0559_077.3.05.0 4 5 1 1 27 P5.B2 0561_103.3.01.1 4 5 1 1 1 Something wrong here 28 P5.C2 0563_103.3.01.0 4 5 1 1 29 P5.D2 0564_093.3.03.0 4 5 1 1 30 P5.E2 0570_095.3.01.0 4 5 1 1 31 P5. bE1 0550_076.3.01.0 4 5 2 1 32 P5. bA7 0631_057.3.09.0 4 5 2 1 33 QC QC 4 5 1 1 34 Blank Blank 0 5 1 1 35 dQ C dQ C 4 5 1 1 36 P5.F 2 0571_105.3.04.0 4 5 1 1 37 P5.G2 0573_105.3.03.0 4 5 1 1 38 P5.H2 0574_099.3.02.0 4 5 1 1 39 P5.A3 0575_099.3.01.0 4 5 1 1 40 P5.B3 0577_099.3.03.0 4 5 1 1 41 P5.C3 0578_099.3.01.1 4 5 1 1 42 P5.D3 0581_096.3.01.0 4 5 1 1 43 P5.E3 0582_101.3.01.0 4 5 1 1 44 P5.F 3 0584_123.3.01.0 4 5 1 1 45 P5.G3 0585_085.3.01.0 4 5 1 1 46 QC QC 4 5 1 1 47 Blank Blank 0 5 1 1 48 dQ C dQ C 4 5 1 1 49 P5.H3 0587_085.3.01.1 4 5 1 1 50 P5.A4 0589_095.3.01.1 4 5 1 1 51 P5.B4 0590_105.3.01.0 4 5 1 1 52 P5.C4 0591_105.3.02.0 4 5 1 1 53 P5.D4 0593_077.3.12.1 4 5 1 1Running samples 54 P5.E4 0594_077.3.12.0 4 5 1 1 55 P5. bF9 0664_130.3.20.1 4 5 2 1 56 P5. bF10 0678_118.3.01.0 4 5 2 1 57 P5.F 4 0597_117.3.02.1 4 5 1 1 58 P5.G4 0598_117.3.02.0 4 5 1 1 59 QC QC 4 5 1 1 60 Blank Blank 0 5 1 1 61 dQ C dQ C 4 5 1 1 62 P5.H4 0599_117.3.01.1 4 5 1 1 63 P5.A5 0600_117.3.01.0 4 5 1 1 64 P5.B5 0603_098.3.04.0 4 5 1 1 65 P5.C5 0604_098.3.02.0 4 5 1 1 66 P5.D5 0605_098.3.01.0 4 5 1 1 67 P5.E5 0606_098.3.01.1 4 5 1 1 68 P5. bB3 0577_099.3.03.0 4 5 2 1 69 P5. bH3 0587_085.3.01.1 4 5 2 1 70 P5.F 5 0607_015.3.16.0 4 5 1 1 71 P5.G5 0608_078.3.02.0 4 5 1 1 72 QC QC 4 5 1 1 73 Blank Blank 0 5 1 1 74 dQ C dQ C 4 5 1 1 75 P5.H5 0609_078.3.03.0 4 5 1 1 76 P5.A6 0611_078.3.01.0 4 5 1 1 77 P5.B6 0612_088.3.02.0 4 5 1 1 78 P5.C6 0613_088.3.01.0 4 5 1 1 79 P5.D6 0616_085.3.02.0 4 5 1 1 80 P5.E6 0618_094.3.05.0 4 5 1 1 81 P5. bE6 0618_094.3.05.0 4 5 2 1 82 P5. bB10 0673_107.3.05.0 4 5 2 1 83 P5. bG1 0555_097.3.01.0 4 5 2 1 84 P5. bC4 0591_105.3.02.0 4 5 2 1 85 QC QC 4 5 1 1 86 Blank Blank 0 5 1 1Calibration blocks at regular intervals 87 dQ C dQ C 4 5 1 1 88 P5.C3.b C3 3 5 1 1 89 P5.C7.b C7 7 5 1 1 90 P5.C2.b C2 2 5 1 1 91 P5.C6.b C6 6 5 1 1 92 P5.C5.b C5 5 5 1 1 93 P5.C4.b C4 4 5 1 1 94 P5.C0.b C0 0 5 1 1 95 P5.C1.b C1 1 5 1 1 96 P5.F 6 0620_107.3.01.0 4 5 1 1 97 P5.G6 0629_092.3.01.1 4 5 1 1 98 P5.H6 0630_092.3.01.0 4 5 1 1 99 QC QC 4 5 1 1 100 Blank Blank 0 5 1 1 101 dQ C dQ C 4 5 1 1 102 P5.A7 0631_057.3.09.0 4 5 1 1 103 P5.B7 0632_057.3.09.1 4 5 1 1 104 P5.C7 0634_091.3.01.0 4 5 1 1 105 P5.D7 0635_015.3.17.0 4 5 1 1 106 P5.E7 0638_072.3.01.0 4 5 1 1 107 P5.F 7 0639_066.3.03.0 4 5 1 1 108 P5.G7 0640_066.3.03.1 4 5 1 1 109 P5.H7 0642_109.3.02.0 4 5 1 1 110 P5.A8 0643_109.3.01.0 4 5 1 1 111 P5.B8 0646_110.3.06.1 4 5 1 1 112 QC QC 4 5 1 1 113 Blank Blank 0 5 1 1 114 dQ C dQ C 4 5 1 1 115 P5.C8 0647_110.3.01.0 4 5 1 1 116 P5.D8 0648_110.3.03.1 4 5 1 1 117 P5.E8 0649_110.3.03.0 4 5 1 1 118 P5.F 8 0650_110.3.06.0 4 5 1 1 119 P5. bH6 0630_092.3.01.0 4 5 2 1 120 P5. bF11 0689_065.3.22.0 4 5 2 1 121 P5.G8 0651_110.3.02.0 4 5 1 1 122 P5.H8 0655_108.3.01.1 4 5 1 1 123 P5.A9 0656_108.3.01.0 4 5 1 1 124 P5.B9 0658_111.3.01.0 4 5 1 1 125 QC QC 4 5 1 1QC-blank-(dummy) QC sequence at regular intervals 126 Blank Blank 0 5 1 1 127 dQ C dQ C 4 5 1 1 128 P5.C9 0659_111.3.02.0 4 5 1 1 129 P5.D9 0661_128.3.01.0 4 5 1 1 130 P5. bF4 0597_117.3.02.1 4 5 2 1 131 P5. bC10 0675_129.3.01.1 4 5 2 1 132 P5.E9 0663_130.3.20.0 4 5 1 1 133 P5.F 9 0664_130.3.20.1 4 5 1 1 134 P5.G9 0665_130.3.19.1 4 5 1 1 135 P5.H9 0666_130.3.19.0 4 5 1 1 136 P5.A10 0668_097.3.10.0 4 5 1 1 137 P5.B10 0673_107.3.05.0 4 5 1 1 138 QC QC 4 5 1 1 139 Blank Blank 0 5 1 1 140 dQ C dQ C 4 5 1 1 141 P5. bB5 0603_098.3.04.0 4 5 2 1 142 P5.C10 0675_129.3.01.1 4 5 1 1 143 P5.D10 0676_129.3.01.0 4 5 1 1 144 P5.E10 0677_118.3.01.1 4 5 1 1 145 P5.F 10 0678_118.3.01.0 4 5 1 1 146 P5.G10 0681_118.3.02.0 4 5 1 1 147 P5. bH10 0683_078.3.05.0 4 5 2 1 148 P5. bD4 0593_077.3.12.1 4 5 2 1 1 O nly Integrated for TGs 149 P5.H10 0683_078.3.05.0 4 5 1 1 150 P5.A11 0684_065.3.27.0 4 5 1 1 151 QC QC 4 5 1 1 152 Blank Blank 0 5 1 1 153 dQ C dQ C 4 5 1 1 154 P5.B11 0685_065.3.28.0 4 5 1 1 155 P5.C11 0686_065.3.29.0 4 5 1 1 156 P5.D11 0687_065.3.26.0 4 5 1 1 157 P5.E11 0688_065.3.30.0 4 5 1 1 158 P5.F 11 0689_065.3.22.0 4 5 1 1 159 P5.G11 0690_065.3.20.0 4 5 1 1 160 P5.H11 0691_065.3.24.0 4 5 1 1 161 P5.A12 0693_065.3.23.0 4 5 1 1Possible outliers are flagged and if confirmed ignored 162 P5.B12 0694_065.3.25.0 4 5 1 1 163 P5.C12 0696_112.3.04.0 4 5 1 1 164 QC QC 4 5 1 1 165 Blank Blank 0 5 1 1 166 dQ C dQ C 4 5 1 1 167 P5.D12 0697_112.3.04.1 4 5 1 1 168 P5.E12 0699_072.3.02.1 4 5 1 1 169 P5.F 12 0692_065.3.21.0 4 5 1 1 170 P5.C0.c C0 0 5 1 1 171 P5.C2.c C2 2 5 1 1 172 P5.C4.c C4 4 5 1 1 173 P5.C6.c C6 6 5 1 1 174 P5.C5.c C5 5 5 1 1 175 P5.C3.c C3 3 5 1 1 176 P5.C7.c C7 7 5 1 1 177 P5.C1.c C1 1 5 1 1 178 P5. bH7 0642_109.3.02.0 4 5 2 1 179 QC QC 4 5 1 1 180 Blank Blank 4 5 1 1 181 Blank Blank 0 5 1 1 182 Blank Blank 0 5 1 1 7
  8. 8. 14-2-2013 Data Acquisition, LC-MS & GC-MSFor one chemical compound, the pattern isapproximately the multiplication of a component Intensityspecific mass profile M/Z 6 5and the abundance at a certain retention time 4 Intensity 3 2 1Component specific mass profile: 0 1 2 3 4 5 6 Retention time 7 8 9 10LC-MS: natural isotopes + adducts (soft ionization)GC-MS: fragments (hard ionization) 8
  9. 9. 14-2-2013 number of mass channels selected for processing vs scan number 18000 16000 14000Raw Data, LC-MS 12000 # mass channels 10000 8000 6000 4000• Huge amount of data 2000 0 0 200 400 600 800 1000 1200 1400 ~1000s mass spectra (retention time scans) scan# ~10.000s ion chromatograms ~1.000.000s (m/z – retention time) pairs For each sample!• Complex data - Noise (detector noise and chemical noise), spikes, background - Concentration differences between the compounds are rather large and therefore also intensity differences 9
  10. 10. 14-2-2013 Preprocessing, LC-MS• Targeted platforms: vendor preprocessing software – Expert knowledge => optimized settings• Untargeted platforms: in-house developed preprocessing software – Conversion of manufacturer formats to common formats (e.g. ‘netcdf’ & ‘mzxml’) – Centroiding and binning – Baseline correction – Alignment – Peak extraction (asks for an estimate of noise level) – Matching of peaks over samples• Result: feature/peak/compound list – m/z & rt: peak area Centroiding RAW CENTROIDED 10
  11. 11. 14-2-2013 m/z shifts within a sample Small m/z shifts probably due to centroid sampling mode MS spectra and mass fluctuations during recordingBinning• Binning algorithm: sum intensities within predefined bins = mass ranges• Definition of bins is a challenge, mostly related to the mass resolution (e.g. resolution = 10 000 define bin 100.00 – 100.01)• When done incorrect large influence on peak extraction steps 11
  12. 12. 14-2-2013Background correction TIC Background correctedRetention time alignment 5 x 10 3 2.5 2 1.5 1 0.5 0 -0.5 0 1000 2000 3000 4000 5000 6000 7000 5 x 10 detail 2.5 2 1.5 1 0.5 0 -0.5 2000 2200 2400 2600 2800 3000 3200 12
  13. 13. 14-2-2013Alignment algorithms target dataset• Dynamic Time Warping (DTW) – Time point by time point mapping (dynamic programming) dataset to align• Correlation Optimized Warping (COW) -optimization of correlation between – Piecewise linear, segments instead of the two pieces of each dataset -not allow large retention time individual time points (dyn. progr.) variation (determined by the slack parameter t)• (Semi)-Parametric Warping (PTW, Eilers) – Global, nonlinear (parametric transfer function estimation)Alignment algorithms 200 200 150 150 100 100• Dynamic Time Warping (DTW) 50 50 – Time point by time point mapping 0 0 (dynamic programming) -50 3200 3300 3400 3500 -50 3200 3300 3400 3500 200 150 100• Correlation Optimized Warping (COW) 50 – Piecewise linear, segments instead of 0 individual time points (dyn. progr.) -50 3200 3250 3300 3350 3400 3450 3500 Warped, detail 200 180 160• Parametric Warping (Eilers) 140 120 100 – Global, nonlinear (parametric transfer 80 60 function estimation) 40 20 0 3250 3300 3350 3400 3450 3500 3550 13
  14. 14. 14-2-2013 Peak/Feature extraction and peak integration • XCMS http://metlin.scripps.edu/xcms/index.php • MetAlign http://www.wageningenur.nl/en/show/MetAlign-1.htm • TNO-DECO Jellema, et al, Chemom. Intel. Lab. Systems, 104 (10) 132 • MZExtract van der Kloet et al, submitted TNO-DECO Works with GC-MS and not too complex LC-MS Decomposes experimental data into the product of pure mass spectra and concentration profiles of all compounds in the sample Advantages: -Result is combined mass spectrum (identification!!) -All samples analyzed at once Problems / issues: -Least squares (abundant compounds have large influence on result) -Noise level estimation -Correct binning essentialJellema, Chemo. Intel. Lab. Systems (2010) 104 132-139. 14
  15. 15. 14-2-2013 DeconvolutionDeconvolution of LC-MS data 6 Extracted mass spectra x 10 rt: 14.769 14 1 761 baseline corrected data 0.5 12 184 0 100 200 300 400 500 14.3868 rt: 600 700 800 900 1000 1 759 10 0.5 184 8 0 100 200 300 400 500 13.9818 rt: 600 700 800 900 1000 1 704 6 184 726 0.5 628 757 4 0 100 200 300 400 500 14.5777 rt: 600 700 800 900 1000 785 1 2 0.5 184 0 0 0 10 20 30 40 50 60 100 200 300 400 500 600 700 800 900 1000 6 6 x 10 Extracted chromatographic profiles x 10 reconstructed signal 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 00 10 20 30 40 50 60 0 0 10 20 30 40 50 60 15
  16. 16. 14-2-2013 MZExtract Per sample: •Feature extraction of recalibrated and centroided data (in-house) •Integration of features (areas) •Grouping of features to feature-sets (enrichment step knowledge based: isotopes, adducts) Over samples: •Match feature-sets Advantage of two-step approach: fully scalable solution (parallel implementation)van der Kloet, submitted. Grouping related features within a single sample No retention time window necessary to match features (only isotopic patterns or other known relations, e.g. adducts) 16
  17. 17. 14-2-2013ValidationTarget list from MassHunter (Agilent) used tolocate 174 known targets. – Mass window -> resolution 10.000 – RT window -> +/- 10 seconds – 171 were found – 3 missing targets: no isotopic patterns were detected (they were found in the list of ‘single’ features)How to validate unknown feature-sets?here: selection based on QC presence Comparable: 1.175 feature-sets about 3.200 unknown feature-sets Low abundant: 366 feature-sets 17
  18. 18. 14-2-2013 PLS-DA, Selectivity ratio*, to quantify the variables discrimanatory abilityThe low abundant feature-sets do contain biological relevance!The most important feature-sets is an unknown!*Anal. Chem. 2009, 81, 2581–2590 Quality Assessment• Make use of all additional measured compounds and samples – Internal Standards – Replicates – Blanks – Quality Control samples• Quality Assessment => QC report (in-house) 18
  19. 19. 14-2-2013 Part of a measurement run QC sample Study sample Replicate study sample Response Measurement Order QC report overviewtotable ANOVA for batch batch variation N mean std RSDqc RSD reps p-value diffsCholE02 58 0.0298 0.0079 26.4% 21.4% 0.000 (2-1,3-1,3-2,4-2,4-3)CholE04 46 0.0240 0.0124 51.9% 40.6%CholE05 58 0.0120 0.0024 20.4% 19.1% 0.000 (2-1,3-1,4-1,3-2,4-3)CholE06 58 0.0085 0.0021 24.7% 19.5% 0.000 (3-1,3-2,4-3)DG02 58 0.0049 0.0011 23.4% 22.7% 0.000 (2-1,3-1,4-1,3-2,4-2,4-3)LPC01 58 0.0183 0.0009 4.7% 4.8% 0.000 (4-1,4-2,4-3)LPC02 58 0.0130 0.0015 11.7% 11.5% 0.000 (2-1,3-1,4-1)LPC03 58 0.0101 0.0010 9.5% 12.1% 0.360LPC04 58 0.0436 0.0019 4.4% 5.4% 0.000 (2-1,4-1,3-2,4-3)LPC05 58 1.8684 0.1259 6.7% 6.8% 0.000 (2-1,3-1,4-1,3-2,4-2,4-3)LPC07 58 0.0109 0.0007 6.1% 6.4% 0.004 (4-2)LPC08 58 0.6096 0.0141 2.3% 3.2% 0.000 (2-1,3-1,4-1,3-2,4-2,4-3)LPC09 58 0.4170 0.0200 4.8% 4.8% 0.000 (3-1,4-1,3-2,4-2,4-3)LPC10 58 0.6625 0.0976 14.7% 13.8% 0.000 (2-1,3-1,4-1,3-2,4-2,4-3)LPC11 58 0.0394 0.0446 113.1% 57.6% 0.000 (2-1,3-2,4-2,4-3)LPC12 58 0.1126 0.0024 2.1% 3.6% 0.000 (2-1,3-1,3-2,4-2,4-3)LPC13 58 0.0425 0.0049 11.5% 9.8% 0.000 (3-1,4-1,3-2,4-2)LPC14 58 0.0311 0.0010 3.3% 3.7% 0.000 (2-1,3-1,4-2,4-3)LPC16 58 0.0064 0.0016 24.9% 28.7% 0.000 (4-1,3-2,4-2,4-3)LPC17 58 0.0033 0.0010 32.0% 36.4% 0.000 (3-1,4-1,3-2,4-2,4-3)LPE02 58 0.0303 0.0056 18.6% 19.4% 0.000 (2-1,4-1,3-2,4-2,4-3) RSD values forLPE04 43 0.0034 0.0011 33.1% 21.9%PC01 58 0.0832 0.0105 12.6% 12.5% 0.000 (4-1,4-2,4-3)PC02• QC samples 58 0.3333 0.0151 4.5% 4.6% 0.000 (2-1,4-1,4-2,4-3)PC03PC04 • Replicate samples 58 58 0.2238 0.1257 0.0077 0.0040 3.4% 3.1% 3.7% 4.8% 0.000 0.000 (2-1,3-1,4-1,4-2,4-3) (3-1,4-1,3-2,4-3)PC05 (independent validation) 58 0.0674 0.0248 36.8% 35.9% 0.000 (2-1,3-1,4-1,3-2,4-3)PC06 58 0.0667 0.0084 12.7% 10.1% 0.000 (2-1,4-1,3-2,4-3)PC07 58 0.0225 0.0026 11.5% 14.2% 0.000 (2-1,3-1,4-1,4-2,4-3) 19
  20. 20. 14-2-2013Uncorrected Peak areas 20
  21. 21. 14-2-2013QC samples only Ratio (unc)Area RSD QC 25.8% 21
  22. 22. 14-2-2013 Internal standard RSDQC=25.8%Internal Standard Corrected data RSDQC=20.6% 22
  23. 23. 14-2-2013 Intra and Inter batch variation• Analytical Column ‘aging’• Analytical Column replacement• Eluent ‘refills’ and small variations• Instrument malfunction/breakdown – Etc… Intra and Inter batch correction• Instead of just monitoring QC sample responses use them to correct variation 23
  24. 24. 14-2-2013 QC correction QC sample Study sample Penalized smoother Response Measurement OrderVan der Kloet et al., Journal of Proteome Research 2009 QC correctionbefore afterResponse Response Measurement Order Measurement OrderVan der Kloet et al., Journal of Proteome Research 2009 24
  25. 25. 14-2-2013 QC correctionvan der Kloet et al., Journal of Proteome Research 2009 QC correctionvan der Kloet et al., Journal of Proteome Research 2009 25
  26. 26. 14-2-2013ISTD/QC corrected data RSDQC=4.1% RSDreplicates=10.0% All samples 26
  27. 27. 14-2-2013 All batches Correction chartsRSDQCRSDReplicates 27
  28. 28. 14-2-2013 Scores plot based upon 93 lipids Uncorrected Area batches. Differences between Scores plot based on 93 components (Peak Area) 35 batch 1 30 batch 2 batch 3 batch 4 25 QC samples 20 15 PC 2 (14%) 10 5 0 -5 -10 -15 -15 Clear trends in QC 0samples. -10 -5 5 10 15 20 PC 1 (39.3%) Scores plot based upon 93 lipids ISTD Smaller differences between correction batches. Scores plot based on 93 components (ISTD correction) 15 batch 1 batch 2 batch 3 10 batch 4 QC samples 5 PC 2 (14.8%) 0 -5 -10Spread in QC samples greatly -15reduced. -10 However, batch to batch 5 -5 0 10 15 20 25 30 35 PC 1 (21.3%)differences remain present. 28
  29. 29. 14-2-2013 Scores plot based upon 93 lipids Scores plot based on 93 components RSDqc<0.15 and RSDreps<0.15 20 15 batch 1 10 batch 2 batch 3 batch 4 PC 2 (14.7%) 5 QC samples 0 -5 -10 -15 -15 -10 -5 0 5 10 15 20 25 30 35 PC 1 (22.9%) Combining data in systems biology variablesComprehensive view of patient, animal, … : objectse.g. combine genomics, proteomics & metabolomics data 1 2 Data integration / fusion: joining data from different measurement approaches, same objects variables 1 objectsIncrease power of statistical analyses:Combine e.g. metabolomics batch datasets 2 ‘Equating’: (*) make comparable data from same measurement approach, different objects *Equating is psychometrical term 29
  30. 30. 14-2-2013 Why not just concatenate datasets? variables • ‘Omics data typically batch data 1 objects • Metabolomics often not quantitative 2 datasets not comparable • Calibration model transfer would be solution but… ? …often no full calibration models can be made!* *Sangster et al, The Analyst 2006 (131): 1075-1078 A proposed approach: QC samples Correction for structural differences between series using quality control (QC) samples (pooled samples or representative samples)* (picture from reference below)*van der Greef et al, J Proteome Res 2007 (6): 1540-1559 30
  31. 31. 14-2-2013 Problem with QC sample approach • Rationale: make medians of QC data equal for all series • Unwanted side-effect: inflation of variation in rest of data: Inflation of MAD in series 2 relative to series 1 Series 1 MAD Series 2, uncorrected Series 2, QC-corrected Lipid compoundsMAD: median absolute deviation (robust SD) Alternative solution: equating variables • Combination of data from different measurement series 1 objects 2 • …in studies with limited number of internal standards (typically metabolomics!) • …or even from different studies • General: enables maximal flexibility in subsequent data analysis on combined datasets 31
  32. 32. 14-2-2013 Illustration: LC–MS data• 182 (54 + 128) healthy participants (Netherlands Twin Register)* Measured in two series:• Blood samples (overnight fasting) year 1 (Y1) N=54• Plasma analyzed with liquid chromatography–MS method for lipids + Target list for 59 lipids: LPC / PC / SPM / year 2 (Y2) N=128 ChE / TG Data per lipid corrected for class-specific internal standard *Draisma et al, OMICS 2008: 17–31 PCA scores before equating Y2 Y1 Data mean-centered prior to PCA 32
  33. 33. 14-2-2013 Univariate quantile equating•Quantiles: values marking boundaries between regular intervals of the cumulative distribution function (CDF)•Example: 54 data values and associated CDF CDF 0.52 quantile 1/54 0.50 quantile (= median) 1/54 0.48 quantile Univariate quantile equatingAverage values of corresponding quantiles CDF Y1 x = 1.81 CDF(x) = 0.50 CDF Y2 x = 2.64 Data from: Frisby & Clatworthy, Perception 1975: 173-178 33
  34. 34. 14-2-2013 Quantile equatingAlgorithm: 1. Number of quantiles = min {N1 , N2, …} 2. Average values of corresponding 1 1 quantiles by projection onto unit vector ( ,..., ) n n 3. Substitute averaged values for original values belonging to each quantile Often applied for quantile normalization (*) of gene arrays, between arrays (objects) over probes (variables) *Bolstad et al, Bioinformatics 2003: 185–193 Example univariate quantile equating Q-Q plot Y1 Projection onto CDF Y2 Projection onto unit vector: unit vector averaging Y2 After Y1 Y2 CDF Y1 Before 34
  35. 35. 14-2-2013 PCA scores after equating LC–MS data After equating Before Y2 red: Y1 black: Y2 Y1 Data meancentered prior to PCA Y1–Y2 similarity in PCA score space*direction:location:variance: Box’sloadings D2 PCA M statistic Mahalanobis’ Y2 PC3 Y1 *Jouan-Rimbaud et al, Chemom Intell Lab Syst 1998: 129-144 35
  36. 36. 14-2-2013 Y1–Y2 similarity in PCA score space direction variance location Before After equating equating All parameters: 0 = ‘dissimilar’, 1 = ‘similar’ Jouan-Rimbaud et al, Chemom Intell Lab Syst (1998) 129-144 Effects on clustering results Y2 Y1 No equating, Y1–Y2 datasets combined: Obvious Y2 between-series effect Y1Draisma et al, Anal Chem (2010) 82 1039-1046 36
  37. 37. 14-2-2013 Effects on clustering results ♂ ♀ After quantile equating, Y1–Y2 datasets combined:♂ Y1–Y2 effect removed Biological information extractable from combined dataset ♀Draisma et al, Anal Chem (2010) 82 1039-1046 Conclusions• ‘Garbage in = Garbage out’ so try to control data quality as much as possible• Proper measurement design allows separation of unwanted experimental variation from biological variation (IS, QCs, replicates)• Preprocessing: trade off between data quality, speed (automation) and completeness (number of features)• Road to high quality data is balanced mix of data acquisition and data processing 37
  38. 38. 14-2-2013 Acknowledgements• DCL – Jorne Troost • LACDR – Evelyne Steenvoorden – Frans van der Kloet – Shanna Shi – Katrin Strassbourgh – Faisa Galud – Vanessa Gonzalez – Rob Vreeken – Margriet Hendriks – Amy Harms – Harmen Draisma – Raymond Ramakers – Thomas Hankemeier – Irina Paliukovich – Adrie Dane 38

×