1 statistical analysis

Biology

Chemistry
Informatics

Evaluation of sample processing
protocols for the analysis of
pumpkin leaf metabolites

Statistics

Goals: Compare different extraction and drying
protocols to identify the “optimal” sample processing
approach
Topics:
1. Data quality overview
2. Statistical comparisons
3. Power analysis

Data Quality Overview
Biology

Chemistry
Informatics

Goal: Calculate and visualize the summary statistics for each
metabolite/treatment (Use DATA: Pumpkin data 1.csv)
Calculate:
1. Mean and standard deviation (sd)
2. The percent relative standard deviation, %RSD, (sd/mean)*100

Statistics

Visualize:
1. The relationship between mean vs. sd, mean and %RSD
2. Compare mean metabolite values for all treatments
Exercises:
1. Describe the relationship between analyte mean and sd, mean and %RSD?
2. Describe what constitutes an “optimal” method?
3. Which extraction/treatment should be chosen to process further samples?

Summary statistics
Biology

Chemistry

Statistics

Informatics

Mean vs. SD
Biology

Chemistry
Informatics

Mean and sd are highly correlated
Larger means have larger sd
This effect is also called heteroscedasticity

Statistics

SD

•
•
•

Mean

Mean vs. %RSD
Biology

Chemistry
Informatics

Statistics

%RSD

• %RSD is minimally correlated with the mean
Can be used as criteria for:
• Comparing method reproducibility
• Identifying data quality

Mean

Qualities of %RSD
Biology

Chemistry
Informatics

•

•
•

%RSD (also called the coefficient of variation or CV) is the sd (variation)
scaled by the mean (magnitude).
Removes the relationship between variation and magnitude
Provides a single value which can be used to compare the variation of a
measurement among different treatments/samples

Statistics

Showing the mean and sd of the %RSD for all metabolites for a given treatment

Data quality
Biology

Chemistry
Informatics

Below
LOQ

%RSD

(sensitivity)

Bad

Statistics

~40%

Moderate

~10,000 Mean

Good

Selecting the “optimal” method
Biology

Chemistry
Informatics

Optimal can be:
1. Lowest average %RSD for all measurements
2. Lowest %RSD for measurements of interest
3. Largest number of metabolites passing %RSD cutoff
4. Lowest average %RSD for all measurements passing %RSD cutoff
Using strategy #4 for metabolites %RSD ≤ 40

Statistics

Count

Method #2 (ACN/IPA/water 3:3:2) looks optimal…

%RSD (mean

sd)

Based on Method #2
Biology

Chemistry
Informatics

Mean

%RSD

%RSD ≤ 40

Log Mean

Statistics

Analytes with high
signal and high %RSD
should be further
interrogated for
explanations of low
reproducibility
Log Mean

Biology

Chemistry

Statistical comparison of the
effects of sample drying

Informatics

Goals: identify the effect of treatment (fresh/lyophylized) on Methods #3-4
performance? (Use DATA: Pumpkin data 2.csv)
Count
%RSD (mean sd)

Statistics

Steps:
1. Use t-Test to compare metabolite means for each treatment
2. Correct for the false discovery rate (FDR) adjusted p-value
3. Estimate FDR (q-value)
Visualize:
1. Relationship between p-value and FDR adjusted p-value
2. Relationship between FDR adjusted p-value and q-value
3. Box plots for highest and lowest p-value metabolites
Questions:
1. When should you use a one-sample, two-sample or paired t-test, ANOVA?

*return to 0-introduction

Hypothesis Testing Strategies
Biology

Chemistry

Statistics

Informatics

• One sample t-Test is used to compare single value to a population mean
• Two sample t-Test is used to compare 2 independent populations
• Paired t-Test is used to compare the same population (intervention, repeated
measures)
• One-way ANOVA (analysis of variance) is used to compare n populations for
one factor
• Two-way ANOVA is used to compare n populations for 2 factors
• ANCOVA (analysis of covariance) is used to adjust n populations for
covariate (typically continuous) prior to testing for n factors
• Mixed effects models are versatile analogue to linear model or
ANOVA/ANCOVA and typically used to adjust for covariates or variance due
to repeated measures
*All of the above are parametric tests, and some of which have non-parametric analogues

p-value vs. FDR adjusted p-value
Biology

Chemistry
Informatics

FDR adjusted p-value

Benjamini & Hochberg
(1995) (“BH”)
• Accepted standard

Statistics

Bonferroni
• Very conservative
• adjusted p-value = pvalue*# of tests
(e.g. 0.005 * 148 = 0.74 )
p-value

p-value vs. q-value
Biology

Chemistry
Informatics

Statistics

FDR adjusted p-value

• q-value can be used to
select appropriate p-value
cut off for an acceptable
FDR for multiple
hypotheses tested
• q=0.05 nicely matches
assumptions of p=0.05 for
multiple hypotheses tested
• q-value≤0.2 can be
acceptable

q-value

Biology

Chemistry

Change in metabolites due to
treatment

Informatics

Statistics

Effect size:

small

large

Effect of drying: is minimal
Biology

Chemistry
Informatics

- Log p-value

FDR p-value= 0.05

Statistics

7 significantly
different
metabolites out
of 148 (5%)

- Log p-value
Fold change (relative to fresh)

Power analysis
Biology

Chemistry
Informatics

Goals: Use power analysis to plan a follow up experiment to detect
differences in metabolites due to treatment

Steps:
1. Calculate effect size and power for three metabolites
2. Given the observed effect size calculate the number of samples needed to
reach 80% power

Statistics

Questions:
1. How would you take FDR in to account?

Power analysis
Biology

Chemistry
Informatics

Statistics

Scaled difference in means
between treatments

Ability to detect a
difference when it exists
(control false negative rate)

Probability of being wrong when spotting
a difference (control false positive rate)

Power analysis
Biology

Chemistry
Informatics

The minimum fold change (FC) in means observable by the study can be
calculated using RSD and estimated effect size to reach 0.8 (80%) power
given the population size

Statistics

RSD = 0.21 and effect size (EF) =1.2

We can observe a minimum of a 38% change in means at 0.8 power (p= 0.05).

1 statistical analysis

More Related Content

What's hot

Similar to 1 statistical analysis

More from Dmitry Grapov

1 statistical analysis