Statistical Analysis of cDNA Microarray Genomics Data Yuehua Cui Graduate student Department of Statistics December 4th, 2002
Outline of the topics
Post Hoc Analysis
New technique introduced in 1995 by Schena.
Quantitatively monitor expression level for thousands of genes at a time.
All the methods and applications are based on Nylon membrane microarrays and can be extended to other DNA microarray analysis using other platforms.
A number of systematic variations can occur during experiments. For example, different samples being compared are hybridized on different nylon membranes. Need normalization to remove these sources of variation.
Well normalized data are the foundation of good analysis results.
AtlasImage Data Preprocessing
Alignment : each gene is represented by two spots. Match these two spots to a schematic representation of an array. Final intensity for this gene will be the average value of the intensities of these two spots.
external(global):median intensity of the black space between different panels.
user-defined external:median intensity of user-defined area
local:median intensity of the space surrounding the gene spot
Adjusted intensity = raw intensity - background value
log2, log10 or natural.
part of Atlas nylon membrane array
*Note: the two spots above or below the white bar represent
one gene, i.e. one gene has two spots.
An example RL95 cell line data set
Each Clontech Stress array contains 234 sequences expressed in response to stress.
Each insert cDNA is denatured and UV cross-linked to a positively charged membrane
Samples are treated with DMSO and BaP (Benzo(a)pyrene) dissolved in DMSO. So DMSO is the control and BaP is the treatment.
DMSO and BaP treated samples are hybridized under the same condition each time. Two membranes are used three times for DMSO and BaP treated samples, respectively.
Three biological replicates done with the same membrane(s) (correlation occurs)
Use Phosphor Imager laser scanner to obtain densities of each spot on filter. Control RNA Sample Test RNA Sample Hybridization to microarray filters radio-labelled cDNA probes Reverse-Transcription 33 P - dCTP 33 P - dCTP Compare densities at each spot to determine if treatment changes gene expression. Compile subset of differentially expressed genes. Gene Control Test A 1X 3X : : : Z 1X 0.5X
Scatter plots of adjusted log intensities for paired experiments of D MSO vs BaP
Gobal normalization (AtlasImage TM )
assumption : given large enough sample size, the average signal intensity (gene expression level) does not change.
Sum method : Norm coef.(k j ) =
Where I mi = intensity of gene i on array Array m , m =1,2
B m = background intensity on Array m , m =1,2
n = number of genes on the array
problem: validity of the assumption; stronger signals dominate the summation.
Median (robust with respect to outliers)
Normalization coefficient (k j ) =
Housekeeping gene normalization
Housekeeping genes are a set of genes whose expression levels are not affected by the treatment.
The normalization coefficient is the ratio of m C /m T , where m C and m T are the means of the selected housekeeping genes for control and treatment respectively.
Problem: housekeeping genes change their expression level sometimes. The assumption doesn’t hold.
Trimmed mean normalization (adjusted global method)
trim off 5% highest and lowest extreme values , then globally normalize data. The normalization coefficient is:
where are the trimmed means for the i th treatment and control respectively.
Regression normalization :
Fit the linear regression model:
Assumption: all the genes on the array have the same variance (homogeneity)
Test the significance of the intercept . Fit a linear regression without if it is insignificant.
Transform the treatment data:
assumption may not hold
nonlinear trend (the third replicates of RL95 data has a slight quadratic trend) .
Scatter plot of log intensity before and after regression normalization
Rank normalization: (this method assumes only a small number of genes will be differentially expressed)
R Cj c criteria, j=1,…,g, where c =g 10%, g is the total number of genes and R Cj is the rank for gene j in control.
choose a set of genes which have a similar expression pattern, ie. R Tj ( R Cj c )
Normalization coefficient: where and are the means of the selected genes for the ith treatment and control respectively
Locally nonparametric method and is robust to a small number of differentially expressed genes.
M-A plot of DMSO vs BaP (Before and after intensity-dependent normalization, f=0.3)
Global or local, parametric or nonparametric method
No unique normalization method for the same data. It depends on what kind of experiment you have and what the data look like.
No absolute criteria for normalization. Basically, the normalized log ratio should be centered around 0. Combing with post hoc analysis to choose the best one.
Post Hoc Analysis
Data adjustment: for paired normalization, truncate big ratios first. Quantile criteria (1% or 5%, 95% or 99% quantiles)
Parametric tests assume that data follow a certain distribution
Non-parametric tests do not make such assumptions
Check the validity of the assumptions made for parametric test and make sure using the right test.
AtlasImage TM 2-fold criteria
AtlasImage TM software: report genes with 2-fold change as up or down-regulated genes.
Fails to account for sample variation.
Low intensity tends to have higher ratio
Ignores the fact that a difference less than 2-fold can also elicit meaningful biological effects.
One sample t-test
Obtain normalized log ratio for each pair (control vs treatment). Calculate the mean and SD for each gene.
Under the null hypothesis,ie., there is no expression difference, the mean of the log ratio for gene i is 0:
The test statistic is
where m i and sd i are the mean and standard deviation of the log ratio for gene i.
Problem:small sample size;normality assumption; multiple test adjustment.
Two sample t-test
Obtain normalized log intensity.
Let the sample mean and variances of Y ij ’s for gene j under the two conditions be , the test statistics is:
if unequal variance is assumed and
with df d i =2(n-1)
if equal variance is assumed.
Under the normality assumption for Y ij , Z i approximately has a t-distribution with d i degree of freedom:
Problem: small sample size; normality assumption; multiple test adjustment.
Multiple test adjustment
Hundreds of genes tested at the same time. Assume 1000 genes are not differentially expressed. P-value of 0.01(false positive rate) means that around 10 genes will nevertheless be significant.
Bonferroni correction: want to make sure that P[ 1 gene significant from 1000] 0.05. Consequently, p-value for a single gene to be announced as significant is: P [single gene] 0.05/1000 = 0.00005
Conservative and lower power.
keep FWR manageable and try some p-value, say 0.001 as the significant level.
Westfall and Young’s step-down adjusted P-value.
Predictive Interval (PI) method
Use the normalization method discussed above to normalize data.
Obtain the average log ratio(ALR) which is centered around zero.
Using normal approximation method.
Step I: Treating the maximum or minimum value of ALR greater than mean+3*sd or less than mean-3*sd as outlier, delete it from ALR and take it as a differentially expressed gene.
Step II: calculate the mean and sd for remaining genes and repeat step I.
Do above steps iteratively until no more ourlier exists. Then, calculate the 95% predictive interval for the remaining genes. Those values outside of the PI are significant.
The final set of differentially expressed genes include those outliers detected in step I and II and those outside of PI.
Assuming there is constant coefficient of variation c for the entire gene set
the observed differential expression, R k =T k /C k (ratio of treatment and control intensity at gene k), has a sampling distribution dependent only on c. R k is approximately normally distributed.
The density function of R becomes:
Use the Maximum likelihood method to estimate the constant c, and use the EM algorithm to get the final estimate of c and m.
Use the polynomial: to get the CI.
Measurement errors depends on signal strength
Significant genes list of BaP/DMSO
For gene i in each paired experiment, permute data within pair
to get the permuted sample. Under the assumption that genes
do not change their expression pattern under the two conditions
of study, we can permute data as follows:
Permutation test continued
Get the normalized average log ratio for original(ALR) and permuted data(ALR*)
calculate the p-value for gene i :
where g is the total number of genes
permute data n times and obtain n p-value for each gene. Then get the mean and sd for each gene and calculate 95% CI.
If lower bound is less than 0.05, claim this gene as significant.
List of significant genes picked up by permutation test
Significance Analysis of Microarrays (SAM)
Limitation of parametric test :
Estimation of Variance: limited sample size (= few replicates)
Normal Distribution assumptions: error model still not clear
Excel add-in performing robust method for differential analysis of microarray data.( Method developed and implemented by the Tibshirani group at Stanford (free for academic use)
Permutation technique:Assuming no difference between conditions, all genes are from the same population.
False Discovery Rate: Number of falsely called genes divided by number of differential genes in original data
need large number of replicates
SAM test Statistic
d i = Score
s i = Standard Deviation
s 0 = Fudge Factor
The SAM process
Perform permutation and compute test statistics for each permutation
Rank test statistics in ascending order
Compute mean test statistics for each “rank” over all permutations
Plot original “ranked” test Statistic Versus Mean test statistic from permutations
Define distance from mean permuted value you call significant
Cutoff point determination: set up critical point to eliminate genes whose intensity is less than this point.
Statistically significant? No unique method to analyze data. Some methods are better for one data set, but may not be good for other data sets. In practice, we have to try different ways to see which methods work well.
Biologically significant? For those genes picked up by statistics, we have to be careful to draw conclusions. Some genes shown to be significant may not be functionally meaningful. Conversely, genes that do not show up significant may be significant, especially for those genes at the boarder line in the statistical test.