Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Microarray Statistics


Published on

Published in: Education, Technology
  • Be the first to comment

Microarray Statistics

  1. 1. Statistical Analysis of cDNA Microarray Genomics Data Yuehua Cui Graduate student Department of Statistics December 4th, 2002
  2. 2. Outline of the topics <ul><li>Introduction </li></ul><ul><li>Data preprocessing </li></ul><ul><ul><li>Alignment </li></ul></ul><ul><ul><li>Background calculation </li></ul></ul><ul><ul><li>Data transformation </li></ul></ul><ul><li>An example </li></ul><ul><li>Normalization Comparison </li></ul><ul><li>Post Hoc Analysis </li></ul>
  3. 3. Introduction <ul><li>New technique introduced in 1995 by Schena. </li></ul><ul><li>Quantitatively monitor expression level for thousands of genes at a time. </li></ul><ul><li>All the methods and applications are based on Nylon membrane microarrays and can be extended to other DNA microarray analysis using other platforms. </li></ul><ul><li>Why normalization: </li></ul><ul><ul><li>A number of systematic variations can occur during experiments. For example, different samples being compared are hybridized on different nylon membranes. Need normalization to remove these sources of variation. </li></ul></ul><ul><ul><li>Well normalized data are the foundation of good analysis results. </li></ul></ul><ul><li>Statistical analysis </li></ul>
  4. 4. AtlasImage Data Preprocessing <ul><li>Alignment : each gene is represented by two spots. Match these two spots to a schematic representation of an array. Final intensity for this gene will be the average value of the intensities of these two spots. </li></ul><ul><li>Background calculation </li></ul><ul><ul><li>external(global):median intensity of the black space between different panels. </li></ul></ul><ul><ul><li>user-defined external:median intensity of user-defined area </li></ul></ul><ul><ul><li>local:median intensity of the space surrounding the gene spot </li></ul></ul><ul><li>Data transformation: </li></ul><ul><ul><li>Adjusted intensity = raw intensity - background value </li></ul></ul><ul><ul><li>log2, log10 or natural. </li></ul></ul>
  5. 5. part of Atlas nylon membrane array <ul><li>*Note: the two spots above or below the white bar represent </li></ul><ul><li>one gene, i.e. one gene has two spots. </li></ul>
  6. 6. An example  RL95 cell line data set <ul><li>Each Clontech Stress array contains 234 sequences expressed in response to stress. </li></ul><ul><li>Each insert cDNA is denatured and UV cross-linked to a positively charged membrane </li></ul><ul><li>Samples are treated with DMSO and BaP (Benzo(a)pyrene) dissolved in DMSO. So DMSO is the control and BaP is the treatment. </li></ul><ul><li>DMSO and BaP treated samples are hybridized under the same condition each time. Two membranes are used three times for DMSO and BaP treated samples, respectively. </li></ul><ul><li>Three biological replicates done with the same membrane(s) (correlation occurs) </li></ul>
  7. 7. Use Phosphor Imager laser scanner to obtain densities of each spot on filter. Control RNA Sample Test RNA Sample Hybridization to microarray filters radio-labelled cDNA probes Reverse-Transcription 33 P - dCTP 33 P - dCTP Compare densities at each spot to determine if treatment changes gene expression. Compile subset of differentially expressed genes. Gene Control Test A 1X 3X : : : Z 1X 0.5X
  8. 8. Scatter plots of adjusted log intensities for paired experiments of D MSO vs BaP
  9. 9. Normalization <ul><li>Gobal normalization (AtlasImage TM ) </li></ul><ul><ul><li>assumption : given large enough sample size, the average signal intensity (gene expression level) does not change. </li></ul></ul><ul><ul><li>Sum method : Norm coef.(k j ) = </li></ul></ul><ul><ul><ul><li>Where I mi = intensity of gene i on array Array m , m =1,2 </li></ul></ul></ul><ul><ul><li>B m = background intensity on Array m , m =1,2 </li></ul></ul><ul><li>n = number of genes on the array </li></ul><ul><ul><li>problem: validity of the assumption; stronger signals dominate the summation. </li></ul></ul><ul><ul><li>Median (robust with respect to outliers) </li></ul></ul><ul><ul><li>Normalization coefficient (k j ) = </li></ul></ul>
  10. 10. Normalization continued <ul><li>Housekeeping gene normalization </li></ul><ul><ul><li>Housekeeping genes are a set of genes whose expression levels are not affected by the treatment. </li></ul></ul><ul><ul><li>The normalization coefficient is the ratio of m C /m T , where m C and m T are the means of the selected housekeeping genes for control and treatment respectively. </li></ul></ul><ul><ul><li>Problem: housekeeping genes change their expression level sometimes. The assumption doesn’t hold. </li></ul></ul><ul><li>Trimmed mean normalization (adjusted global method) </li></ul><ul><li>trim off 5% highest and lowest extreme values , then globally normalize data. The normalization coefficient is: </li></ul><ul><li>where are the trimmed means for the i th treatment and control respectively. </li></ul>
  11. 11. Normalization continued <ul><li>Regression normalization : </li></ul><ul><ul><li>Fit the linear regression model: </li></ul></ul><ul><ul><li>Assumption: all the genes on the array have the same variance (homogeneity) </li></ul></ul><ul><ul><li>Test the significance of the intercept  . Fit a linear regression without  if it is insignificant. </li></ul></ul><ul><ul><li>Transform the treatment data: </li></ul></ul><ul><ul><li>Problem: </li></ul></ul><ul><ul><ul><li>assumption may not hold </li></ul></ul></ul><ul><ul><ul><li>nonlinear trend (the third replicates of RL95 data has a slight quadratic trend) . </li></ul></ul></ul>
  12. 12. Scatter plot of log intensity before and after regression normalization
  13. 13. Normalization continued <ul><li>Rank normalization: (this method assumes only a small number of genes will be differentially expressed) </li></ul><ul><ul><li>R Cj  c criteria, j=1,…,g, where c =g  10%, g is the total number of genes and R Cj is the rank for gene j in control. </li></ul></ul><ul><ul><li>choose a set of genes which have a similar expression pattern, ie. R Tj  ( R Cj  c ) </li></ul></ul><ul><ul><li>Normalization coefficient: where and are the means of the selected genes for the ith treatment and control respectively </li></ul></ul><ul><ul><li>Question: how to choose c? </li></ul></ul><ul><ul><li>Rank invariant genes (Eric Schadt, 2001, Journal of Cellular Biochemistry (supplement) 37:120-125) </li></ul></ul>
  14. 14. Normalization continued <ul><li>Intensity-dependent normalization ( Yang, YH, 2002 ) </li></ul><ul><ul><li>Do M-A plot to check the data distribution, where </li></ul></ul><ul><ul><li>Use Lowess function in R to perform normalization </li></ul></ul><ul><ul><li>where c(A) is the lowess fit to the M-A plot </li></ul></ul><ul><ul><li>Transform data by M'=M - c(A) . </li></ul></ul><ul><ul><li>Locally nonparametric method and is robust to a small number of differentially expressed genes. </li></ul></ul>
  15. 15. M-A plot of DMSO vs BaP (Before and after intensity-dependent normalization, f=0.3)
  16. 16. Conclusion <ul><li>Global or local, parametric or nonparametric method </li></ul><ul><li>No unique normalization method for the same data. It depends on what kind of experiment you have and what the data look like. </li></ul><ul><li>No absolute criteria for normalization. Basically, the normalized log ratio should be centered around 0. Combing with post hoc analysis to choose the best one. </li></ul>
  17. 17. Post Hoc Analysis <ul><li>Before analysis </li></ul><ul><ul><li>Data adjustment: for paired normalization, truncate big ratios first. Quantile criteria (1% or 5%, 95% or 99% quantiles) </li></ul></ul><ul><ul><li>Parametric tests assume that data follow a certain distribution </li></ul></ul><ul><ul><li>Non-parametric tests do not make such assumptions </li></ul></ul><ul><ul><li>Check the validity of the assumptions made for parametric test and make sure using the right test. </li></ul></ul>
  18. 18. AtlasImage TM 2-fold criteria <ul><li>AtlasImage TM software: report genes with 2-fold change as up or down-regulated genes. </li></ul><ul><li>Fails to account for sample variation. </li></ul><ul><li>Low intensity tends to have higher ratio </li></ul><ul><li>Ignores the fact that a difference less than 2-fold can also elicit meaningful biological effects. </li></ul>
  19. 19. One sample t-test <ul><li>Obtain normalized log ratio for each pair (control vs treatment). Calculate the mean and SD for each gene. </li></ul><ul><li>Hypothesis: </li></ul><ul><li>Under the null hypothesis,ie., there is no expression difference, the mean of the log ratio for gene i is 0: </li></ul><ul><li>The test statistic is </li></ul><ul><li>where m i and sd i are the mean and standard deviation of the log ratio for gene i. </li></ul><ul><li>Problem:small sample size;normality assumption; multiple test adjustment. </li></ul>
  20. 20. Two sample t-test <ul><li>Obtain normalized log intensity. </li></ul><ul><li>Let the sample mean and variances of Y ij ’s for gene j under the two conditions be , the test statistics is: </li></ul><ul><li>with df </li></ul><ul><li>if unequal variance is assumed and </li></ul><ul><li>with df d i =2(n-1) </li></ul><ul><li>if equal variance is assumed. </li></ul><ul><li>Under the normality assumption for Y ij , Z i approximately has a t-distribution with d i degree of freedom: </li></ul><ul><li>Problem: small sample size; normality assumption; multiple test adjustment. </li></ul>
  21. 21. Multiple test adjustment <ul><li>Hundreds of genes tested at the same time. Assume 1000 genes are not differentially expressed. P-value of 0.01(false positive rate) means that around 10 genes will nevertheless be significant. </li></ul><ul><li>Bonferroni correction: want to make sure that P[  1 gene significant from 1000]  0.05. Consequently, p-value for a single gene to be announced as significant is: P [single gene]  0.05/1000 = 0.00005 </li></ul><ul><li>Conservative and lower power. </li></ul><ul><li>keep FWR manageable and try some p-value, say 0.001 as the significant level. </li></ul><ul><li>Westfall and Young’s step-down adjusted P-value. </li></ul>
  22. 22. Predictive Interval (PI) method <ul><li>Use the normalization method discussed above to normalize data. </li></ul><ul><li>Obtain the average log ratio(ALR) which is centered around zero. </li></ul><ul><li>Using normal approximation method. </li></ul><ul><ul><li>Step I: Treating the maximum or minimum value of ALR greater than mean+3*sd or less than mean-3*sd as outlier, delete it from ALR and take it as a differentially expressed gene. </li></ul></ul><ul><ul><li>Step II: calculate the mean and sd for remaining genes and repeat step I. </li></ul></ul><ul><ul><li>Do above steps iteratively until no more ourlier exists. Then, calculate the 95% predictive interval for the remaining genes. Those values outside of the PI are significant. </li></ul></ul><ul><ul><li>The final set of differentially expressed genes include those outliers detected in step I and II and those outside of PI. </li></ul></ul>
  23. 23. Yidong’s algorithm <ul><li>Assumption: </li></ul><ul><ul><li>Assuming there is constant coefficient of variation c for the entire gene set </li></ul></ul><ul><ul><li>the observed differential expression, R k =T k /C k (ratio of treatment and control intensity at gene k), has a sampling distribution dependent only on c. R k is approximately normally distributed. </li></ul></ul><ul><ul><li>Assume </li></ul></ul><ul><ul><li>The density function of R becomes: </li></ul></ul><ul><li>Use the Maximum likelihood method to estimate the constant c, and use the EM algorithm to get the final estimate of c and m. </li></ul><ul><li>Use the polynomial: to get the CI. </li></ul><ul><li>Measurement errors depends on signal strength </li></ul>
  24. 24. Significant genes list of BaP/DMSO
  25. 25. Permutation test <ul><li>For gene i in each paired experiment, permute data within pair </li></ul><ul><li>to get the permuted sample. Under the assumption that genes </li></ul><ul><li>do not change their expression pattern under the two conditions </li></ul><ul><li>of study, we can permute data as follows: </li></ul>
  26. 26. Permutation test continued <ul><li>Get the normalized average log ratio for original(ALR) and permuted data(ALR*) </li></ul><ul><li>calculate the p-value for gene i : </li></ul><ul><li>where g is the total number of genes </li></ul><ul><li>permute data n times and obtain n p-value for each gene. Then get the mean and sd for each gene and calculate 95% CI. </li></ul><ul><li>If lower bound is less than 0.05, claim this gene as significant. </li></ul>
  27. 27. List of significant genes picked up by permutation test
  28. 28. Significance Analysis of Microarrays (SAM) <ul><li>Limitation of parametric test : </li></ul><ul><ul><li>Estimation of Variance: limited sample size (= few replicates) </li></ul></ul><ul><ul><li>Normal Distribution assumptions: error model still not clear </li></ul></ul><ul><ul><li>Multiple Testing </li></ul></ul><ul><li>Excel add-in performing robust method for differential analysis of microarray data.( Method developed and implemented by the Tibshirani group at Stanford (free for academic use) </li></ul><ul><li>Permutation technique:Assuming no difference between conditions, all genes are from the same population. </li></ul><ul><li>False Discovery Rate: Number of falsely called genes divided by number of differential genes in original data </li></ul><ul><li>need large number of replicates </li></ul>
  29. 29. SAM test Statistic <ul><li>d i = Score </li></ul><ul><li>s i = Standard Deviation </li></ul><ul><li>s 0 = Fudge Factor </li></ul>
  30. 30. The SAM process <ul><li>Perform permutation and compute test statistics for each permutation </li></ul><ul><li>Rank test statistics in ascending order </li></ul><ul><li>Compute mean test statistics for each “rank” over all permutations </li></ul><ul><li>Plot original “ranked” test Statistic Versus Mean test statistic from permutations </li></ul><ul><li>Define distance from mean permuted value you call significant </li></ul><ul><li>Compute false discovery rate for this value </li></ul><ul><li>Iterate until you get appropriate FDR </li></ul>
  31. 33. SAM analysis
  32. 34. Other Methods and Software <ul><li>ANOVA </li></ul><ul><li>Likelihood ratio test </li></ul><ul><li>Bayesian analysis </li></ul><ul><li>GeneSpring, GenePix etc. </li></ul><ul><li> </li></ul>
  33. 35. Conclusion <ul><li>Cutoff point determination: set up critical point to eliminate genes whose intensity is less than this point. </li></ul><ul><li>Statistically significant? No unique method to analyze data. Some methods are better for one data set, but may not be good for other data sets. In practice, we have to try different ways to see which methods work well. </li></ul><ul><li>Biologically significant? For those genes picked up by statistics, we have to be careful to draw conclusions. Some genes shown to be significant may not be functionally meaningful. Conversely, genes that do not show up significant may be significant, especially for those genes at the boarder line in the statistical test. </li></ul>
  34. 36. Acknowledgements <ul><li>Dept. of Pharmacology & Therapeutics </li></ul><ul><li>Dr. Shiverick </li></ul><ul><li>Terry Medrano </li></ul><ul><li>Renita Handayani </li></ul><ul><li>Dept. of Statistics </li></ul><ul><li>Dr. Booth </li></ul><ul><li>Presentation download: </li></ul><ul><li> </li></ul>