Successfully reported this slideshow.
Upcoming SlideShare
×

# DESeq Paper Journal club

1,880 views

Published on

Published in: Education, Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### DESeq Paper Journal club

1. 1. Background - DESeq • Modelling the number of reads sequenced from a gene X – Can use a Binomial B(n, p), n=total number of reads, p=prob. from gene X – Can approximate with a Poisson(np) as n large, p small – Poisson model works ok for a gene’s variation between technical replicates – However, Poisson understimates variation between biological replicates – edgeR and deseq use a negative binomial instead (for gene i in sample j) Equation (1): Kij ~ NB(mu_ij, sigma2 ij) – Negative binomial has two parameters, mean mu and variance sigma2 – Number of replicates is usually too small to estimate both for a gene X EdgeR – Assumes sigma2 = mu + alpha*mu2 , where alpha is the same for all genes – Just needs to estimate mu for a gene, then calculate sigma2 from that • DESeq – For each sample, makes a local regression of sigma2 versus mu – Given mu for gene X, uses the local regression to estimate sigma2
2. 2. Results & Discussion • DESeq’s model - makes three assumptions – Equation (2): mu_ij = qi,rho(j) * sj mu_ij = expected value of mean count (no. reads) for gene i in sample j qi,rho(j) = proportional to concentration of fragments from gene i in sample j sj = coverage (sampling depth) of library j – Equation (3): sigma2 _ij = mu_ij + sj2 * vi,rho(j) sigma2_ij = variance of no. reads for gene i in sample j mu_ij = variance due to Poisson model (technical variation) = “shot noise” sj2 * vi,rho(j) = variance due to biological variation(?) = “raw variance” – Equation (4): vi,rho(j) = vrho ( qi,rho(j) ) ie. vi,rho(j) is a function of qi,rho(j) So we can make a regression of vi,rho(j) against qi,rho(j) for lots of genes (i) Then estimate vi,rho(j) for gene X, based on qi,rho(j) and the regression line
3. 3. • DESeq’s model – estimating parameters – sj : coverage (sampling depth) of library j The total number of reads in library j is not a good measure of depth. Instead, take the median (over all genes) of the ratios of observed counts: Equation (5): sj = median_over_i ( kij / [ Sum_over_v kiv ]^(1/m) ] ) – qi,rho(j) = “expression strength” parameter for gene i in condition rho Proportional to concentration of fragments from gene i in sample j. Use the average of countsfrom samples j for condition rho: Equation (6): qi,rho = 1/m_rho * Sum_over_j (kij / sj) – vrho = function describing how vi,rho(j) depends on qi,rho(j) Estimate the sample variance for each gene i, wi(rho) (Equation 7) Fit a local regression line to wi(rho) versus qi(rho) For a particular qi(rho) value, predict w=wi(rho) from the regression line Also calculate zi(rho) for gene i (Equation 8) Then use v = w – zi(rho) as an unbiased estimate of the variance vi,rho for gene i (Equation 9)
4. 4. • DESeq’s model – testing for differential expression – Null hypothesis: qiA = qiB qiA = expression strength parameter for gene i in the samples of condition A, mA = number of samples for condition A – Test statistic: total counts in each condition Equation (10): KiA = counts in condition A = Sum_over_A ( Kij) – P-value for test of null hypothesis Under the null hypothesis, can compute prob(KiA = a, KiB = b) = p(a,b) Equation (11): P-value for observed count (kiA, kiB) = Sum of probabilities p(a,b) where p(a,b)≤ p(kiA,kiB), a+b = kiA+kiB Sum of probabilities p(a,b) where a+b = kiA+kiB – Computing p(a,b) values p(a,b) = Prob(KiA = a) * Prob(KiB = b), assuming samples are independent KiA is the sum of mA NB-distributed variables We approximate its distribution by a NB(mu, sigma) distribution whose parameters mu, sigma are estimated using Equations 12,13,14
5. 5. Applications • Variance estimation – Use RNA-seq data from fly embryos: ‘A’ and ‘B’ samples, 2 replicates each Figure 1: estimated variances wi(rho) plotted against qi(rho) for fly sample A Distance between orange and purple lines is noise due to biological sampling regression edgeR “shot noise” (technical variation)
6. 6. • Testing for differential expression – Compared the 2 replicates for fly sample A Figure 2: the empirical cumulative distribution functions of the P-values The ECDF curve (blue line) should be below the diagonal (gray line) Type I error is controlled by EdgeR & DESeq, but not a Poisson-based test EdgeR has an excess of small P-values for low counts, but is more conservative for high counts DESeq edgeR Poisson Low High All
7. 7. • Testing for differential expression – Compared fly A & B samples Figure 3: obtained fold changes and P-values The ability to detect differential expression depends on overall counts The strong shot noise (technical variation) for low counts causes the testing procedure to call only very high fold changes as significant Red: significant p-value
8. 8. • Comparison with EdgeR – Ran edgeR with 4 settings: (i) “Common-dispersion” or “tagwise-dispersion” modes for estimating variance (ii) Size factors estimated by DESeq, or total number of reads Results were very similar for the 4 settings EdgeR’s single-value dispersion estimate of variance is lower than DESeq for weakly expressed genes & higher for strongly expressed genes (Figure 1) regression edgeR “shot noise” (technical variation) As a result, EdgeR is anti-conservative for lowly expressed genes, but more conservative for strongly expressed genes
9. 9. This biases the list of discoveries by EdgeR Figure 4 shows that weakly expressed genes seem to be over-represented Few genes with high average level are called differentially expressed by EdgeR DESeq produced results which were more balanced over the dynamic range All fly data DESeq hits EdgeR hits
10. 10. • Working without replicates – DESeq can work if there are no replicates in one or both conditions If there are just replicates from one condition, fit regression line using that one If there are no replicates, treat the samples as replicates to fit the regression For neural cell data, variability between replicates ≈ variability bet. conditions However, for fly data, variability between replicates << variability bet. conditions
11. 11. • Variance-stabilising transformation (VST) – Given a variance-mean regression, a VST transforms the values so the variance is independent of the mean (Equation 15) This yields (transformed) count values whose variances are approximately the same throughout the dynamic range This is useful for sample clustering, since clustering assumes all genes have roughly the same variance Figure 5 shows clustering for neural cell samples, using VST-transformed data
12. 12. • ChIP-Seq data – Compared HapMap IDs GM12878 and GM12891 DESeq does not give false positives when comparing replicates for 1 individual Using a Poisson-based model, you would get many false positives DESeq Poisson Same individual Different individuals
13. 13. Summary • A Poisson model underestimates the variance between biological samples; this leads to false positives in differential expression analyses • A Negative Binomial distribution is much better • This is especially true for highly expressed genes • DESeq and EdgeR use the Negative Binomial • However, DESeq estimates the sequencing depth differently • Also DESeq estimates the variance for a gene by assuming it has similar variance to genes of similiar expression level • DESeq and EdgeR have similar sensitivity, but EdgeR calls a greater number of weakly expressed genes as significant, and fewer highly expressed genes as significant