RNA-seq for DE analysis: detecting differential expression - part 5
Upcoming SlideShare
Loading in...5
×
 

RNA-seq for DE analysis: detecting differential expression - part 5

on

  • 855 views

Part 5 of the training sesson 'RNA-seq for differential expression analysis' considers the algorithm used for detecting differential expression between conditions. See http://www.bits.vib.be

Part 5 of the training sesson 'RNA-seq for differential expression analysis' considers the algorithm used for detecting differential expression between conditions. See http://www.bits.vib.be

Statistics

Views

Total Views
855
Views on SlideShare
855
Embed Views
0

Actions

Likes
2
Downloads
22
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

RNA-seq for DE analysis: detecting differential expression - part 5 Presentation Transcript

  • 1. Detecting differentially expressed genes RNA-seq for DE analysis training Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
  • 2. Goal Based on a count table, we want to detect differentially expressed genes between conditions of interest. We will assign to each gene a p-value (0-1), which shows us 'how surprised we should be' to see this difference, when we assume there is no difference. 0 p-value 1 Very big chance there is a difference Very small chance there is a real difference
  • 3. Goal Every single decision we have taken in previous analysis steps was done to improve this outcome of detecting DE expressed genes.
  • 4. Goal
  • 5. Algorithms under active development http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Detecting_differential_expression_by_count_analysis
  • 6. Algorithms under active development http://genomebiology.com/2010/11/10/r106
  • 7. Intuition gene_id CAF0006876 Condition A sample1 23171 sample2 22903 sample3 29227 sample4 24072 sample5 23151 sample6 26336 sample7 25252 sample8 24122 Condition B Sample9 19527 sample10 26898 sample11 18880 sample12 24237 sample13 26640 sample14 22315 sample15 20952 sample16 25629 Variability X Variability Y Compare and conclude given a Mean level: similar or not?
  • 8. Intuition gene_id CAF0006876 Condition A sample1 23171 sample2 22903 sample3 29227 sample4 24072 sample5 23151 sample6 26336 sample7 25252 sample8 24122 Condition B Sample9 19527 sample10 26898 sample11 18880 sample12 24237 sample13 26640 sample14 22315 sample15 20952 sample16 25629
  • 9. Intuition – model is fitted gene_id CAF0006876 Condition A sample1 23171 sample2 22903 sample3 29227 sample4 24072 sample5 23151 sample6 26336 sample7 25252 sample8 24122 Condition B Sample9 19527 sample10 26898 sample11 18880 sample12 24237 sample13 26640 sample14 22315 sample15 20952 sample16 25629 NB model is estimated: 2 parameters: mean and dispersion needed.
  • 10. Intuition – difference is quantified gene_id CAF0006876 Condition A sample1 23171 sample2 22903 sample3 29227 sample4 24072 sample5 23151 sample6 26336 sample7 25252 sample8 24122 Condition B Sample9 19527 sample10 26898 sample11 18880 sample12 24237 sample13 26640 sample14 22315 sample15 20952 sample16 25629 NB model is estimated: 2 parameters: mean and dispersion needed. Difference is put into p-value
  • 11. BUT counts are dependent on The read counts of a gene between different conditions, is dependent on (see first part): 1. Chance (NB model) 2. Expression level 3. Library size (number of reads in that library) 4. Length of transcript 5. GC content of the genes
  • 12. Normalize for library size Assumption: most genes are not DE between samples. DESeq calculates for every sample the 'effective library size' by a scale factor. 100% 100% Rest of the genes Rest of the genes http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/
  • 13. Normalize for library size Original library size * scale factor = effective library size DESeq will multiply original counts by the sample scaling factor. DESeq: This normalization method [14] is included in the DESeq Bioconductor package (version 1.6.0) [14] and is based on the hypothesis that most genes are not DE. A DESeq scaling factor for a given lane is computed as the median of the ratio, for each gene, of its read count over its geometric mean across all lanes. The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis DESeq computes a scaling factor for a given sample by computing the median of the ratio, for each gene, of its read count over its geometric mean across all samples. It then uses the assumption that most genes are not DE and uses this median of ratios to obtain the scaling factor associated with this sample. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/
  • 14. Normalize for library size Geom Mean 24595 … … ... … … 1,1 0,9 1,1 1,0 1,1 0,8 … 1,2 1,2 1,0 1,2 1,2 1,0 0,85 … 0,8 0,9 1,0 1,2 0,8 0,9 1,0 0,8 0,85 1,2 0,8 1,01,1 1,2 … … 1,1 0,9 1,2 0,85 1,2 1,0 0,85 0,9 … 0,85 0,8 1,0 0,8 1,2 1,0 1,0 … 1,1 0,85 1,1 0,85 1,1 0,85 1,2 … 1,1 1,0 1,2 1,0 0,8 1,0 1,2 … 1,0 Divide each count by geomean Take median per column
  • 15. Other normalisations ● EdgeR: TMM, trimmed mean of M-values
  • 16. Dispersion estimation ● ● For every gene, an NB is fitted based on the counts. The most important factor in that model to be estimated is the dispersion. DESeq2 applies three steps ● Estimates dispersion parameter for each gene ● Plots and fits a curve ● Adjusts the dispersion parameter towards the curve ('shrinking')
  • 17. Dispersion estimation 1. Black dots: estimated from normalized data. 2. Red line: curve fitted 3. blue dots: final assigned dispersion parameter for that gene Model is fit!
  • 18. Test is run between conditions If 2 conditions are compared, for each gene 2 NB models (one for each condition) are made, and a test (Wald test) decides whether the difference is significant (red in plot). MA-plot: mean of counts versus the log2 fold change between 2 conditions. Significant (p-value < 0,01) Not significant
  • 19. Test is run between conditions If 2 conditions are compared, for each gene 2 NB models (one for each condition) are made, and a test (Wald test) decides whether the difference is significant (red in plot). This means that we are going to perform 1000's of tests. If we set a cut-off on the p-value of 0,01 and we have performed 20000 tests (= genes), 1000 genes will appear significant by chance.
  • 20. Check the distribution of p-values An enrichment (smaller or Bigger) should be seen at low P-values. If the histogram of the p-values does not match a profile as shown below, the test is not reliable. Perhaps the NB fitting step did not succeed, or confounding variables are present. Other p-values should not show a trend.
  • 21. Confounded distribution of p-values
  • 22. Improve test results A fraction is correctly identified as DE A fraction is false positive You set a cut-off of 0,05.
  • 23. Improve test results We can improve testing by 2 measures: avoid testing: apply a filtering before testing, an independent filtering. ● ● apply a multiple testing correction
  • 24. Independent filtering Some scientists just remove genes with mean counts in the samples <10. But there is a more formal method to remove genes, in order to reduce the testing. http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/ From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
  • 25. Independent filtering Left a scatter plot of mean counts versus transformed p-values. The red line depicts a cut-off of 0,1. Note that genes with lower counts do not reach the p-value threshold. Some of them are save to exclude from testing. http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/ From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
  • 26. Independent filtering If we filter out increasingly bigger portions of genes based on their mean counts, the number of significant genes increase.
  • 27. Independent filtering See later (slide 30) Choose the variable of interest. You can run it once on all to check the outcome. http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/ From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf
  • 28. Independent filtering
  • 29. Multiple testing correction Automatically in results: Benjamini/Hochberg correction, to control false discovery rate (FDR). FDR is the fraction of false positives in the genes that are classified as DE. If we set a threshold α of 0,05, 20% of the DE genes will be false positives.
  • 30. Including different factors Always remember your experimental setup and the goal. Summarize the setup in the sample descriptions file. GDA (=G) GDA + vit C (=AG) Yeast (=WT) Yeast mutant (=UPC) Day 1 Day 2 Day 1 Day 2 Additional metadata (batch factor)
  • 31. Including different factors We provide a combination of factors (the model, GLM) which influence the counts. Every factor should match the column name in the sample descriptions The levels of the factors corresponding To the 'base' or 'no perturbation'. The fraction filtered out, determined by the independent filter tool. Adjusted p-value cut-off
  • 32. Including different factors The 'detect differential expression' tool gives you four results: the first is the report including graphs. Only lower than cut-off and with indep filtering. All genes, with indep filtering applied. Complete DESeq results, without indep filtering applied.
  • 33. Standard Error (SE) of LogFC Log2(FC) Standard Error (SE) of LogFC Including different factors Log2(FC)
  • 34. Including different factors Volcano plot: shows the DE genes with our given cut-off.
  • 35. Comparing different conditions GDA (=G) GDA + vit C (=AG) Yeast (=WT) Yeast mutant (=UPC) Day 1 Day 2 Day 1 Day 2 Which genes are DE between UPC and WT? Which genes are DE between G and AG? Which genes are DE in WT between G and AG?
  • 36. Comparing different conditions Adjust the sample descriptions file and the model: 1. Which genes are DE between UPC and WT? 2. Which genes are DE between G and AG? 3. Which genes are DE in WT between G and AG? 1. 2. 3. Remove these Remove these
  • 37. Congratulations! We have reached our goal!
  • 38. Overview http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
  • 39. Reads
  • 40. Keywords Effective library size dispersion shrinking Significantly differentially expressed MA-plot Alpha cut-off Independent filtering FDR p-value Write in your own words what the terms mean
  • 41. Break