RNA-seq: general concept, goal and experimental design - part 1

  • 499 views
Uploaded on

First part of the presentation slides of 'RNA-seq for DE analysis.'. See http://www.bits.vib.be for more information.

First part of the presentation slides of 'RNA-seq for DE analysis.'. See http://www.bits.vib.be for more information.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
499
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
46
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Defining the goal of RNA-seq analysis for differential expression Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
  • 2. Great power comes with great responsibility RNA-seq enables one to 1) get an idea which are all active genes 2) quantify expression of each transcript 3) quantify alternative splicing … (use your imagination) Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12109/abstract
  • 3. Great power comes with great responsibility You can't do all RNA-seq is powerful, we have to aim for a certain goal. Our goal is to detect differential expression on the gene level.
  • 4. Differential expression: useful? What are we looking for? Explanations of observed phenotypes GDA yeast Yeast mutant GDA + vit C why?
  • 5. The central dogma causes the phenotypic differences GDA yeast Yeast mutant GDA + vit C ?
  • 6. The central dogma Difference in protein activity causes the phenotypic differences GDA yeast Yeast mutant GDA + vit C ?
  • 7. The central dogma Presence/concentration of proteins in a cell causes the phenotypic differences GDA yeast Yeast mutant GDA + vit C ?
  • 8. The central dogma Level of protein production causes the phenotypic differences GDA yeast Yeast mutant GDA + vit C ?
  • 9. The central dogma Level of templates for protein production causes the phenotypic differences GDA yeast Yeast mutant GDA + vit C ?
  • 10. The central dogma Level of mRNA copies causes the phenotypic differences GDA yeast Yeast mutant GDA + vit C ?
  • 11. Does it hold? Level of mRNA copies Level of templates for protein production Level of protein production Presence/concentration of proteins in a cell Difference in protein activity Phenotype
  • 12. Problem reduction We can measure mRNA levels (much easier than protein levels). So we measure mRNA. The level of mRNA is a proxy of the level of protein activity causing the aberrant phenotype.
  • 13. How to measure mRNA 1. Q-PCR (real-time) A lot of work to measure few genes, in a relatively wide array of tissues. Very accurate. 2. Microarray Easier way to measure many predefined genes in a relatively wide array of tissues. Robust. 3. RNA-seq
  • 14. RNA-seq protocol in a nut shell ● Get your sample ● Lyse the cells and extract RNA ● Convert the RNA to cDNA ● The cDNA pool get sequenced Yeast sample The result is sequence information from scratch. No prior information is needed. Comprehensive comparative analysis of strand-specific RNA sequencing methods http://www.nature.com/nmeth/journal/v7/n9/full/nmeth.1491.html Comparative analysis of RNA sequencing methods for degraded or low-input samples http://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html
  • 15. The predecessors of RNA-seq ● ● ESTs: expressed sequence tags, ideal for discovery of new genes. SAGE: serial analysis of gene expression, measurement of number of copies of mRNA http://www.montana.edu/observatory/people/mcdermottlab.html
  • 16. The predecessors of RNA-seq ● ● ESTs: expressed sequence tags, ideal for discovery of new genes. SAGE: serial analysis of gene expression, measurement of number of copies of mRNA http://www.sagenet.org/findings/index.html
  • 17. The predecessors of RNA-seq ● ESTs: expressed sequence tags ● SAGE: serial analysis of gene expression Low throughput: long sequence information, but for only ~thousands of genes.
  • 18. Concept of measuring with RNA-seq One template of protein production GeneA GeneB GeneC Extract mRNA and turn into cDNA Fragment, ligate adaptor, amplify. Put a fraction of the pool on sequencer to read fragments. Figure: All things must pass: contrasts and commonalities in eukaryotic and bacterial mRNA decay, Nature Reviews Molecular Cell Biology 11, 467–478
  • 19. RNA-seq protocol in a nut shell Yeast sample
  • 20. So many steps must fail our assumption Phenotype Define the phenotype Proteins Are a proxy for protein activity mRNA levels Represent the RNA pool we've extracted cDNA pool Represent the cDNA pool we've created RNA-seq reads
  • 21. So many steps must fail our assumption Phenotype mRNA templates have different speeds of protein proDuction: availability of tRNAs, rate of mRNA degration, Alternative splicing events,... Proteins mRNA levels cDNA pool Fail to map reads to correct gene, lane-specific biases on reading cDNA fragments,... Protein activity is regulated: Fosforylation, ubiquitination,... Loss on RNA extraction, 90% of RNA in cell is rRNA, ligation of adapters, conversion to cDNA not 100% RNA-seq reads
  • 22. Consequence: focus on comparison Phenotype A Proteins Phenotype B Possibly due to differences in expression Proteins mRNA levels mRNA levels cDNA pool cDNA pool RNA-seq reads RNA-seq reads
  • 23. Consequence: focus on comparison Phenotype A Phenotype B Proteins Proteins mRNA levels mRNA levels cDNA pool cDNA pool RNA-seq reads RNA-seq reads DESIGN OF EXPERIMENT
  • 24. Comparing number of reads to genes sample RNA-seq GeneA GeneB GeneC Obviously, the number of reads is dependent on: OUR QUESTION 1. the expression level of the gene 2. the total number of reads generated 3. the length of the transcript Normalisation is needed!
  • 25. Experimental design Our focus: which genes are differentially expressed between different conditions? Obviously, the number of reads is dependent on: 1. the expression level of the gene 2. the total number of reads generated 3. the length of the transcript How many reads to sequence? Which normalisation is needed?
  • 26. Experimental design Our focus: which genes are differentially expressed between different conditions? “How can we detect genes for which the counts of reads change between conditions more systematically than as expected by chance” We must design an experiment in which we can test this deviance from chance. Oshlack et al. 2010. From RNA-seq reads to differential expression results. Genome Biology 2010, 11:220 http://genomebiology.com/2010/11/12/220
  • 27. How many reads to sequence? In other words: how deep to sequence? What is the required 'depth of sequencing'? sample RNA-seq GeneA GeneB GeneC GeneA GeneB GeneC sample RNA-seq The final test will look at ratios: 6 5 3 5 6 4 1,2 0,83 0,75
  • 28. How many reads to sequence? The difference between the lowest gene count and the highest gene count is typically 105. This is called the dynamic range. Linear scale is useless. The logarithmic scale is better. Wait! Something's not correct here!
  • 29. Zero remains zero! We are working with counts. A count is >=1. A gene with zero counts can be not yet sequenced (not deep enough) or is not expressed in that condition. 0 It is not a full logarithmic scale. It starts at zero.
  • 30. So keep all counts above zero? Assuming equal sequencing depth in the samples, and these counts. Do all these genes differ in expression? sample sample RATIO GeneA 5 10 2 GeneB 15 30 2 GeneC 40 80 2 GeneD 100 200 2 GeneE 1000 2000 2 GeneZ 1 2 2
  • 31. So keep everything above zero? Sequencing the result of the same steps again is called a technical replicate. Is there a trend in how these numbers change? sample sample RATIO GeneA 11 10 0,91 GeneB 11 30 2,72 GeneC 60 80 1,33 GeneD 79 200 2,53 GeneE 1150 2000 1,74 GeneZ 5 1 0,20 2?
  • 32. Technical replicates We take the same cDNA pool and sequence it several times: technical replicates. sample sample sample sample GeneA 11 5 4 4 GeneB 11 16 14 8 GeneC 60 45 32 38 GeneD 79 102 95 110 GeneE 1150 1023 987 1005 GeneZ 3 0 0 1
  • 33. The poisson distribution The counts of technical replicates follow a poisson distribution (Marioni et al 2008). The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. From Wikipedia. Can be 3 different genes, each with their own poisson distribution. Lambda is the mean of the gene's distribution, with a certain number of reads. Y=axis: chance to pick that number of reads.
  • 34. The poisson distribution So when we have 4 technical replicates sequenced up to a big depth (say 10 M reads). We can get by chance, these numbers for 3 different genes. GeneA 0, 0, 1, 3 GeneB 2, 3, 4, 7 GeneC 8, 9, 11, 14
  • 35. Working the intuition How many blue balls? How many red balls? Draw 10 Draw 10 more Draw 10 more Estimate how large the fraction is in the set?
  • 36. The intuition with the balls Color Blue Red No color 10 draws 20 draws 30 draws 40 draws
  • 37. Conclusion of the experiment How bigger the fraction in the pool, how quicker (i.e. with less sequencing depth) we are certain about the estimate of that fraction. estimate=count; variance=count For lower counts, the variance is relatively bigger than the variance for higher counts. CV (coëfficient of variation) = sqrt(count)/count Genes with lower expression need much deeper sequencing than genes with higher expression levels.
  • 38. Comparing counts “Here we show the overlap of Poisson distributions of single measurements at different read counts. Because relative Poisson uncertainty is high at low read counts, a count of 1 versus 2 has very little power to discriminate a true 2X fold change, though at higher counts a 2X fold change becomes significant. In an actual experiment, the width of the distribution would be greater due to additional biological and technical uncertainty, but the uncertainty to the mean expression would narrow with each additional replicate.” Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics (2013) doi: 10.1093/bioinformatics/btt015
  • 39. (Log2 of the counts) Comparing technical replicates Correlation between mean and variance according to Poisson Lowess fit through the data (Log2 of the counts) Risso et al. “GC-Content Normalization for RNA-Seq Data” BMC Bioinformatics 2011, 12:480 http://www.biomedcentral.com/1471-2105/12/480 - EDASeq package (R)
  • 40. But poisson does not seem to fit Extending the samples to real biological samples, this mean variance relationship does not hold... Plotted using EDASeq Package in R.
  • 41. But poisson does not seem to fit Extending the samples to real biological samples, this mean variance relationship does not hold! Something is going on! Reasonable fit Plotted using EDASeq Package in R.
  • 42. An extra source of variation The Poisson distribution has an 'overdispersed' variance: the variance is bigger than expected for higher counts between biological replicates. Something is going on! Plotted using EDASeq Package in R.
  • 43. An extra source of variation Where Poisson: CV = std dev / mean => CV² = 1/μ If an additional distribution is involved (also dependent on π, the fraction of the gene in the cDNA pool), we have a mixture of distributions: CV² = 1/μ + φ Low counts! dispersion Generalization of Poisson with this extra parameter: the Negative Binomial Model fits better!
  • 44. The negative binomial model The NB model fits observed expression data of RNA-seq better. It is a generalization of Poisson, and 2 parameters need to be estimated (μ and φ) Counts (gene g in sample j) has a Mean = μgj Variance = μgj + φg μgj² Biological CV² = φg => Biological CV = √φg Methods differ in estimating this dispersion per gene: Can only be measured with true biological replicates
  • 45. Variation summary, intuitively Total CV² = Technical CV² + Biological CV² For low counts, the Poisson (technical) variation or the measurement error is dominant. For higher counts, the Poisson variation gets smaller, and another source of variation becomes dominant, the dispersion or the biological variation. Biological variation does not get smaller with higher counts.
  • 46. Beyond the NB model It appears from analysis of many biological replicates (#=69) that not every gene can be modeled as NB: the Poisson-Tweedie model provides a further generalisation and a better fit for many genes (with an additional shape parameter). Left figure: raw data shows that about 26% of the genes fit a NB model. Depending on the estimated shape parameter, other distributions fit better. Esnaola et al. BMC Bioinformatics 2013, 14:254 http://www.biomedcentral.com/1471-2105/14/254
  • 47. Consequence for our design For low counts: the uncertainty is big due to Poisson ● For high counts: the uncertainty is big due to biological variation. (highly expressed genes differ in their natural variation (regulated by cellular processes) more than lowly expressed genes). ● If we focus on the ratios between the conditions: is it reasonable to set a restriction of fold change? Highly expressed genes can have a smaller and be significant. Lowly expressed genes can exceed 2. ●
  • 48. Consequence on fold change The readily applied cut-off in micro-array analysis is in RNA-seq not of use. Volcanoplot Blue and red: known DE genes These cut-offs often applied can prohibit detecting DE genes
  • 49. Long story to say... We need to estimate the model behind the count. Never work without biological replicates. Never work with 2 biological replicates. Try avoiding working with 3 biological replicates. Go for at least 4 biological replicates.
  • 50. Break?
  • 51. Overview Sample 1 RNA-seq Condition X GeneA GeneB GeneC GeneA GeneB GeneC GeneA GeneB GeneC GeneA GeneB GeneC GeneA GeneB GeneC GeneA GeneB GeneC Sample 2 RNA-seq Sample 3 RNA-seq Sample 4 RNA-seq Condition Y Sample 5 RNA-seq Sample 6 RNA-seq
  • 52. Summary Obviously, the number of reads is dependent on: 1. chance → Define the count model (NB) from replicates 2. the expression level of the gene → Compare the ratios with a test 2. the total number of reads generated 3. the length of the transcript
  • 53. The total number of reads generated sample RNA-seq GeneA GeneB GeneC sample More RNA-seq GeneA GeneB GeneC The number of reads is dependent on the total number of reads generated. If one library is sequenced to 20M reads, and another one to 40M, most genes will ~double their counts.
  • 54. Normalization for library size Naive approach: divide by total library size. Is not applied anymore! Why not? Composition matters! 2 things to remember: - zero sum system (or “we cannot count what we can't sequence”) - 5 orders of magnitude
  • 55. Normalization for library size 2 things to remember: - zero sum system - 5 orders of magnitude In every sample, a lot of reads are spend on few extremely highly expressed genes. Which genes? That differ between libraries, but affects negatively the naïve size normalization if we include those genes.
  • 56. Normalization for library size Schematically: when normalized on library size (square represent number of reads). Few genes with enormous counts: there is NO SATURATION of these counts Rest of the genes All counts for library A Rest of the genes All counts for library B
  • 57. Normalization for library size Better normalization would be as shown below. DESeq2 and EdgeR apply such an approach (see 100% later). 100% Rest of the genes Rest of the genes
  • 58. Gene length influence the count “Longer transcripts generate more reads” True! But the transcript length does not differ between samples. Since we are concerned with relative differences between samples, this needs no normalization (this story changes in case of absolute quantification). Sample A Sample B Gene A Gene A Gene B Gene B
  • 59. Between sample variation Properties of libraries/samples can effect the counts, and lead to variation. This is called between-lane variation. Obvious ones: library size (how many reads are sampled), library composition. Different libraries/samples can exhibit increased variation by differing in how gene properties relate to gene counts. This is called within-lane variation.
  • 60. GC-content of genes can influence counts GC-content differs between genes. But it does not change between samples, so there should be no problem for relative expression comparison. We can visualize the relationship between counts and GC very easily (see right). There is some trend, and it is equal for all samples. EDAseq (R)
  • 61. GC-content of genes can influence counts Sometimes, samples show different relationships between GC-content of the genes and the counts. This within-lane variation (or intra-sample) variation needs to be corrected for, so that in one sample not all differentially expressed genes are also the GC-riched ones. Length can have also this effect.
  • 62. What we need to know for our set-up We want to detect differentially expressed genes between 2 or more conditions. For this, we need to apply the conditions in a controlled environment (randomisation,...). For good testing, we need to have some biological replicates per condition. For cost effectiveness, we determine how deep we will sequence from each sample. We analyse the reads, get raw counts and do the test!
  • 63. Library preparation and lane loading HiSeq2000: 24 single-index barcodes available. 1 lane gives 150-180 M reads. One lane of 50 bp SE approx €1.500.
  • 64. Bioinformatics analysis will take most of your time Biological insight DE test Quality control (QC) of raw reads QC of the count table Count table extraction Preprocessing: filtering of reads and read parts, to help our goal of differential detection. QC of preprocessing QC of the mapping Mapping to a reference genome (alternative: to a transcriptome)
  • 65. Bioinformatics analysis will take most of your time Biological insight DE test Quality control (QC) of raw reads QC of the count table Count table extraction Preprocessing: filtering of reads and read parts, to help our goal of differential detection. QC of preprocessing QC of the mapping Mapping to a reference genome (alternative: to a transcriptome)
  • 66. Bioinformatics analysis will take most of your time Biological insight 6 1 DE test Quality control (QC) of raw reads 5 QC of the count table 4 Count table extraction Preprocessing: filtering of reads 2 and read parts, to help our goal of differential detection. QC of the mapping 3 QC of preprocessing Mapping to a reference genome (alternative: to a transcriptome)
  • 67. Overview http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
  • 68. The numbers get reduced with every step 25M 20M 15M
  • 69. Deeper, or more replicates? Variance will be lower with more reads: but sequencing another biological replicate is preferred over sequencing deeper, or technical reps. Doi: 10.1093/bioinformatics/btt015
  • 70. There is tool to help you set up
  • 71. Scotty – power analysis Power: the probability to reject the null hypothesis if the alternative is true. 'How many samples and how deep in order to minimize false negatives'. (a null hypothesis is always a scenario in which there is no difference, hence no differential expression). Alternative tools: http://wiki.bits.vib.be/index.php/RNAseq_toolbox
  • 72. Help with design http://wiki.bits.vib.be/index.php/RNAseq_toolbox
  • 73. How many samples to sequence? → Scotty exercise
  • 74. Keywords A read count of a gene is dependent on: 1. chance 2. expression level 3. transcript length 4. depth of sequencing 5. GC-content Poisson distribution Negative binomial distribution Condition Sample Normalization Write in your own words what the terms mean
  • 75. Reads All my references available at: https://www.zotero.org/groups/dernaseq/items