RNA-seqRNA-seq, also called "Whole Transcriptome Shotgun Sequencing“ ("WTSS"), refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content.
Analysis of RNA-seq dataSingle nucleotidevariationdiscovery: currently being applied to cancer research and microbiology.Fusiongenedetection: fusion genes have gained attention because of their relationship with cancer. The idea follows from the process of aligning the short transcriptomic reads to a reference genome. Most of the short reads will fall within one complete exon, and a smaller but still large set would be expected to map to known exon-exon junctions. The remaining unmapped short reads would then be further analyzed to determine whether they match an exon-exon junction where the exons come from different genes. This would be evidence of a possible fusion.Geneexpression
Paired-end
Fusiongenedetection
Definitions
GeneexpressionDetect differences in gene level expression between samples. This sort of analysis is particularly relevant for controlled experiments comparing expression in wild-type and mutant strains of the same tissue, comparing treated versus untreated cells, cancer versus normal, and so on.
Differentialexpression (2)RNA-seq gives a discrete measurement for each gene. Transformation of count data is not well approximated by continuous distributions, especially in the lower count range and for small samples. Therefore, statistical models appropriate for count data are vital to extracting the most information from RNA-seq data.In general, the Poisson distribution forms the basis for modeling RNA-seq count data.
RNA-seqPipeline
MappingThe first step in this procedure is the read mapping or alignment: to find the unique location where a short read is identical to the reference. However, in reality the reference is never a perfect representation of the actual biological source of RNA being sequenced: SNPs, indels, also the consideration that the reads arise from a spliced transcriptome rather than a genome. Short reads can sometimes align perfectly to multiple locations and can contain sequencing errors that have to be accounted for. The real task is to find the location where each short read best matches the reference, while allowing for errors and structural variation.
AlignersAligners differ in how they handle ‘multimaps’ (reads that map equally well to several locations). Most aligners either discard multimaps, allocate them randomly or allocate them on the basis of an estimate of local coverage. Paired-end reads reduce the problem of multi-mapping, as both ends of the cDNA fragment from which the short reads were generated should map nearby on the transcriptome, allowing the ambiguity of multimaps to be resolved in most circumstances.
ReferencegenomeThe most commonly used approach is to use the genome itself as the reference. This has the benefit of being easy and not biased towards any known annotation. However, reads that span exon boundaries will not map to this reference. Thus, using the genome as a reference will give greater coverage (at the same true expression level) to transcripts with fewer exons, as they will contain fewer exon junctions.In order to account for junction reads, it is common practice to build exon junction libraries in which reference sequences are constructed using boundaries between annotated exons, a proxy genome generated with known exonic sequences.Another option is the de novo assembly of the transcriptome, for use as a reference, using genome assembly tools.A commonly used approach for transcriptome mapping is to progressively increase the complexity of the mapping strategy to handle the unaligned reads.
Normalization
Normalization (2)Within-library normalization allows quantification of expression levels of each gene relative to other genes in the sample. Because longer transcripts have higher read counts (at the same expression level), a common method for within-library normalization is to divide the summarized counts by the length of the gene [32,34]. The widely used RPKM (reads per kilobase of exon model per million mapped reads) accounts for both library size and gene length effects in within-sample comparisons. When testing individual genes for DE between samples, technical biases, such as gene length and nucleotide composition, will mainly cancel out because the underlying sequence used for summarization is the same between samples. However, between-sample normalization is still essential for comparing counts from different libraries relative to each other. The simplest and most commonly used normalization adjusts by the total number of reads in the library [34,51], accounting for the fact that more reads will be assigned to each gene if a sample is sequenced to a greater depth.
Normalization: methods
Normalization (example)
NG-5045 (Diabetes)Pool 1 2, 4, 12, 16 Pool 2 3, 9, 13, 14 Pool 3 1, 5, 6, 7 Pool 4 8, 10, 11, 15Morbidly obese persons without insulin resistance: 2, 3, 4, 9, 12, 13, 14, 16.Morbidly obese persons with high insulin resistance: 1, 5, 6, 7, 8, 10, 11, 15.
DifferentialexpressionThe goal of a DE analysis is to highlight genes that have changed significantly in abundance across experimental conditions. In general, this means taking a table of summarized count data for each library and performing statistical testing between samples of interest.Transformation of count data is not well approximated by continuous distributions, especially in the lower count range and for small samples. Therefore, statistical models appropriate for count data are vital to extracting the most information from RNA-seq data.
Poisson-basedanalysisIn an early RNA-seq study using a single source of RNA goodness-of-fit statistics suggested that the distribution of counts across lanes for the majority of genes was indeed Poisson distributed . This has been independently confirmed using a technical experiment and software tools are readily available to perform these analyses.
Each RNA sample was sequenced in seven lanes, producing 12.9–14.7 million reads per lane at the 3 pM concentration and 8.4–9.3million reads at the 1.5 pMconcentration. We aligned all reads against the whole genome. 40% of reads mapped uniquely to a genomic location, and of these, 65% mapped to autosomal or sex chromosomes (the remainder mapped almost exclusively to mitochondrial DNA).
Poissonbased softwareR packages in Bioconductor:DEGseq (Wang et al., 2010)AlternativestrategiesBiological variability is not captured well by the Poisson assumption. Hence, Poisson-based analyses for datasets with biological replicates will be prone to high false positive rates resulting from the underestimation of sampling error Goodness-of-fit tests indicate that a small proportion of genes show clear deviations from this model (extra-Poisson variation), and although we found that these deviations did not lead to falsepositive identification of differentially expressed genes at a stringent FDR, there is nevertheless room for improved models that account for the extra-Poisson variation. One natural strategy would be to replace the Poisson distribution with another distribution, such as the quasi-Poisson distribution (Venables and Ripley 2002) or the negative binomial distribution (Robinson and Smyth 2007), which have an additional parameter that estimates over- (or under-) dispersion relative to a Poisson model.
Poisson-Negative Binomial The negative binomial distribution, can be used as an alternative to the Poisson distribution. It is especially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not an appropriate model. Since the negative binomial distribution has one more parameter than the Poisson, the second parameter can be used to adjust the variance independently of the mean.
Negative-BinomialbasedanalysisIn order to account for biological variability, methods that have been developed for serial analysis of gene expression (SAGE) data have recently been applied to RNA-seq data. The major difference between SAGE and RNA-seq data is the scale of the datasets. To account for biological variability, the negative binomial distribution has been used as a natural extension of the Poisson distribution, requiring an additional dispersion parameter to be estimated.
Description of SAGE Serial analysis of gene expression (SAGE) is a method for comprehensive analysis of gene expression patterns. Three principles underlie the SAGE methodology: A short sequence tag (10-14bp) contains sufficient information to uniquely identify a transcript provided that that the tag is obtained from a unique position within each transcript; Sequence tags can be linked together to from long serial molecules that can be cloned and sequenced; and Quantization of the number of times a particular tag is observed provides the expression level of the corresponding transcript.
Robinson, McCarthy, Smyth (2010)
edgeR paperedgeR paperRobinson, McCarthy, Smyth (2010) (2)
Robinson and Smyth 2008
Robinson and Smyth 2008 (2)
Negative-Binomialbased softwareR packages in Bioconductor:edgeR (Robinson et al., 2010): Exact test based on Negative Binomial distribution.
DESeq (Anders and Huber, 2010): Exact test based on Negative Binomial distribution.
baySeq (Hardcastle et al., 2010): Estimation of the posterior likelihood of dierential expression (or more complex hypotheses) via empirical Bayesian methods using Poisson or NB distributions.CLC GenomicsWorkbenchapproach19.4.2.1 Kal et al.'s test (Z-test) Kal et al.'s test [Kal et al., 1999] compares a single sample against another single sample, and thus requires that each group in you experiment has only one sample. The test relies on an approximation of the binomial distribution by the normal distribution [Kal et al., 1999]. Considering proportions rather than raw counts the test is also suitable in situations where the sum of counts is different between the samples. 19.4.2.2 Baggerley et al.'s test (Beta-binomial) Baggerley et al.'s test [Baggerly et al., 2003] compares the proportions of counts in a group of samples against those of another group of samples, and is suited to cases where replicates are available in the groups. The samples are given different weights depending on their sizes (total counts). The weights are obtained by assuming a Beta distribution on the proportions in a group, and estimating these, along with the proportion of a binomial distribution, by the method of moments. The result is a weighted t-type test statistic.
Baggerly, K., Deng, L., Morris, J., and Aldaz, C. (2003). Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics, 19(12):1477-1483.
Resolució amb edgeR> library(edgeR)> set.seed(101)> n <- 200> lib.sizes <- c(40000, 50000, 38000, 40000)> p <- runif(n, min = 1e-04, 0.001)> mu <- outer(p, lib.sizes)> mu[1:5, 3:4] <- mu[1:5, 3:4] * 8> y <- matrix(rnbinom(4 * n, size = 4, mu = mu), nrow = n)> rownames(y) <- paste("tag", 1:nrow(y), sep = ".")> y[1:10, ]       [,1] [,2] [,3] [,4]tag.1    15   13  117   77tag.2     3    4   49   33tag.3    25   56  302  332tag.4    40   13  271   91tag.5    13    3   51   56tag.6    14    7   31   18tag.7    16   39   19    9tag.8     6   28    6    6tag.9    10   42   80   14tag.10   33   25    5   27> d <- DGEList(counts = y, group = rep(1:2, each = 2), lib.size = lib.sizes)> d <- estimateCommonDisp(d)> de.common <- exactTest(d)Comparison of groups:  2 - 1 > topTags(de.common)Comparison of groups:  2 - 1 logConclogFCPValue         	FDRtag.184	 -13.636760	 -5.236853 	6.112570e-05	 0.005195714tag.2   	-11.769438  	3.766465 	6.405229e-05 	0.005195714tag.3    	-8.550981  	3.214682 	7.793571e-05 	0.005195714tag.4    	-9.188394  	2.911743 	3.300004e-04 	0.013214944tag.1   	-10.135230  	2.984351 	3.303736e-04 	0.013214944tag.5   	-10.944756  	2.868619 	1.035516e-03 	0.034517212tag.105 	-10.693557  	2.618355 	2.337750e-03 	0.066792856tag.164 	-11.253348 	-2.209660 	1.090272e-02 	0.233310771tag.14  	-11.258031  	2.238669 	1.090272e-02 	0.233310771tag.123 	-13.277812 	-2.756096 	1.166554e-02 	0.233310771> >
Suggestedpipeline ?Quality Control: fastQC, DNAA
Mappingthereads:
Obtainingthereference

EiB Seminar from Antoni Miñarro, Ph.D

  • 2.
    RNA-seqRNA-seq, also called"Whole Transcriptome Shotgun Sequencing“ ("WTSS"), refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content.
  • 3.
    Analysis of RNA-seqdataSingle nucleotidevariationdiscovery: currently being applied to cancer research and microbiology.Fusiongenedetection: fusion genes have gained attention because of their relationship with cancer. The idea follows from the process of aligning the short transcriptomic reads to a reference genome. Most of the short reads will fall within one complete exon, and a smaller but still large set would be expected to map to known exon-exon junctions. The remaining unmapped short reads would then be further analyzed to determine whether they match an exon-exon junction where the exons come from different genes. This would be evidence of a possible fusion.Geneexpression
  • 4.
  • 5.
  • 6.
  • 7.
    GeneexpressionDetect differences ingene level expression between samples. This sort of analysis is particularly relevant for controlled experiments comparing expression in wild-type and mutant strains of the same tissue, comparing treated versus untreated cells, cancer versus normal, and so on.
  • 8.
    Differentialexpression (2)RNA-seq givesa discrete measurement for each gene. Transformation of count data is not well approximated by continuous distributions, especially in the lower count range and for small samples. Therefore, statistical models appropriate for count data are vital to extracting the most information from RNA-seq data.In general, the Poisson distribution forms the basis for modeling RNA-seq count data.
  • 9.
  • 10.
    MappingThe first stepin this procedure is the read mapping or alignment: to find the unique location where a short read is identical to the reference. However, in reality the reference is never a perfect representation of the actual biological source of RNA being sequenced: SNPs, indels, also the consideration that the reads arise from a spliced transcriptome rather than a genome. Short reads can sometimes align perfectly to multiple locations and can contain sequencing errors that have to be accounted for. The real task is to find the location where each short read best matches the reference, while allowing for errors and structural variation.
  • 11.
    AlignersAligners differ inhow they handle ‘multimaps’ (reads that map equally well to several locations). Most aligners either discard multimaps, allocate them randomly or allocate them on the basis of an estimate of local coverage. Paired-end reads reduce the problem of multi-mapping, as both ends of the cDNA fragment from which the short reads were generated should map nearby on the transcriptome, allowing the ambiguity of multimaps to be resolved in most circumstances.
  • 12.
    ReferencegenomeThe most commonlyused approach is to use the genome itself as the reference. This has the benefit of being easy and not biased towards any known annotation. However, reads that span exon boundaries will not map to this reference. Thus, using the genome as a reference will give greater coverage (at the same true expression level) to transcripts with fewer exons, as they will contain fewer exon junctions.In order to account for junction reads, it is common practice to build exon junction libraries in which reference sequences are constructed using boundaries between annotated exons, a proxy genome generated with known exonic sequences.Another option is the de novo assembly of the transcriptome, for use as a reference, using genome assembly tools.A commonly used approach for transcriptome mapping is to progressively increase the complexity of the mapping strategy to handle the unaligned reads.
  • 13.
  • 14.
    Normalization (2)Within-library normalizationallows quantification of expression levels of each gene relative to other genes in the sample. Because longer transcripts have higher read counts (at the same expression level), a common method for within-library normalization is to divide the summarized counts by the length of the gene [32,34]. The widely used RPKM (reads per kilobase of exon model per million mapped reads) accounts for both library size and gene length effects in within-sample comparisons. When testing individual genes for DE between samples, technical biases, such as gene length and nucleotide composition, will mainly cancel out because the underlying sequence used for summarization is the same between samples. However, between-sample normalization is still essential for comparing counts from different libraries relative to each other. The simplest and most commonly used normalization adjusts by the total number of reads in the library [34,51], accounting for the fact that more reads will be assigned to each gene if a sample is sequenced to a greater depth.
  • 15.
  • 16.
  • 17.
    NG-5045 (Diabetes)Pool 12, 4, 12, 16 Pool 2 3, 9, 13, 14 Pool 3 1, 5, 6, 7 Pool 4 8, 10, 11, 15Morbidly obese persons without insulin resistance: 2, 3, 4, 9, 12, 13, 14, 16.Morbidly obese persons with high insulin resistance: 1, 5, 6, 7, 8, 10, 11, 15.
  • 22.
    DifferentialexpressionThe goal ofa DE analysis is to highlight genes that have changed significantly in abundance across experimental conditions. In general, this means taking a table of summarized count data for each library and performing statistical testing between samples of interest.Transformation of count data is not well approximated by continuous distributions, especially in the lower count range and for small samples. Therefore, statistical models appropriate for count data are vital to extracting the most information from RNA-seq data.
  • 23.
    Poisson-basedanalysisIn an earlyRNA-seq study using a single source of RNA goodness-of-fit statistics suggested that the distribution of counts across lanes for the majority of genes was indeed Poisson distributed . This has been independently confirmed using a technical experiment and software tools are readily available to perform these analyses.
  • 24.
    Each RNA samplewas sequenced in seven lanes, producing 12.9–14.7 million reads per lane at the 3 pM concentration and 8.4–9.3million reads at the 1.5 pMconcentration. We aligned all reads against the whole genome. 40% of reads mapped uniquely to a genomic location, and of these, 65% mapped to autosomal or sex chromosomes (the remainder mapped almost exclusively to mitochondrial DNA).
  • 26.
    Poissonbased softwareR packagesin Bioconductor:DEGseq (Wang et al., 2010)AlternativestrategiesBiological variability is not captured well by the Poisson assumption. Hence, Poisson-based analyses for datasets with biological replicates will be prone to high false positive rates resulting from the underestimation of sampling error Goodness-of-fit tests indicate that a small proportion of genes show clear deviations from this model (extra-Poisson variation), and although we found that these deviations did not lead to falsepositive identification of differentially expressed genes at a stringent FDR, there is nevertheless room for improved models that account for the extra-Poisson variation. One natural strategy would be to replace the Poisson distribution with another distribution, such as the quasi-Poisson distribution (Venables and Ripley 2002) or the negative binomial distribution (Robinson and Smyth 2007), which have an additional parameter that estimates over- (or under-) dispersion relative to a Poisson model.
  • 27.
    Poisson-Negative Binomial Thenegative binomial distribution, can be used as an alternative to the Poisson distribution. It is especially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not an appropriate model. Since the negative binomial distribution has one more parameter than the Poisson, the second parameter can be used to adjust the variance independently of the mean.
  • 28.
    Negative-BinomialbasedanalysisIn order toaccount for biological variability, methods that have been developed for serial analysis of gene expression (SAGE) data have recently been applied to RNA-seq data. The major difference between SAGE and RNA-seq data is the scale of the datasets. To account for biological variability, the negative binomial distribution has been used as a natural extension of the Poisson distribution, requiring an additional dispersion parameter to be estimated.
  • 29.
    Description of SAGESerial analysis of gene expression (SAGE) is a method for comprehensive analysis of gene expression patterns. Three principles underlie the SAGE methodology: A short sequence tag (10-14bp) contains sufficient information to uniquely identify a transcript provided that that the tag is obtained from a unique position within each transcript; Sequence tags can be linked together to from long serial molecules that can be cloned and sequenced; and Quantization of the number of times a particular tag is observed provides the expression level of the corresponding transcript.
  • 30.
  • 31.
    edgeR paperedgeR paperRobinson,McCarthy, Smyth (2010) (2)
  • 32.
  • 33.
  • 34.
    Negative-Binomialbased softwareR packagesin Bioconductor:edgeR (Robinson et al., 2010): Exact test based on Negative Binomial distribution.
  • 35.
    DESeq (Anders andHuber, 2010): Exact test based on Negative Binomial distribution.
  • 36.
    baySeq (Hardcastle etal., 2010): Estimation of the posterior likelihood of dierential expression (or more complex hypotheses) via empirical Bayesian methods using Poisson or NB distributions.CLC GenomicsWorkbenchapproach19.4.2.1 Kal et al.'s test (Z-test) Kal et al.'s test [Kal et al., 1999] compares a single sample against another single sample, and thus requires that each group in you experiment has only one sample. The test relies on an approximation of the binomial distribution by the normal distribution [Kal et al., 1999]. Considering proportions rather than raw counts the test is also suitable in situations where the sum of counts is different between the samples. 19.4.2.2 Baggerley et al.'s test (Beta-binomial) Baggerley et al.'s test [Baggerly et al., 2003] compares the proportions of counts in a group of samples against those of another group of samples, and is suited to cases where replicates are available in the groups. The samples are given different weights depending on their sizes (total counts). The weights are obtained by assuming a Beta distribution on the proportions in a group, and estimating these, along with the proportion of a binomial distribution, by the method of moments. The result is a weighted t-type test statistic.
  • 37.
    Baggerly, K., Deng,L., Morris, J., and Aldaz, C. (2003). Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics, 19(12):1477-1483.
  • 41.
    Resolució amb edgeR>library(edgeR)> set.seed(101)> n <- 200> lib.sizes <- c(40000, 50000, 38000, 40000)> p <- runif(n, min = 1e-04, 0.001)> mu <- outer(p, lib.sizes)> mu[1:5, 3:4] <- mu[1:5, 3:4] * 8> y <- matrix(rnbinom(4 * n, size = 4, mu = mu), nrow = n)> rownames(y) <- paste("tag", 1:nrow(y), sep = ".")> y[1:10, ] [,1] [,2] [,3] [,4]tag.1 15 13 117 77tag.2 3 4 49 33tag.3 25 56 302 332tag.4 40 13 271 91tag.5 13 3 51 56tag.6 14 7 31 18tag.7 16 39 19 9tag.8 6 28 6 6tag.9 10 42 80 14tag.10 33 25 5 27> d <- DGEList(counts = y, group = rep(1:2, each = 2), lib.size = lib.sizes)> d <- estimateCommonDisp(d)> de.common <- exactTest(d)Comparison of groups: 2 - 1 > topTags(de.common)Comparison of groups: 2 - 1 logConclogFCPValue FDRtag.184 -13.636760 -5.236853 6.112570e-05 0.005195714tag.2 -11.769438 3.766465 6.405229e-05 0.005195714tag.3 -8.550981 3.214682 7.793571e-05 0.005195714tag.4 -9.188394 2.911743 3.300004e-04 0.013214944tag.1 -10.135230 2.984351 3.303736e-04 0.013214944tag.5 -10.944756 2.868619 1.035516e-03 0.034517212tag.105 -10.693557 2.618355 2.337750e-03 0.066792856tag.164 -11.253348 -2.209660 1.090272e-02 0.233310771tag.14 -11.258031 2.238669 1.090272e-02 0.233310771tag.123 -13.277812 -2.756096 1.166554e-02 0.233310771> >
  • 42.
  • 43.
  • 44.