EiB Seminar from Antoni Miñarro, Ph.D


Published on

Seminar from Antonio Miñarro, Ph.d.
Statistics and Bioinformatics research group.
Department of Statistics
University of Barcelona

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

EiB Seminar from Antoni Miñarro, Ph.D

  1. 1. RNA-seqRNA-seq, also called "Whole Transcriptome Shotgun Sequencing“("WTSS"), refers to the use of high-throughput sequencing technologies tosequence cDNA in order to get information about a samples RNA content.
  2. 2. Analysis of RNA-seq dataSingle nucleotide variation discovery: currently being applied to cancerresearch and microbiology.Fusion gene detection: fusion genes have gained attention because oftheir relationship with cancer. The idea follows from the process ofaligning the short transcriptomic reads to a reference genome.Most of the short reads will fall within one complete exon, and asmaller but still large set would be expected to map to knownexon-exon junctions. The remaining unmapped short reads wouldthen be further analyzed to determine whether they match anexon-exon junction where the exons come from different genes.This would be evidence of a possible fusion.Gene expression
  3. 3. Paired-end
  4. 4. Fusion gene detection
  5. 5. Definitions
  6. 6. Gene expressionDetect differences in gene level expression between samples. Thissort of analysis is particularly relevant for controlled experimentscomparing expression in wild-type and mutant strains of the sametissue, comparing treated versus untreated cells, cancer versusnormal, and so on.
  7. 7. Differential expression (2)RNA-seq gives a discrete measurement for each gene.Transformation of count data is not well approximated by continuousdistributions, especially in the lower count range and for smallsamples. Therefore, statistical models appropriate for count data arevital to extracting the most information from RNA-seq data.In general, the Poisson distribution forms the basis for modeling RNA-seq count data.
  8. 8. RNA-seq Pipeline
  9. 9. MappingThe first step in this procedure is the read mapping or alignment: to find theunique location where a short read is identical to the reference.However, in reality the reference is never a perfect representation of theactual biological source of RNA being sequenced: SNPs, indels, also theconsideration that the reads arise from a spliced transcriptome rather than agenome.Short reads can sometimes align perfectly to multiple locations and cancontain sequencing errors that have to be accounted for.The real task is to find the location where each short read best matches thereference, while allowing for errors and structural variation.
  10. 10. AlignersAligners differ in how they handle ‘multimaps’ (reads thatmap equally well to several locations). Most alignerseither discard multimaps, allocate them randomly orallocate them on the basis of an estimate of localcoverage.Paired-end reads reduce the problem of multi-mapping, asboth ends of the cDNA fragment from which the shortreads were generated should map nearby on thetranscriptome, allowing the ambiguity of multimaps to beresolved in most circumstances.
  11. 11. Reference genomeThe most commonly used approach is to use the genome itself as thereference. This has the benefit of being easy and not biased towards anyknown annotation. However, reads that span exon boundaries will not map tothis reference. Thus, using the genome as a reference will give greatercoverage (at the same true expression level) to transcripts with fewer exons,as they will contain fewer exon junctions. In order to account for junction reads, it is common practice to build exon junction libraries in which reference sequences are constructed using boundaries between annotated exons, a proxy genome generated with known exonic sequences. Another option is the de novo assembly of the transcriptome, for use as a reference, using genome assembly tools. A commonly used approach for transcriptome mapping is to progressively increase the complexity of the mapping strategy to handle the unaligned reads.
  12. 12. Normalization
  13. 13. Normalization (2)Within-library normalization allows quantification of expression levels of each generelative to other genes in the sample. Because longer transcripts have higher readcounts (at the same expression level), a common method for within-librarynormalization is to divide the summarized counts by the length of the gene[32,34]. The widely used RPKM (reads per kilobase of exon model per millionmapped reads) accounts for both library size and gene length effects in within-sample comparisons.When testing individual genes for DE between samples, technical biases, such asgene length and nucleotide composition, will mainly cancel out because theunderlying sequence used for summarization is the same between samples.However, between-sample normalization is still essential for comparing countsfrom different libraries relative to each other. The simplest and most commonlyused normalization adjusts by the total number of reads in the library [34,51],accounting for the fact that more reads will be assigned to each gene if a sample issequenced to a greater depth.
  14. 14. Normalization: methods
  15. 15. Normalization (example)
  16. 16. NG-5045 (Diabetes)Pool 1 2, 4, 12, 16Pool 2 3, 9, 13, 14Pool 3 1, 5, 6, 7Pool 4 8, 10, 11, 15Morbidly obese persons without insulin resistance: 2, 3, 4, 9, 12, 13,14, 16.Morbidly obese persons with high insulin resistance: 1, 5, 6, 7, 8, 10,11, 15.
  17. 17. Differential expressionThe goal of a DE analysis is to highlight genes that have changedsignificantly in abundance across experimental conditions. In general,this means taking a table of summarized count data for each libraryand performing statistical testing between samples of interest.Transformation of count data is not well approximated by continuousdistributions, especially in the lower count range and for smallsamples. Therefore, statistical models appropriate for count data arevital to extracting the most information from RNA-seq data.
  18. 18. Poisson-based analysisIn an early RNA-seq study using a single source of RNA goodness-of-fitstatistics suggested that the distribution of counts across lanes for themajority of genes was indeed Poisson distributed . This has beenindependently confirmed using a technical experiment and software tools arereadily available to perform these analyses.
  19. 19. Each RNA sample was sequenced in seven lanes, producing 12.9–14.7 million reads per lane at the 3 pM concentrationand 8.4–9.3million reads at the 1.5 pM concentration. We aligned all reads against the whole genome. 40% of reads mappeduniquely to a genomic location, and of these, 65% mapped to autosomal or sex chromosomes (the remainder mapped
  20. 20. Haga clic para modificar el estilo de texto del patrón Segundo nivel ● Tercer nivel ● Cuarto nivel ● Quinto nivel
  21. 21. Poisson based softwareR packages in Bioconductor: • DEGseq (Wang et al., 2010)
  22. 22. Alternative strategiesBiological variability is not captured well by the Poisson assumption.Hence, Poisson-based analyses for datasets with biological replicates willbe prone to high false positive rates resulting from the underestimationof sampling errorGoodness-of-fit tests indicate that a small proportion of genes show cleardeviations from this model (extra-Poisson variation), and although wefound that these deviations did not lead to falsepositive identification ofdifferentially expressed genes at a stringent FDR, there is neverthelessroom for improved models that account for the extra-Poisson variation.One natural strategy would be to replace the Poisson distribution withanother distribution, such as the quasi-Poisson distribution (Venables andRipley 2002) or the negative binomial distribution (Robinson and Smyth2007), which have an additional parameter that estimates over- (or under-)dispersion relative to a Poisson model.
  23. 23. Poisson-Negative BinomialThe negative binomial distribution, can be used as an alternative to the Poisson distribution. It isespecially useful for discrete data over an unbounded positive range whose sample varianceexceeds the sample mean. In such cases, the observations are overdispersed with respect to aPoisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is notan appropriate model. Since the negative binomial distribution has one more parameter than thePoisson, the second parameter can be used to adjust the variance independently of the mean.
  24. 24. Negative-Binomial based analysisIn order to account for biological variability, methods that have beendeveloped for serial analysis of gene expression (SAGE) data haverecently been applied to RNA-seq data. The major difference betweenSAGE and RNA-seq data is the scale of the datasets. To account forbiological variability, the negative binomial distribution has been usedas a natural extension of the Poisson distribution, requiring anadditional dispersion parameter to be estimated.
  25. 25. Description of SAGESerial analysis of gene expression (SAGE) isa method for comprehensive analysis ofgene expression patterns.Three principles underlie the SAGEmethodology:1. A short sequence tag (10-14bp) contains sufficient information to uniquely identify a transcript provided that that the tag is obtained from a unique position within each transcript;2. Sequence tags can be linked together to from long serial molecules that can be cloned and sequenced; and3. Quantization of the number of times a particular tag is observed provides the expression level of the corresponding transcript.
  26. 26. Robinson, McCarthy, Smyth (2010) Haga clic para modificar el estilo de texto del patrón Segundo nivel ● Tercer nivel ● Cuarto nivel ● Quinto nivel
  27. 27. Robinson, McCarthy, Smyth (2010) edgeR paper edgeR paper (2)
  28. 28. Robinson and Smyth 2008
  29. 29. Robinson and Smyth 2008 (2)
  30. 30. Negative-Binomial based software R packages in Bioconductor: ØedgeR (Robinson et al., 2010): Exact test based on Negative Binomial distribution. ØDESeq (Anders and Huber, 2010): Exact test based on Negative Binomial distribution. ØbaySeq (Hardcastle et al., 2010): Estimation of the posterior likelihood of dierential expression (or more complex hypotheses) via empirical Bayesian methods using Poisson or NB distributions.
  31. 31. CLC Genomics Workbench approach19.4.2.1 Kal et al.s test (Z-test)Kal et al.s test [Kal et al., 1999] compares a single sample against anothersingle sample, and thus requires that each group in you experiment has onlyone sample. The test relies on an approximation of the binomial distributionby the normal distribution [Kal et al., 1999]. Considering proportions ratherthan raw counts the test is also suitable in situations where the sum ofcounts is different between the samples. Baggerley et al.s test (Beta-binomial)Baggerley et al.s test [Baggerly et al., 2003] compares the proportions ofcounts in a group of samples against those of another group of samples, andis suited to cases where replicates are available in the groups. The samplesare given different weights depending on their sizes (total counts). Theweights are obtained by assuming a Beta distribution on the proportions in agroup, and estimating these, along with the proportion of a binomialdistribution, by the method of moments. The result is a weighted t-type teststatistic.
  32. 32. Baggerly, K., Deng, L., Morris, J., and Aldaz, C. (2003).Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics, 19(12):1477-1483.
  33. 33. Resolució amb edgeR> library(edgeR)> set.seed(101) > d <- DGEList(counts = y, group = rep(1:2, each = 2), lib.size = lib.sizes)> n <- 200 > d <- estimateCommonDisp(d)> lib.sizes <- c(40000, 50000, 38000, 40000) > de.common <- exactTest(d)> p <- runif(n, min = 1e-04, 0.001) Comparison of groups: 2 - 1> mu <- outer(p, lib.sizes) > topTags(de.common)> mu[1:5, 3:4] <- mu[1:5, 3:4] * 8 Comparison of groups: 2 - 1> y <- matrix(rnbinom(4 * n, size = 4, mu = mu), logConc logFC PValue FDRnrow = n) tag.184 -13.636760 -5.236853 6.112570e-05 0.005195714> rownames(y) <- paste("tag", 1:nrow(y), sep = ".") tag.2 -11.769438 3.766465 6.405229e-05 0.005195714> y[1:10, ] tag.3 -8.550981 3.214682 7.793571e-05 [,1] [,2] [,3] [,4] 0.005195714tag.1 15 13 117 77 tag.4 -9.188394 2.911743 3.300004e-04tag.2 3 4 49 33 0.013214944tag.3 25 56 302 332 tag.1 -10.135230 2.984351 3.303736e-04 0.013214944tag.4 40 13 271 91tag.5 13 3 51 56 tag.5 -10.944756 2.868619 1.035516e-03 0.034517212tag.6 14 7 31 18 tag.105 -10.693557 2.618355 2.337750e-03tag.7 16 39 19 9 0.066792856tag.8 6 28 6 6 tag.164 -11.253348 -2.209660 1.090272e-02tag.9 10 42 80 14 0.233310771tag.10 33 25 5 27 tag.14 -11.258031 2.238669 1.090272e-02 0.233310771 tag.123 -13.277812 -2.756096 1.166554e-02 0.233310771 > >
  34. 34. Suggested pipeline ?Quality Control: fastQC, DNAA•Mapping the reads:• • Obtaining the reference • Aligning reads to the reference: BOWTIEDifferential Expression• • Summarization of reads • Differential Expression Testing: edgeRGene Set testing (GO): goseq•
  35. 35. Experimental design ?Many of the current strategies for DE analysis of count data arelimited to simple experimental designs, such as pairwise or multiplegroup comparisons. To the best of our knowledge, no generalmethods have been proposed for the analysis of more complexdesigns, such as paired samples or time course experiments, in thecontext of RNA-seq data. In the absence of such methods,researchers have transformed their count data and used toolsappropriate for continuous data. Generalized linear models providethe logical extension to the count models presented above, andclever strategies to share information over all genes will need to bedeveloped; software tools now provide these methods (such asedgeR). Auer, P.L., and Doerge R.W. (2010) Statistical Design and Analysis of RNA Sequencing Data. Genetics, 185, 405-416.
  36. 36. Integration with other dataThere is wide scope for integrating the results of RNA-seq data withother sources of biological data to establish a more complete picture ofgene regulation [69]. For example, RNA-seq has been used inconjunction with genotyping data to identify genetic loci responsiblefor variation in gene expression between individuals (expressionquantitative trait loci or eQTLs) [35,70]. Furthermore, integration ofexpression data with transcription factor binding, RNA interference,histone modification and DNA methylation information has thepotential for greater understanding of a variety of regulatorymechanisms. A few reports of these ‘integrative’ analyses haveemerged recently [71-73].