SlideShare a Scribd company logo
RNA-seq
RNA-seq, also called "Whole Transcriptome Shotgun Sequencing“
("WTSS"), refers to the use of high-throughput sequencing technologies to
sequence cDNA in order to get information about a sample's RNA content.
Analysis of RNA-seq data
Single nucleotide variation discovery:   currently being applied to cancer
research and microbiology.
Fusion gene detection: fusion genes have gained attention because of
their relationship with cancer. The idea follows from the process of
aligning the short transcriptomic reads to a reference genome.
Most of the short reads will fall within one complete exon, and a
smaller but still large set would be expected to map to known
exon-exon junctions. The remaining unmapped short reads would
then be further analyzed to determine whether they match an
exon-exon junction where the exons come from different genes.
This would be evidence of a possible fusion.
Gene expression
Paired-end
Fusion gene detection
Definitions
Gene expression
Detect differences in gene level expression between samples. This
sort of analysis is particularly relevant for controlled experiments
comparing expression in wild-type and mutant strains of the same
tissue, comparing treated versus untreated cells, cancer versus
normal, and so on.
Differential expression (2)
RNA-seq gives a discrete measurement for each gene.
Transformation of count data is not well approximated by continuous
distributions, especially in the lower count range and for small
samples. Therefore, statistical models appropriate for count data are
vital to extracting the most information from RNA-seq data.
In general, the Poisson distribution forms the basis for modeling RNA-
seq count data.
RNA-seq Pipeline
Mapping
The first step in this procedure is the read mapping or alignment: to find the
unique location where a short read is identical to the reference.
However, in reality the reference is never a perfect representation of the
actual biological source of RNA being sequenced: SNPs, indels, also the
consideration that the reads arise from a spliced transcriptome rather than a
genome.
Short reads can sometimes align perfectly to multiple locations and can
contain sequencing errors that have to be accounted for.
The real task is to find the location where each short read best matches the
reference, while allowing for errors and structural variation.
Aligners
Aligners differ in how they handle ‘multimaps’ (reads that
map equally well to several locations). Most aligners
either discard multimaps, allocate them randomly or
allocate them on the basis of an estimate of local
coverage.
Paired-end reads reduce the problem of multi-mapping, as
both ends of the cDNA fragment from which the short
reads were generated should map nearby on the
transcriptome, allowing the ambiguity of multimaps to be
resolved in most circumstances.
Reference genome
The most commonly used approach is to use the genome itself as the
reference. This has the benefit of being easy and not biased towards any
known annotation. However, reads that span exon boundaries will not map to
this reference. Thus, using the genome as a reference will give greater
coverage (at the same true expression level) to transcripts with fewer exons,
as they will contain fewer exon junctions.


                                     In order to account for junction reads, it is
                                     common practice to build exon junction
                                     libraries in which reference sequences are
                                     constructed    using    boundaries  between
                                     annotated exons, a proxy genome generated
                                     with known exonic sequences.
                                     Another option is the de novo assembly of the
                                     transcriptome, for use as a reference, using
                                     genome assembly tools.
                                     A commonly used approach for transcriptome
                                     mapping is to progressively increase the
                                     complexity of the mapping strategy to handle
                                     the unaligned reads.
Normalization
Normalization (2)
Within-library normalization allows quantification of expression levels of each gene
relative to other genes in the sample. Because longer transcripts have higher read
counts (at the same expression level), a common method for within-library
normalization is to divide the summarized counts by the length of the gene
[32,34]. The widely used RPKM (reads per kilobase of exon model per million
mapped reads) accounts for both library size and gene length effects in within-
sample comparisons.

When testing individual genes for DE between samples, technical biases, such as
gene length and nucleotide composition, will mainly cancel out because the
underlying sequence used for summarization is the same between samples.
However, between-sample normalization is still essential for comparing counts
from different libraries relative to each other. The simplest and most commonly
used normalization adjusts by the total number of reads in the library [34,51],
accounting for the fact that more reads will be assigned to each gene if a sample is
sequenced to a greater depth.
Normalization: methods
Normalization (example)
NG-5045 (Diabetes)
Pool 1 2, 4, 12, 16
Pool 2 3, 9, 13, 14
Pool 3 1, 5, 6, 7
Pool 4 8, 10, 11, 15
Morbidly obese persons without insulin resistance: 2, 3, 4, 9, 12, 13,
14, 16.
Morbidly obese persons with high insulin resistance: 1, 5, 6, 7, 8, 10,
11, 15.
Differential expression
The goal of a DE analysis is to highlight genes that have changed
significantly in abundance across experimental conditions. In general,
this means taking a table of summarized count data for each library
and performing statistical testing between samples of interest.
Transformation of count data is not well approximated by continuous
distributions, especially in the lower count range and for small
samples. Therefore, statistical models appropriate for count data are
vital to extracting the most information from RNA-seq data.
Poisson-based analysis
In an early RNA-seq study using a single source of RNA goodness-of-fit
statistics suggested that the distribution of counts across lanes for the
majority of genes was indeed Poisson distributed . This has been
independently confirmed using a technical experiment and software tools are
readily available to perform these analyses.
Each RNA sample was sequenced in seven lanes, producing 12.9–14.7 million reads per lane at the 3 pM concentration
and 8.4–9.3
million reads at the 1.5 pM concentration. We aligned all reads against the whole genome. 40% of reads mapped
uniquely to a genomic location, and of these, 65% mapped to autosomal or sex chromosomes (the remainder mapped
Haga clic para modificar el estilo de texto del patrón
   Segundo nivel
           ● Tercer nivel

                 ● Cuarto nivel

                        ● Quinto nivel
Poisson based software
R packages in Bioconductor:

      •
          DEGseq (Wang et al., 2010)
Alternative strategies

Biological variability is not captured well by the Poisson assumption.
Hence, Poisson-based analyses for datasets with biological replicates will
be prone to high false positive rates resulting from the underestimation
of sampling error
Goodness-of-fit tests indicate that a small proportion of genes show clear
deviations from this model (extra-Poisson variation), and although we
found that these deviations did not lead to falsepositive identification of
differentially expressed genes at a stringent FDR, there is nevertheless
room for improved models that account for the extra-Poisson variation.
One natural strategy would be to replace the Poisson distribution with
another distribution, such as the quasi-Poisson distribution (Venables and
Ripley 2002) or the negative binomial distribution (Robinson and Smyth
2007), which have an additional parameter that estimates over- (or under-)
dispersion relative to a Poisson model.
Poisson-Negative Binomial
The negative binomial distribution, can be used as an alternative to the Poisson distribution. It is
especially useful for discrete data over an unbounded positive range whose sample variance
exceeds the sample mean. In such cases, the observations are overdispersed with respect to a
Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not
an appropriate model. Since the negative binomial distribution has one more parameter than the
Poisson, the second parameter can be used to adjust the variance independently of the mean.
Negative-Binomial based analysis

In order to account for biological variability, methods that have been
developed for serial analysis of gene expression (SAGE) data have
recently been applied to RNA-seq data. The major difference between
SAGE and RNA-seq data is the scale of the datasets. To account for
biological variability, the negative binomial distribution has been used
as a natural extension of the Poisson distribution, requiring an
additional dispersion parameter to be estimated.
Description of SAGE
Serial analysis of gene expression (SAGE) is
a method for comprehensive analysis of
gene        expression       patterns.

Three principles       underlie     the    SAGE
methodology:
1.   A short sequence tag (10-14bp) contains
     sufficient information to uniquely identify a
     transcript provided that that the tag is
     obtained from a unique position within each
     transcript;
2.   Sequence tags can be linked together to
     from long serial molecules that can be
     cloned and sequenced; and
3.   Quantization of the number of times a
     particular tag is observed provides the
     expression level of the corresponding
     transcript.
Robinson, McCarthy, Smyth (2010)


     Haga clic para modificar el estilo de texto del patrón
        Segundo nivel
                ● Tercer nivel

                      ● Cuarto nivel

                             ● Quinto nivel
Robinson, McCarthy, Smyth (2010)
        edgeR paper
        edgeR paper
              (2)
Robinson and Smyth 2008
Robinson and Smyth 2008 (2)
Negative-Binomial based software

 R packages in Bioconductor:

 ØedgeR (Robinson et al., 2010): Exact test based on Negative
 Binomial distribution.

 ØDESeq (Anders and Huber, 2010): Exact test based on
 Negative Binomial distribution.

 ØbaySeq (Hardcastle et al., 2010): Estimation of the posterior
 likelihood of dierential expression (or more complex
 hypotheses) via empirical Bayesian methods using Poisson or
 NB distributions.
CLC Genomics Workbench approach

19.4.2.1 Kal et al.'s test (Z-test)
Kal et al.'s test [Kal et al., 1999] compares a single sample against another
single sample, and thus requires that each group in you experiment has only
one sample. The test relies on an approximation of the binomial distribution
by the normal distribution [Kal et al., 1999]. Considering proportions rather
than raw counts the test is also suitable in situations where the sum of
counts is different between the samples.

19.4.2.2 Baggerley et al.'s test (Beta-binomial)
Baggerley et al.'s test [Baggerly et al., 2003] compares the proportions of
counts in a group of samples against those of another group of samples, and
is suited to cases where replicates are available in the groups. The samples
are given different weights depending on their sizes (total counts). The
weights are obtained by assuming a Beta distribution on the proportions in a
group, and estimating these, along with the proportion of a binomial
distribution, by the method of moments. The result is a weighted t-type test
statistic.
Baggerly, K., Deng, L., Morris, J., and Aldaz, C. (2003).
Differential expression in SAGE: accounting for normal between-library
                               variation.
                   Bioinformatics, 19(12):1477-1483.
Resolució amb edgeR
> library(edgeR)
> set.seed(101)                                       > d <- DGEList(counts = y, group = rep(1:2, each = 2), lib.size = lib.sizes)
> n <- 200                                            > d <- estimateCommonDisp(d)
> lib.sizes <- c(40000, 50000, 38000, 40000)          > de.common <- exactTest(d)
> p <- runif(n, min = 1e-04, 0.001)                   Comparison of groups: 2 - 1
> mu <- outer(p, lib.sizes)                           > topTags(de.common)
> mu[1:5, 3:4] <- mu[1:5, 3:4] * 8                    Comparison of groups: 2 - 1
> y <- matrix(rnbinom(4 * n, size = 4, mu = mu),              logConc       logFC       PValue        FDR
nrow = n)                                             tag.184        -13.636760 -5.236853      6.112570e-05 0.005195714
> rownames(y) <- paste("tag", 1:nrow(y), sep = ".")   tag.2 -11.769438 3.766465         6.405229e-05           0.005195714
> y[1:10, ]                                           tag.3         -8.550981     3.214682     7.793571e-05
     [,1] [,2] [,3] [,4]                              0.005195714
tag.1 15 13 117 77
                                                      tag.4         -9.188394     2.911743     3.300004e-04
tag.2 3 4 49 33
                                                      0.013214944
tag.3 25 56 302 332
                                                      tag.1 -10.135230 2.984351         3.303736e-04           0.013214944
tag.4 40 13 271 91
tag.5 13 3 51 56                                      tag.5 -10.944756 2.868619         1.035516e-03           0.034517212
tag.6 14 7 31 18                                      tag.105       -10.693557 2.618355        2.337750e-03
tag.7 16 39 19 9                                      0.066792856
tag.8 6 28 6 6                                        tag.164       -11.253348 -2.209660       1.090272e-02
tag.9 10 42 80 14                                     0.233310771
tag.10 33 25 5 27                                     tag.14        -11.258031 2.238669        1.090272e-02
                                                      0.233310771
                                                      tag.123       -13.277812 -2.756096       1.166554e-02
                                                      0.233310771
                                                      >
                                                      >
Suggested pipeline ?

Quality Control: fastQC, DNAA
•




Mapping the reads:
•

    •
      Obtaining the reference
    •
      Aligning reads to the reference: BOWTIE

Differential Expression
•

      •
        Summarization of reads
      •
        Differential Expression Testing: edgeR

Gene Set testing (GO): goseq
•
Experimental design ?

Many of the current strategies for DE analysis of count data are
limited to simple experimental designs, such as pairwise or multiple
group comparisons. To the best of our knowledge, no general
methods have been proposed for the analysis of more complex
designs, such as paired samples or time course experiments, in the
context of RNA-seq data. In the absence of such methods,
researchers have transformed their count data and used tools
appropriate for continuous data. Generalized linear models provide
the logical extension to the count models presented above, and
clever strategies to share information over all genes will need to be
developed; software tools now provide these methods (such as
edgeR).
      Auer, P.L., and Doerge R.W. (2010) Statistical Design and
     Analysis of RNA Sequencing Data. Genetics, 185, 405-416.
Integration with other data
There is wide scope for integrating the results of RNA-seq data with
other sources of biological data to establish a more complete picture of
gene regulation [69]. For example, RNA-seq has been used in
conjunction with genotyping data to identify genetic loci responsible
for variation in gene expression between individuals (expression
quantitative trait loci or eQTLs) [35,70]. Furthermore, integration of
expression data with transcription factor binding, RNA interference,
histone modification and DNA methylation information has the
potential for greater understanding of a variety of regulatory
mechanisms. A few reports of these ‘integrative’ analyses have
emerged recently [71-73].

More Related Content

What's hot

Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
Monica Munoz-Torres
 
Gene mapping
Gene mappingGene mapping
Gene mapping
Pratik Parikh
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
University of California, Davis
 
Why Transcriptome? Why RNA-Seq? ENCODE answers….
Why Transcriptome? Why RNA-Seq?  ENCODE answers….Why Transcriptome? Why RNA-Seq?  ENCODE answers….
Why Transcriptome? Why RNA-Seq? ENCODE answers….
Mohammad Hossein Banabazi
 
Rna seq
Rna seq Rna seq
Rna seq
Amitha Dasari
 
Genome alteration detection using high throughput data
Genome alteration detection using high throughput dataGenome alteration detection using high throughput data
Genome alteration detection using high throughput data
Samarth Kulshrestha
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
Denis C. Bauer
 
Kirmitzoglou_PhD_Final
Kirmitzoglou_PhD_FinalKirmitzoglou_PhD_Final
Kirmitzoglou_PhD_Final
Ioannis Kirmitzoglou
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
Yaoyu Wang
 
Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS) Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS)
Bharathiar university
 
Sage
SageSage
Rna seq
Rna seqRna seq
Rna seq
Sean Davis
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
Junsu Ko
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
Denis C. Bauer
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vst
Qiang Kou
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
cursoNGS
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
Joachim Jacob
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
VHIR Vall d’Hebron Institut de Recerca
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
Alireza Doustmohammadi
 
Genetic and Physical map of Genome
Genetic and Physical map of GenomeGenetic and Physical map of Genome
Genetic and Physical map of Genome
KAUSHAL SAHU
 

What's hot (20)

Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Gene mapping
Gene mappingGene mapping
Gene mapping
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Why Transcriptome? Why RNA-Seq? ENCODE answers….
Why Transcriptome? Why RNA-Seq?  ENCODE answers….Why Transcriptome? Why RNA-Seq?  ENCODE answers….
Why Transcriptome? Why RNA-Seq? ENCODE answers….
 
Rna seq
Rna seq Rna seq
Rna seq
 
Genome alteration detection using high throughput data
Genome alteration detection using high throughput dataGenome alteration detection using high throughput data
Genome alteration detection using high throughput data
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
Kirmitzoglou_PhD_Final
Kirmitzoglou_PhD_FinalKirmitzoglou_PhD_Final
Kirmitzoglou_PhD_Final
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS) Massively Parallel Signature Sequencing (MPSS)
Massively Parallel Signature Sequencing (MPSS)
 
Sage
SageSage
Sage
 
Rna seq
Rna seqRna seq
Rna seq
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vst
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
 
Genetic and Physical map of Genome
Genetic and Physical map of GenomeGenetic and Physical map of Genome
Genetic and Physical map of Genome
 

Viewers also liked

N2 FireFighter - Oxygen Redut
N2 FireFighter - Oxygen RedutN2 FireFighter - Oxygen Redut
N2 FireFighter - Oxygen Redut
Andrea Thurnher
 
Las vacunas
Las vacunasLas vacunas
Las vacunas
10042010
 
Views amidst violence: George Varughese
Views amidst violence: George VarugheseViews amidst violence: George Varughese
Views amidst violence: George Varughese
SLRCslides
 
Poverty Sunday 2012
Poverty Sunday 2012Poverty Sunday 2012
Poverty Sunday 2012
Church Urban Fund
 
P 6
P 6P 6
οι μεγάλες επενδύσεις
οι μεγάλες επενδύσειςοι μεγάλες επενδύσεις
οι μεγάλες επενδύσειςgiouli
 

Viewers also liked (7)

N2 FireFighter - Oxygen Redut
N2 FireFighter - Oxygen RedutN2 FireFighter - Oxygen Redut
N2 FireFighter - Oxygen Redut
 
Opinion Article on Pneumonia
Opinion Article on PneumoniaOpinion Article on Pneumonia
Opinion Article on Pneumonia
 
Las vacunas
Las vacunasLas vacunas
Las vacunas
 
Views amidst violence: George Varughese
Views amidst violence: George VarugheseViews amidst violence: George Varughese
Views amidst violence: George Varughese
 
Poverty Sunday 2012
Poverty Sunday 2012Poverty Sunday 2012
Poverty Sunday 2012
 
P 6
P 6P 6
P 6
 
οι μεγάλες επενδύσεις
οι μεγάλες επενδύσειςοι μεγάλες επενδύσεις
οι μεγάλες επενδύσεις
 

Similar to EiB Seminar from Antoni Miñarro, Ph.D

Functional genomics
Functional genomicsFunctional genomics
Functional genomics
saswat tripathy
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
Athira RG
 
Shotgun and clone contig method
Shotgun and clone contig methodShotgun and clone contig method
Shotgun and clone contig method
Dr. Naveen Gaurav srivastava
 
Structural annotation................pptx
Structural annotation................pptxStructural annotation................pptx
Structural annotation................pptx
Cherry
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencing
Sean Davis
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
Nawfal Aldujaily
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
Sean Davis
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
Monica Munoz-Torres
 
Gene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGHGene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGH
Rafael C. Jimenez
 
Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysis
Dr. Olusoji Adewumi
 
Rna lecture
Rna lectureRna lecture
Rna lecture
nishulpu
 
Gene mapping and cloning of disease gene
Gene mapping and cloning of disease geneGene mapping and cloning of disease gene
Gene mapping and cloning of disease gene
Dineshk117
 
CROP GENOME SEQUENCING
CROP GENOME SEQUENCINGCROP GENOME SEQUENCING
CROP GENOME SEQUENCING
SABYASACHISAHU10
 
Physical Mapping.pptx
Physical Mapping.pptxPhysical Mapping.pptx
Physical Mapping.pptx
harshitasharma208781
 
Construction of human gene map through map integration- from genetic map to p...
Construction of human gene map through map integration- from genetic map to p...Construction of human gene map through map integration- from genetic map to p...
Construction of human gene map through map integration- from genetic map to p...
Central University Of Kerala
 
Present status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPresent status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptx
PrabhatSingh628463
 
Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir Filter
IJMER
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
Jyoti Singh
 

Similar to EiB Seminar from Antoni Miñarro, Ph.D (20)

Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Shotgun and clone contig method
Shotgun and clone contig methodShotgun and clone contig method
Shotgun and clone contig method
 
Structural annotation................pptx
Structural annotation................pptxStructural annotation................pptx
Structural annotation................pptx
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencing
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Genomics Technologies
Genomics TechnologiesGenomics Technologies
Genomics Technologies
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
 
Gene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGHGene gain and loss: aCGH. ISACGH
Gene gain and loss: aCGH. ISACGH
 
Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysis
 
Rna lecture
Rna lectureRna lecture
Rna lecture
 
Gene mapping and cloning of disease gene
Gene mapping and cloning of disease geneGene mapping and cloning of disease gene
Gene mapping and cloning of disease gene
 
CROP GENOME SEQUENCING
CROP GENOME SEQUENCINGCROP GENOME SEQUENCING
CROP GENOME SEQUENCING
 
Physical Mapping.pptx
Physical Mapping.pptxPhysical Mapping.pptx
Physical Mapping.pptx
 
Construction of human gene map through map integration- from genetic map to p...
Construction of human gene map through map integration- from genetic map to p...Construction of human gene map through map integration- from genetic map to p...
Construction of human gene map through map integration- from genetic map to p...
 
Present status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPresent status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptx
 
Analysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir FilterAnalysis of Genomic and Proteomic Sequence Using Fir Filter
Analysis of Genomic and Proteomic Sequence Using Fir Filter
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 

EiB Seminar from Antoni Miñarro, Ph.D

  • 1.
  • 2. RNA-seq RNA-seq, also called "Whole Transcriptome Shotgun Sequencing“ ("WTSS"), refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content.
  • 3. Analysis of RNA-seq data Single nucleotide variation discovery: currently being applied to cancer research and microbiology. Fusion gene detection: fusion genes have gained attention because of their relationship with cancer. The idea follows from the process of aligning the short transcriptomic reads to a reference genome. Most of the short reads will fall within one complete exon, and a smaller but still large set would be expected to map to known exon-exon junctions. The remaining unmapped short reads would then be further analyzed to determine whether they match an exon-exon junction where the exons come from different genes. This would be evidence of a possible fusion. Gene expression
  • 7. Gene expression Detect differences in gene level expression between samples. This sort of analysis is particularly relevant for controlled experiments comparing expression in wild-type and mutant strains of the same tissue, comparing treated versus untreated cells, cancer versus normal, and so on.
  • 8. Differential expression (2) RNA-seq gives a discrete measurement for each gene. Transformation of count data is not well approximated by continuous distributions, especially in the lower count range and for small samples. Therefore, statistical models appropriate for count data are vital to extracting the most information from RNA-seq data. In general, the Poisson distribution forms the basis for modeling RNA- seq count data.
  • 10. Mapping The first step in this procedure is the read mapping or alignment: to find the unique location where a short read is identical to the reference. However, in reality the reference is never a perfect representation of the actual biological source of RNA being sequenced: SNPs, indels, also the consideration that the reads arise from a spliced transcriptome rather than a genome. Short reads can sometimes align perfectly to multiple locations and can contain sequencing errors that have to be accounted for. The real task is to find the location where each short read best matches the reference, while allowing for errors and structural variation.
  • 11. Aligners Aligners differ in how they handle ‘multimaps’ (reads that map equally well to several locations). Most aligners either discard multimaps, allocate them randomly or allocate them on the basis of an estimate of local coverage. Paired-end reads reduce the problem of multi-mapping, as both ends of the cDNA fragment from which the short reads were generated should map nearby on the transcriptome, allowing the ambiguity of multimaps to be resolved in most circumstances.
  • 12. Reference genome The most commonly used approach is to use the genome itself as the reference. This has the benefit of being easy and not biased towards any known annotation. However, reads that span exon boundaries will not map to this reference. Thus, using the genome as a reference will give greater coverage (at the same true expression level) to transcripts with fewer exons, as they will contain fewer exon junctions. In order to account for junction reads, it is common practice to build exon junction libraries in which reference sequences are constructed using boundaries between annotated exons, a proxy genome generated with known exonic sequences. Another option is the de novo assembly of the transcriptome, for use as a reference, using genome assembly tools. A commonly used approach for transcriptome mapping is to progressively increase the complexity of the mapping strategy to handle the unaligned reads.
  • 14. Normalization (2) Within-library normalization allows quantification of expression levels of each gene relative to other genes in the sample. Because longer transcripts have higher read counts (at the same expression level), a common method for within-library normalization is to divide the summarized counts by the length of the gene [32,34]. The widely used RPKM (reads per kilobase of exon model per million mapped reads) accounts for both library size and gene length effects in within- sample comparisons. When testing individual genes for DE between samples, technical biases, such as gene length and nucleotide composition, will mainly cancel out because the underlying sequence used for summarization is the same between samples. However, between-sample normalization is still essential for comparing counts from different libraries relative to each other. The simplest and most commonly used normalization adjusts by the total number of reads in the library [34,51], accounting for the fact that more reads will be assigned to each gene if a sample is sequenced to a greater depth.
  • 17. NG-5045 (Diabetes) Pool 1 2, 4, 12, 16 Pool 2 3, 9, 13, 14 Pool 3 1, 5, 6, 7 Pool 4 8, 10, 11, 15 Morbidly obese persons without insulin resistance: 2, 3, 4, 9, 12, 13, 14, 16. Morbidly obese persons with high insulin resistance: 1, 5, 6, 7, 8, 10, 11, 15.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Differential expression The goal of a DE analysis is to highlight genes that have changed significantly in abundance across experimental conditions. In general, this means taking a table of summarized count data for each library and performing statistical testing between samples of interest. Transformation of count data is not well approximated by continuous distributions, especially in the lower count range and for small samples. Therefore, statistical models appropriate for count data are vital to extracting the most information from RNA-seq data.
  • 23. Poisson-based analysis In an early RNA-seq study using a single source of RNA goodness-of-fit statistics suggested that the distribution of counts across lanes for the majority of genes was indeed Poisson distributed . This has been independently confirmed using a technical experiment and software tools are readily available to perform these analyses.
  • 24. Each RNA sample was sequenced in seven lanes, producing 12.9–14.7 million reads per lane at the 3 pM concentration and 8.4–9.3 million reads at the 1.5 pM concentration. We aligned all reads against the whole genome. 40% of reads mapped uniquely to a genomic location, and of these, 65% mapped to autosomal or sex chromosomes (the remainder mapped
  • 25. Haga clic para modificar el estilo de texto del patrón Segundo nivel ● Tercer nivel ● Cuarto nivel ● Quinto nivel
  • 26. Poisson based software R packages in Bioconductor: • DEGseq (Wang et al., 2010)
  • 27. Alternative strategies Biological variability is not captured well by the Poisson assumption. Hence, Poisson-based analyses for datasets with biological replicates will be prone to high false positive rates resulting from the underestimation of sampling error Goodness-of-fit tests indicate that a small proportion of genes show clear deviations from this model (extra-Poisson variation), and although we found that these deviations did not lead to falsepositive identification of differentially expressed genes at a stringent FDR, there is nevertheless room for improved models that account for the extra-Poisson variation. One natural strategy would be to replace the Poisson distribution with another distribution, such as the quasi-Poisson distribution (Venables and Ripley 2002) or the negative binomial distribution (Robinson and Smyth 2007), which have an additional parameter that estimates over- (or under-) dispersion relative to a Poisson model.
  • 28. Poisson-Negative Binomial The negative binomial distribution, can be used as an alternative to the Poisson distribution. It is especially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not an appropriate model. Since the negative binomial distribution has one more parameter than the Poisson, the second parameter can be used to adjust the variance independently of the mean.
  • 29. Negative-Binomial based analysis In order to account for biological variability, methods that have been developed for serial analysis of gene expression (SAGE) data have recently been applied to RNA-seq data. The major difference between SAGE and RNA-seq data is the scale of the datasets. To account for biological variability, the negative binomial distribution has been used as a natural extension of the Poisson distribution, requiring an additional dispersion parameter to be estimated.
  • 30. Description of SAGE Serial analysis of gene expression (SAGE) is a method for comprehensive analysis of gene expression patterns. Three principles underlie the SAGE methodology: 1. A short sequence tag (10-14bp) contains sufficient information to uniquely identify a transcript provided that that the tag is obtained from a unique position within each transcript; 2. Sequence tags can be linked together to from long serial molecules that can be cloned and sequenced; and 3. Quantization of the number of times a particular tag is observed provides the expression level of the corresponding transcript.
  • 31. Robinson, McCarthy, Smyth (2010) Haga clic para modificar el estilo de texto del patrón Segundo nivel ● Tercer nivel ● Cuarto nivel ● Quinto nivel
  • 32. Robinson, McCarthy, Smyth (2010) edgeR paper edgeR paper (2)
  • 34. Robinson and Smyth 2008 (2)
  • 35. Negative-Binomial based software R packages in Bioconductor: ØedgeR (Robinson et al., 2010): Exact test based on Negative Binomial distribution. ØDESeq (Anders and Huber, 2010): Exact test based on Negative Binomial distribution. ØbaySeq (Hardcastle et al., 2010): Estimation of the posterior likelihood of dierential expression (or more complex hypotheses) via empirical Bayesian methods using Poisson or NB distributions.
  • 36. CLC Genomics Workbench approach 19.4.2.1 Kal et al.'s test (Z-test) Kal et al.'s test [Kal et al., 1999] compares a single sample against another single sample, and thus requires that each group in you experiment has only one sample. The test relies on an approximation of the binomial distribution by the normal distribution [Kal et al., 1999]. Considering proportions rather than raw counts the test is also suitable in situations where the sum of counts is different between the samples. 19.4.2.2 Baggerley et al.'s test (Beta-binomial) Baggerley et al.'s test [Baggerly et al., 2003] compares the proportions of counts in a group of samples against those of another group of samples, and is suited to cases where replicates are available in the groups. The samples are given different weights depending on their sizes (total counts). The weights are obtained by assuming a Beta distribution on the proportions in a group, and estimating these, along with the proportion of a binomial distribution, by the method of moments. The result is a weighted t-type test statistic.
  • 37. Baggerly, K., Deng, L., Morris, J., and Aldaz, C. (2003). Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics, 19(12):1477-1483.
  • 38.
  • 39.
  • 40.
  • 41. Resolució amb edgeR > library(edgeR) > set.seed(101) > d <- DGEList(counts = y, group = rep(1:2, each = 2), lib.size = lib.sizes) > n <- 200 > d <- estimateCommonDisp(d) > lib.sizes <- c(40000, 50000, 38000, 40000) > de.common <- exactTest(d) > p <- runif(n, min = 1e-04, 0.001) Comparison of groups: 2 - 1 > mu <- outer(p, lib.sizes) > topTags(de.common) > mu[1:5, 3:4] <- mu[1:5, 3:4] * 8 Comparison of groups: 2 - 1 > y <- matrix(rnbinom(4 * n, size = 4, mu = mu), logConc logFC PValue FDR nrow = n) tag.184 -13.636760 -5.236853 6.112570e-05 0.005195714 > rownames(y) <- paste("tag", 1:nrow(y), sep = ".") tag.2 -11.769438 3.766465 6.405229e-05 0.005195714 > y[1:10, ] tag.3 -8.550981 3.214682 7.793571e-05 [,1] [,2] [,3] [,4] 0.005195714 tag.1 15 13 117 77 tag.4 -9.188394 2.911743 3.300004e-04 tag.2 3 4 49 33 0.013214944 tag.3 25 56 302 332 tag.1 -10.135230 2.984351 3.303736e-04 0.013214944 tag.4 40 13 271 91 tag.5 13 3 51 56 tag.5 -10.944756 2.868619 1.035516e-03 0.034517212 tag.6 14 7 31 18 tag.105 -10.693557 2.618355 2.337750e-03 tag.7 16 39 19 9 0.066792856 tag.8 6 28 6 6 tag.164 -11.253348 -2.209660 1.090272e-02 tag.9 10 42 80 14 0.233310771 tag.10 33 25 5 27 tag.14 -11.258031 2.238669 1.090272e-02 0.233310771 tag.123 -13.277812 -2.756096 1.166554e-02 0.233310771 > >
  • 42. Suggested pipeline ? Quality Control: fastQC, DNAA • Mapping the reads: • • Obtaining the reference • Aligning reads to the reference: BOWTIE Differential Expression • • Summarization of reads • Differential Expression Testing: edgeR Gene Set testing (GO): goseq •
  • 43. Experimental design ? Many of the current strategies for DE analysis of count data are limited to simple experimental designs, such as pairwise or multiple group comparisons. To the best of our knowledge, no general methods have been proposed for the analysis of more complex designs, such as paired samples or time course experiments, in the context of RNA-seq data. In the absence of such methods, researchers have transformed their count data and used tools appropriate for continuous data. Generalized linear models provide the logical extension to the count models presented above, and clever strategies to share information over all genes will need to be developed; software tools now provide these methods (such as edgeR). Auer, P.L., and Doerge R.W. (2010) Statistical Design and Analysis of RNA Sequencing Data. Genetics, 185, 405-416.
  • 44. Integration with other data There is wide scope for integrating the results of RNA-seq data with other sources of biological data to establish a more complete picture of gene regulation [69]. For example, RNA-seq has been used in conjunction with genotyping data to identify genetic loci responsible for variation in gene expression between individuals (expression quantitative trait loci or eQTLs) [35,70]. Furthermore, integration of expression data with transcription factor binding, RNA interference, histone modification and DNA methylation information has the potential for greater understanding of a variety of regulatory mechanisms. A few reports of these ‘integrative’ analyses have emerged recently [71-73].