RNA-Seq_Presentation

RNA-Seq analysis pipeline For
Unraveling Traits in crops:
TUXEDO protocol
By Abdulsalam Toyin

OUTLINE
RNA-Seq
Biology of Gene Expression
Concept of Genes and Transcripts
Alignment
RNA-Seq specific aligners
Tophat, STAR
SAM/BAM format
 Quantification
cufflinks(gene level, isoform level, splicing…..)
Alternative methods
 Post-Alignment Quality Control
 Other analysis with RNA-Seq

RNA-Seq
Diagram Describes a
summary of the RNA-Seq
Technique: (1) RNA is
isolated from a panel of
tissues or treatments. (2) A
pool of core tissues (e.g.,
leaf, root, flower, fruit) are
used to create the
reference cDNA library
which is then sequenced
accordingly.

RNA-Seq
• RNA-seq basically seeks to answer the following question:
 What genes are expressed in a sample?
What transcripts are expressed for each gene?
What are the genes, and transcripts?: what are the expression
levels?
How do expression levels and splicing pattern differ between
two conditions?

Biology of Gene Expression
The process of gene
expression is very
complex. “Genes takes a
body of their own in
process called gene
expression”. The process
starts with transcription
during which RNA-
polymerase creates a copy
of the gene, nucleotide by
nucleotide as a single
stranded molecule.
While in the
nucleus(highly unstable):5’
capping, 3’ cleavage and
addition of long polyA tail
to stabilize.

Genes are encoded in
the genome occupying
a specific location.
Genes consist of:
Informative block
“Exon” and non
informative block
“intron”. During
transcription the
introns are spliced out
and the exons are glued
together for form
mRNA.

Mapping
 How to match millions of reads(~100 character fragments) with a reference
sequence of billions of characters.

Mapping for RNA-Seq
 The transcript may result from splicing and the mapping strategy must account for
this.
 Either we map to the transcriptome or we map to the genome with a “splice-aware”
aligner.
 Tophat is one of the first and by far the most popular aligner for RNA-Seq data
 it was built “on top” of bowtie

Mapping for RNA-Seq
 It breaks the reads into pieces, maps first to the genome and then extends to
“possible splice junctions”.
 We can also Pass an annotation file(GTF format) as arguments.
 Extract transcript, build an index
 Map to the “transcriptome” and then to the genome.

Tophat
 Using annotation:
 Improves accuracy, mostly around splicing junctions.
 “bias” in favor of known transcripts
 less power to detect novel transcripts or novel isoforms
 Beware if you have an incomplete annotation
 At the mapping step, it’s better to keep multi-mappers(within a reasonable limit,
10, 20 hits) Tophat provides an options control multiple mappings.

SAM/BAM FORMAT
 Composed of two parts
 Header to describe the source of the data, the reference sequence, the method of alignment and so on.
 Alignments to describe the reads, the location and the nature of the alignments.

Quantification
 In general, we want more than just alignment
 in theory, RNA-Seq is a quantifying assay, and we want to measure gene expression.

Cufflinks
 Based on alignment there are two goals:
 Transcript assembly
 Transcript quantification

Cufflinks
 Assembly: Try to find the
minimal number of paths in a
graph to fully represent the
alignment.
 Quantification: Estimate the
most likely abundance of the
difference isoforms.

Unit of Abundance
 We can only measure relative measures based on X number of reads in library, Y map to
geneA and Z map to geneB. If we change X then both Y and Z will change.
 Now if gene A and gene B have the same no of reads mapped
 Do they have similar expression levels?
 if gene B and gene A have the same size, Yes
 otherwise, No because a longer gene will receive more reads than a shorter gene

FPKM
Fragment per Kilobase per millions of reads.
 Not annotated gene size, but effective size
 Effective length: number of possible start site on transcript(depends on the
estimated fragment size)
 Millions of reads; millions of mapped reads(not millions of sequenced reads)
 FPKM let you compare the expression of gene between samples(because it
account for differences in library depth)
 It also lets you compare the expression of two genes within the same
sample (because it account difference in gene length).

Cufflinks and Cuffdiff
 In addition to estimating expression, cufflinks output gene expression(more or less
the sum of the different isoforms)
 cufflinks contains a method called cuffdiff for differential expression.
 cuffdiff estimate the isoform expression in two groups(which can be composed of
multiple replicates) and performs statistical test for;
 Differential gen expression
 Differential transcript levels
 Alternative splicing
 Differential usage of transcription start sites

Quality Control
 A number of matrices are important to look at:
Ribosomal contamination
Map the entire library against a set of ribosomal RNA
sequence and count the number reads mapping
 Numbers of reads mapping, and number of reads
mapping uniquely.
 Distribution of expression. Few high expressed,
some mid-expressed many low expression genes.

Quality Control
 Other matrices:
 Duplication rate(based on the location of the alignment) might bias the
estimation of the gene expression
% of the reads mapping of CDS, UTR, intron and intergenic regions,
obviously, the more on CDS and UTR the better.
Unsupervised clustering to verify that the samples cluster according to
biological differences and not according to experimental batches.
Hierarchical Clustering
PCA, MDS plot

RNA-Seq_Presentation

More Related Content

What's hot

Viewers also liked

Similar to RNA-Seq_Presentation

RNA-Seq_Presentation