RNASeq - Analysis Pipeline for Differential Expression

RNA-Seq
• Application of Next Generation Sequencing technology
(NGS) for RNA sequencing for transcript identification
and quantification of RNA.
• Can be used for:
– Estimating the number of transcripts in the sample
(transcriptomics or expression profiling)
– Reveal sequence variation
– Detection of alternate splicing
– Gene expression profiles of healthy versus diseased tissue

RNA-Seq vs Microarray
BMC Bioinformatics201415(Suppl 11):S3, DOI: 10.1186/1471-2105-15-S11-S3

Data Generation Steps REVI EWS
Nature Reviews Genetics 12, 671-682 (October 2011) , Doi:10.1038/nrg3068

RNA-Seq analysis Pipeline for Detecting
Differential Expression
Genome Biology 2010 11:220, DOI: 10.1186/gb-2010-11-12-220

Read-Mapping Challenges
• NGS Computational challenges
• Memory footprint
• Millions of short reads
• RNA-Seq Special Mapping Concerns
• New technology old problems
• Exact vs inexact matches
From wikipedia

Algorithms For Read Mapping
Build an Index
Set of position where reads are most likely to align
Refined alignment at the target locations
- Hash table
- Burrow-Wheeler
transform (BWT); FM
Index
Seed and Extend

Hash Tables
• Use hash tables to store position of all k-mers
in a genome
1 2
012345678901234567890
AATCGCATAG
ATCGCATAGT
TCGCATAGTT
CGCATAGTTA
GCATAGTTA T
- Chr 9, location 0
- Chr 9, location 1
- Chr 9, location 2
- Chr 9, location 3
- Chr 9, location 4
- Chr 9, location 5
AATCGCATAGTTATTAATGCTA

Output String: TTGGAACC
Input String: GCTAGCTA
GCTAGCTA
CTAGCTAG
TAGCTAGC
AGCTAGCT
GCTAGCTA
CTAGCTAG
TAGCTAGC
AGCTAGCT
AGCTAGCT
AGCTAGCT
CTAGCTAG
CTAGCTAG
GCTAGCTA
GCTAGCTA
TAGCTAGC
TAGCTAGC
Sorting
Burrows-Wheeler Transformation
BWT
• Reversible transformation
• Repetitive nature of the
outcome makes it easier to
compress

Seed and Extend
Read Target
ATGCTAGT ATGCTGTT
ATGCTAGT
Mis-match
Match

RNA-Seq: Special Mapping Concerns
www.ensembl.org

genome.gov
Alternate Splicing

• For RNA sequencing data, many reads will map to the reference
genome, but many reads will not because (coming from RNA) they
span exon–exon junctions.
• Methods to deal with junction reads
• Align to the reference transcriptome (well annotated).
• Align to the reference genome and build a junction library
from known adjacent exons and then align unmapped reads to
junction library
• Map reads to the genome and identify putative exon (indel
finding algorithm); using these candidate exon build all
possible exon-exon junctions
• De novo assembly of RNA-Seq reads

Genome Biology 2013 14:R36, DOI: 10.1186/gb-2013-14-4-r36

Reference Based Mapping Methods
BMC Genomics. 2014; 15(1): 570, Doi: 10.1186/1471-2164-15-570

Tophat2
Genome Biology 2013 14:R36, DOI: 10.1186/gb-2013-14-4-r36

Transcript Assembly
IEEE/ACM Trans Comput Biol Bioinform. 2013 Sep-Oct; 10(5): 1234–1240.

Summarizing Reads
• Aggregate reads over biological meaningful units such as transcripts or
genes
• Count the number of reads overlapping exons in a gene (but significant
proportion of the reads will also map outside annotated regions
Genome Biology 2010 11:220, DOI: 10.1186/gb-2010-11-12-220

Count Normalization
• Number of reads aligned to a gene gives a measure of
its level of expression
• Normalization of the count data
• Sequencing depth
• Length bias
o decide
rom the
require-
h assem-
ut differ
ufflinks b
Isoform 1
d
a
Low
Short transcript
High
Long transcript
Readcount
21
43
1 2 3 4
Exon unio104
Nature Methods 8, 469–477 (2011), Doi:10.1038/nmeth.1613

Count Normalization
• RPKM (Reads Per Kilobase of exon model per Million mapped reads)
• FPKM (Fragments Per Kilobase of exon model per Million mapped reads
• TPM (Transcripts per million)
Exon length
Raw number of reads
Number of mapped reads in the sample
1,000,000
RPKM =

Count Normalization
Gene/Transcript Name R1 counts R2 counts
A (50 kb) 37000 70000
B (100 kb) 50000 110000
C (200 kb) 50000 88000
D (-- kb) ---- ----
XDD (-- kb) ---- -----
Total number of reads 2000000 4000000

Differential Expression
• Goal of the DE analysis is to identify the genes
for which abundance across different
experimental conditions has changed
significantly
• Biological replicates (to account for biological
variation)
• Ranked list of genes with associated p-values
and fold changes
• DE tools: edgeR, DESeq

Alignment Independent Quantification
• Sailfish
• Salmon
• Kallisto
Main Idea
• Quantify the abundance of known transcripts
• Read mapping is unnecessary
• Replace inexact pattern matching with exact sub-pattern counting

Sailfish
Nature Biotechnology 32, 462–464 (2014), Doi:10.1038/nbt.2862

Transcript: TACGTACTAGACCTAA….....
Read: TGCGTACTAGCCCT
K-mers are Robust to Errors

Kallisto
arXiv:1505.02710v2 [q-bio.QM]

RNASeq - Analysis Pipeline for Differential Expression

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RNASeq - Analysis Pipeline for Differential Expression

Similar to RNASeq - Analysis Pipeline for Differential Expression (20)

Recently uploaded

Recently uploaded (20)

RNASeq - Analysis Pipeline for Differential Expression

Editor's Notes