Rnaseq forgenefinding

Transcript discovery and
gene model correction
using next generation
sequencing data

Sucheta Tripathy, 6th July
2012

NextGen Sequencing Methods
 454 sequencing methods(2006)
 Principles of pyrophosphate detection(1985, 1988)
 Illumina(Solexa) Genome sequencing
methods(2007)
 Applied Biosystems ABI SOLiD System(2007)
 Helicos single molecule sequencing(Helioscope,
2007)
 Pacific Biosciences single-molecule real-
time(SMRT) technology, 2010
 Sequenom for Nanotechnology based
sequencing.
 BioNanomatrix nanofluidiscs.
 RNAP technology.

Roberts et al.
Genome Biology 2011

RNASeq
 Catalogue all species of transcripts.
 mRNA
 Non-coding RNA
 Small RNA
 Splicing patterns or other post-transcriptional
modifications.
 Quantify the expression levels.

Topics covered
 Sequence formats
 Calculate the sequencing depth of coverage
 Data Analysis Workflow
 Mapping programs
 Output data files
 SAM
 SHRIMP
 MAQ
 Clustering and assembly programs
 Finding new genes and correction of existing genes
 Annotation of RNAseq data

Input File Types
@SNPSTER4_90_307R0AAXX:2:41:528:604 run=080625_SNPSTER4_090_307R0AAXX
GCGCCTATCCACTTTGCGGTCTTCCAAAGNCTCCGG
Raw +
IIIIIIIIIIIIIIIIIIIIIIIIII,II!IIIIII
sequence files
in csfasta or >853_22_43_F3
T32310120021231211023112232332233113303231202211332
fastq format
>853_22_43_F3
20 24 23 22 14 13 18 12 23 22 14 14 17 26 26 18 12 17 16 26 23 16 15 16 25 5 14
25 26 23 8 10 9 20 2 11 2 9 25 26 8 6 19 24 15 18 6 10 20 12

Calculate the sequencing depth of
coverage
 Read Length
 Number of reads
 GeneSpace size/genome size

Read Length * Number of Reads/GeneSpace (or genome size)

Problem: 12 million reads , read length = 50 bases, Total
GeneSpace=8 MB
12 * 10^6 * 50/8 * 10^6 = 75X

Part -1 : Alignment of the reads to the reference Genome

Raw Reads mapped to
QC by R
Sequence reference Bowtie,
ShortReads
Data BWA, Shrimp
Files(FastQ/
colorspace)

1. Filter out spike-
BEDTools
ins
1. Read Depth
2. Filter reads
of coverage
mapping multi
2. Manipulatio
locations
n of
3. Sam -> Bam
BED,SAM,
4. Remove PCR
BAM, GTF,
duplicates
GFF files
5. Sort, View,
pileup, merge

SNP
discovery,
indel

Part 2: Data Anlysis

Assembly of Assembly of
Mapped reads raw QCd
(cufflink) reads by
denovo
methods
Abyss, Velvet

Gene Model
Align correction/ju
Merging assembled nction
cufflink reads back to finding
outputs from genome(BLAT) TopHat,
different Transabyss
Splice
libraries Variants
(cuffcompare
)
Expression Analysis
Copy and differential
Number expression (cuffdiff,
Variation DEGseq, edgeR)

Zhong Wang et
al; Nat. Rev.
Genetics, 2009

Mapping
 One or two mis-matches < 35 bases
 One insertion/deletion.
 K-mer based seeding.

•Identification of Novel Transcripts.
•Transcript abundance.

Available tools for Nextgen
sequence alignment
BFAST: Blat like Fast Alignment Tool.
Bowtie: Burrows-Wheeler-Transformed (BWT)
index.
BWA: Gapped global alignment wrt query
sequences.
ELAND: Is part of Illumina distr. And runs on
single processor, Local Alignment.
SOAP: Short Oligonucleotide Alignment Program.
SSAHA: SSAHA (Sequence Search and
Alignment by Hashing Algorithm)
SHRiMP(Short Read Mapping algorithm)
SOCS: Rabin-Karp string search algorithm, which

Integrated Pipeline

• SOLiD™ System Analysis Pipeline Tool
(Corona Lite)
• CLCBio Genomic workbench.
• Partek
• Galaxy Server.
• ERANGE: Is a full package for RNASeq
and chipSeq data analysis
• DESEQ(used by edgeR package)

Output File Formats
 SAM(Sequence Alignment and Mapping)
 SAM BAM
 Sorting/indexing BAM/SAM files
 Extracting and viewing alignment
 SNP calling(mpileup)
 Text viewer(Tview)

1082_1988_1406_F3 16 scaffold_1 31452 255 48M *
0 0
TCCACGTCACCAGCAAGCCTCCGGTCAATCCGTCTGACTTGTCCTGTC
8E/./:R*
$BIG/!%GP9@MMK;@FMJIXVNSWNNUUOTXQNGFQUPN XA:i:0
MD:Z:48 NM:i:0 CM:i:5
0 -> the read is not paired and mapped, forward strand
4 -> unmapped read
16 -> mapped to the reverse strand http://samtools.sourceforge.net/SAM1.pdf

SHRiMP and MAQ Format
>947_1567_1384_F3 reftig_991 + 22901 22923 3 25 25 2020
18x2x3
A perfect match for 25-bp tags is: "25“
Edit String
A SNP at the 16th base of the tag is: "15A9“
A four-base insertion in the reference: "3(TGCT)20"
A four-base deletion in the reference: "5----20"
Two sequencing errors: "4x15x6" (i.e. 25 matches with 2
crossovers)
http://compbio.cs.toronto.edu/shrimp/README

ID19_190907_6_195_127_427 Contig0_2091311 60 + 0
0
30 30 30 0 0 1 4 35
GTGCAGCCATTTGCGT
ACaAGCaTCtCaaGctACt ?IIIIIIIIIIIIII@EI6<II6HB9I(8I6.G<-

Assembly program
 Abyss
 Supports multiple K values
 Fast
 Merging different K valued assembly possible
 Trans-abyss pipeline runs on this

 MIRA(Mimicking Intelligent Read Assembly)
 Hybrid Denovo assembler
 Genome Mapper
 Velvet

Splice Junction prediction
 TopHat
 Cufflink
 MapSplice
 Trans-Abyss

An overview of the MapSplice pipeline.

© The Author(s) 2010. Published by Oxford University Press.

Wang K et al. Nucl. Acids Res. 2010;38:e178-e178

Cufflink
 Transcript Assembly
 Expression levels with a reference GTF
 Expression levels without GTF.
 Merging experimental replicates(cuffcompare)
 Differential Expression Analysis(cuffdiff)

Annotation of RNASeq Data

De novo Reads
Assembled mapped to
Reads (contigs) reference
assembled

Map Back to
genome
(BLAT)
Expressio
Train for n Profiling
Junction/no gene
vel prediction
transcripts/ Differential
Splice Expression CNV
variants analysis

Genome Viewer
 Desktop/standalone application
 Tbrowse
 Bamview
 Savant
 IGV
 IGB
 Web based browsers
 Gbrowse
 UCSC Genome Browse
 VBI Transcriptomics browser

Other Applications
 SNP detection
 Splice Variant Discovery
 Identification of miRNA targets
 TF binding sites
 Genome Methylation pattern
 RNA editing
 Metagenomic projects
 Gene Expression Analysis

Difference with other expression
sequencing
 EST: Low throughput, expansive, NOT
quantitative.
 SAGE, CAGE, MPSS: Highthroughput, digital
gene expression levels
 Expansive
 Sanger sequencing methods
 A portion of transcript is analyzed
 Isoforms are indistinguishable

Advantages:
 Zero or very less background noise.
 Sensitive to isoform discovery.
 Both low and highly expressed genes can be
quantified.
 Highly reproducible.

Transcripts discovered/Corrected
 10,000 new Transcription start site discovered in
Rhesus macaque(Liu et al., NAR 2010)
 602 transcriptionally active regions and numerous
introns in Candida albicans(Bruno et al., 2010,
Genome Research)
 96% of the genes were corrected in Laccaria
bicolor(Larsen et al., PLoS One 2010).
 16,923 regions in mouse (Martazavi et al., 2008).
 3,724 novel isoforms (Trapanell 2010).

Bioinformatics Challenges
 Store , retrieve and analyze large amounts of
data
 Matching of reads to multiple locations
 Short reads with higher copy number and long
reads representing less expressed genes.

References:
 Wilhelm J. Ansorge, Next-generation DNA sequencing techniques, New
Biotechnology, Volume 25, Issue 4, April 2009, Pages 195-203
 Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a
revolutionary tool for transcriptomics. Nat Rev Genet. 2009 January;
10(1): 57–63.
 Peter E. Larsen et al., Using Deep RNA Sequencing for the Structural
Annotation of the Laccaria Bicolor Mycorrhizal TranscriptomePLoS One.
2010; 5(7): e9780
 Wang et al. MapSplice: Accurate mapping of RNA-seq reads for splice
junction discovery, NAR, 2010
 Denoeud et al., Annotating genomes with massive-scale RNA
sequencing, Genome Biology, 2008
 Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren
MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and
quantification by RNA-Seq reveals unannotated transcripts and isoform
switching during cell differentiation Nature Biotechnology
doi:10.1038/nbt.1621
 Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions
with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
 Mortazavi et al. Nature Methods, May 2008

Rnaseq forgenefinding

More Related Content

What's hot

Viewers also liked

Similar to Rnaseq forgenefinding

More from Sucheta Tripathy

Recently uploaded

Rnaseq forgenefinding

Editor's Notes