An overview of the MapSplice pipeline. The algorithm contains two phases: tag alignment (Step 1–Step 4) and splice inference (Step 5–Step 6). In the ‘tag alignment' phase, candidate alignments of the mRNA tags to the reference genome are determined. In the ‘splice inference' phase, splice junctions that appear in one or more tag alignments are analyzed to determine a splice significance score based on the quality and diversity of alignments that include the splice. Ambiguous candidate alignments are resolved by selecting the alignment with the overall highest quality match and highest confidence splice junctions.
Cap analysis of gene expression, Massively parallel signature sequencing , Serial analysis of gene expression
Transcript discovery and gene model correction using next generation sequencing data Sucheta Tripathy, 6th July 2012
NextGen Sequencing Methods 454 sequencing methods(2006) Principles of pyrophosphate detection(1985, 1988) Illumina(Solexa) Genome sequencing methods(2007) Applied Biosystems ABI SOLiD System(2007) Helicos single molecule sequencing(Helioscope, 2007) Pacific Biosciences single-molecule real- time(SMRT) technology, 2010 Sequenom for Nanotechnology based sequencing. BioNanomatrix nanofluidiscs. RNAP technology.
RNASeq Catalogue all species of transcripts. mRNA Non-coding RNA Small RNA Splicing patterns or other post-transcriptional modifications. Quantify the expression levels.
Topics covered Sequence formats Calculate the sequencing depth of coverage Data Analysis Workflow Mapping programs Output data files SAM SHRIMP MAQ Clustering and assembly programs Finding new genes and correction of existing genes Annotation of RNAseq data
Calculate the sequencing depth ofcoverage Read Length Number of reads GeneSpace size/genome sizeRead Length * Number of Reads/GeneSpace (or genome size)Problem: 12 million reads , read length = 50 bases, TotalGeneSpace=8 MB 12 * 10^6 * 50/8 * 10^6 = 75X
Part -1 : Alignment of the reads to the reference Genome Raw Reads mapped to QC by R Sequence reference Bowtie, ShortReads Data BWA, Shrimp Files(FastQ/ colorspace) 1. Filter out spike- BEDTools ins 1. Read Depth 2. Filter reads of coverage mapping multi 2. Manipulatio locations n of 3. Sam -> Bam BED,SAM, 4. Remove PCR BAM, GTF, duplicates GFF files 5. Sort, View, pileup, merge SNP discovery, indel
Part 2: Data Anlysis Assembly of Assembly of Mapped reads raw QCd (cufflink) reads by denovo methods Abyss, Velvet Gene Model Align correction/ju Merging assembled nction cufflink reads back to finding outputs from genome(BLAT) TopHat, different Transabyss Splice libraries Variants (cuffcompare ) Expression AnalysisCopy and differentialNumber expression (cuffdiff,Variation DEGseq, edgeR)
Mapping One or two mis-matches < 35 bases One insertion/deletion. K-mer based seeding. •Identification of Novel Transcripts. •Transcript abundance.
Available tools for Nextgen sequence alignmentBFAST: Blat like Fast Alignment Tool.Bowtie: Burrows-Wheeler-Transformed (BWT)index.BWA: Gapped global alignment wrt querysequences.ELAND: Is part of Illumina distr. And runs onsingle processor, Local Alignment.SOAP: Short Oligonucleotide Alignment Program.SSAHA: SSAHA (Sequence Search andAlignment by Hashing Algorithm)SHRiMP(Short Read Mapping algorithm)SOCS: Rabin-Karp string search algorithm, which
Integrated Pipeline• SOLiD™ System Analysis Pipeline Tool (Corona Lite)• CLCBio Genomic workbench.• Partek• Galaxy Server.• ERANGE: Is a full package for RNASeq and chipSeq data analysis• DESEQ(used by edgeR package)
Output File Formats SAM(Sequence Alignment and Mapping) SAM BAM Sorting/indexing BAM/SAM files Extracting and viewing alignment SNP calling(mpileup) Text viewer(Tview)1082_1988_1406_F3 16 scaffold_1 31452 255 48M *0 0TCCACGTCACCAGCAAGCCTCCGGTCAATCCGTCTGACTTGTCCTGTC8E/./:R*$BIG/!%GP9@MMK;@FMJIXVNSWNNUUOTXQNGFQUPN XA:i:0MD:Z:48 NM:i:0 CM:i:50 -> the read is not paired and mapped, forward strand4 -> unmapped read16 -> mapped to the reverse strand http://samtools.sourceforge.net/SAM1.pdf
SHRiMP and MAQ Format >947_1567_1384_F3 reftig_991 + 22901 22923 3 25 25 2020 18x2x3 A perfect match for 25-bp tags is: "25“ Edit String A SNP at the 16th base of the tag is: "15A9“ A four-base insertion in the reference: "3(TGCT)20" A four-base deletion in the reference: "5----20" Two sequencing errors: "4x15x6" (i.e. 25 matches with 2 crossovers) http://compbio.cs.toronto.edu/shrimp/READMEID19_190907_6_195_127_427 Contig0_2091311 60 + 0030 30 30 0 0 1 4 35GTGCAGCCATTTGCGTACaAGCaTCtCaaGctACt ?IIIIIIIIIIIIII@EI6<II6HB9I(8I6.G<-
Assembly program Abyss Supports multiple K values Fast Merging different K valued assembly possible Trans-abyss pipeline runs on this MIRA(Mimicking Intelligent Read Assembly) Hybrid Denovo assembler Genome Mapper Velvet
Cufflink Transcript Assembly Expression levels with a reference GTF Expression levels without GTF. Merging experimental replicates(cuffcompare) Differential Expression Analysis(cuffdiff)
Annotation of RNASeq Data De novo Reads Assembled mapped to Reads (contigs) reference assembled Map Back to genome (BLAT) Expressio Train for n Profiling Junction/no gene vel prediction transcripts/ Differential Splice Expression CNV variants analysis
Other Applications SNP detection Splice Variant Discovery Identification of miRNA targets TF binding sites Genome Methylation pattern RNA editing Metagenomic projects Gene Expression Analysis
Difference with other expressionsequencing EST: Low throughput, expansive, NOT quantitative. SAGE, CAGE, MPSS: Highthroughput, digital gene expression levels Expansive Sanger sequencing methods A portion of transcript is analyzed Isoforms are indistinguishable
Advantages: Zero or very less background noise. Sensitive to isoform discovery. Both low and highly expressed genes can be quantified. Highly reproducible.
Transcripts discovered/Corrected 10,000 new Transcription start site discovered in Rhesus macaque(Liu et al., NAR 2010) 602 transcriptionally active regions and numerous introns in Candida albicans(Bruno et al., 2010, Genome Research) 96% of the genes were corrected in Laccaria bicolor(Larsen et al., PLoS One 2010). 16,923 regions in mouse (Martazavi et al., 2008). 3,724 novel isoforms (Trapanell 2010).
Bioinformatics Challenges Store , retrieve and analyze large amounts of data Matching of reads to multiple locations Short reads with higher copy number and long reads representing less expressed genes.
References: Wilhelm J. Ansorge, Next-generation DNA sequencing techniques, New Biotechnology, Volume 25, Issue 4, April 2009, Pages 195-203 Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 January; 10(1): 57–63. Peter E. Larsen et al., Using Deep RNA Sequencing for the Structural Annotation of the Laccaria Bicolor Mycorrhizal TranscriptomePLoS One. 2010; 5(7): e9780 Wang et al. MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery, NAR, 2010 Denoeud et al., Annotating genomes with massive-scale RNA sequencing, Genome Biology, 2008 Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621 Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120 Mortazavi et al. Nature Methods, May 2008