Transcript discovery and gene model correction using next generation sequencing data<br />SuchetaTripathy, VBI, 11th Nov 2...
Brief History of Sequencing<br />Sanger Dideoxy Sequencing methods(1977).<br />Maxam Gilberts Chemical degradation methods...
Brief History Of sequencing<br />Hypoxanthine-guanine phosphoribosyltransferase (HGPRT)<br />Alu sequences<br />
Hitachi Laboratory developed High throughput capillary array sequencer, 1996.<br />1991, A patent filed by EMBL on media l...
NextGen Sequencing Methods<br />454 sequencing methods(2006)<br />Principles of pyrophosphate detection(1985, 1988)<br />I...
Figure 1. (A) Outline of the GS 454 DNA sequencer workflow. Library construction (I) ligates 454-specific adapters to DNA ...
Outline of the Illumina Genome Analyzer workflow. Similar fragmentation and adapter ligation steps take place (I), before ...
(A) Primers hybridise to the P1 adapter within the library template. A set of four fluorescence-labelleddi-base probes com...
Cost<br />Adapted from Eric Lander, 2010<br />
Throughput<br />Standard ABI “Sanger” sequencing <br />96 samples/day<br />Read length ~650 bp<br />Total = 450,000 bases ...
Throughput<br />454 Life Sciences/Roche<br />Genome Sequencer FLX: currently produces 400-600 million bases per day per ma...
RNASeq<br />Catalogue all species of transcripts.	<br />mRNA<br />Non-coding RNA<br />Small RNA<br />Splicing patterns or ...
Zhong Wang et al;  Nat. Rev. Genetics, 2009<br />
Other Applications<br />SNP detection<br />Splice Variant Discovery<br />Identification of miRNA targets<br />TF binding s...
Difference with other expression sequencing<br />EST: Low throughput, expansive, NOT quantitative.<br />SAGA, CAGE, MPSS: ...
Advantages:<br />Zero or very less background noise.<br />Sensitive to isoform discovery.<br />Both low and highly express...
Data Analysis<br />Mapping Reads to the reference assembly<br />Filtering output:<br />Reads mapping > x number of times<b...
Mapping<br />One or two mis-matches < 35 bases<br />One insertion/deletion.<br /> K-mer based seeding.<br /><ul><li>Identi...
Transcript abundance.</li></li></ul><li>Available tools for Nextgen sequence alignment<br />BFAST: Blat like Fast Alignmen...
Integrated Pipeline<br /><ul><li>SOLiD™ System Analysis Pipeline Tool (Corona Lite)
CLCBio Genomic workbench.
Galaxy Server.
ERANGE:Is a full package for RNASeq and chipSeq data analysis
DESEQ(used by edgeR package)</li></li></ul><li>Trapnell et. al 2009<br />
An overview of the MapSplice pipeline.<br />© The Author(s) 2010. Published by Oxford University Press.<br />Wang K et al....
Larsen et al 2010<br />
Denoeud et al, 2008<br />
Upcoming SlideShare
Loading in...5
×

Rnaseq forgenefinding

1,149

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,149
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
40
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Cap analysis of gene expression, Massively parallel signature sequencing , Serial analysis of gene expression
  • An overview of the MapSplice pipeline. The algorithm contains two phases: tag alignment (Step 1–Step 4) and splice inference (Step 5–Step 6). In the ‘tag alignment&apos; phase, candidate alignments of the mRNA tags to the reference genome are determined. In the ‘splice inference&apos; phase, splice junctions that appear in one or more tag alignments are analyzed to determine a splice significance score based on the quality and diversity of alignments that include the splice. Ambiguous candidate alignments are resolved by selecting the alignment with the overall highest quality match and highest confidence splice junctions.
  • Rnaseq forgenefinding

    1. 1. Transcript discovery and gene model correction using next generation sequencing data<br />SuchetaTripathy, VBI, 11th Nov 2010<br />
    2. 2. Brief History of Sequencing<br />Sanger Dideoxy Sequencing methods(1977).<br />Maxam Gilberts Chemical degradation methods(1977).<br />Two Labs that owned automated sequencers:<br />1. Leroy Hood at Caltech, 1986(commercialized by AB)<br />2. Wilhelm Ansorge at EMBL, 1986(commercialized by Pharmacia-Amersham and GE healthcare)<br />
    3. 3. Brief History Of sequencing<br />Hypoxanthine-guanine phosphoribosyltransferase (HGPRT)<br />Alu sequences<br />
    4. 4. Hitachi Laboratory developed High throughput capillary array sequencer, 1996.<br />1991, A patent filed by EMBL on media less, solid support based sequencing.<br />Brief History Of sequencing<br />
    5. 5. NextGen Sequencing Methods<br />454 sequencing methods(2006)<br />Principles of pyrophosphate detection(1985, 1988)<br />Illumina(Solexa) Genome sequencing methods(2007)<br />Applied Biosystems ABI SOLiD System(2007)<br />Helicos single molecule sequencing(Helioscope, 2007)<br />Pacific Biosciences single-molecule real-time(SMRT) technology, 2010<br />Sequenom for Nanotechnology based sequencing.<br />BioNanomatrixnanofluidiscs. <br />RNAP technology.<br />
    6. 6. Figure 1. (A) Outline of the GS 454 DNA sequencer workflow. Library construction (I) ligates 454-specific adapters to DNA fragments (indicated as A and B) and couples amplification beads with DNA in an emulsion PCR to amplify fragments before sequencing (II). The beads are loaded into the picotiter plate (III). (B) Schematic illustration of the pyrosequencing reaction which occurs on nucleotide incorporation to report sequencing-by-synthesis. (Adapted from http://www.454.com.)<br />
    7. 7. Outline of the Illumina Genome Analyzer workflow. Similar fragmentation and adapter ligation steps take place (I), before applying the library onto the solid surface of a flow cell. Attached DNA fragments form ‘bridge’ molecules which are subsequently amplified via an isothermal amplification process, leading to a cluster of identical fragments that are subsequently denatured for sequencing primer annealing (II). Amplified DNA fragments are subjected to sequencing-by-synthesis using 3′ blocked labelled nucleotides (III). (Adapted from the Genome Analyzer brochure, http://www.solexa.com.) <br />
    8. 8. (A) Primers hybridise to the P1 adapter within the library template. A set of four fluorescence-labelleddi-base probes competes for ligation to the sequencing primer. These probes have partly degenerated DNA sequence (indicated by n and z). Specificity of the di-base probe is achieved by interrogating the first and second base in each ligation reaction (CA in this case for the complementary strand). (B) Sequence determination by the SOLiD DNA sequencing platform is performed in multiple ligation cycles, using different primers, each one shorter from the previous one by a single base. The number of ligation cycles determines the eventual read length, whilst for each sequence tag, six rounds of primer reset occur [from primer (n) to primer (n − 4)]. (Adapted and modified from http://www.appliedbiosystems.com.) <br />
    9. 9. Cost<br />Adapted from Eric Lander, 2010<br />
    10. 10. Throughput<br />Standard ABI “Sanger” sequencing <br />96 samples/day<br />Read length ~650 bp<br />Total = 450,000 bases of sequence data<br />454 was the game changer!<br />~400,000 different templates (reads)/day<br />Read length ~250 bp<br />Total = 100,000,000 bases of sequence data!!!<br />
    11. 11. Throughput<br />454 Life Sciences/Roche<br />Genome Sequencer FLX: currently produces 400-600 million bases per day per machine<br />Published 1 million bases of Neanderthal DNA in 2006<br />May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage) <br />Solexa/Illumina<br />10 GB per machine/week<br />May 2008 published complete genomes for 3 hapmap subjects (14x coverage) <br />ABI SOLID<br />20 GB per machine/week <br />
    12. 12. RNASeq<br />Catalogue all species of transcripts. <br />mRNA<br />Non-coding RNA<br />Small RNA<br />Splicing patterns or other post-transcriptional modifications.<br />Quantify the expression levels.<br />
    13. 13. Zhong Wang et al; Nat. Rev. Genetics, 2009<br />
    14. 14. Other Applications<br />SNP detection<br />Splice Variant Discovery<br />Identification of miRNA targets<br />TF binding sites<br />Genome Methylation pattern<br />RNA editing<br />Metagenomic projects<br />Gene Expression Analysis<br />
    15. 15. Difference with other expression sequencing<br />EST: Low throughput, expansive, NOT quantitative.<br />SAGA, CAGE, MPSS: Highthroughput, digital gene expression levels<br />Expansive<br />Sanger sequencing methods<br />A portion of transcript is analyzed<br />Isoforms are indistinguishable<br />
    16. 16. Advantages:<br />Zero or very less background noise.<br />Sensitive to isoform discovery.<br />Both low and highly expressed genes can be quantified.<br />Highly reproducible.<br />
    17. 17. Data Analysis<br />Mapping Reads to the reference assembly<br />Filtering output:<br />Reads mapping > x number of times<br />Downstream data analysis<br />
    18. 18. Mapping<br />One or two mis-matches < 35 bases<br />One insertion/deletion.<br /> K-mer based seeding.<br /><ul><li>Identification of Novel Transcripts.
    19. 19. Transcript abundance.</li></li></ul><li>Available tools for Nextgen sequence alignment<br />BFAST: Blat like Fast Alignment Tool.<br />Bowtie: Burrows-Wheeler-Transformed (BWT) index.<br />BWA:Gapped global alignment wrt query sequences.<br />ELAND: Is part of Illumina distr. And runs on single processor, Local Alignment.<br />SOAP: Short Oligonucleotide Alignment Program.<br />SSAHA: SSAHA (Sequence Search and Alignment by Hashing Algorithm)<br />SOCS: Rabin-Karp string search algorithm, which uses hashing<br />Vmatch: A Large string matching toolbox.<br />.<br />
    20. 20. Integrated Pipeline<br /><ul><li>SOLiD™ System Analysis Pipeline Tool (Corona Lite)
    21. 21. CLCBio Genomic workbench.
    22. 22. Galaxy Server.
    23. 23. ERANGE:Is a full package for RNASeq and chipSeq data analysis
    24. 24. DESEQ(used by edgeR package)</li></li></ul><li>Trapnell et. al 2009<br />
    25. 25.
    26. 26. An overview of the MapSplice pipeline.<br />© The Author(s) 2010. Published by Oxford University Press.<br />Wang K et al. Nucl. Acids Res. 2010;38:e178-e178<br />
    27. 27. Larsen et al 2010<br />
    28. 28. Denoeud et al, 2008<br />
    29. 29. Transcripts discovered/Corrected<br />10,000 new Transcription start site discovered in Rhesus macaque(Liu et al., NAR 2010)<br />602 transcriptionally active regions and numerous introns in Candida albicans(Bruno et al., 2010, Genome Research)<br />96% of the genes were corrected in Laccaria bicolor(Larsen et al., PLoS One 2010).<br />16,923 regions in mouse (Martazavi et al., 2008).<br />3,724 novel isoforms (Trapanell2010).<br />
    30. 30. Bioinformatics Challenges<br />Store , retrieve and analyze large amounts of data<br />Matching of reads to multiple locations<br />Short reads with higher copy number and long reads representing less expressed genes.<br />
    31. 31. References:<br />Wilhelm J. Ansorge, Next-generation DNA sequencing techniques, New Biotechnology, Volume 25, Issue 4, April 2009, Pages 195-203<br />Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 January; 10(1): 57–63. <br />Peter E. Larsen et al., Using Deep RNA Sequencing for the Structural Annotation of the Laccaria Bicolor MycorrhizalTranscriptomePLoS One. 2010; 5(7): e9780<br />Wang et al. MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery, NAR, 2010<br />Denoeud et al., Annotating genomes with massive-scale RNA sequencing, Genome Biology, 2008<br />Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L.Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnology doi:10.1038/nbt.1621<br />Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120<br />Mortazavi et al. Nature Methods, May 2008<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×