RNA-Seq-based Genome Annotation
using mGene.ngs and MiTie
Gunnar R¨tsch
a
Biomedical Data Science Group
Computational Biol...
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr
...
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr
...
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr
...
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr
...
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annot...
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annot...
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annot...
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annot...
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annot...
Proposed new gene finding method (mGene.ngs) for reannotation of
19 A. thaliana genomes (and genome assembly + analysis).

...
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence ...
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence ...
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence ...
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

STEP 1: SVM Signal...
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

STEP 1: SVM Signal...
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

Wrong gene model
S...
Memorial Sloan-Kettering Cancer Center

Training of mGene.ngs
genomic position
True gene model

2

3

4

5

Wrong gene mod...
Memorial Sloan-Kettering Cancer Center

Training of mGene.ngs
genomic position
True gene model

2

3

4

5

Wrong gene mod...
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
RNA-seq:
paired-end, strand-specific RNA ligation based prot...
Memorial Sloan-Kettering Cancer Center

Results for C. elegans

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation u...
Memorial Sloan-Kettering Cancer Center

Digestion
Observations:
RNA-seq helps to improve performance
Genomic signals help ...
Memorial Sloan-Kettering Cancer Center

Skimming and Non-coding Transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based...
Memorial Sloan-Kettering Cancer Center

Skimming and Non-coding Transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based...
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using ...
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using ...
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using ...
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGen...
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGen...
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGen...
Memorial Sloan-Kettering Cancer Center

Gene Finding vs. Transcript Assembly
Gene expression level
low

high

Genefinding ...
BIOINFORMATICS

ORIGINAL PAPER

Genome analysis

Vol. 29 no. 20 2013, pages 2529–2538
doi:10.1093/bioinformatics/btt442

A...
Memorial Sloan-Kettering Cancer Center

Transcript Reconstruction with RNA-seq
Reads

Genome Based Assembly
(Cufflinks, Sc...
Memorial Sloan-Kettering Cancer Center

Transcript Reconstruction with RNA-seq
Reads

Genome Based Assembly
(Cufflinks, Sc...
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Potential Transcripts

[Behr...
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Potential Transcripts

1
1
1...
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Abundance

Potential Transcr...
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Abundance

Potential Transcr...
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts M...
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts M...
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts M...
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts M...
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts M...
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts M...
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification  Quantification
Segment Graph

Abundance

Transcripts Ma...
Memorial Sloan-Kettering Cancer Center

MiTie’s Main Features
Uses a likelihood function L based on a probabilistic model ...
Memorial Sloan-Kettering Cancer Center

MiTie Results
F−score on Transcript Level

A

F−score on Transcript Level

B

Huma...
Memorial Sloan-Kettering Cancer Center

Gene Finding vs. Transcript Assembly
Gene expression level
low

high

mGene.ngs
= ...
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly express...
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly express...
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly express...
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly express...
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly express...
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly express...
Just published:

Checkout:
http://oqtans.org
http://galaxy.cbio.mskcc.org
[Sreedharan et al., 2014]
c Gunnar R¨tsch (cBio@...
References I
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. ...
References II

VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N G¨rnitz, G Zeller, and...
Upcoming SlideShare
Loading in …5
×

RNA-seq based Genome Annotation with mGene.ngs and MiTie

1,625 views

Published on

Talk in gene discovery session at PAGXXII (https://pag.confex.com/pag/xxii/webprogram/Session2128.html)

Joint work with Jonas Behr, Gabriele Schweikert, Andre Kahles and others.

Abstract: High throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in detection of expressed genes and transcripts. However, the immense dynamic range of gene expression, limitations and biases of the sequencing technology, as well as the observed complexity of the transcriptional landscape pose profound computational challenges. We discuss several of these challenges and based on illustrative simulation examples, we identify the limits of state-of-the-art tools in reconstructing multiple alternative transcripts even if sufficient information is provided. We propose a novel framework, called MiTie, for simultaneous transcript reconstruction and quantification based on combinatorial optimization. We use the negative binomial distribution to define a likelihood function and use a regularization approach to select a small number of transcripts quantitatively explaining the observed read data. We show that the resulting regularized maximum likelihood problem can be formulated as a mixed integer programming problem (MIP) which can be solved optimally using standard optimization approaches. We will also describe an extension of the discriminative gene finding system mGene that takes advantage of RNA-seq reads. We demonstrate that the extended system mGene.ngs can significantly more accurately predict transcript annotations when using RNA-seq data and also better than tools for transcriptome reconstruction that are solely based on RNA-seq data. Finally, we illustrate how a combination of gene finding and transcriptome reconstruction methods like MiTie can be used to accurately annotate newly sequenced genomes without prior annotations.

Published in: Health & Medicine, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,625
On SlideShare
0
From Embeds
0
Number of Embeds
184
Actions
Shares
0
Downloads
44
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

RNA-seq based Genome Annotation with mGene.ngs and MiTie

  1. 1. RNA-Seq-based Genome Annotation using mGene.ngs and MiTie Gunnar R¨tsch a Biomedical Data Science Group Computational Biology Center Memorial Sloan-Kettering Cancer Center gxr #mGene #MiTie #PAGXXII
  2. 2. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  3. 3. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  4. 4. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  5. 5. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  6. 6. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  7. 7. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  8. 8. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  9. 9. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  10. 10. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  11. 11. Proposed new gene finding method (mGene.ngs) for reannotation of 19 A. thaliana genomes (and genome assembly + analysis). c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 4
  12. 12. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  13. 13. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  14. 14. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  15. 15. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  16. 16. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  17. 17. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop large margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  18. 18. Memorial Sloan-Kettering Cancer Center Training of mGene.ngs genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Coverage RNA-seq Score f(y|x) intron support from spliced reads large margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  19. 19. Memorial Sloan-Kettering Cancer Center Training of mGene.ngs genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Coverage RNA-seq intron support from spliced reads Score f(y|x) larger margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  20. 20. Memorial Sloan-Kettering Cancer Center Results for C. elegans RNA-seq: paired-end, strand-specific RNA ligation based protocol 76bp reads, 50 million reads Alignment with Palmapper Evaluation: Transcript-level F-score of coding transcripts . . . for different expression levels Compare mGene (ab initio), mGene.ngs, cufflinks c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 7
  21. 21. Memorial Sloan-Kettering Cancer Center Results for C. elegans c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 8
  22. 22. Memorial Sloan-Kettering Cancer Center Digestion Observations: RNA-seq helps to improve performance Genomic signals help much (see cufflinks) Problems: Need existing annotation for training Cannot predict non-coding transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 9
  23. 23. Memorial Sloan-Kettering Cancer Center Skimming and Non-coding Transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 10
  24. 24. Memorial Sloan-Kettering Cancer Center Skimming and Non-coding Transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 10
  25. 25. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  26. 26. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  27. 27. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  28. 28. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation 0.4 0.3 0.2 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  29. 29. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation mGene.nc − w/o annotation 0.4 0.3 0.2 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  30. 30. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation mGene.nc − w/o annotation 0.4 0.3 0.2 De novo prediction works! Modeling noncoding transcripts improves coding transcript prediction. 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  31. 31. Memorial Sloan-Kettering Cancer Center Gene Finding vs. Transcript Assembly Gene expression level low high Genefinding + RNA-seq => only one transcript RNA transcript assembly =>multiple transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 13
  32. 32. BIOINFORMATICS ORIGINAL PAPER Genome analysis Vol. 29 no. 20 2013, pages 2529–2538 doi:10.1093/bioinformatics/btt442 Advance Access publication August 25, 2013 MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples ´ Jonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 and ¨ Gunnar Ratsch1,* 1 Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, Germany ¨ Associate Editor: Ivo Hofacker ABSTRACT c Gunnar R¨tsch (cBio@MSKCC) a Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction. Results: We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with RNA-Seq-based Annotation using corroborate that aand respect to annotated transcripts. Our results mGene.ngs well- Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on Decem genic locus by means of alternative splicing, transcription start and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al., ¨ 2007; Schweikert et al., 2009). A comprehensive catalog of all transcripts encoded by a genomic locus is essential for downstream analyses that aim at a more detailed understanding of gene expression and RNA processing regulation. RNA-Seq is a method for parallel sequencing of a large number of RNA molecules based on high-throughput sequencing technologies (ENCODE Project Consortium et al., 2012; Mortazavi et al., 2008; Wang et al., 2009). Currently available sequencing platforms typically provide several 10–100 millions of sequence fragments (reads) with a typical length of 50–150 bases. By mapping these reads back to the genome, one can determine where gene products are encoded in the genome (e.g. Denoeud et al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al., 2011) and collect evidence of RNA processing such as splicing (Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing (Bahn et al., 2012). In many cases, the RNA-Seq reads are first aligned to a reference genome using an alignment tool that identifies possible read origins within the genome. Contiguous regions covered with read alignments (possibly with small gaps) are candidates for exonic segments. Alignment tools for RNA-Seq reads, such as PALMapper PAG XXIIal., 2008; Discovery Workshop (De Bona et Gene Jean et al., 2010), TopHat MiTie Transcript prediction via combinatorial optimization that combines evidence from multiple experiments & achieves higher accuracy. 14
  33. 33. Memorial Sloan-Kettering Cancer Center Transcript Reconstruction with RNA-seq Reads Genome Based Assembly (Cufflinks, Scripture) Read alignments Denovo Assembly (Trinity, Oases) Genomic DNA Data processing Segment graph Optimization 108 possible transcripts, 1028 possible subsets of transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 15
  34. 34. Memorial Sloan-Kettering Cancer Center Transcript Reconstruction with RNA-seq Reads Genome Based Assembly (Cufflinks, Scripture) Read alignments Denovo Assembly (Trinity, Oases) Genomic DNA Data processing Segment graph Optimization 108 possible transcripts, 1028 possible subsets of transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 15
  35. 35. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Potential Transcripts [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  36. 36. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  37. 37. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Abundance Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 Sample1 Sample2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0.0 0.2 0.0 0.8 0.0 0.0 0.0 0.0 Expected coverage 0.0 0.0 0.1 0.9 0.0 0.0 0.0 0.0 R. Bohnert and G. R¨tsch, NAR (2010) a c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie [Behr et al., 2013] PAG XXII Gene Discovery Workshop 16
  38. 38. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Abundance Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 Sample1 Sample2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 min L( U T × W , W 0.0 0.2 0.0 0.8 0.0 0.0 0.0 0.0 Expected coverage 0.0 0.0 0.1 0.9 0.0 0.0 0.0 0.0 C )+γ× W 1 expected coverage observed coverage R. Bohnert and G. R¨tsch, NAR (2010) a [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  39. 39. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 k 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  40. 40. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 min L( U T × W , U,W Expected coverage Sample1 Sample2 0.8 0.2 0.0 0.0 0.9 0.0 0.1 0.0 W )+γ×N C expected coverage observed coverage [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  41. 41. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  42. 42. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  43. 43. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W '$ s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid &% RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  44. 44. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  45. 45. Memorial Sloan-Kettering Cancer Center Simultaneous Identification Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min × W , C ) + γ × N L(U T U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  46. 46. Memorial Sloan-Kettering Cancer Center MiTie’s Main Features Uses a likelihood function L based on a probabilistic model for the read coverage. Uses combinatorial optimization to find transcripts that explain data from multiple RNA-seq libraries Newly predicted transcripts are penalized (once). Can use already known/confirmed transcripts without penalty. Provides a p-value for each transcript providing a confidence measure for presence of predicted transcript. Log-likelihood ratio test: Tt = −2 log p(D|M) p(D|Mt ) [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 17
  47. 47. Memorial Sloan-Kettering Cancer Center MiTie Results F−score on Transcript Level A F−score on Transcript Level B Human Simulated Data 0.45 MITIE + MMO MITIE Cufflinks + Cuffmerge Cufflinks 0.40 0.35 1 0.37 2 3 4 D. melanogaster modENCODE Data 5 0.35 0.33 0.31 0.29 MITIE Cufflinks + Cuffmerge 1 2 3 4 5 Number of Samples 6 7 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 18
  48. 48. Memorial Sloan-Kettering Cancer Center Gene Finding vs. Transcript Assembly Gene expression level low high mGene.ngs = only one transcript MiTie =multiple transcripts low confidence high confidence for alternative transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 19
  49. 49. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  50. 50. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  51. 51. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  52. 52. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  53. 53. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  54. 54. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  55. 55. Just published: Checkout: http://oqtans.org http://galaxy.cbio.mskcc.org [Sreedharan et al., 2014] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 21
  56. 56. References I Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn., pages 3–10, 2003. J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. R¨tsch. a Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting, September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf. Jonas Behr, Andr´ Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar R¨tsch. Mitie: Simultaneous e a rna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013. doi: 10.1093/bioinformatics/btt442. RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Sch¨lkopf, M Nordborg, G R¨tsch, JR Ecker, and D Weigel. Common sequence polymorphisms o a shaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi: 10.1126/science.1138632. G. R¨tsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P. a Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004. G R¨tsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Sch¨lkopf, J. Platt, and T. Hoffman, editors, a o Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MIT Press. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM. G. R¨tsch, S. Sonnenburg, and B. Sch¨lkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21 a o (Suppl. 1):i369–i377, June 2005. Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann, Anja Bohlen, Nina Kr¨ger, S¨ren Sonnenburg, and Gunnar R¨tsch. mgene: Accurate svm-based u o a gene finding with an application to nematode genomes. Genome Research, 2009. URL http://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009. S. Sonnenburg, G. R¨tsch, A. Jagota, and K.-R. M¨ller. New methods for splice-site recognition. In Proc. International a u Conference on Artificial Neural Networks, 2002. S¨ren Sonnenburg, Alexander Zien, and Gunnar R¨tsch. ARTS: Accurate Recognition of Transcription Starts in Human. o a Bioinformatics, 22(14):e472–480, 2006. c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 22
  57. 57. References II VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N G¨rnitz, G Zeller, and Gunnar o R¨tsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis. a Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014. G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thaliana with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi: 10.1101/gr.070169.107. A. Zien, G. R¨tsch, S. Mika, B. Sch¨lkopf, T. Lengauer, and K.-R. M¨ller. Engineering Support Vector Machine Kernels That a o u Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000. c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 23

×