RNA-seq based Genome Annotation with mGene.ngs and MiTie

RNA-Seq-based Genome Annotation
using mGene.ngs and MiTie
Gunnar R¨tsch
a
Biomedical Data Science Group
Computational Biology Center
Memorial Sloan-Kettering Cancer Center
gxr #mGene #MiTie #PAGXXII


Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2


Genome Annotation Pipeline(s)

a



3

Proposed new gene ﬁnding method (mGene.ngs) for reannotation of
19 A. thaliana genomes (and genome assembly + analysis).

a



4


mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
diﬀerent sources of information x
Train parameters such that
f (y |x)

f (y |x) for all y = y

(“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a

Automatically adapts to quality of RNA-seq data/alignments

a



5


Training of mGene
genomic position
True gene model

2

3

4

5

STEP 1: SVM Signal Predictions
tss
tis
acc
don

Score f(y|x)

stop

genomic position

a



6


Training of mGene
genomic position
True gene model

2

3

4

5

Wrong gene model
tss
tis
acc
don

Score f(y|x)

stop

large margin

genomic position

a



6


Training of mGene.ngs
genomic position
True gene model

2

3

4

5

Wrong gene model
tss
tis
acc
don
stop

Coverage

RNA-seq

Score f(y|x)

intron support
from spliced reads

large margin

genomic position

a



6


Training of mGene.ngs
genomic position
True gene model

2

3

4

5

Wrong gene model
tss
tis
acc
don
stop

Coverage

RNA-seq

intron support
from spliced reads

Score f(y|x)

larger margin

genomic position

a



6


Results for C. elegans
RNA-seq:
paired-end, strand-specific RNA ligation based protocol
76bp reads, 50 million reads
Alignment with Palmapper
Evaluation:
Transcript-level F-score of coding transcripts
. . . for different expression levels
Compare mGene (ab initio), mGene.ngs, cufflinks

a



7



a



8


Digestion
Observations:
RNA-seq helps to improve performance
Genomic signals help much (see cuﬄinks)
Problems:
Need existing annotation for training
Cannot predict non-coding transcripts

a



9


Skimming and Non-coding Transcripts

a



10


Learning Strategy

a



11


0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation

0.4

0.3

0.2

0.1

0

0

10

a

20

30

40

50

60

expression percentile

70


80

90

100


12


0.7

0.6

F−score

0.5
mGene.nc − w/o annotation

0.4

0.3

0.2

0.1

0

0

10

a

20

30

40

50

60


70


80

90

100


12


0.7

0.6

F−score

0.5
mGene.nc − w/o annotation

0.4

0.3

0.2

De novo prediction works!
Modeling noncoding
transcripts improves coding
transcript prediction.

0.1

0

0

10

a

20

30

40

50

60


70


80

90

100


12


Gene Finding vs. Transcript Assembly
Gene expression level
low

high

Genefinding + RNA-seq
=> only one transcript
RNA transcript assembly
=>multiple transcripts

a



13

BIOINFORMATICS

ORIGINAL PAPER

Genome analysis

Vol. 29 no. 20 2013, pages 2529–2538
doi:10.1093/bioinformatics/btt442

Advance Access publication August 25, 2013

MITIE: Simultaneous RNA-Seq-based transcript identification and
quantification in multiple samples
´
Jonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 and
¨
Gunnar Ratsch1,*
1

Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2Friedrich
Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, Germany
¨

Associate Editor: Ivo Hofacker
ABSTRACT

a

Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led
to tremendous improvements in the detection of expressed genes and
reconstruction of RNA transcripts. However, the extensive dynamic
range of gene expression, technical limitations and biases, as well
as the observed complexity of the transcriptional landscape, pose
profound computational challenges for transcriptome reconstruction.
Results: We present the novel framework MITIE (Mixed Integer
Transcript IdEntification) for simultaneous transcript reconstruction
and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few
transcripts collectively explaining the observed read data and show
how to find the optimal solution using Mixed Integer Programming.
MITIE can (i) take advantage of known transcripts, (ii) reconstruct
and quantify transcripts simultaneously in multiple samples, and
(iii) resolve the location of multi-mapping reads. It is designed for
genome- and assembly-based transcriptome reconstruction. We
present an extensive study based on realistic simulated RNA-Seq
data. When compared with state-of-the-art approaches, MITIE
proves to be significantly more sensitive and overall more accurate.
Moreover, MITIE yields substantial performance gains when used with
multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of
reconstructing omitted transcript annotations and the specificity with
RNA-Seq-based Annotation using corroborate that aand
respect to annotated transcripts. Our results mGene.ngs well-

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on Decem

genic locus by means of alternative splicing, transcription start
and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al.,
¨
2007; Schweikert et al., 2009). A comprehensive catalog of all
transcripts encoded by a genomic locus is essential for downstream analyses that aim at a more detailed understanding of
gene expression and RNA processing regulation.
RNA-Seq is a method for parallel sequencing of a large number of RNA molecules based on high-throughput sequencing
technologies (ENCODE Project Consortium et al., 2012;
Mortazavi et al., 2008; Wang et al., 2009). Currently available
sequencing platforms typically provide several 10–100 millions of
sequence fragments (reads) with a typical length of 50–150 bases.
By mapping these reads back to the genome, one can determine
where gene products are encoded in the genome (e.g. Denoeud
et al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al.,
2011) and collect evidence of RNA processing such as splicing
(Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing
(Bahn et al., 2012).
In many cases, the RNA-Seq reads are first aligned to a reference genome using an alignment tool that identifies possible
read origins within the genome. Contiguous regions covered with
read alignments (possibly with small gaps) are candidates for
exonic segments. Alignment tools for RNA-Seq reads, such as
PALMapper PAG XXIIal., 2008; Discovery Workshop
(De Bona et Gene Jean et al., 2010), TopHat
MiTie

Transcript prediction via combinatorial optimization that combines
evidence from multiple experiments & achieves higher accuracy.

14


Transcript Reconstruction with RNA-seq
Reads

Genome Based Assembly
(Cufflinks, Scripture)
Read alignments

Denovo Assembly
(Trinity, Oases)

Genomic DNA

Data
processing

Segment graph

Optimization

108 possible transcripts, 1028 possible subsets of transcripts
a



15


Enumerate and Quantify all Transcripts
Segment Graph

Potential Transcripts

[Behr et al., 2013]

a



16


Segment Graph


1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0
[Behr et al., 2013]

a



16


Segment Graph

Abundance


1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

Sample1 Sample2

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0

0.0
0.2
0.0
0.8
0.0
0.0
0.0
0.0

Expected coverage

0.0
0.0
0.1
0.9
0.0
0.0
0.0
0.0

R. Bohnert and G. R¨tsch, NAR (2010)
a
a


[Behr et al., 2013]
16


Segment Graph

Abundance


1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

Sample1 Sample2

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0

min L( U T × W ,
W

0.0
0.2
0.0
0.8
0.0
0.0
0.0
0.0

Expected coverage

0.0
0.0
0.1
0.9
0.0
0.0
0.0
0.0

C

)+γ× W

1

expected coverage observed coverage

R. Bohnert and G. R¨tsch, NAR (2010)
a
[Behr et al., 2013]

a



16


Simultaneous Identiﬁcation & Quantiﬁcation
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
k

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

[Behr et al., 2013]

a



16


Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

min L( U T × W ,

U,W

Expected coverage

Sample1 Sample2

0.8
0.2
0.0
0.0

0.9
0.0
0.1
0.0

W

)+γ×N

C

expected coverage observed coverage

[Behr et al., 2013]
a



16


Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

[Behr et al., 2013]

a



16


Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

s.t.
a

U is valid



16


Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

'$

s.t.
a

U is valid

&%



16


Simultaneous Identiﬁcation Quantiﬁcation
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min × W , C ) + γ × N
L(U T

U,W

s.t.
a

U is valid



16


MiTie’s Main Features
Uses a likelihood function L based on a probabilistic model for
the read coverage.
Uses combinatorial optimization to find transcripts that explain
data from multiple RNA-seq libraries
Newly predicted transcripts are penalized (once).
Can use already known/confirmed transcripts without penalty.
Provides a p-value for each transcript providing a confidence
measure for presence of predicted transcript.
Log-likelihood ratio test:
Tt = −2 log

p(D|M)
p(D|Mt )
[Behr et al., 2013]

a



17


MiTie Results
F−score on Transcript Level

A

F−score on Transcript Level

B

Human Simulated Data
0.45
MITIE + MMO
MITIE
Cufflinks + Cuffmerge
Cufflinks

0.40

0.35
1
0.37

2

3

4

D. melanogaster modENCODE Data

5

0.35

0.33
0.31
0.29

MITIE
Cufflinks + Cuffmerge
1

2

3

4

5

Number of Samples

6

7

[Behr et al., 2013]
a



18


Gene Finding vs. Transcript Assembly
Gene expression level
low

high

mGene.ngs
= only one transcript
MiTie
=multiple transcripts
low confidence
high confidence
for alternative transcripts

a



19


Conclusions
Genome annotation pipeline
Transcript Skimmer identiﬁes highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
a



20

Just published:

Checkout:
http://oqtans.org
http://galaxy.cbio.mskcc.org
[Sreedharan et al., 2014]
a



21

References I
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn.,
pages 3–10, 2003.
J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. R¨tsch.
a
Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting,
September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.
Jonas Behr, Andr´ Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar R¨tsch. Mitie: Simultaneous
e
a
rna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013.
doi: 10.1093/bioinformatics/btt442.
RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen,
KA Frazer, DH Huson, B Sch¨lkopf, M Nordborg, G R¨tsch, JR Ecker, and D Weigel. Common sequence polymorphisms
o
a
shaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:
10.1126/science.1138632.
G. R¨tsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P.
a
Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004.
G R¨tsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Sch¨lkopf, J. Platt, and T. Hoffman, editors,
a
o
Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MIT
Press. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM.
G. R¨tsch, S. Sonnenburg, and B. Sch¨lkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21
a
o
(Suppl. 1):i369–i377, June 2005.
Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio
De Bona, Lisa Hartmann, Anja Bohlen, Nina Kr¨ger, S¨ren Sonnenburg, and Gunnar R¨tsch. mgene: Accurate svm-based
u
o
a
gene finding with an application to nematode genomes. Genome Research, 2009. URL
http://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009.
S. Sonnenburg, G. R¨tsch, A. Jagota, and K.-R. M¨ller. New methods for splice-site recognition. In Proc. International
a
u
Conference on Artificial Neural Networks, 2002.
S¨ren Sonnenburg, Alexander Zien, and Gunnar R¨tsch. ARTS: Accurate Recognition of Transcription Starts in Human.
o
a
Bioinformatics, 22(14):e472–480, 2006.
a



22

References II

VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N G¨rnitz, G Zeller, and Gunnar
o
R¨tsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis.
a
Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014.
G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thaliana
with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi:
10.1101/gr.070169.107.
A. Zien, G. R¨tsch, S. Mika, B. Sch¨lkopf, T. Lengauer, and K.-R. M¨ller. Engineering Support Vector Machine Kernels That
a
o
u
Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.

a



23

RNA-seq based Genome Annotation with mGene.ngs and MiTie

Recommended

Recommended

More Related Content

Similar to RNA-seq based Genome Annotation with mGene.ngs and MiTie

Similar to RNA-seq based Genome Annotation with mGene.ngs and MiTie (20)

Recently uploaded

Recently uploaded (20)

RNA-seq based Genome Annotation with mGene.ngs and MiTie