SlideShare a Scribd company logo
RNA-Seq-based Genome Annotation
using mGene.ngs and MiTie
Gunnar R¨tsch
a
Biomedical Data Science Group
Computational Biology Center
Memorial Sloan-Kettering Cancer Center
gxr #mGene #MiTie #PAGXXII
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Proposed new gene finding method (mGene.ngs) for reannotation of
19 A. thaliana genomes (and genome assembly + analysis).

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

4
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)

f (y |x) for all y = y

(“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a

Automatically adapts to quality of RNA-seq data/alignments

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

5
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)

f (y |x) for all y = y

(“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a

Automatically adapts to quality of RNA-seq data/alignments

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

5
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)

f (y |x) for all y = y

(“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a

Automatically adapts to quality of RNA-seq data/alignments

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

5
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

STEP 1: SVM Signal Predictions
tss
tis
acc
don

Score f(y|x)

stop

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

STEP 1: SVM Signal Predictions
tss
tis
acc
don

Score f(y|x)

stop

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don

Score f(y|x)

stop

large margin

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Training of mGene.ngs
genomic position
True gene model

2

3

4

5

Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don
stop

Coverage

RNA-seq

Score f(y|x)

intron support
from spliced reads

large margin

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Training of mGene.ngs
genomic position
True gene model

2

3

4

5

Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don
stop

Coverage

RNA-seq

intron support
from spliced reads

Score f(y|x)

larger margin

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
RNA-seq:
paired-end, strand-specific RNA ligation based protocol
76bp reads, 50 million reads
Alignment with Palmapper
Evaluation:
Transcript-level F-score of coding transcripts
. . . for different expression levels
Compare mGene (ab initio), mGene.ngs, cufflinks

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

7
Memorial Sloan-Kettering Cancer Center

Results for C. elegans

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

8
Memorial Sloan-Kettering Cancer Center

Digestion
Observations:
RNA-seq helps to improve performance
Genomic signals help much (see cufflinks)
Problems:
Need existing annotation for training
Cannot predict non-coding transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

9
Memorial Sloan-Kettering Cancer Center

Skimming and Non-coding Transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

10
Memorial Sloan-Kettering Cancer Center

Skimming and Non-coding Transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

10
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

11
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

11
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

11
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation

0.4

0.3

0.2

0.1

0

0

10

c Gunnar R¨tsch (cBio@MSKCC)
a

20

30

40

50

60

expression percentile

70

RNA-Seq-based Annotation using mGene.ngs and MiTie

80

90

100

PAG XXII Gene Discovery Workshop

12
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation
mGene.nc − w/o annotation

0.4

0.3

0.2

0.1

0

0

10

c Gunnar R¨tsch (cBio@MSKCC)
a

20

30

40

50

60

expression percentile

70

RNA-Seq-based Annotation using mGene.ngs and MiTie

80

90

100

PAG XXII Gene Discovery Workshop

12
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation
mGene.nc − w/o annotation

0.4

0.3

0.2

De novo prediction works!
Modeling noncoding
transcripts improves coding
transcript prediction.

0.1

0

0

10

c Gunnar R¨tsch (cBio@MSKCC)
a

20

30

40

50

60

expression percentile

70

RNA-Seq-based Annotation using mGene.ngs and MiTie

80

90

100

PAG XXII Gene Discovery Workshop

12
Memorial Sloan-Kettering Cancer Center

Gene Finding vs. Transcript Assembly
Gene expression level
low

high

Genefinding + RNA-seq
=> only one transcript
RNA transcript assembly
=>multiple transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

13
BIOINFORMATICS

ORIGINAL PAPER

Genome analysis

Vol. 29 no. 20 2013, pages 2529–2538
doi:10.1093/bioinformatics/btt442

Advance Access publication August 25, 2013

MITIE: Simultaneous RNA-Seq-based transcript identification and
quantification in multiple samples
´
Jonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 and
¨
Gunnar Ratsch1,*
1

Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2Friedrich
Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, Germany
¨

Associate Editor: Ivo Hofacker
ABSTRACT

c Gunnar R¨tsch (cBio@MSKCC)
a

Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led
to tremendous improvements in the detection of expressed genes and
reconstruction of RNA transcripts. However, the extensive dynamic
range of gene expression, technical limitations and biases, as well
as the observed complexity of the transcriptional landscape, pose
profound computational challenges for transcriptome reconstruction.
Results: We present the novel framework MITIE (Mixed Integer
Transcript IdEntification) for simultaneous transcript reconstruction
and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few
transcripts collectively explaining the observed read data and show
how to find the optimal solution using Mixed Integer Programming.
MITIE can (i) take advantage of known transcripts, (ii) reconstruct
and quantify transcripts simultaneously in multiple samples, and
(iii) resolve the location of multi-mapping reads. It is designed for
genome- and assembly-based transcriptome reconstruction. We
present an extensive study based on realistic simulated RNA-Seq
data. When compared with state-of-the-art approaches, MITIE
proves to be significantly more sensitive and overall more accurate.
Moreover, MITIE yields substantial performance gains when used with
multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of
reconstructing omitted transcript annotations and the specificity with
RNA-Seq-based Annotation using corroborate that aand
respect to annotated transcripts. Our results mGene.ngs well-

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on Decem

genic locus by means of alternative splicing, transcription start
and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al.,
¨
2007; Schweikert et al., 2009). A comprehensive catalog of all
transcripts encoded by a genomic locus is essential for downstream analyses that aim at a more detailed understanding of
gene expression and RNA processing regulation.
RNA-Seq is a method for parallel sequencing of a large number of RNA molecules based on high-throughput sequencing
technologies (ENCODE Project Consortium et al., 2012;
Mortazavi et al., 2008; Wang et al., 2009). Currently available
sequencing platforms typically provide several 10–100 millions of
sequence fragments (reads) with a typical length of 50–150 bases.
By mapping these reads back to the genome, one can determine
where gene products are encoded in the genome (e.g. Denoeud
et al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al.,
2011) and collect evidence of RNA processing such as splicing
(Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing
(Bahn et al., 2012).
In many cases, the RNA-Seq reads are first aligned to a reference genome using an alignment tool that identifies possible
read origins within the genome. Contiguous regions covered with
read alignments (possibly with small gaps) are candidates for
exonic segments. Alignment tools for RNA-Seq reads, such as
PALMapper PAG XXIIal., 2008; Discovery Workshop
(De Bona et Gene Jean et al., 2010), TopHat
MiTie

Transcript prediction via combinatorial optimization that combines
evidence from multiple experiments & achieves higher accuracy.

14
Memorial Sloan-Kettering Cancer Center

Transcript Reconstruction with RNA-seq
Reads

Genome Based Assembly
(Cufflinks, Scripture)
Read alignments

Denovo Assembly
(Trinity, Oases)

Genomic DNA

Data
processing

Segment graph

Optimization

108 possible transcripts, 1028 possible subsets of transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

15
Memorial Sloan-Kettering Cancer Center

Transcript Reconstruction with RNA-seq
Reads

Genome Based Assembly
(Cufflinks, Scripture)
Read alignments

Denovo Assembly
(Trinity, Oases)

Genomic DNA

Data
processing

Segment graph

Optimization

108 possible transcripts, 1028 possible subsets of transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

15
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Potential Transcripts

[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Potential Transcripts

1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0
[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Abundance

Potential Transcripts

1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

Sample1 Sample2

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0

0.0
0.2
0.0
0.8
0.0
0.0
0.0
0.0

Expected coverage

0.0
0.0
0.1
0.9
0.0
0.0
0.0
0.0

R. Bohnert and G. R¨tsch, NAR (2010)
a
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

[Behr et al., 2013]
PAG XXII Gene Discovery Workshop
16
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Abundance

Potential Transcripts

1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

Sample1 Sample2

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0

min L( U T × W ,
W

0.0
0.2
0.0
0.8
0.0
0.0
0.0
0.0

Expected coverage

0.0
0.0
0.1
0.9
0.0
0.0
0.0
0.0

C

)+γ× W

1

expected coverage observed coverage

R. Bohnert and G. R¨tsch, NAR (2010)
a
[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
k

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

min L( U T × W ,

U,W

Expected coverage

Sample1 Sample2

0.8
0.2
0.0
0.0

0.9
0.0
0.1
0.0

W

)+γ×N

C

expected coverage observed coverage

[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a

U is valid

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

'$

s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a

U is valid

&%

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W



min L(U T × W , C ) + γ × N


U,W

s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a

U is valid

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification  Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W



min × W , C ) + γ × N
L(U T

U,W

s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a

U is valid

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

MiTie’s Main Features
Uses a likelihood function L based on a probabilistic model for
the read coverage.
Uses combinatorial optimization to find transcripts that explain
data from multiple RNA-seq libraries
Newly predicted transcripts are penalized (once).
Can use already known/confirmed transcripts without penalty.
Provides a p-value for each transcript providing a confidence
measure for presence of predicted transcript.
Log-likelihood ratio test:
Tt = −2 log

p(D|M)
p(D|Mt )
[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

17
Memorial Sloan-Kettering Cancer Center

MiTie Results
F−score on Transcript Level

A

F−score on Transcript Level

B

Human Simulated Data
0.45
MITIE + MMO
MITIE
Cufflinks + Cuffmerge
Cufflinks

0.40

0.35
1
0.37

2

3

4

D. melanogaster modENCODE Data

5

0.35

0.33
0.31
0.29

MITIE
Cufflinks + Cuffmerge
1

2

3

4

5

Number of Samples

6

7

[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

18
Memorial Sloan-Kettering Cancer Center

Gene Finding vs. Transcript Assembly
Gene expression level
low

high

mGene.ngs
= only one transcript
MiTie
=multiple transcripts
low confidence
high confidence
for alternative transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

19
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Just published:

Checkout:
http://oqtans.org
http://galaxy.cbio.mskcc.org
[Sreedharan et al., 2014]
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

21
References I
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn.,
pages 3–10, 2003.
J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. R¨tsch.
a
Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting,
September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.
Jonas Behr, Andr´ Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar R¨tsch. Mitie: Simultaneous
e
a
rna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013.
doi: 10.1093/bioinformatics/btt442.
RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen,
KA Frazer, DH Huson, B Sch¨lkopf, M Nordborg, G R¨tsch, JR Ecker, and D Weigel. Common sequence polymorphisms
o
a
shaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:
10.1126/science.1138632.
G. R¨tsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P.
a
Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004.
G R¨tsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Sch¨lkopf, J. Platt, and T. Hoffman, editors,
a
o
Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MIT
Press. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM.
G. R¨tsch, S. Sonnenburg, and B. Sch¨lkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21
a
o
(Suppl. 1):i369–i377, June 2005.
Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio
De Bona, Lisa Hartmann, Anja Bohlen, Nina Kr¨ger, S¨ren Sonnenburg, and Gunnar R¨tsch. mgene: Accurate svm-based
u
o
a
gene finding with an application to nematode genomes. Genome Research, 2009. URL
http://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009.
S. Sonnenburg, G. R¨tsch, A. Jagota, and K.-R. M¨ller. New methods for splice-site recognition. In Proc. International
a
u
Conference on Artificial Neural Networks, 2002.
S¨ren Sonnenburg, Alexander Zien, and Gunnar R¨tsch. ARTS: Accurate Recognition of Transcription Starts in Human.
o
a
Bioinformatics, 22(14):e472–480, 2006.
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

22
References II

VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N G¨rnitz, G Zeller, and Gunnar
o
R¨tsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis.
a
Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014.
G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thaliana
with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi:
10.1101/gr.070169.107.
A. Zien, G. R¨tsch, S. Mika, B. Sch¨lkopf, T. Lengauer, and K.-R. M¨ller. Engineering Support Vector Machine Kernels That
a
o
u
Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

23

More Related Content

Similar to RNA-seq based Genome Annotation with mGene.ngs and MiTie

Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
Gunnar Rätsch
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Elia Brodsky
 
2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
GenomeInABottle
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
Vall d'Hebron Institute of Research (VHIR)
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 
F Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineF Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis Pipeline
Francesca Giordano
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
Despoina Kalfakakou
 
Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and Pathology
Dan Gaston
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Nuria Lopez-Bigas
 
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
ExternalEvents
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Golden Helix Inc
 
NGS and the molecular basis of disease: a practical view
NGS and the molecular basis of disease: a practical viewNGS and the molecular basis of disease: a practical view
NGS and the molecular basis of disease: a practical view
Vall d'Hebron Institute of Research (VHIR)
 
Apac distributor training series 3 swift product for cancer study
Apac distributor training series 3  swift product for cancer studyApac distributor training series 3  swift product for cancer study
Apac distributor training series 3 swift product for cancer study
Swift Biosciences
 
final_presentation
final_presentationfinal_presentation
final_presentation
David Stevens
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
GenomeInABottle
 
Kshivets O. Lung Cancer Surgery: Prognosis
Kshivets O. Lung Cancer Surgery: PrognosisKshivets O. Lung Cancer Surgery: Prognosis
Kshivets O. Lung Cancer Surgery: Prognosis
Oleg Kshivets
 
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
ICRISAT
 
Karen miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionKaren miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detection
GenomeInABottle
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad Abbas
MuhammadAbbaskhan9
 

Similar to RNA-seq based Genome Annotation with mGene.ngs and MiTie (20)

Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
F Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineF Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis Pipeline
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and Pathology
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
 
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
NGS and the molecular basis of disease: a practical view
NGS and the molecular basis of disease: a practical viewNGS and the molecular basis of disease: a practical view
NGS and the molecular basis of disease: a practical view
 
Apac distributor training series 3 swift product for cancer study
Apac distributor training series 3  swift product for cancer studyApac distributor training series 3  swift product for cancer study
Apac distributor training series 3 swift product for cancer study
 
final_presentation
final_presentationfinal_presentation
final_presentation
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
Kshivets O. Lung Cancer Surgery: Prognosis
Kshivets O. Lung Cancer Surgery: PrognosisKshivets O. Lung Cancer Surgery: Prognosis
Kshivets O. Lung Cancer Surgery: Prognosis
 
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
 
Karen miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionKaren miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detection
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad Abbas
 

Recently uploaded

Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.GawadHemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
NephroTube - Dr.Gawad
 
Cell Therapy Expansion and Challenges in Autoimmune Disease
Cell Therapy Expansion and Challenges in Autoimmune DiseaseCell Therapy Expansion and Challenges in Autoimmune Disease
Cell Therapy Expansion and Challenges in Autoimmune Disease
Health Advances
 
Hiranandani Hospital Powai News [Read Now].pdf
Hiranandani Hospital Powai News [Read Now].pdfHiranandani Hospital Powai News [Read Now].pdf
Hiranandani Hospital Powai News [Read Now].pdf
Dr. Sujit Chatterjee CEO Hiranandani Hospital
 
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachIntegrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Ayurveda ForAll
 
Aortic Association CBL Pilot April 19 – 20 Bern
Aortic Association CBL Pilot April 19 – 20 BernAortic Association CBL Pilot April 19 – 20 Bern
Aortic Association CBL Pilot April 19 – 20 Bern
suvadeepdas911
 
All info about Diabetes and how to control it.
 All info about Diabetes and how to control it. All info about Diabetes and how to control it.
All info about Diabetes and how to control it.
Gokuldas Hospital
 
Identifying Major Symptoms of Slip Disc.
 Identifying Major Symptoms of Slip Disc. Identifying Major Symptoms of Slip Disc.
Identifying Major Symptoms of Slip Disc.
Gokuldas Hospital
 
CBL Seminar 2024_Preliminary Program.pdf
CBL Seminar 2024_Preliminary Program.pdfCBL Seminar 2024_Preliminary Program.pdf
CBL Seminar 2024_Preliminary Program.pdf
suvadeepdas911
 
Ketone bodies and metabolism-biochemistry
Ketone bodies and metabolism-biochemistryKetone bodies and metabolism-biochemistry
Ketone bodies and metabolism-biochemistry
Dhayanithi C
 
Does Over-Masturbation Contribute to Chronic Prostatitis.pptx
Does Over-Masturbation Contribute to Chronic Prostatitis.pptxDoes Over-Masturbation Contribute to Chronic Prostatitis.pptx
Does Over-Masturbation Contribute to Chronic Prostatitis.pptx
walterHu5
 
Efficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in AyurvedaEfficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in Ayurveda
Dr. Jyothirmai Paindla
 
Osteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdfOsteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdf
Jim Jacob Roy
 
Role of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of HyperthyroidismRole of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of Hyperthyroidism
Dr. Jyothirmai Paindla
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
shivalingatalekar1
 
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptxMuscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Dr. Rabia Inam Gandapore
 
CHEMOTHERAPY_RDP_CHAPTER 3_ANTIFUNGAL AGENT.pdf
CHEMOTHERAPY_RDP_CHAPTER 3_ANTIFUNGAL AGENT.pdfCHEMOTHERAPY_RDP_CHAPTER 3_ANTIFUNGAL AGENT.pdf
CHEMOTHERAPY_RDP_CHAPTER 3_ANTIFUNGAL AGENT.pdf
rishi2789
 
Abortion PG Seminar Power point presentation
Abortion PG Seminar Power point presentationAbortion PG Seminar Power point presentation
Abortion PG Seminar Power point presentation
AksshayaRajanbabu
 
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa CentralClinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
19various
 
Top Travel Vaccinations in Manchester
Top Travel Vaccinations in ManchesterTop Travel Vaccinations in Manchester
Top Travel Vaccinations in Manchester
NX Healthcare
 
Histololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptxHistololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptx
AyeshaZaid1
 

Recently uploaded (20)

Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.GawadHemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
Hemodialysis: Chapter 4, Dialysate Circuit - Dr.Gawad
 
Cell Therapy Expansion and Challenges in Autoimmune Disease
Cell Therapy Expansion and Challenges in Autoimmune DiseaseCell Therapy Expansion and Challenges in Autoimmune Disease
Cell Therapy Expansion and Challenges in Autoimmune Disease
 
Hiranandani Hospital Powai News [Read Now].pdf
Hiranandani Hospital Powai News [Read Now].pdfHiranandani Hospital Powai News [Read Now].pdf
Hiranandani Hospital Powai News [Read Now].pdf
 
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachIntegrating Ayurveda into Parkinson’s Management: A Holistic Approach
Integrating Ayurveda into Parkinson’s Management: A Holistic Approach
 
Aortic Association CBL Pilot April 19 – 20 Bern
Aortic Association CBL Pilot April 19 – 20 BernAortic Association CBL Pilot April 19 – 20 Bern
Aortic Association CBL Pilot April 19 – 20 Bern
 
All info about Diabetes and how to control it.
 All info about Diabetes and how to control it. All info about Diabetes and how to control it.
All info about Diabetes and how to control it.
 
Identifying Major Symptoms of Slip Disc.
 Identifying Major Symptoms of Slip Disc. Identifying Major Symptoms of Slip Disc.
Identifying Major Symptoms of Slip Disc.
 
CBL Seminar 2024_Preliminary Program.pdf
CBL Seminar 2024_Preliminary Program.pdfCBL Seminar 2024_Preliminary Program.pdf
CBL Seminar 2024_Preliminary Program.pdf
 
Ketone bodies and metabolism-biochemistry
Ketone bodies and metabolism-biochemistryKetone bodies and metabolism-biochemistry
Ketone bodies and metabolism-biochemistry
 
Does Over-Masturbation Contribute to Chronic Prostatitis.pptx
Does Over-Masturbation Contribute to Chronic Prostatitis.pptxDoes Over-Masturbation Contribute to Chronic Prostatitis.pptx
Does Over-Masturbation Contribute to Chronic Prostatitis.pptx
 
Efficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in AyurvedaEfficacy of Avartana Sneha in Ayurveda
Efficacy of Avartana Sneha in Ayurveda
 
Osteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdfOsteoporosis - Definition , Evaluation and Management .pdf
Osteoporosis - Definition , Evaluation and Management .pdf
 
Role of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of HyperthyroidismRole of Mukta Pishti in the Management of Hyperthyroidism
Role of Mukta Pishti in the Management of Hyperthyroidism
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
 
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptxMuscles of Mastication by Dr. Rabia Inam Gandapore.pptx
Muscles of Mastication by Dr. Rabia Inam Gandapore.pptx
 
CHEMOTHERAPY_RDP_CHAPTER 3_ANTIFUNGAL AGENT.pdf
CHEMOTHERAPY_RDP_CHAPTER 3_ANTIFUNGAL AGENT.pdfCHEMOTHERAPY_RDP_CHAPTER 3_ANTIFUNGAL AGENT.pdf
CHEMOTHERAPY_RDP_CHAPTER 3_ANTIFUNGAL AGENT.pdf
 
Abortion PG Seminar Power point presentation
Abortion PG Seminar Power point presentationAbortion PG Seminar Power point presentation
Abortion PG Seminar Power point presentation
 
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa CentralClinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
 
Top Travel Vaccinations in Manchester
Top Travel Vaccinations in ManchesterTop Travel Vaccinations in Manchester
Top Travel Vaccinations in Manchester
 
Histololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptxHistololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptx
 

RNA-seq based Genome Annotation with mGene.ngs and MiTie

  • 1. RNA-Seq-based Genome Annotation using mGene.ngs and MiTie Gunnar R¨tsch a Biomedical Data Science Group Computational Biology Center Memorial Sloan-Kettering Cancer Center gxr #mGene #MiTie #PAGXXII
  • 2. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  • 3. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  • 4. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  • 5. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  • 6. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 7. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 8. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 9. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 10. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 11. Proposed new gene finding method (mGene.ngs) for reannotation of 19 A. thaliana genomes (and genome assembly + analysis). c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 4
  • 12. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  • 13. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  • 14. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  • 15. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 16. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 17. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop large margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 18. Memorial Sloan-Kettering Cancer Center Training of mGene.ngs genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Coverage RNA-seq Score f(y|x) intron support from spliced reads large margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 19. Memorial Sloan-Kettering Cancer Center Training of mGene.ngs genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Coverage RNA-seq intron support from spliced reads Score f(y|x) larger margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 20. Memorial Sloan-Kettering Cancer Center Results for C. elegans RNA-seq: paired-end, strand-specific RNA ligation based protocol 76bp reads, 50 million reads Alignment with Palmapper Evaluation: Transcript-level F-score of coding transcripts . . . for different expression levels Compare mGene (ab initio), mGene.ngs, cufflinks c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 7
  • 21. Memorial Sloan-Kettering Cancer Center Results for C. elegans c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 8
  • 22. Memorial Sloan-Kettering Cancer Center Digestion Observations: RNA-seq helps to improve performance Genomic signals help much (see cufflinks) Problems: Need existing annotation for training Cannot predict non-coding transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 9
  • 23. Memorial Sloan-Kettering Cancer Center Skimming and Non-coding Transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 10
  • 24. Memorial Sloan-Kettering Cancer Center Skimming and Non-coding Transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 10
  • 25. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  • 26. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  • 27. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  • 28. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation 0.4 0.3 0.2 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  • 29. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation mGene.nc − w/o annotation 0.4 0.3 0.2 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  • 30. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation mGene.nc − w/o annotation 0.4 0.3 0.2 De novo prediction works! Modeling noncoding transcripts improves coding transcript prediction. 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  • 31. Memorial Sloan-Kettering Cancer Center Gene Finding vs. Transcript Assembly Gene expression level low high Genefinding + RNA-seq => only one transcript RNA transcript assembly =>multiple transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 13
  • 32. BIOINFORMATICS ORIGINAL PAPER Genome analysis Vol. 29 no. 20 2013, pages 2529–2538 doi:10.1093/bioinformatics/btt442 Advance Access publication August 25, 2013 MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples ´ Jonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 and ¨ Gunnar Ratsch1,* 1 Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, Germany ¨ Associate Editor: Ivo Hofacker ABSTRACT c Gunnar R¨tsch (cBio@MSKCC) a Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction. Results: We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with RNA-Seq-based Annotation using corroborate that aand respect to annotated transcripts. Our results mGene.ngs well- Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on Decem genic locus by means of alternative splicing, transcription start and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al., ¨ 2007; Schweikert et al., 2009). A comprehensive catalog of all transcripts encoded by a genomic locus is essential for downstream analyses that aim at a more detailed understanding of gene expression and RNA processing regulation. RNA-Seq is a method for parallel sequencing of a large number of RNA molecules based on high-throughput sequencing technologies (ENCODE Project Consortium et al., 2012; Mortazavi et al., 2008; Wang et al., 2009). Currently available sequencing platforms typically provide several 10–100 millions of sequence fragments (reads) with a typical length of 50–150 bases. By mapping these reads back to the genome, one can determine where gene products are encoded in the genome (e.g. Denoeud et al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al., 2011) and collect evidence of RNA processing such as splicing (Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing (Bahn et al., 2012). In many cases, the RNA-Seq reads are first aligned to a reference genome using an alignment tool that identifies possible read origins within the genome. Contiguous regions covered with read alignments (possibly with small gaps) are candidates for exonic segments. Alignment tools for RNA-Seq reads, such as PALMapper PAG XXIIal., 2008; Discovery Workshop (De Bona et Gene Jean et al., 2010), TopHat MiTie Transcript prediction via combinatorial optimization that combines evidence from multiple experiments & achieves higher accuracy. 14
  • 33. Memorial Sloan-Kettering Cancer Center Transcript Reconstruction with RNA-seq Reads Genome Based Assembly (Cufflinks, Scripture) Read alignments Denovo Assembly (Trinity, Oases) Genomic DNA Data processing Segment graph Optimization 108 possible transcripts, 1028 possible subsets of transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 15
  • 34. Memorial Sloan-Kettering Cancer Center Transcript Reconstruction with RNA-seq Reads Genome Based Assembly (Cufflinks, Scripture) Read alignments Denovo Assembly (Trinity, Oases) Genomic DNA Data processing Segment graph Optimization 108 possible transcripts, 1028 possible subsets of transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 15
  • 35. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Potential Transcripts [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 36. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 37. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Abundance Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 Sample1 Sample2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0.0 0.2 0.0 0.8 0.0 0.0 0.0 0.0 Expected coverage 0.0 0.0 0.1 0.9 0.0 0.0 0.0 0.0 R. Bohnert and G. R¨tsch, NAR (2010) a c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie [Behr et al., 2013] PAG XXII Gene Discovery Workshop 16
  • 38. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Abundance Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 Sample1 Sample2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 min L( U T × W , W 0.0 0.2 0.0 0.8 0.0 0.0 0.0 0.0 Expected coverage 0.0 0.0 0.1 0.9 0.0 0.0 0.0 0.0 C )+γ× W 1 expected coverage observed coverage R. Bohnert and G. R¨tsch, NAR (2010) a [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 39. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 k 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 40. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 min L( U T × W , U,W Expected coverage Sample1 Sample2 0.8 0.2 0.0 0.0 0.9 0.0 0.1 0.0 W )+γ×N C expected coverage observed coverage [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 41. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 42. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 43. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W '$ s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid &% RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 44. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 45. Memorial Sloan-Kettering Cancer Center Simultaneous Identification Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min × W , C ) + γ × N L(U T U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 46. Memorial Sloan-Kettering Cancer Center MiTie’s Main Features Uses a likelihood function L based on a probabilistic model for the read coverage. Uses combinatorial optimization to find transcripts that explain data from multiple RNA-seq libraries Newly predicted transcripts are penalized (once). Can use already known/confirmed transcripts without penalty. Provides a p-value for each transcript providing a confidence measure for presence of predicted transcript. Log-likelihood ratio test: Tt = −2 log p(D|M) p(D|Mt ) [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 17
  • 47. Memorial Sloan-Kettering Cancer Center MiTie Results F−score on Transcript Level A F−score on Transcript Level B Human Simulated Data 0.45 MITIE + MMO MITIE Cufflinks + Cuffmerge Cufflinks 0.40 0.35 1 0.37 2 3 4 D. melanogaster modENCODE Data 5 0.35 0.33 0.31 0.29 MITIE Cufflinks + Cuffmerge 1 2 3 4 5 Number of Samples 6 7 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 18
  • 48. Memorial Sloan-Kettering Cancer Center Gene Finding vs. Transcript Assembly Gene expression level low high mGene.ngs = only one transcript MiTie =multiple transcripts low confidence high confidence for alternative transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 19
  • 49. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 50. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 51. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 52. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 53. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 54. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 55. Just published: Checkout: http://oqtans.org http://galaxy.cbio.mskcc.org [Sreedharan et al., 2014] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 21
  • 56. References I Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn., pages 3–10, 2003. J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. R¨tsch. a Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting, September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf. Jonas Behr, Andr´ Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar R¨tsch. Mitie: Simultaneous e a rna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013. doi: 10.1093/bioinformatics/btt442. RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Sch¨lkopf, M Nordborg, G R¨tsch, JR Ecker, and D Weigel. Common sequence polymorphisms o a shaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi: 10.1126/science.1138632. G. R¨tsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P. a Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004. G R¨tsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Sch¨lkopf, J. Platt, and T. Hoffman, editors, a o Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MIT Press. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM. G. R¨tsch, S. Sonnenburg, and B. Sch¨lkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21 a o (Suppl. 1):i369–i377, June 2005. Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann, Anja Bohlen, Nina Kr¨ger, S¨ren Sonnenburg, and Gunnar R¨tsch. mgene: Accurate svm-based u o a gene finding with an application to nematode genomes. Genome Research, 2009. URL http://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009. S. Sonnenburg, G. R¨tsch, A. Jagota, and K.-R. M¨ller. New methods for splice-site recognition. In Proc. International a u Conference on Artificial Neural Networks, 2002. S¨ren Sonnenburg, Alexander Zien, and Gunnar R¨tsch. ARTS: Accurate Recognition of Transcription Starts in Human. o a Bioinformatics, 22(14):e472–480, 2006. c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 22
  • 57. References II VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N G¨rnitz, G Zeller, and Gunnar o R¨tsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis. a Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014. G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thaliana with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi: 10.1101/gr.070169.107. A. Zien, G. R¨tsch, S. Mika, B. Sch¨lkopf, T. Lengauer, and K.-R. M¨ller. Engineering Support Vector Machine Kernels That a o u Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000. c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 23