1. Transcript reconstruction algorithms available in the
Trinity RNA-Seq package
Daniel Standage
Brendel Group, Indiana University
4 Mar 2014
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
1 / 24
5. Introduction
Assembly with Trinity
Transcriptome assembly
In the absence of full-length transcript sequences,
reconstruct full-length sequences from fragments.
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
5 / 24
7. Introduction
Assembly with Trinity
Trinity RNA-Seq
Now with 3 transcript reconstruction modes!
Butterfly (default)
--PasaFly
--CuffFly
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
7 / 24
8. Introduction
Assembly with Trinity
Review outline
Trinity algorithm
PASA algorithm
Cufflinks algorithm
Discussion
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
8 / 24
9. Trinity
Inchworm
Step 1: Inchworm
Assemble unique contigs representing transcript
subsequences.
Often produces dominant isoform in full length, and then just unique
portions of alternative isoforms.
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
9 / 24
10. Trinity
Inchworm
Inchworm procedure
1
Create dictionary of k-mers (k = 25)
2
Remove k-mers containing probable errors (based on coverage?)
3
Selects highest occurring k-mer
4
Build contig by extending k-mer (find highest occurring k-mer with
k − 1 bp overlap, extend 1 bp), remove k-mer from dictionary
5
Repeat previous step until the contig cannot be extended further,
report contig
6
Repeat steps 3-5 until all k-mers are exhausted
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
10 / 24
11. Trinity
Chrysalis
Step 2: Chrysalis
Group Inchworm contigs, construct de Bruijn
graph for each cluster.
Each connected component of the graph corresponds to one or more genes
with shared sequence.
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
11 / 24
12. Trinity
Chrysalis
Chrysalis procedure
1
Group contigs if they share perfect overlap of k − 1 bp (with reads
supporting the overlap)
2
Build de Bruijn graph with k − 1 word size for nodes, k for edges;
edges weighted by supporting reads
3
Assign each read to component with which it shares the largest
number of k-mers
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
12 / 24
13. Trinity
Butterfly
Step 3: Butterfly
Traverse read-supported paths in each subgraph,
enumerate plausible sequences.
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
13 / 24
14. Trinity
Butterfly
Butterfly procedure
1
2
Graph simplification: merge consecutive nodes in linear paths,
pruning minor deviations
Plausible path scoring: identify paths in graph with read support
Initialize DP table with source nodes (no incoming edges)
Fill in table by extending path prefixes by one node
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
14 / 24
15. PASA
PASA
Program to Assemble Spliced Alignments
designed for ESTs and FL-cDNAs (pre-NGS era)
works on sequence alignments
computes consensus spliced alignments
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
15 / 24
16. PASA
PASA algorithm
Input: a set of spliced cDNA alignments A
Output: for each alignment a ∈ A, the largest assembly containing a
1
Sort alignments
2
Test overlapping alignments for compatibility
3
Build DP table, backtrace to find maximal assembly A∗
4
If ∃a ∈ A∗ , build reciprocal DP table, trace to enumerate additional
/
assemblies
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
16 / 24
17. PASA
PASA algorithm
Recurrences
La = max{Ca , Lb + Ca/b }
b
Ra = max{Ca , Rb + Ca/b }
b
La , Ra : maximum number of cDNAs in an assembly that contains
alignment a, starting from left and right (respectively)
Ca : number of a-compatible alignments in the span of a
Ca/b : number of a-compatible alignments in the span of a but not in
the span of b
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
17 / 24
19. Cufflinks
Cufflinks
designed for short transcript reads (NGS era)
works on read alignments (mappings)
identifies fewest number of transcripts that “explain” the read
mappings
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
19 / 24
20. Cufflinks
Cufflinks algorithm
Input: overlap graph G of mapped reads
Output: a minimal path cover of G , with each path corresponding
to a single assembled transcript
1
Alignments divided into non-overlapping loci
2
Erroneous read alignments removed
3
Compute transitive reduction of G , G
4
5
Construct bipartite graph G ∗ from transitive closure of G ,with edges
weighted by coverage to “phase” distant exons by their coverage
Compute minimum-cost maximal matching in G ∗ , which corresponds
to minimum path cover of G
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
20 / 24
21. Discussion
Three different construction approaches
Butterfly: enumerate all plausible transcripts with minimal read
support
PASA: for each alignment, find largest assembly (transcript)
containing the alignment
CuffLinks: find minimal assembl(y|ies) that explain the data,
using read coverage to “phase” distant exons
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
21 / 24
22. Discussion
Next time: comparison of 8 Trinity assemblies
Four assembly settings
Butterfly
--PasaFly
--CuffFly
Butterfly, --min kmer cov 2
Two input data sets
Groomed data
Groomed data with digital normalization
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
22 / 24
23. Discussion
Next time: comparison of 8 Trinity assemblies
Hypotheses
(transcripts per assembly)
Butterfly > PasaFly > CuffFly
Diginorm > No diginorm
Daniel Standage (Brendel Group @ IU)
Trinity Assembly
4 Mar 2014
23 / 24