Combining transcriptome assemblies from multiple de novo assemblers
to generate full length RNA silencing gene transcripts...
Upcoming SlideShare
Loading in …5

Combining Transcriptome Assemblies - Kenlee Nakasugi


Published on

Combining transcriptome assemblies from multiple de novo assemblers to generate full length RNA silencing gene transcripts in Nicotiana benthamiana

In an effort to produce an assembly that contained (but not limited to) full length RNA silencing gene transcripts to facilitate more informative first pass searches, and to increase the chances of finding paralogous transcripts while limiting redundancy, we have combined the sequences from multiple assemblies generated by four popular de novo transcriptome assemblers: Trans-Abyss, Trinity, Soap-denovo-trans and Oases. The subject organism is Nicotiana benthamiana, an allopolyploid plant.

Two methods were implemented to reduce the redundancy of combined assemblies - a clustering based approach (TGI clustering tools), and one that selects a 'best set' of mRNA sequences rather than producing longest possible transcripts (EvidentialGene pipeline). Metrics used to assess the quality of assemblies include the average length of the 1000 longest proteins, average bit-scores from blast comparisons against reference databases, and feature response curves.

By combining the output of different assemblers by varying k-mer sizes and input read counts, we were able to detect all 35 query RNA silencing gene transcripts as full length from simple first pass blast searches. Only 24 RNA silencing transcripts could previously be detected as complete using one assembler. While the TGI clustering tool could produce longer transcripts, the average bit-scores of blast searches and feature response curves show that the Evidential Gene pipeline produced higher quality assemblies.

By using a combined assemblies approach as recommended by the EvidentialGene pipeline, one can recover more completely assembled transcripts while limiting redundancy and maximising the quality of the assembly

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Combining Transcriptome Assemblies - Kenlee Nakasugi

  1. 1. Combining transcriptome assemblies from multiple de novo assemblers to generate full length RNA silencing gene transcripts in Nicotiana benthamiana Kenlee Nakasugi and Peter Waterhouse School of Molecular Bioscience, University of Sydney Nicotiana benthamiana is an allopolyploid (hybridisation of 2 genomes followed by whole genome duplication) plant, which can be problematic for transcriptome assemblies due to duplicated gene copies yielding novel but highly similar transcripts with markedly differing expression levels. This can lead to unassembled, partial or chimerically assembled transcripts. In addition, no one assembler with any one given parameter space can assemble all possible transcripts. In an effort to produce an assembly that contain (but not limited to) full length RNA silencing gene transcripts to facilitate more informative first pass searches, and to increase the chances of finding paralogous transcripts while limiting redundancy, we have combined the sequences from multiple assemblies generated by four popular de novo transcriptome assemblers: Trans-Abyss, Trinity, Soap-denovo-trans and Oases. We varied kmer sizes and to a lesser extent input read depth to try and assemble as many variants as possible. We then applied two pipelines to reduce redundancy, TGI clustering tool and EvidentialGene tr2aacds pipeline, and compared the quality of assemblies pre and post processing with these tools. 1 3 TGI clustering tool Assembly: Ta, Tr, So, Oa Processed input RNAseq reads Cluster and create assemblies (contigs) from a set of DNA sequences. 1. Group similar sequences into clusters 2. Assemble clusters using CAP3 Assess Details of assemblies: Paired end reads 189,333,894 228,279,832 A Dataset 1 (ds1) k48-86, step size 2 k25 k31 - Dataset 1 Dataset 2 Transabyss (Ta) Transabyss (Ta) Trinity (Tr) Soap de novo trans (So) Oases (Oa) Combined assemblies SasmM (sum of merged assemblies)* SasmK (sum of all kmer assemblies) Single end reads 48,827,381 50,249,303 B Dataset 2 (ds2) k20-44, step size 4 k48-86, step size 2 k25 k21-81, step size 10 k25-75, step size 10 EvidentialGene tr2aacds - Std metrics (contig length/numbers) - Avg. length of top 1000 longest prots - CEGMA analysis - Blast against reference databases - Read mapping statistics - Feature Response Curves - Query alignment coverage - Impact of sequencing depth “Principle is that over-assembling transcript reads with many assembly options produces a subset of accurate assemblies in a superset of crappy ones.” tr2aacds selects the most biologically useful "best" set of mRNA which are classified into primary and alternate transcripts. 1. Remove identical/highly similar transcripts 2. Find longest AA and remove highly similar AA 3. Self-blast to identify highly similar transcripts 4. Filter and classify 'main' and 'alternate' transcripts 1 2 3 4 5 Contains: A1, A3, A4, B1+B2, B2, B3, B4, B5 A1, A3, A4, B1, B2, B3, B4, B4,B5 * k-mer assemblies merged by: Ta – Ta merge utility; So – TGI clustering software; Oa – Oa merge utility Assembly overview: 1. Assemble kmers individually 2. Merge kmer assemblies - TaM, SoM, OaM ('M' denotes merged assembly). Tr is single kmer. 3. Combine assemblies: - combine merged assemblies - combine individual kmer assemblies without any merging 4. Apply Tgi and Evi tools to individually merged and combined assemblies 5. Assess via multiple metrics 2 Statistics of assemblies generated from dataset 2 reads. Metrics are of merged assemblies of individual assemblers as well as combined assemblies. Each 'raw' assembly was processed by the TGI clustering tool (Tgi) or EvidentialGene (Evi) tr2aacds pipeline. While the Tgi assemblies generated longer transcripts and proteins, average bitscores and mean contig length is higher and contig number lower in most Evi assemblies. SoMtgi SoMevi OaMraw OaMtgi OaMevi SasmMraw SasmMtgi SasmMevi SasmKraw SasmKtgi SasmKevi 243483 95212 797243 393516 72637 3562114 1066952 216177 9891564 735375 234526 883 686 2024 2226 2131 1400 2079 1806 971 2162 2208 642 618 1407 1605 1671 910 1327 1283 681 1340 1674 19151 15107 22494 22626 16312 22494 22494 16312 18177 20330 16038 1119 932 2409 2033 1660 2811 2521 1840 3261 2456 2137 87.1 47.18 99.19 99.19 90.73 99.19 99.6 97.18 99.6 99.6 98.79 98.79 80.65 100 100 92.34 100 100 97.98 100 100 99.6 TaMraw TaMtgi 750713 178700 906 1463 706 925 14633 19508 1778 1433 97.18 99.6 99.6 100 Trraw 284583 1853 1017 16069 1678 97.58 99.19 Trtgi 247174 1809 966 16069 1552 97.58 99.19 Trevi 44726 1951 1323 16069 1325 90.32 92.74 Soraw 1248430 468 433 15107 1239 49.19 83.06 290.13 294.64 303.54 346.19 329.90 386.05 145.81 199.58 202.17 380.15 358.97 437.37 320.95 362.51 368.25 263.83 374.40 493.05 319.99 99.53 0.081 97.54 Number of contigs N50 Mean contig length Longest contig Average length of top1000 longest proteins % of complete CEGMA proteins % of partial CEGMA protiens Average bitscore to top 1000 longest uniprot tomato proteins Average bitscore to top 1000 longest solgenomics Nben proteins Mapping % of input reads(BWA) Mapping % of chimeric reads mapQ>30 (BWA) Mapping % of input reads (Bowtie2) TaMevi 128938 1231 933 14453 1405 93.55 96.37 344.99 99.55 0.265 97.17 333.77 88.54 0.237 79.93 420.45 99.21 0.480 91.53 403.25 99.19 0.536 91.40 434.89 84.16 0.381 74.01 163.87 99.62 0.062 96.80 240.74 99.64 0.570 96.38 224.79 81.89 0.344 71.18 442.55 99.63 0.015 94.73 423.07 99.64 0.033 94.55 503.58 85.88 0.450 72.40 365.16 99.77 0.002 98.90 421.91 99.77 0.013 98.83 418.70 93.57 0.124 88.82 298.63 99.78 0.002 98.79 449.56 99.77 0.016 98.81 Feature response curves of assemblies, using the "High_spanning_PE" feature, which measures the number of PE reads where the pairs are mapped onto different contigs. The feature threshold is used to filter out contigs that contain the number of features that fall above the threshold. That is, only contigs that contain less than the threshold number of features are used to calculate the coverage acheived. Except for the TransAbyss-Merged assembly, the EvidentialGene (Evi) tr2aacds pipeline appear to perform best or at least on par on all assemblies (raw vs tgi vs evi) as higher coverage is achieved at a lower feature threshold. 554.61 93.87 0.179 88.86 5 4 Blast results summarized by Circos showing the proportion of transcripts in each assembly that matched the Solgenomics N. benthamiana predicted protein database (green track) and the percentage of these hits that were also found by other assemblies (orange tracks). Common db hits between Trraw and, in clockwise direction: TaMraw, TaMtgi, TaMevi, Trraw, Trtgi, Trevi, SoMraw, SoMtgi, SoMevi, OaMraw, OaMtgi, OaMevi, SasmMraw, SasmMtgi, SasmMevi, SasmKraw, SasmKtgi, SasmKevi (self comparison will always be 100%) TaMraw Kevi Sasm Ta M i g Kt tgi m as Sa i sm ev Kr M Ta aw S mM Sas 100% Trtgi SasmMtgi Total hits against db (>80% db coverage) Total hits against db db % actually expressed (RSEM) Total db % of Trraw assembly that matched against ref db. Blue bars: Blastx against all proteins in ref db Greyscale bars: Blastx against top 1000 longest proteins in ref db i ev So aM M ra O w w Mra Trev i m Sas Oa i g Mt Mt So gi vi OaMra SoMe w 6 w Trra w Trra evi 100% Links showing the proportion of the Trraw sequences that are found in the combined assembly SasmM before and after EvidentialGene pipeline Increased read counts prevent assembly of full length Dcl1 transcript in Trinity A single complete transcript of Dcl1 can be generated by assembly of dataset 1 reads Only two partial transcripts can be generated from assembly of dataset 2 reads There is a noticaeble increase in read depth between position 2200 and 2950, and towards the end of the Dcl1 transcript between dataset 2 and dataset 1, implying that increased reads from dcl1 variants could be preventing the assembly of the full-lengthed Dcl1 transcript. Inspection of the sequences of the transcripts at the read depth 'cliffs' (Box B) show the presence of Dcl1 variants including transcripts with intron sequence (Box A), and which are supported by reads spanning potential intron/exon junctions (BoxC). The full-length Dcl1 transcript could not be assembled with dataset 2 reads most likely due to increased read 'noise' from other variants. A Intron sequence ds1_comp79561_c0_seq14: scaffold13731: Varying the read depth presented to the assembler (either by raw data, or by coverage threshold options) is important in generating full length transcripts. This also depends on the type of assembler used, as Trans-abyss was not able to generate the full-length Dcl1 sequence with either dataset 1 or 2. ds2_comp85436_c0_seq7: scaffold165219: B 2500 1 scaffold13731: ds1_comp79561_c0_seq14: read 1: read 2: C read 3: read 4: 5000 Alignment coverage of 35 RNA silencing gene transcripts. The consensus CDS (identified previously) of these genes were screened with blastn against each assembly. Genes such as Dcl2 and Dcl3 were always found to be assembled to 100% in all assemblies, whereas Dcl1 could only be assembled to 100% by Oases. As expected, the combined assemblies always contained the query sequences that were close to 100% assembled. Summary: 1. 35 query RNA silencing gene sequences were detected close to 100% assembled in the combined assemblies. 2. Using multiple assemblers, varying kmer sizes, and varying input read depth yielded more full-length transcripts in general. 3. The Dcl1 transcript example illustrates the effect of input read depth and the assembly of a complete transcript. 4. The EvidentialGene tr2aacds pipeline produced the best quality assemblies overall based on several metrics including Feature Response Curves and bitscores despite not producing the longest transcripts based on traditional length based metrics. In addition, the number of contigs/transcripts was substantially less than the TGI clustering tool. The caveat is that the Tgi tool can correctly generate additional full-length transcripts post-assembly, while the Evi pipeline will only be as good as the input transcripts. 5. In terms of full length RNA silencing gene transcript assembly in N. benthamiana, assembler performance was: Oa > Ta >= Tr >> So References: * EvidentialGene pipeline (by Don Gilbert @ Indiana University): * TGI Clustering tool: * Visualization tools: - Circos plot: Krzywinski, M. et al. Circos: an Information Aesthetic for Comparative Genomics. Genome Res (2009) 19:1639-1645 ( - IGV: - Geneious: * Feature Response Curve: Vezzi F et al.(2012) Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons. PLoS ONE 7(12): e52210 * Assemblers: TransAbyss, Trinity, Soap denovo Trans, Velvet/Oases * Other software: R (v3.0.1) with ggplots2 package, BWA (v0.7.5a), Bowtie2 (v2.2.1), NCBI blast (v2.2.26+), CEGMA (v2.4) * HPC resources: Intersect Australia, 'Orange' server; UQ Research Computing Centre, 'Barrine' cluster