Solanum lycopersicum Heinz 1706
genome assembly and annotation
SL4.0 and ITAG4.0
Sol Genomics Network
https://solgenomics.net/
SL4.0 assembly
â—Ź 80X Pacbio coverage with RSII and Sequel (13kb read N50)
â—Ź Canu assembly (N50 5.5 Mb)
â—Ź Hi-C scaffolding (12 chromosomes and unplaced contigs)
â—Ź Corrected with Illumina DNAseq (coverage 60x)
â—Ź Filtered for mitochondrial and chloroplast contigs
â—Ź Validated with Bionano optical maps and 10X linked reads
Comparison with the previous assemblies
Genome Assembly versions SL4.0 SL3.0 SL2.5
Assembly Size (bp) 782,520,133 828,076,956 823,944,041
Non-N bases 782,475,302 746,357,470 737,636,348
N’s (bp) 44,831 81,719,486 86,307,693
Chr 00 / unplaced contig size (bp)
9,643,350 20,852,292 21,805,821
Number of Chr 00 contigs 152 3,141 4,410
Repeat content
(RepeatModeler/RepeatMasker)
64.19% 56.39% 56.34%
Repeat content (REPET) 71.77% 61.55% 60.94%
Assembly completeness estimation
based on kmer's
99.24% 98.96% 98.83%
SL3.0 vs SL4.0
Genome assembly co-linearity
Input data for genome annotation
- Full-length cDNA sequenced using PacBio IsoSeq (Breaker and Mature
green fruit stages)
- RNAseq Illumina data from >1,300 libraries with >14 billion reads
- Disease resistance data (Martin and Jones labs)
- 3’ and 5’ UTR enriched data (Giovannoni, Aharoni and Sinha labs)
- Public data from NCBI SRA
- NCBI EST sequences (~300 K)
- Full-length cDNA sequences (~13 K) from Micro-Tom (Aoki et. al., 2010)
Annotation of protein-coding gene models
ITAG4.0 ITAG2.4
Number of protein-coding genes 34,075 34,725
Average transcript length 1,303 1,209
Average number of exons per gene 4.74 4.61
Fraction of genes with 5' UTR 0.49 0.34
Fraction of genes with 3' UTR 0.58 0.41
Long non-coding RNA in ITAG4.0 - 5,874 with 6,694 alternately spliced isoforms
Annotation Edit Distance (AED)
Annotation Edit Distance (AED)
provides a means to evaluate
quality of annotations given the
evidence set.
AED cumulative plot shows
improvements in the ITAG4.0
compared to ITAG2.4.
Novel protein coding genes in ITAG4.0
Novel genes in ITAG4.0
are enriched in stress
response genes.
GO-terms enriched in
novel genes are shown as
fold enriched in minus
log10 of their
corresponding P-values.
Thank you!
Submit your annotation corrections using Tomato Apollo annotation editor - contact SGN for account
https://solgenomics.net/contact/form

Sl4.0 and ITAG4.0

  • 1.
    Solanum lycopersicum Heinz1706 genome assembly and annotation SL4.0 and ITAG4.0 Sol Genomics Network https://solgenomics.net/
  • 2.
    SL4.0 assembly â—Ź 80XPacbio coverage with RSII and Sequel (13kb read N50) â—Ź Canu assembly (N50 5.5 Mb) â—Ź Hi-C scaffolding (12 chromosomes and unplaced contigs) â—Ź Corrected with Illumina DNAseq (coverage 60x) â—Ź Filtered for mitochondrial and chloroplast contigs â—Ź Validated with Bionano optical maps and 10X linked reads
  • 3.
    Comparison with theprevious assemblies Genome Assembly versions SL4.0 SL3.0 SL2.5 Assembly Size (bp) 782,520,133 828,076,956 823,944,041 Non-N bases 782,475,302 746,357,470 737,636,348 N’s (bp) 44,831 81,719,486 86,307,693 Chr 00 / unplaced contig size (bp) 9,643,350 20,852,292 21,805,821 Number of Chr 00 contigs 152 3,141 4,410 Repeat content (RepeatModeler/RepeatMasker) 64.19% 56.39% 56.34% Repeat content (REPET) 71.77% 61.55% 60.94% Assembly completeness estimation based on kmer's 99.24% 98.96% 98.83%
  • 4.
    SL3.0 vs SL4.0 Genomeassembly co-linearity
  • 5.
    Input data forgenome annotation - Full-length cDNA sequenced using PacBio IsoSeq (Breaker and Mature green fruit stages) - RNAseq Illumina data from >1,300 libraries with >14 billion reads - Disease resistance data (Martin and Jones labs) - 3’ and 5’ UTR enriched data (Giovannoni, Aharoni and Sinha labs) - Public data from NCBI SRA - NCBI EST sequences (~300 K) - Full-length cDNA sequences (~13 K) from Micro-Tom (Aoki et. al., 2010)
  • 6.
    Annotation of protein-codinggene models ITAG4.0 ITAG2.4 Number of protein-coding genes 34,075 34,725 Average transcript length 1,303 1,209 Average number of exons per gene 4.74 4.61 Fraction of genes with 5' UTR 0.49 0.34 Fraction of genes with 3' UTR 0.58 0.41 Long non-coding RNA in ITAG4.0 - 5,874 with 6,694 alternately spliced isoforms
  • 7.
    Annotation Edit Distance(AED) Annotation Edit Distance (AED) provides a means to evaluate quality of annotations given the evidence set. AED cumulative plot shows improvements in the ITAG4.0 compared to ITAG2.4.
  • 8.
    Novel protein codinggenes in ITAG4.0 Novel genes in ITAG4.0 are enriched in stress response genes. GO-terms enriched in novel genes are shown as fold enriched in minus log10 of their corresponding P-values.
  • 9.
    Thank you! Submit yourannotation corrections using Tomato Apollo annotation editor - contact SGN for account https://solgenomics.net/contact/form