Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Improvements in the Tomato Reference
Genome (SL3.0) and Annotation
(ITAG3.0)
Prashant S Hosmani, Surya Saha, Mirella Flore...
Acknowledgements
Gabino Sanchez
Henri van de Geest
SGN Community (You!)
RNAseq data contributors
Stephane Rombauts
Florian...
SL3.0
Solanum lycopersicum
Heinz 1706
BAC Integration Workflow
Automatic
integration of BACs
Manual validation NCBI validation
https://github.com/solgenomics/Bi...
BioNano Workflow
Assemble molecules
into CMaps
Hybrid assembly with
NGS scaffolds
Manual validation
Hybrid assembly statis...
Chr00 Integration
Chr00
Chr02
Cmap 84
• Chr00 contig NW_004194391.1 (203,142bp) inserted in chr09 150kb scaffold gap
• Two...
ITAG3.0
Annotation
Structural annotation pipeline
Repeat masking
genome
Evidence – RNA
and protein
ITAG 2.40 gene
models
Post-processing
• Ge...
Repeat identification and masking the
genome
• Generated custom repeat
libraryRepeatModeler
• Exclusion of repeats with
si...
Repeat identification and classification
Extensive identification and classification of repeats using
REPET, which masks 6...
ITAG 2.40 processing
• ITAG2.40 protein-coding
genes34,725
• Webapollo curated genes
• Removed contamination (56)
• Remove...
Expression evidence for annotation
Expression data evidence
• 8 billion RNAseq reads
• Tissue and treatment specific RNAse...
RNAseq data sources
• Jim Giovannoni (BTI/USDA)
• Jocelyn Rose (Cornell)
• Greg Martin (BTI)
• Zhangjun Fei (BTI/USDA)
• J...
MAKER pipeline
Ab-initio gene prediction methods
• Augustus (Training using BRAKER1)
• SNAP (MAKER based training)
• GeneM...
Improvements in ITAG 3.0 compared with
ITAG 2.40
ITAG 2.40 ITAG 3.0
# of genes 34,725 34,769
Avg. gene length 1,209 bp 1,5...
Gene structure improvement example
ITAG3.0
ITAG2.40
ITAG3.0
ITAG2.40
Correct fusion example
UTR example
RNAseq
XY plot
RNA...
Quality check - Annotation Edit Distance
(AED)
AED= 0 complete support
AED =1 lack of support
AED
Functional annotation
Automated Assignment of Human Readable Descriptions
(AHRD)
Swissprot plant protein database
TrEMBL p...
Functional annotation
Automated Assignment of Human Readable Descriptions (AHRD)
AHRD-Version 3.3.2
Quality score (***)
So...
Novel genes in ITAG3.0
5,822 novel gens in ITAG 3.0
Future work
Genome
Improving genome assembly by sequencing with Pacbio
technology
Annotation
tRNA, non-coding RNA annotati...
Workshop: SGN and RTB Databases
Tuesday, Jan 17 10:30 AM
Posters
Surya Saha: Improved Tomato Genome
Reference (SL3.0) usin...
Thank you!!
Questions??
Data available to download from
FTP
• ITAG 3.0
• GFF, proteins, transcripts, CDS
• List of fused genes
SGN Workshop, SOL 2...
Gap Reduction
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 ...
Repeat classification
SGN Workshop, SOL 2016
LTR retrotransposon
Copia 64840935
Gypsy 260719161
TRIM/LARD 671571
Non-LTR r...
Mapping rates for different RNAseq data
RNAseq data # of reads in
Millions
REPET light RepeatModeler
light
AC_Jim 637 86.8...
Upcoming SlideShare
Loading in …5
×

Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

1,867 views

Published on

Presented at Solanaceae workshop at PAG 2017

Published in: Education
  • Be the first to comment

Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0)

  1. 1. Improvements in the Tomato Reference Genome (SL3.0) and Annotation (ITAG3.0) Prashant S Hosmani, Surya Saha, Mirella Flores, Stephane Rombauts, Florian Maumus, Henri van de Geest, Gabino Sanchez- Perez and Lukas Mueller Boyce Thompson Institute, Ithaca, NY VIB Department of Plant Systems Biology, Ghent University, Gent, Belgium URGI, INRA, Université Paris-Saclay, Versailles, France Wageningen Plant Research, Wageningen University, Netherlands psh65@cornell.edu
  2. 2. Acknowledgements Gabino Sanchez Henri van de Geest SGN Community (You!) RNAseq data contributors Stephane Rombauts Florian Maumus
  3. 3. SL3.0 Solanum lycopersicum Heinz 1706
  4. 4. BAC Integration Workflow Automatic integration of BACs Manual validation NCBI validation https://github.com/solgenomics/Bio-GenomeUpdate BAC assemblies Align to SL2.50 • 500bp BAC ends • 100% identity Place BACs 1,069 full-length phase htgs3 BACs integrated and ~11Mb of contig gaps removed
  5. 5. BioNano Workflow Assemble molecules into CMaps Hybrid assembly with NGS scaffolds Manual validation Hybrid assembly statistics Scaffolds: 57 Total Genome Map Length: 779.789 Mb Avg. Genome Map Length: 13.681 Mb Genome Map N50: 25.384 Mb
  6. 6. Chr00 Integration Chr00 Chr02 Cmap 84 • Chr00 contig NW_004194391.1 (203,142bp) inserted in chr09 150kb scaffold gap • Two Inversions on chromosome 12 • 19 gaps resized Chr00 contig NW_004194387.1 (561,203bp) integrated in 1.4Mb scaffold gap
  7. 7. ITAG3.0 Annotation
  8. 8. Structural annotation pipeline Repeat masking genome Evidence – RNA and protein ITAG 2.40 gene models Post-processing • Genes with functional domain support • Assign Solyc-ID to novel genes
  9. 9. Repeat identification and masking the genome • Generated custom repeat libraryRepeatModeler • Exclusion of repeats with similarity with known proteins (SwissProt) ProtExcluder • Masked 56.39% genomeRepeatMasker
  10. 10. Repeat identification and classification Extensive identification and classification of repeats using REPET, which masks 61% of the SL3.0 reference genome. Florian Maumus
  11. 11. ITAG 2.40 processing • ITAG2.40 protein-coding genes34,725 • Webapollo curated genes • Removed contamination (56) • Removed transposon (2,244) 32,425 • ITAG2.40 mapped - GMAP • Mapped to SL3.0 repeat masked genome 31,309
  12. 12. Expression evidence for annotation Expression data evidence • 8 billion RNAseq reads • Tissue and treatment specific RNAseq • 5’ and 3’ UTR enriched RNAseq • RENseq for NBS-LRR genes • Pacbio Iso-seq data • SwissProt plant proteins Mapped on to SL3.0 and transcriptome was assembled Mapping rate ~85%
  13. 13. RNAseq data sources • Jim Giovannoni (BTI/USDA) • Jocelyn Rose (Cornell) • Greg Martin (BTI) • Zhangjun Fei (BTI/USDA) • Jonathan Jones (The Sainsbury Laboratory) • Asaph Aharoni (Weizmann Institute of Science) • Neelima Sinha (University of California, Davis)
  14. 14. MAKER pipeline Ab-initio gene prediction methods • Augustus (Training using BRAKER1) • SNAP (MAKER based training) • GeneMark (with high quality genes) • Eugene (Stephane Rombauts) Updating legacy annotation (ITAG2.40) Post-processing Added genes only with functional domain support (Pfam) ~800 genes Removed genes with 70% overlap with repeats (674 genes). Assigned Solyc ID to novel genes with ITAG convention. Novel genes are assigned Solyc ID between existing Solyc ID.
  15. 15. Improvements in ITAG 3.0 compared with ITAG 2.40 ITAG 2.40 ITAG 3.0 # of genes 34,725 34,769 Avg. gene length 1,209 bp 1,529 bp Exons per gene 4.61 5.10 5’ UTR per gene 0.39 0.63 3’ UTR per gene 0.44 0.62 Novel genes in ITAG3.0 – 5,822
  16. 16. Gene structure improvement example ITAG3.0 ITAG2.40 ITAG3.0 ITAG2.40 Correct fusion example UTR example RNAseq XY plot RNAseq XY plot
  17. 17. Quality check - Annotation Edit Distance (AED) AED= 0 complete support AED =1 lack of support AED
  18. 18. Functional annotation Automated Assignment of Human Readable Descriptions (AHRD) Swissprot plant protein database TrEMBL plant protein database Araport 11 (Arabidopsis latest annotation) User curated locus information from solgenomics.net (2000+) Unknown proteins In ITAG 3.0, 409 have a functional description of “Unknown proteins” compared to 7,689 in ITAG2.40
  19. 19. Functional annotation Automated Assignment of Human Readable Descriptions (AHRD) AHRD-Version 3.3.2 Quality score (***) Solyc08g081780.1.1 Dirigent protein (***) Solyc01g008960.2.1 Argonaute family protein (***) Solyc01g013880.1.1 Leucine-rich repeat receptor-like protein kinase family protein (*-*) Position Criteria 1 Bit score of the blast result is >50 and e-value is <e-10 2 Alignment of the blast result is >60% 3 Human Readable Description score is >0.5 “AHRD’s quality-code consists of a three character string, where each character is either ‘*’ if the respective criteria is met or ‘-’ otherwise.”
  20. 20. Novel genes in ITAG3.0 5,822 novel gens in ITAG 3.0
  21. 21. Future work Genome Improving genome assembly by sequencing with Pacbio technology Annotation tRNA, non-coding RNA annotation Multiple isoforms Co-expression network based functional annotation
  22. 22. Workshop: SGN and RTB Databases Tuesday, Jan 17 10:30 AM Posters Surya Saha: Improved Tomato Genome Reference (SL3.0) using Full-Length BACs, BioNano Optical Maps and SGN Community Resources (P0798) Prashant Hosmani: ITAG3.0 Annotation for the New Tomato Reference Genome SL3.0 (P0797)
  23. 23. Thank you!! Questions??
  24. 24. Data available to download from FTP • ITAG 3.0 • GFF, proteins, transcripts, CDS • List of fused genes SGN Workshop, SOL 2016
  25. 25. Gap Reduction 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 0 50 100 150 200 250 300 350 400 450 500 1 2 3 4 5 6 7 8 9 10 11 12 BACs Reduction in contig gaps BACsIntegrated
  26. 26. Repeat classification SGN Workshop, SOL 2016 LTR retrotransposon Copia 64840935 Gypsy 260719161 TRIM/LARD 671571 Non-LTR retrotransposon LINE 9871924 Putative_retrotransposon Putative_RT 528982 DNA DNA 20712725 Helitron Helitron 1210271 TIR TIR 12144035 Confused Confused 48373586 Unclassified Unclassified 70850157 Hostgene Endogenous virus 5839457 Tandem repeats Hostgene 5044454 Tandem repeats 8901715 Ns SUM repeats 509708973
  27. 27. Mapping rates for different RNAseq data RNAseq data # of reads in Millions REPET light RepeatModeler light AC_Jim 637 86.87% 88.03% epigenome 82 60.77% 64.35% UTR seq 87 85.88% 86.57% TEA part A 4,295 84.41% 84.39% TEA part B 2,449 84.40% 84.71% RENseq 15 32.91% 39.83% Yang 331 79.94% 80.28% Total reads 7,930

×