Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Comparative genomics to the rescue: How complete is your plant genome sequence?

597 views

Published on

5th Plant Genomics & Gene Editing Congress
16-17 March 2017, Amsterdam

Published in: Science
  • This program is a must read for anyone suffering from Candida or ones like me who has ever taken antibiotics and is now experiencing any of the many problems that go along with intestinal flora imbalance. I must also add that several people in my church have been following this book and are doing great!  http://ishbv.com/index7/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Comparative genomics to the rescue: How complete is your plant genome sequence?

  1. 1. Comparative genomics to the rescue How complete is your plant genome sequence? Klaas Vandepoele Ghent University - VIB, Belgium 5th Plant Genomics & Gene Editing Congress 16-17 March 2017, Amsterdam plaza_genomics
  2. 2. Plant genome sequencing is booming  New and faster sequencing technologies  Generating a draft genome sequence has become cheap  The number of published plant genomes grows exponentially 2 >150 published plant genomes Credit: Usadel lab
  3. 3. From read data to knowledge The basic genome analysis toolkit:  Genome assembly  Structural annotation shows where genes are  Functional annotation tells you what genes do  Data availability of genome sequence & gene annotation Faciliate biological discovery 3
  4. 4. Yet another “draft” plant genome  What is the quality and completeness of plant genome sequences? 4 The N50 denotes that 50% of the total assembly length is contained in scaffolds of length N50 or longer
  5. 5. Quality of a genome: what to expect? 5
  6. 6. Quality of a genome: what to expect? 6 Transcript mapping
  7. 7. Transcript mapping: tools & settings 7 • The transcript mapping score is stable (standard deviation < 1%) in bin sizes of at least 3,000 ESTs • Challenging to have a correct estimation of the assembled gene space. Influence of  mapping tools  coverage cutoffs Intron-aware transcript mapping using GMAP
  8. 8. Transcript mapping: library size 8 • If the libraries contain more than 10,000 ESTs, the EST mapping scores for A. thaliana libraries converge to the same value as for subsampling bins of >10,000 ESTs. • RNA-Seq de novo assembled transcripts can lead to the  over-estimation of the expected number of genes (allelic transcripts, splice variants and fragmented transcripts)  under-estimation due to the failure to reconstruct low- abundant transcripts
  9. 9. Estimating gene space completeness along an evolutionary scale 9 evolutionary conserved Species-specific expected gene spaces influenced by within-species diversity between- species diversity CEGMA 248, single copy BUSCO 952, single copy PLAZA CoreGF 7k gene families Transcript mapping Species tree of life PLAZA CoreGF 3k gene families
  10. 10. Biases in the expected “conserved” gene space 10
  11. 11. A diverse set of genome quality metrics 11
  12. 12. Evaluation 12
  13. 13. Evaluation 13 • Arabidopsis and Oryza have consistent high Completeness scores • Over-estimation of completeness by CEGMA • Lolium: discrepancy between genome vs gene set completeness
  14. 14. Improving Lolium gene annotation 14 2 Transcriptomes, aligned with GenomeThreader de novo assembly Orthology-guided assembly 300k 80k 4 Proteomes, aligned with GenomeThreader Brachypodium distachyon Oryza sativa Sorghum bicolor Zea mays 16k 11k 11k 10k 2 Annotation sets Byrne et al. (2015) ab initio predictions 28k 41k # loci EVM consensus 39.967 Haas et al. (2008), Gremme et al. (2005), Ruttink et al. (2013)
  15. 15. Updated completeness scores Lolium 15 Completeness score (%) 75 80 90 9585 100 Byrne et al. (2015) EVM consensus >900 new coreGF loci found in the genome! CEGMA 248, single copy BUSCO 952, single copy PLAZA CoreGF 7k gene families Transcript mapping Species tree of life PLAZA CoreGF 3k gene families
  16. 16. Evaluation 16 • Arabidopsis and Oryza have consistent high Completeness scores • Over-estimation of completeness by CEGMA • Lolium: discrepancy between genome vs gene set completeness • Cicer: EST mapping score much lower than BUSCO geneset or coreGF score More than half of the unmapped sequences are of non-plant origin (mostly from Fusarium oxysporum) Proper taxonomic binning of expected transcripts is essential!
  17. 17. Guidelines to assess the quality of a new genome sequence 1. Estimate genome size using different methods 2. Define and evaluate the expected gene space based on transcript mapping AND evolutionary conservation  Cleaning and mapping transcripts  Prefer coreGF/BUSCO over CEGMA to model expected conserved genes 3. Large differences in completeness scores between genome assembly / annotated gene set can point to gene prediction issues 4. To perform cross-species genome comparisons, focus on genomes with complete and contiguous assemblies 17 Veeckman, E., Ruttink, T., and Vandepoele, K. (2016). Are We There Yet? Reliably Estimating the Completeness of Plant Genome Sequences. Plant Cell 28, 1759-1768.
  18. 18. • Gene family annotation and phylogenetic trees • Traceable functional annotation (GO/InterPro/MapMan) • Colinearity and synteny • Integrative gene orthology inference  Highly integrative platform to translate knowledge from model to crop • 55 species/genomes • Highly scalable design • Web-based mobile user interface • Integrated Workbench for analysis of sets of genes http://bioinformatics.psb.ugent.be/plaza/
  19. 19. Coverage gene function information 19 blue = primary GO; green = GO projection (orthology + homology) Gene descriptions Gene Ontology (Biological Process)
  20. 20. TRAPID: analysis of non-model transcriptomes 20  Homology-based ORFs detection incl. frameshift correction  Gene family assignment  Functional annotation based on Gene Ontology and/or protein domains  Two reference databases: PLAZA 2.5 and OrthoMCL-DB  Applications  Sugar cane, wheat, Crocus sativa, conifers, Coffea arabica, Prunus  Dinoflagellates, diatoms, worms, fishes SRA Viridiplantae Transcriptomic Van Bel, … & Vandepoele, Genome Biology 2013
  21. 21. Drought Tolerance Conferred to Sugarcane by Association with Gluconacetobacter diazotrophicus: A Transcriptomic View of Hormone Pathways 21 Vargas et al., PLoS One 2014
  22. 22. Further reading Veeckman, E., Ruttink, T., and Vandepoele, K. (2016). Are We There Yet? Reliably Estimating the Completeness of Plant Genome Sequences. Plant Cell 28, 1759-1768. Proost, S., Van Bel, M. … and Vandepoele, K. (2015). PLAZA 3.0: an access point for plant comparative genomics. Nucleic Acids Research Jan;43(Database issue):D974-81 Vandepoele K (2017) A Guide to the PLAZA 3.0 Plant Comparative Genomic Database. In Methods Mol Biol, Vol 1533, pp 183-200 Van Bel, M., Proost, S., Van Neste, C., Deforce, D., Van de Peer, Y., and Vandepoele, K. (2013). TRAPID: an efficient online tool for the functional and comparative analysis of de novo RNA- Seq transcriptomes. Genome Biol 14, R134. plaza_genomics Code freely available to efficiently compute coreGF completeness score Want a free PDF? Check out PLAZA poster
  23. 23. PLAZA 3.0 user statistics (2016)  >11,000 users (+13%), 370K page views (+30%)  Users from >95 countries  Intensively used by  academia (>400 citations)  industry
  24. 24. 24 Text-mining Orthology
  25. 25. PLAZA Workbench 25  Create a custom gene set (~experiment) using gene identifiers or BLAST  External/internal gene IDs (e.g. AN3, AT5G28640, GRMZM2G180246_T01)  BLAST interface can be used to map sequence data from a non-model species to a reference species present in PLAZA  A toolbox is available to analyze user-defined gene sets (~experiment)  2,132 registered users processed 11,875 Workbench experiments

×