VectorBase gene sets


Published on

An introduction to the gene sets in VectorBase - how they are made and how to use them

Published in: Health & Medicine, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

VectorBase gene sets

  1. 1. VectorBase gene sets A tutorial Martin Hammond VectorBase European Bioinformatics Institute December 2008 Slide 1 of 18
  2. 2. What this tutorial covers • What is a VectorBase gene set? • How are the gene sets made? • What problems may there be with the gene models? • How are the sets affected by the genome assembly? • Are there different issues for the different organisms? • Can manual & community input improve gene sets? • How do gene models get their annotation? Slide 2 of 18
  3. 3. What is a VB gene set? VB provides a single ‘official’ gene set for its main species: 3 mosquitoes (Anopheles, Aedes, Culex), plus the tick Ixodes scapularis. VB & its collaborators make the best set we can, initially using automated gene modeling systems. The same set is the primary gene annotation on the GenBank / EMBL records for the sequence assembly. Having an official set with stable identifiers makes it easier for the research community to talk to one another about genes. And when improvements are made to a gene set, we keep track of how old & new models are related, so any work on a gene won’t be lost. Slide 3 of 18
  4. 4. What’s in a VB gene set? Most people are interested in protein-coding genes. The gene structures VB presents are predictions (‘models’) built on the whole genome shotgun sequence assembly for the species. Many models will not be exact representations of the ‘real’ gene in a real animal, and some ‘real’ genes may be missing altogether. The reasons for this apply to all genome annotation projects, not just to VB, and are discussed in later sections of this tutorial. VB also presents non-coding RNA genes (ncRNA) such as tRNAs, miRNAs, snRNAs etc. Again, these are built on the assembly and are likely to be incomplete sets. Slide 4 of 18
  5. 5. How are the gene sets made? • The initial gene set for each species is a collaboration between VB and one or more of the institutes involved in sequencing the genome • The initial annotation is automated (as opposed to manual) and uses a variety of approaches to find genes & predict their structures • VB annotation is combined with sets produced by our collaborators using complementary approaches, to produce the initial gene set • VB then takes over ongoing curation and improvement of the gene set • VB’s annotation procedure is outlined in the next few slides Slide 5 of 18
  6. 6. VB automatic annotation: overview Targetted set: Genewise Set I using species-specific protein Arthropod similarity set: Set 2 Genewise using arthropod proteins Merge - giving priority Raw Masked EST gene set: to higher- genome Set 3 Exonerate using species- genome confidence specific ESTs + combiner sets Repeat masking Metazoan similarity set: Set 4 Genewise using all other metazoan proteins ab initio gene set: Set 5 SNAP + require Pfam domain Slide 6 of 18
  7. 7. VB automatic annotation: Repeat masking • Several approaches (TRF, Recon, RepeatScout, RepeatMasker) are used to identify & mark repeated sequences in the the genome assembly • simple repeats, transposable elements etc • Using this repeat-masked genome sequence helps avoid predicting bad ‘genes’ • The repeat-masked sequence is available at VB from each species’ genome home page Slide 7 of 18
  8. 8. VB automatic annotation: Gene sets • We make genome-wide sets of gene models using 3 main approaches: • aligning various sets of protein sequences to the genome using Genewise • aligning ESTs and combining them to make ‘EST genes’ (both these approaches use the Ensembl system) • running an ab initio gene predictor called SNAP (from Ian Korf). • We then combine these sets, prioritizing the higher confidence genes, and adding in lower confidence ones only where there are gaps to be filled (illustrated on next slide). • We may also combine a protein-based and an EST-based model to produce a protein-based model with its untranslated 5’ & 3’ regions (UTRs) • The next slide shows how we combine the sets - but be aware that the details are tailored to suit different species Slide 8 of 18
  9. 9. VB automatic annotation: gap filling The 2 genes from the Targeted set 1 have been placed, and one gene from set 2 can be added into a gap. We will subsequently add single genes from sets 3 & 4, but nothing from set 5. Set I Targeted Set 2 Arthropod Gene set being assembled Set 3 EST-based Metazoan Set 4 Set 5 Ab initio Slide 9 of 18
  10. 10. Combining annotation from VB and collaborators • In most of our projects, the initial gene annotation was produced in collaboration with the J Craig Venter Institute (JCVI) &/or the Broad Institute • Each of the collaborating institutes generated a gene set • Approaches included EST-based modeling using PASA, Genewise, ab initio program such as Augustus etc. • All sets were then merged into one: – No alternative transcripts (a limited number were added later in some species) – Genes with compatible structures: keep the longest – Overlapping genes with different structures: keep the best- supported – Where ab initio model only: eliminate short ones unless similar to known protein or domain – Re-screen to eliminate CDS from transposable elements Slide 10 of 18
  11. 11. Limitations of gene sets • Gene sets made by automated methods will never be perfect! • Also dependent on quality of the assembly (see next slide) • Genes may be missed – gaps in assembly; lack of EST or protein-homology evidence in the databases • Genes may be incomplete – gaps in assembly; inability to model less-conserved start & end exons • Merges & splits – adjacent genes may occasionally be merged into one model – partial support or gaps can lead to one ‘real’ gene being split into two or more models Slide 11 of 18
  12. 12. Genome assembly issues • Whole genome shotgun sequencing projects are often assessed on coverage and on number & average size of contigs and supercontigs – the VB projects have quite high coverage (bases of sequence generated >6X number of bases in genome) – but many gaps are still present in all our assemblies • Polymorphism problems – VB animals are small, and the DNA for sequencing comes from many individuals which may have significant genetic diversity – causes assembly problems including artifactual duplications/deletions and missed regions • Repeat problems – Genomes with high levels of repeated sequences are harder to assemble and, in trying to mask the repeats, gene families can occasionally be masked • Remember, as well as the assembly, the raw traces (sequence reads) are also available and can be searched. Slide 12 of 18
  13. 13. Gene set comparisons December 2008 Assembly predicted gene # supercontigs length models 3,171 Culex 580 Mb 18,883 genes supercontigs 4,758 Aedes 1.38 Gb 15,419 genes supercontigs 8,987 supercontigs: Anopheles 280 Mb (4,654 ordered on 5 12,945 genes chromosome arms) 369,495 Ixodes 1.77 Gb 20,486 genes supercontigs Can you conclude that Ixodes & Culex really have more genes? No - they might, but the number of predictions depends on the state of the assembly and gene annotation as well. For example, the number of predicted genes for Anopheles decreased in the first revisions as bad predictions were eliminated, and is now set to increase again as a result of detailed manual annotation. Slide 13 of 18
  14. 14. Issues for different species By now you will be aware that all gene sets, including those at VB, need to be used with a degree of caution. Here are a few additional points for each of the VB species, emphasizing how they differ. Anopheles gambiae The only assembly where scaffolds have been assigned to chromosomes; known polymorphism issues partially addressed; gene set now in its fourth version; gene set includes much manual annotation. Aedes aegypti Genome much larger & with higher repeat content than the other mosquitoes. Culex quinquefasciatus Higher gene count may reflect some real family expansion but may also be some overprediction. Ixodes scapularis Large genome; high level of polymorphism leading to assembly with many gaps and large number of separate supercontigs. Gene set expected to be missing genes and to include models that may be incomplete. Slide 14 of 18
  15. 15. Manual & community input can improve gene sets • Automatic annotation can be applied to a whole genome relatively rapidly and although it has limitations, these can be taken into account when making use of the gene set. • Expert manual annotation can improve the structures of individual genes, but is a slow process – VB has carried out some systematic manual annotation - mostly on Anopheles so far – VB has also done targeted manual annotation leading to correction of some models in all 3 mosquito species – Community annotation for individual genes is welcomed and can be submitted via our Community Annotation system - read more in the tutorial here: Slide 15 of 18
  16. 16. Anopheles browser showing manually- annotated models on chromsome arm 2R The manual annotator suggests merging 2 existing models and changing the structure of another. These changes will be incorporated into build 5 of the Anopheles gene set. Slide 16 of 18
  17. 17. Adding annotations to gene models • VB adds value by automatically annotating features of gene models • Protein features, including: – transmembrane regions & signal peptides – families (Prints, TIGRFam etc) – domains (Pfam etc) • Cross references to other resources – database records that may represent the same gene – GO terms • Community annotations are also welcomed - see the guide here: Slide 17 of 18
  18. 18. Further information and help VectorBase help documentation starts at Please email the VectorBase help desk with any further comments or questions. The address is: Slide 18 of 18