VectorBase gene sets
European Bioinformatics Institute
Slide 1 of 18
What this tutorial covers
• What is a VectorBase gene set?
• How are the gene sets made?
• What problems may there be with the gene models?
• How are the sets affected by the genome assembly?
• Are there different issues for the different organisms?
• Can manual & community input improve gene sets?
• How do gene models get their annotation?
Slide 2 of 18
What is a VB gene set?
VB provides a single ‘official’ gene set for its main
species: 3 mosquitoes (Anopheles, Aedes, Culex), plus
the tick Ixodes scapularis. VB & its collaborators make
the best set we can, initially using automated gene
The same set is the primary gene annotation on the
GenBank / EMBL records for the sequence assembly.
Having an official set with stable identifiers makes it easier
for the research community to talk to one another about
genes. And when improvements are made to a gene set,
we keep track of how old & new models are related, so
any work on a gene won’t be lost. Slide 3 of 18
What’s in a VB gene set?
Most people are interested in protein-coding genes.
The gene structures VB presents are predictions
(‘models’) built on the whole genome shotgun
sequence assembly for the species. Many models will
not be exact representations of the ‘real’ gene in a real
animal, and some ‘real’ genes may be missing altogether.
The reasons for this apply to all genome annotation
projects, not just to VB, and are discussed in later
sections of this tutorial.
VB also presents non-coding RNA genes (ncRNA) such
as tRNAs, miRNAs, snRNAs etc.
Again, these are built on the assembly and are likely to be
incomplete sets. Slide 4 of 18
How are the gene sets made?
• The initial gene set for each species is a collaboration
between VB and one or more of the institutes involved in
sequencing the genome
• The initial annotation is automated (as opposed to
manual) and uses a variety of approaches to find genes &
predict their structures
• VB annotation is combined with sets produced by our
collaborators using complementary approaches, to
produce the initial gene set
• VB then takes over ongoing curation and improvement of
the gene set
• VB’s annotation procedure is outlined in the next few
Slide 5 of 18
VB automatic annotation: overview
Targetted set: Genewise
Set I using species-specific
Arthropod similarity set:
Set 2 Genewise using arthropod
proteins Merge -
Raw Masked EST gene set: to higher-
genome Set 3 Exonerate using species-
specific ESTs + combiner
Metazoan similarity set:
Set 4 Genewise using all other
ab initio gene set:
SNAP + require Pfam
Slide 6 of 18
VB automatic annotation:
• Several approaches (TRF, Recon, RepeatScout,
RepeatMasker) are used to identify & mark repeated
sequences in the the genome assembly
• simple repeats, transposable elements etc
• Using this repeat-masked genome sequence helps avoid
predicting bad ‘genes’
• The repeat-masked sequence is available at VB from
each species’ genome home page
Slide 7 of 18
VB automatic annotation: Gene sets
• We make genome-wide sets of gene models using 3 main
• aligning various sets of protein sequences to the genome using
• aligning ESTs and combining them to make ‘EST genes’
(both these approaches use the Ensembl system)
• running an ab initio gene predictor called SNAP (from Ian Korf).
• We then combine these sets, prioritizing the higher confidence
genes, and adding in lower confidence ones only where there
are gaps to be filled (illustrated on next slide).
• We may also combine a protein-based and an EST-based
model to produce a protein-based model with its untranslated 5’
& 3’ regions (UTRs)
• The next slide shows how we combine the sets - but be aware
that the details are tailored to suit different species
Slide 8 of 18
VB automatic annotation: gap filling
The 2 genes from the Targeted set 1 have been placed, and one gene from
set 2 can be added into a gap. We will subsequently add single genes from
sets 3 & 4, but nothing from set 5.
Set I Targeted
Set 2 Arthropod
Gene set being
Set 3 EST-based
Set 5 Ab initio
Slide 9 of 18
Combining annotation from VB and
• In most of our projects, the initial gene annotation was produced
in collaboration with the J Craig Venter Institute (JCVI) &/or the
• Each of the collaborating institutes generated a gene set
• Approaches included EST-based modeling using PASA,
Genewise, ab initio program such as Augustus etc.
• All sets were then merged into one:
– No alternative transcripts (a limited number were added later in
– Genes with compatible structures: keep the longest
– Overlapping genes with different structures: keep the best-
– Where ab initio model only: eliminate short ones unless similar to
known protein or domain
– Re-screen to eliminate CDS from transposable elements
Slide 10 of 18
Limitations of gene sets
• Gene sets made by automated methods will never be perfect!
• Also dependent on quality of the assembly (see next slide)
• Genes may be missed
– gaps in assembly; lack of EST or protein-homology evidence in the
• Genes may be incomplete
– gaps in assembly; inability to model less-conserved start & end
• Merges & splits
– adjacent genes may occasionally be merged into one model
– partial support or gaps can lead to one ‘real’ gene being split into
two or more models
Slide 11 of 18
Genome assembly issues
• Whole genome shotgun sequencing projects are often assessed on
coverage and on number & average size of contigs and supercontigs
– the VB projects have quite high coverage (bases of sequence generated
>6X number of bases in genome)
– but many gaps are still present in all our assemblies
• Polymorphism problems
– VB animals are small, and the DNA for sequencing comes from many
individuals which may have significant genetic diversity
– causes assembly problems including artifactual duplications/deletions and
• Repeat problems
– Genomes with high levels of repeated sequences are harder to assemble
and, in trying to mask the repeats, gene families can occasionally be
• Remember, as well as the assembly, the raw traces (sequence reads)
are also available and can be searched.
Slide 12 of 18
Gene set comparisons
Assembly predicted gene
Culex 580 Mb 18,883 genes
Aedes 1.38 Gb 15,419 genes
Anopheles 280 Mb (4,654 ordered on 5 12,945 genes
Ixodes 1.77 Gb 20,486 genes
Can you conclude that Ixodes & Culex really have more genes? No - they
might, but the number of predictions depends on the state of the assembly and
gene annotation as well. For example, the number of predicted genes for
Anopheles decreased in the first revisions as bad predictions were eliminated,
and is now set to increase again as a result of detailed manual annotation.
Slide 13 of 18
Issues for different species
By now you will be aware that all gene sets, including those at VB, need to be used
with a degree of caution. Here are a few additional points for each of the VB
species, emphasizing how they differ.
The only assembly where scaffolds have been assigned to chromosomes; known
polymorphism issues partially addressed; gene set now in its fourth version; gene set
includes much manual annotation.
Genome much larger & with higher repeat content than the other mosquitoes.
Higher gene count may reflect some real family expansion but may also be some
Large genome; high level of polymorphism leading to assembly with many gaps and
large number of separate supercontigs. Gene set expected to be missing genes and
to include models that may be incomplete.
Slide 14 of 18
Manual & community input can
improve gene sets
• Automatic annotation can be applied to a whole genome
relatively rapidly and although it has limitations, these can
be taken into account when making use of the gene set.
• Expert manual annotation can improve the structures of
individual genes, but is a slow process
– VB has carried out some systematic manual annotation -
mostly on Anopheles so far
– VB has also done targeted manual annotation leading to
correction of some models in all 3 mosquito species
– Community annotation for individual genes is welcomed and
can be submitted via our Community Annotation system -
read more in the tutorial here:
Slide 15 of 18
Anopheles browser showing manually-
annotated models on chromsome arm 2R
The manual annotator suggests merging 2 existing models
and changing the structure of another. These changes will
be incorporated into build 5 of the Anopheles gene set.
Slide 16 of 18
Adding annotations to gene models
• VB adds value by automatically annotating features of gene
• Protein features, including:
– transmembrane regions & signal peptides
– families (Prints, TIGRFam etc)
– domains (Pfam etc)
• Cross references to other resources
– database records that may represent the same gene
– GO terms
• Community annotations are also welcomed - see the guide
Slide 17 of 18
Further information and help
VectorBase help documentation starts at
Please email the VectorBase help desk with any further
comments or questions. The address is:
Slide 18 of 18