Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 1
1
The Ensembl Database
http://www.ensembl.org
Lecture 7.1 2
Ensembl is a genome browser for vertebrate genomes that
supports research in comparative genomics, evolution,
sequence variation and transcriptional regulation.
Ensembl annotate genes, computes multiple alignments,
predicts regulatory function and collects disease data.
Ensembl
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 2
3
What is Ensembl?
• Public annotation of mammalian and other
genomes
• Open source software
4
The Ensembl Project
“Ensembl is a joint project between EMBL
European Bioinformatics Institute and the
Sanger Institute to develop a software
system which produces and maintains
automatic annotation on eukaryotic
genomes. Ensembl is primarily funded by
the Wellcome Trust”
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 3
5
The Ensembl Project
“The main aim of this campaign is to
encourage scientists across the world - in
academia, pharmaceutical companies, and
the biotechnology and computer industries -
to use this free information.”
- Dr. Mike Dexter, Director of the Wellcome Trust
6
Diagram of
contigview as
“what we want
in the end”
Goal: An Accessible, Annotated
Genome
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 4
7
Ensembl Genome Annotation
• Utilizes raw DNA sequence data from public sources
• Creates a tracking database (The “Ensembl database”)
• Joins the sequences - based on a sequence scaffold
• Automatically finds genes and other features of the sequence
• Associates sequence and features with data from other sources
• Provides a publicly accessible web based interface to the database
8
The Genome Problem
• The problem with the genome (particularly
human) is that it is “large, complicated, and
opaque to analysis” (Ewan Birney, Ensembl)
• Genome features to identify include:
– Genes: protein coding, RNA, pseudogenes
– Regulatory elements
– SNPs, repeats, etc….
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 5
9
DNA sequence in Ensembl
• Sequences are determined in fragments (contigs)
• Features cross boundaries between fragments
• Entire sequence too large and changes too much
(constantly updated and reassembled) to be stored
as one long database entry
10
DNA sequence in Ensembl
• Core design feature is the “virtual contig”
object
• Allows genome sequence to be accessed as
a single large contiguous sequence even
though it is stored as a collection of fragments
• VC object handles reading and writing
features to the DNA sequence
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 6
11
Ensembl Gene Build System
• Three-part gene build system
– “Best in genome” matches for known genes
– Alignment of homologous genes
– Ab initio gene finding
• Genes predicted on repeat-masked DNA
• All genes predicted based on experimental
(available sequence) evidence
12
“Best in genome” predictions
• Find known proteins from SwissProt TrEMBL
on genome
• Incorporate cDNAs using exonerate and
EST_genome
– Align with gaps placed preferentially at splice
consensus sites
– Allows prediction of 5’ and 3’ UTRs
• Refine predictions using genewise
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 7
Lecture 7.1 13
“Best in genome” predictions
ContigView of best in genome gene
with associated evidence
Known gene
(p53)
Proteins aligned
cDNAs aligned
UTRs predicted
Unigene clusters aligned
• Alignments shown in ContigView
14
Homology predictions
• Align homologous proteins using BLAST,
genewise
– Paralogs (from same organism)
– Orthologs (from closely related organisms)
• Assemble novel genes
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 8
Lecture 7.1 15
Ab initio gene predictions
• Use Genscan to identify novel exons
• Confirm exons by BLAST to known proteins, mRNAs,
UniGene clusters
• Based on ab initio predictions but require homology
evidence
ContigView of homology gene with
associated evidence
Novel gene
GenScan predictions
Proteins aligned
Unigene clusters aligned
Lecture 7.1 16
Pseudogenes
• Many pseudogenes also predicted
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 9
17
Manual gene annotation: Otter
• Manual annotation
done with applications
eg. Apollo
• Otter database/server
allows manual
annotations to be
integrated with
automated annotations
18
Manually curated genes: VEGA
• Chromosomes
6,7,13,14, 20
and 22
contain
manually
curated genes
from VEGA
database
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 10
Lecture 7.1 19
Gene information in Ensembl:
GeneView
Lecture 7.1 20
Transcript information in Ensembl:
TransView
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 11
Lecture 7.1 21
Protein information in Ensembl:
ProteinView
22
Comparative genomics in Ensembl
Gene orthologue pairs:
• Human <-> Mouse <-> Rat
<-> Fugu <-> Zebrafish
• C. elegans <-> C. briggsae
• Fly <-> Mosquito
DNA homology:
• Human <-> Mouse <-> Rat
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 12
Lecture 7.1 23
Comparative genomics in
Ensembl: Gene orthologs
• Gene ortholog pairs shown in GeneView
• Calculated by BLAST (reciprocal best BLAST hits, or
BLAST + synteny)
• dN/dS = nonsynonymous/synonymous change
(measure of selection)
24
Comparative genomics in
Ensembl: DNA homology
• DNA homology shown in ContigView
Mouse and rat homology
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 13
25
Comparative genomics in
Ensembl: Synteny
• Large-scale homology
shown in SyntenyView
– Synteny = homologous
sequence blocks, in
same order and
orientation
26
Other features in Ensembl
• Menus
provide
other feature
options
• Features eg.
SNPs and
markers
have special
views
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 14
Lecture 7.1 27
Other data sources in Ensembl
• Ensembl incorporates gene and feature info
from many other datasources
OMIM
SwissProt
Lecture 7.1 28
Other data sources in Ensembl:
Link out
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 15
29
The Distributed Annotation
System
• Allows viewing third-party annotation of the
genomic scaffold
• Users can choose the annotation they are
interested in
• Features are viewed in consistent user
interface/display
• Allows specialized feature annotation and the
comparison of different methodologies
30
Sequence similarity searching
• Two search methods
– SSAHA: very fast, good for identifying near-exact
DNA-DNA matches
– BLAST: slower but more accurate, can do DNA or
protein searches
• Can search against any species
• Can search against genomic sequence,
cDNAs (Ensembl or Genscan), or protein
sequences
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 16
Lecture 7.1 31
Show alignment
[A], sequence [S],
or ContigView [C]
Hits relative to
genome
32
Ensembl updates
• Monthly
• Include:
– Changes in genome builds (with new annotations)
– Changes in code or database schema
– Additional views and tools on website
Erin Pleasance and Steven Jones February 23, 2004
(c) 2004 CGDN 17
33
Pre-Ensembl
• Full annotation can take weeks
• Pre-Ensembl site provides in-progress annotation
– Placement of known proteins
– Ab initio gene predictions
– Repeat masking
– BLAST and SSAHA searching
34
Ensembl Software System
• Software can be accessed by FTP
• Can also be accessed through CVS
(concurrent versions system)
• Possible to set up a mirror of the entire
Ensembl system.

The ensembl database

  • 1.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 1 1 The Ensembl Database http://www.ensembl.org Lecture 7.1 2 Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl
  • 2.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 2 3 What is Ensembl? • Public annotation of mammalian and other genomes • Open source software 4 The Ensembl Project “Ensembl is a joint project between EMBL European Bioinformatics Institute and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is primarily funded by the Wellcome Trust”
  • 3.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 3 5 The Ensembl Project “The main aim of this campaign is to encourage scientists across the world - in academia, pharmaceutical companies, and the biotechnology and computer industries - to use this free information.” - Dr. Mike Dexter, Director of the Wellcome Trust 6 Diagram of contigview as “what we want in the end” Goal: An Accessible, Annotated Genome
  • 4.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 4 7 Ensembl Genome Annotation • Utilizes raw DNA sequence data from public sources • Creates a tracking database (The “Ensembl database”) • Joins the sequences - based on a sequence scaffold • Automatically finds genes and other features of the sequence • Associates sequence and features with data from other sources • Provides a publicly accessible web based interface to the database 8 The Genome Problem • The problem with the genome (particularly human) is that it is “large, complicated, and opaque to analysis” (Ewan Birney, Ensembl) • Genome features to identify include: – Genes: protein coding, RNA, pseudogenes – Regulatory elements – SNPs, repeats, etc….
  • 5.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 5 9 DNA sequence in Ensembl • Sequences are determined in fragments (contigs) • Features cross boundaries between fragments • Entire sequence too large and changes too much (constantly updated and reassembled) to be stored as one long database entry 10 DNA sequence in Ensembl • Core design feature is the “virtual contig” object • Allows genome sequence to be accessed as a single large contiguous sequence even though it is stored as a collection of fragments • VC object handles reading and writing features to the DNA sequence
  • 6.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 6 11 Ensembl Gene Build System • Three-part gene build system – “Best in genome” matches for known genes – Alignment of homologous genes – Ab initio gene finding • Genes predicted on repeat-masked DNA • All genes predicted based on experimental (available sequence) evidence 12 “Best in genome” predictions • Find known proteins from SwissProt TrEMBL on genome • Incorporate cDNAs using exonerate and EST_genome – Align with gaps placed preferentially at splice consensus sites – Allows prediction of 5’ and 3’ UTRs • Refine predictions using genewise
  • 7.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 7 Lecture 7.1 13 “Best in genome” predictions ContigView of best in genome gene with associated evidence Known gene (p53) Proteins aligned cDNAs aligned UTRs predicted Unigene clusters aligned • Alignments shown in ContigView 14 Homology predictions • Align homologous proteins using BLAST, genewise – Paralogs (from same organism) – Orthologs (from closely related organisms) • Assemble novel genes
  • 8.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 8 Lecture 7.1 15 Ab initio gene predictions • Use Genscan to identify novel exons • Confirm exons by BLAST to known proteins, mRNAs, UniGene clusters • Based on ab initio predictions but require homology evidence ContigView of homology gene with associated evidence Novel gene GenScan predictions Proteins aligned Unigene clusters aligned Lecture 7.1 16 Pseudogenes • Many pseudogenes also predicted
  • 9.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 9 17 Manual gene annotation: Otter • Manual annotation done with applications eg. Apollo • Otter database/server allows manual annotations to be integrated with automated annotations 18 Manually curated genes: VEGA • Chromosomes 6,7,13,14, 20 and 22 contain manually curated genes from VEGA database
  • 10.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 10 Lecture 7.1 19 Gene information in Ensembl: GeneView Lecture 7.1 20 Transcript information in Ensembl: TransView
  • 11.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 11 Lecture 7.1 21 Protein information in Ensembl: ProteinView 22 Comparative genomics in Ensembl Gene orthologue pairs: • Human <-> Mouse <-> Rat <-> Fugu <-> Zebrafish • C. elegans <-> C. briggsae • Fly <-> Mosquito DNA homology: • Human <-> Mouse <-> Rat
  • 12.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 12 Lecture 7.1 23 Comparative genomics in Ensembl: Gene orthologs • Gene ortholog pairs shown in GeneView • Calculated by BLAST (reciprocal best BLAST hits, or BLAST + synteny) • dN/dS = nonsynonymous/synonymous change (measure of selection) 24 Comparative genomics in Ensembl: DNA homology • DNA homology shown in ContigView Mouse and rat homology
  • 13.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 13 25 Comparative genomics in Ensembl: Synteny • Large-scale homology shown in SyntenyView – Synteny = homologous sequence blocks, in same order and orientation 26 Other features in Ensembl • Menus provide other feature options • Features eg. SNPs and markers have special views
  • 14.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 14 Lecture 7.1 27 Other data sources in Ensembl • Ensembl incorporates gene and feature info from many other datasources OMIM SwissProt Lecture 7.1 28 Other data sources in Ensembl: Link out
  • 15.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 15 29 The Distributed Annotation System • Allows viewing third-party annotation of the genomic scaffold • Users can choose the annotation they are interested in • Features are viewed in consistent user interface/display • Allows specialized feature annotation and the comparison of different methodologies 30 Sequence similarity searching • Two search methods – SSAHA: very fast, good for identifying near-exact DNA-DNA matches – BLAST: slower but more accurate, can do DNA or protein searches • Can search against any species • Can search against genomic sequence, cDNAs (Ensembl or Genscan), or protein sequences
  • 16.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 16 Lecture 7.1 31 Show alignment [A], sequence [S], or ContigView [C] Hits relative to genome 32 Ensembl updates • Monthly • Include: – Changes in genome builds (with new annotations) – Changes in code or database schema – Additional views and tools on website
  • 17.
    Erin Pleasance andSteven Jones February 23, 2004 (c) 2004 CGDN 17 33 Pre-Ensembl • Full annotation can take weeks • Pre-Ensembl site provides in-progress annotation – Placement of known proteins – Ab initio gene predictions – Repeat masking – BLAST and SSAHA searching 34 Ensembl Software System • Software can be accessed by FTP • Can also be accessed through CVS (concurrent versions system) • Possible to set up a mirror of the entire Ensembl system.