HAVANA / Ensembl / GENCODE
annotation on GRCh38
Jonathan M. Mudge
Wellcome Trust Sanger Institute
HAVANA group
HAVANA provide manual gene annotation
cDNAs
ESTs
Genomic sequence (human, mouse, zebrafish…)
Protein
Transcript model
Publication data
Comparative analyses
Next generation datasets
HAVANA annotation can be viewed in Vega
vega.sanger.ac.uk V61 Jun1 2015
‘update’ annotation
v20 was the first GENCODE on GRCh38
v19 on GRCh37
GRCh38
(1) HAVANA
liftover
(2) HAVANA
reannotation
(3) Merge into full new Ensembl genebuild
(Ensembl release 76)
Most gene IDs are preserved on GRCh38
GRCh37 GRCh38
Gene IDs were transferred based on contig-contig mapping strategy
… also used to map variation etc
ESPN
• Fixed gene issues caused by 37 > 38 changes
• Major QC performed
• New complex regions on chr 1, 9, X
• Alt loci / Haplotype annotation
v20 was the first GENCODE on GRCh38
V19 on GRCh37
GRCh38
(1) HAVANA
liftover
(2) HAVANA
reannotation
(3) Merge into full new Ensembl genebuild
(Ensembl release 76)
The new pericentromic region of chr9
New p-arm
Gaps closed / clones flipped round / clones moved to correct arm
Optical mapping data
Hundreds of new / rebuilt models
Old p-arm
Ongoing strategy for patch annotation
Ensembl: annotate patches when released without full gene build
HAVANA: prioritise certain fix / novel patches and alt loci for annotation
• some patches don’t contain genes that need re-annotating
• others are exceptionally complex
NOVEL patch HG-2048
GRCh38.p3
HAVANA pseudogene
HAVANA LRC annotation on GRCh38
Annotation of 34 Leukoctye Receptor
Complexes (LRCs) completed for v20
COX2
COX1
PGF1
PGF2
DM1A
DM1B
MC1B
MC1A
LILRs KIRs
GENCODE remains a work in progress
… arguably, far from complete
• We are missing genes, transcripts and exons
• 1000s of our models are incomplete
• Functional annotation is largely putative
Which transcripts are functional?
How do they function?
GRCh38 GENCODE incorporates NextGen data
Transcript capture and completion Functional annotation
Next generation experimental data
Short read data: querying transcript-level support of existing introns / exons
examining expression patterns, e.g. tissue specificity
Long read data: querying transcript-level support of existing introns / exons
CAGE / RAMPAGE / PolyAseq: establishing start and end points of genes / transcripts
Ribosome profiling: reappraising initiation codon usage
Mass spectrometry: identifying novel protein-coding regions
GENCODE v23 compared with v19
v23 has 2,678 more genes… 548 less protein coding genes
In conclusion
GENCODE is now a GRCh38 genebuild
Compared to GRCh37 builds it is:
• More accurate
• More comprehensive
• More sophisticated
We recommend you use GENCODEv23 on GRC38
Acknowledgements
Major funding:
GENCODE partners: Wellcome Trust Sanger Institute; European Bioinformatics
Institute; The University of Lausanne; The Centre de Regulació Genòmica; The
University of California, Santa Cruz; The Massachusetts Institute of Technology;
Yale University; The Spanish National Cancer Research Centre.