1. HAVANA / Ensembl / GENCODE
annotation on GRCh38
Jonathan M. Mudge
Wellcome Trust Sanger Institute
HAVANA group
2. HAVANA provide manual gene annotation
cDNAs
ESTs
Genomic sequence (human, mouse, zebrafish…)
Protein
Transcript model
Publication data
Comparative analyses
Next generation datasets
11. HAVANA annotation can be viewed in Vega
vega.sanger.ac.uk V61 Jun1 2015
‘update’ annotation
12. v20 was the first GENCODE on GRCh38
v19 on GRCh37
GRCh38
(1) HAVANA
liftover
(2) HAVANA
reannotation
(3) Merge into full new Ensembl genebuild
(Ensembl release 76)
13. Most gene IDs are preserved on GRCh38
GRCh37 GRCh38
Gene IDs were transferred based on contig-contig mapping strategy
… also used to map variation etc
ESPN
15. • Fixed gene issues caused by 37 > 38 changes
• Major QC performed
• New complex regions on chr 1, 9, X
• Alt loci / Haplotype annotation
v20 was the first GENCODE on GRCh38
V19 on GRCh37
GRCh38
(1) HAVANA
liftover
(2) HAVANA
reannotation
(3) Merge into full new Ensembl genebuild
(Ensembl release 76)
16. The new pericentromic region of chr9
New p-arm
Gaps closed / clones flipped round / clones moved to correct arm
Optical mapping data
Hundreds of new / rebuilt models
Old p-arm
17. Ongoing strategy for patch annotation
Ensembl: annotate patches when released without full gene build
HAVANA: prioritise certain fix / novel patches and alt loci for annotation
• some patches don’t contain genes that need re-annotating
• others are exceptionally complex
NOVEL patch HG-2048
GRCh38.p3
HAVANA pseudogene
18. HAVANA LRC annotation on GRCh38
Annotation of 34 Leukoctye Receptor
Complexes (LRCs) completed for v20
COX2
COX1
PGF1
PGF2
DM1A
DM1B
MC1B
MC1A
LILRs KIRs
19. GENCODE remains a work in progress
… arguably, far from complete
• We are missing genes, transcripts and exons
• 1000s of our models are incomplete
• Functional annotation is largely putative
Which transcripts are functional?
How do they function?
20. GRCh38 GENCODE incorporates NextGen data
Transcript capture and completion Functional annotation
Next generation experimental data
Short read data: querying transcript-level support of existing introns / exons
examining expression patterns, e.g. tissue specificity
Long read data: querying transcript-level support of existing introns / exons
CAGE / RAMPAGE / PolyAseq: establishing start and end points of genes / transcripts
Ribosome profiling: reappraising initiation codon usage
Mass spectrometry: identifying novel protein-coding regions
21. GENCODE v23 compared with v19
v23 has 2,678 more genes… 548 less protein coding genes
22. In conclusion
GENCODE is now a GRCh38 genebuild
Compared to GRCh37 builds it is:
• More accurate
• More comprehensive
• More sophisticated
We recommend you use GENCODEv23 on GRC38
23. Acknowledgements
Major funding:
GENCODE partners: Wellcome Trust Sanger Institute; European Bioinformatics
Institute; The University of Lausanne; The Centre de Regulació Genòmica; The
University of California, Santa Cruz; The Massachusetts Institute of Technology;
Yale University; The Spanish National Cancer Research Centre.