HAVANA / Ensembl / GENCODE
annotation on GRCh38
Jonathan M. Mudge
Wellcome Trust Sanger Institute
HAVANA group
HAVANA provide manual gene annotation
cDNAs
ESTs
Genomic sequence (human, mouse, zebrafish…)
Protein
Transcript model
Publication data
Comparative analyses
Next generation datasets
Ensembl: computational genome annotation
Ensembl genebuild based on genomic alignments
Not all Ensembl releases represent new genebuilds
GENCODE is a HAVANA / Ensembl merge
… with 8 institutes contributing
Run every 3-6 months
GENCODEv23 released July 2015
19,797 protein coding genes
15,931 long non-coding RNA genes 14,477 pseudogenes
Hum GENCODE v23: 60,498 genes containing 198, 619 transcripts
CDS exon
Non-coding / UTR
79,795 CDS transcripts due to alternative splicing
27,817 lncRNA transcripts 1,112 transcribed
GENCODE is the geneset for ENCODE
GENCODE has a designated web portal
www.gencodegenes.org
GENCODE has a designated web portal
www.gencodegenes.org
Viewing GENCODE in genome browsers
www.ensembl.org Ensembl 81/82 = GENCODEv23
Viewing GENCODE in genome browsers
https://genome.ucsc.edu
HAVANA annotation can be viewed in Vega
vega.sanger.ac.uk V61 Jun1 2015
‘update’ annotation
v20 was the first GENCODE on GRCh38
v19 on GRCh37
GRCh38
(1) HAVANA
liftover
(2) HAVANA
reannotation
(3) Merge into full new Ensembl genebuild
(Ensembl release 76)
Most gene IDs are preserved on GRCh38
GRCh37 GRCh38
Gene IDs were transferred based on contig-contig mapping strategy
… also used to map variation etc
ESPN
GRCh37
GRCh37
patch
GRCh38
Ensembl re-annotation of SRGAP2
ENSG00000266028
ENSG00000266028
ENSG00000163486
Assembly Gene ID
• Fixed gene issues caused by 37 > 38 changes
• Major QC performed
• New complex regions on chr 1, 9, X
• Alt loci / Haplotype annotation
v20 was the first GENCODE on GRCh38
V19 on GRCh37
GRCh38
(1) HAVANA
liftover
(2) HAVANA
reannotation
(3) Merge into full new Ensembl genebuild
(Ensembl release 76)
The new pericentromic region of chr9
New p-arm
Gaps closed / clones flipped round / clones moved to correct arm
Optical mapping data
Hundreds of new / rebuilt models
Old p-arm
Ongoing strategy for patch annotation
Ensembl: annotate patches when released without full gene build
HAVANA: prioritise certain fix / novel patches and alt loci for annotation
• some patches don’t contain genes that need re-annotating
• others are exceptionally complex
NOVEL patch HG-2048
GRCh38.p3
HAVANA pseudogene
HAVANA LRC annotation on GRCh38
Annotation of 34 Leukoctye Receptor
Complexes (LRCs) completed for v20
COX2
COX1
PGF1
PGF2
DM1A
DM1B
MC1B
MC1A
LILRs KIRs
GENCODE remains a work in progress
… arguably, far from complete
• We are missing genes, transcripts and exons
• 1000s of our models are incomplete
• Functional annotation is largely putative
Which transcripts are functional?
How do they function?
GRCh38 GENCODE incorporates NextGen data
Transcript capture and completion Functional annotation
Next generation experimental data
Short read data: querying transcript-level support of existing introns / exons
examining expression patterns, e.g. tissue specificity
Long read data: querying transcript-level support of existing introns / exons
CAGE / RAMPAGE / PolyAseq: establishing start and end points of genes / transcripts
Ribosome profiling: reappraising initiation codon usage
Mass spectrometry: identifying novel protein-coding regions
GENCODE v23 compared with v19
v23 has 2,678 more genes… 548 less protein coding genes
In conclusion
GENCODE is now a GRCh38 genebuild
Compared to GRCh37 builds it is:
• More accurate
• More comprehensive
• More sophisticated
We recommend you use GENCODEv23 on GRC38
Acknowledgements
Major funding:
GENCODE partners: Wellcome Trust Sanger Institute; European Bioinformatics
Institute; The University of Lausanne; The Centre de Regulació Genòmica; The
University of California, Santa Cruz; The Massachusetts Institute of Technology;
Yale University; The Spanish National Cancer Research Centre.

Grc ashg2015 workshop_mudge

  • 1.
    HAVANA / Ensembl/ GENCODE annotation on GRCh38 Jonathan M. Mudge Wellcome Trust Sanger Institute HAVANA group
  • 2.
    HAVANA provide manualgene annotation cDNAs ESTs Genomic sequence (human, mouse, zebrafish…) Protein Transcript model Publication data Comparative analyses Next generation datasets
  • 3.
  • 4.
    Ensembl genebuild basedon genomic alignments Not all Ensembl releases represent new genebuilds
  • 5.
    GENCODE is aHAVANA / Ensembl merge … with 8 institutes contributing Run every 3-6 months GENCODEv23 released July 2015
  • 6.
    19,797 protein codinggenes 15,931 long non-coding RNA genes 14,477 pseudogenes Hum GENCODE v23: 60,498 genes containing 198, 619 transcripts CDS exon Non-coding / UTR 79,795 CDS transcripts due to alternative splicing 27,817 lncRNA transcripts 1,112 transcribed GENCODE is the geneset for ENCODE
  • 7.
    GENCODE has adesignated web portal www.gencodegenes.org
  • 8.
    GENCODE has adesignated web portal www.gencodegenes.org
  • 9.
    Viewing GENCODE ingenome browsers www.ensembl.org Ensembl 81/82 = GENCODEv23
  • 10.
    Viewing GENCODE ingenome browsers https://genome.ucsc.edu
  • 11.
    HAVANA annotation canbe viewed in Vega vega.sanger.ac.uk V61 Jun1 2015 ‘update’ annotation
  • 12.
    v20 was thefirst GENCODE on GRCh38 v19 on GRCh37 GRCh38 (1) HAVANA liftover (2) HAVANA reannotation (3) Merge into full new Ensembl genebuild (Ensembl release 76)
  • 13.
    Most gene IDsare preserved on GRCh38 GRCh37 GRCh38 Gene IDs were transferred based on contig-contig mapping strategy … also used to map variation etc ESPN
  • 14.
    GRCh37 GRCh37 patch GRCh38 Ensembl re-annotation ofSRGAP2 ENSG00000266028 ENSG00000266028 ENSG00000163486 Assembly Gene ID
  • 15.
    • Fixed geneissues caused by 37 > 38 changes • Major QC performed • New complex regions on chr 1, 9, X • Alt loci / Haplotype annotation v20 was the first GENCODE on GRCh38 V19 on GRCh37 GRCh38 (1) HAVANA liftover (2) HAVANA reannotation (3) Merge into full new Ensembl genebuild (Ensembl release 76)
  • 16.
    The new pericentromicregion of chr9 New p-arm Gaps closed / clones flipped round / clones moved to correct arm Optical mapping data Hundreds of new / rebuilt models Old p-arm
  • 17.
    Ongoing strategy forpatch annotation Ensembl: annotate patches when released without full gene build HAVANA: prioritise certain fix / novel patches and alt loci for annotation • some patches don’t contain genes that need re-annotating • others are exceptionally complex NOVEL patch HG-2048 GRCh38.p3 HAVANA pseudogene
  • 18.
    HAVANA LRC annotationon GRCh38 Annotation of 34 Leukoctye Receptor Complexes (LRCs) completed for v20 COX2 COX1 PGF1 PGF2 DM1A DM1B MC1B MC1A LILRs KIRs
  • 19.
    GENCODE remains awork in progress … arguably, far from complete • We are missing genes, transcripts and exons • 1000s of our models are incomplete • Functional annotation is largely putative Which transcripts are functional? How do they function?
  • 20.
    GRCh38 GENCODE incorporatesNextGen data Transcript capture and completion Functional annotation Next generation experimental data Short read data: querying transcript-level support of existing introns / exons examining expression patterns, e.g. tissue specificity Long read data: querying transcript-level support of existing introns / exons CAGE / RAMPAGE / PolyAseq: establishing start and end points of genes / transcripts Ribosome profiling: reappraising initiation codon usage Mass spectrometry: identifying novel protein-coding regions
  • 21.
    GENCODE v23 comparedwith v19 v23 has 2,678 more genes… 548 less protein coding genes
  • 22.
    In conclusion GENCODE isnow a GRCh38 genebuild Compared to GRCh37 builds it is: • More accurate • More comprehensive • More sophisticated We recommend you use GENCODEv23 on GRC38
  • 23.
    Acknowledgements Major funding: GENCODE partners:Wellcome Trust Sanger Institute; European Bioinformatics Institute; The University of Lausanne; The Centre de Regulació Genòmica; The University of California, Santa Cruz; The Massachusetts Institute of Technology; Yale University; The Spanish National Cancer Research Centre.