RefSeq curation and annotation of the
reference human genome GRCh38
Kim D. Pruitt
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
www.ncbi.nlm.nih.gov/refseq/
RefSeq Background
• RefSeq provides -
• Human genome annotation
• Known transcripts & proteins (manually curated)
• Model transcripts & proteins (annotation pipeline)
• Collaborations -
• Genome Reference Consortium (GRC)
• HUGO Gene Nomenclature Committee (HGNC)
• Consensus CDS (CCDS) Collaboration (HAVANA curators)
• RefSeqGene/Locus Reference Genomic (LRG)/LSDB
RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/
An NCBI project to provide reference sequence
standards that incorporate current knowledge.
Archaea – Bacteria – Eukaryotes - Virus
Curation support of genic regions of
the reference human assembly
• RefSeqGene and LRG collaboration
• Genomic and cDNA standards for clinical reporting
• Report potential issues to the GRC
• Consensus CDS collaboration
• Stabilized human CDS annotation
• Report potential issues to the GRC
• RefSeq
• Curation of genes, transcript & protein records
• Report potential issues to the GRC
• Review GRC patch updates for gene annotation impact
Genome annotation leverages
curation + computation
Genes:
• Type, location, length
Sequence:
• Accuracy, length
• Alternate splice products
• Functional annotation
Align curated RefSeqs
Align transcripts, proteins
Align RNA-Seq
Filter best alignments
Build model RefSeqs
Assign accessions, GeneID
Evidence-based genome
annotation pipeline
Manual Curation
Sequence - Literature
Transcripts Proteins
Known RefSeqs 50,540 39,363
Model RefSeqs 112,735 60,599
Annotated Genes Count
Protein-coding 20,576
Non-coding 18,037
Pseudogene 12,474
Transition from GRCh37 to GRCh38
• Identify gene/sequence differences vs. GRCh38
• Automatic update at synonymous mismatches
• Curation review of remainder
• >5,100 Known RefSeq transcripts updated since October 2013
• 47,031 Known RefSeqs identical to genome
• 2,916 intentionally retain a mismatch or indel
• ~600 pending
• ~132 genes merged
0 200 400 600 800 1000 1200
2013 Q1
2013 Q3
2014 Q1
2014 Q3
2015 Q1
2015 Q3
Number of updates
* GRCh38 12/24/2013
*
Updating RefSeq to match GRCh38
• Post GRCh38 review:
• NM_173477 updated to match genome (NM_173477.4)
• Model RefSeq XM_005257026.1 promoted to Known RefSeq
GRCh38
GRCh37
alignment
alignment
RefSeq curation & genome
maintenance




 





GRCh38
GRCh37
GRCh37 Issue:
SCX duplication
MROH1 split
GRCh38 update:
Gap closed
MROH1 complete
One SCX gene
gap
RefSeq curation & genome
maintenance
• POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion
vs. GRCh38
• This maintains the correct reading frame
GRCh38
alignment
RefSeq curation & genome
maintenance
• RefSeq reported this sequence issue to the GRC
GRCh38 ALT LOCI and PATCHES
Pre-Patch & ALT review
Polymorphic pseudogenes
Haplotype & CNV variation
ALT-specific RefSeq records
Curator-stored placement data
Evidence-based genome
annotation pipeline
Manual Curation
Assembly-ALT alignments
Alignment quality reports
Subsequent genome
annotation build corrects
the annotation
Interim alignment updates
Polymorphic pseudogenes
• RefSeq provides different transcripts to represent the protein-
coding gene versus the pseudogene
• Curators store assembly placement information (chromosome
versus ALT) in a local database
• This is used by annotation pipeline to ensure correct annotation
Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2
GRCh38 chr22 null pseudo coding pseudo null
ALT_REF_LOCI_1 coding coding coding pseudo pseudo
An example – GSTT cluster on chromosome 22:
GSTT* variation, chromosome 22
• Copy number variation of glutathione-S-transferase theta genes
is associated with digestive track cancers and more
• Accurate gene annotation is important to downstream users
GRCh38 chr22
GRCh38 ALT
pseudogene
chr22 = null allelecoding allele
ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer
GSTT2 polymorphism
AT splice donor Premature
stop codon
GT splice donor Stop codon
GRCh38 chr22
GRCh38 ALT
GRCh38 chr22 GSTT2 pseudogene
GRCh38 chr22
Data access
• Genes:
• <…ncbi root url…>/gene/
• ftp://ftp.ncbi.nlm.nih.gov/gene/
• NCBI YouTube ‘Download genomic sequence for a gene’
• https://www.youtube.com/watch?v=RHz2nZbzjpA
• RefSeq transcripts and proteins:
• Links from NCBI Gene
• Nucleotide/protein query:
• human[organism] + use facets to specify RefSeq and molecule type
• ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/
• NCBI Genome Annotation
• Links from NCBI Assembly or Genome resources
• <ncbi>/assembly/ or <ncbi>/genome/
Data access to annotated genome
Gene
Assembly details
Genome FTP formats
• FASTA
• genome, transcripts, proteins
• GenBank file format
• – genome transcripts, proteins
• GFF genome annotation
• Feature table
• features and locations in
tabular format
• AGP, Assembly details & statistics
• Repeat masker results
• Md5checksums
• Documentation
• README files
• <ncbi>/genome/doc/ftpfaq/
Acknowledgements
RefSeq Curators
Annotation pipeline
Paul Kitts
Terence Murphy
Francoise Thibaud-Nissen
Eric Cox
Catherine Farrell
Tamara Goldfarb
Tripti Gupta
Vinita Joardar
Vamsi Kodali
Kelly McGarvey
Mike Murphy
Nuala O'Leary
Shashi Pujar
Bhanu Rajput
Sanjida Rangwala
Lillian Riddick
Dave Webb
Matt Wright
Susan Hiatt
www.ncbi.nlm.nih.gov/refseq/
Collaborators
Elspeth Bruford (HGNC)
Jen Harrow (HAVANNA)
Locus-Specific Databases
Expert databases
Individual scientists
NCBI Posters & Booth 2405

Ashg2015 grc-pruitt

  • 1.
    RefSeq curation andannotation of the reference human genome GRCh38 Kim D. Pruitt National Center for Biotechnology Information National Library of Medicine National Institutes of Health www.ncbi.nlm.nih.gov/refseq/
  • 2.
    RefSeq Background • RefSeqprovides - • Human genome annotation • Known transcripts & proteins (manually curated) • Model transcripts & proteins (annotation pipeline) • Collaborations - • Genome Reference Consortium (GRC) • HUGO Gene Nomenclature Committee (HGNC) • Consensus CDS (CCDS) Collaboration (HAVANA curators) • RefSeqGene/Locus Reference Genomic (LRG)/LSDB RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/ An NCBI project to provide reference sequence standards that incorporate current knowledge. Archaea – Bacteria – Eukaryotes - Virus
  • 3.
    Curation support ofgenic regions of the reference human assembly • RefSeqGene and LRG collaboration • Genomic and cDNA standards for clinical reporting • Report potential issues to the GRC • Consensus CDS collaboration • Stabilized human CDS annotation • Report potential issues to the GRC • RefSeq • Curation of genes, transcript & protein records • Report potential issues to the GRC • Review GRC patch updates for gene annotation impact
  • 4.
    Genome annotation leverages curation+ computation Genes: • Type, location, length Sequence: • Accuracy, length • Alternate splice products • Functional annotation Align curated RefSeqs Align transcripts, proteins Align RNA-Seq Filter best alignments Build model RefSeqs Assign accessions, GeneID Evidence-based genome annotation pipeline Manual Curation Sequence - Literature Transcripts Proteins Known RefSeqs 50,540 39,363 Model RefSeqs 112,735 60,599 Annotated Genes Count Protein-coding 20,576 Non-coding 18,037 Pseudogene 12,474
  • 5.
    Transition from GRCh37to GRCh38 • Identify gene/sequence differences vs. GRCh38 • Automatic update at synonymous mismatches • Curation review of remainder • >5,100 Known RefSeq transcripts updated since October 2013 • 47,031 Known RefSeqs identical to genome • 2,916 intentionally retain a mismatch or indel • ~600 pending • ~132 genes merged 0 200 400 600 800 1000 1200 2013 Q1 2013 Q3 2014 Q1 2014 Q3 2015 Q1 2015 Q3 Number of updates * GRCh38 12/24/2013 *
  • 6.
    Updating RefSeq tomatch GRCh38 • Post GRCh38 review: • NM_173477 updated to match genome (NM_173477.4) • Model RefSeq XM_005257026.1 promoted to Known RefSeq GRCh38 GRCh37 alignment alignment
  • 7.
    RefSeq curation &genome maintenance            GRCh38 GRCh37 GRCh37 Issue: SCX duplication MROH1 split GRCh38 update: Gap closed MROH1 complete One SCX gene gap
  • 8.
    RefSeq curation &genome maintenance • POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion vs. GRCh38 • This maintains the correct reading frame GRCh38 alignment
  • 9.
    RefSeq curation &genome maintenance • RefSeq reported this sequence issue to the GRC
  • 10.
    GRCh38 ALT LOCIand PATCHES Pre-Patch & ALT review Polymorphic pseudogenes Haplotype & CNV variation ALT-specific RefSeq records Curator-stored placement data Evidence-based genome annotation pipeline Manual Curation Assembly-ALT alignments Alignment quality reports Subsequent genome annotation build corrects the annotation Interim alignment updates
  • 11.
    Polymorphic pseudogenes • RefSeqprovides different transcripts to represent the protein- coding gene versus the pseudogene • Curators store assembly placement information (chromosome versus ALT) in a local database • This is used by annotation pipeline to ensure correct annotation Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2 GRCh38 chr22 null pseudo coding pseudo null ALT_REF_LOCI_1 coding coding coding pseudo pseudo An example – GSTT cluster on chromosome 22:
  • 12.
    GSTT* variation, chromosome22 • Copy number variation of glutathione-S-transferase theta genes is associated with digestive track cancers and more • Accurate gene annotation is important to downstream users GRCh38 chr22 GRCh38 ALT pseudogene chr22 = null allelecoding allele ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer
  • 13.
    GSTT2 polymorphism AT splicedonor Premature stop codon GT splice donor Stop codon GRCh38 chr22 GRCh38 ALT
  • 14.
    GRCh38 chr22 GSTT2pseudogene GRCh38 chr22
  • 15.
    Data access • Genes: •<…ncbi root url…>/gene/ • ftp://ftp.ncbi.nlm.nih.gov/gene/ • NCBI YouTube ‘Download genomic sequence for a gene’ • https://www.youtube.com/watch?v=RHz2nZbzjpA • RefSeq transcripts and proteins: • Links from NCBI Gene • Nucleotide/protein query: • human[organism] + use facets to specify RefSeq and molecule type • ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/ • NCBI Genome Annotation • Links from NCBI Assembly or Genome resources • <ncbi>/assembly/ or <ncbi>/genome/
  • 16.
    Data access toannotated genome Gene Assembly details
  • 17.
    Genome FTP formats •FASTA • genome, transcripts, proteins • GenBank file format • – genome transcripts, proteins • GFF genome annotation • Feature table • features and locations in tabular format • AGP, Assembly details & statistics • Repeat masker results • Md5checksums • Documentation • README files • <ncbi>/genome/doc/ftpfaq/
  • 18.
    Acknowledgements RefSeq Curators Annotation pipeline PaulKitts Terence Murphy Francoise Thibaud-Nissen Eric Cox Catherine Farrell Tamara Goldfarb Tripti Gupta Vinita Joardar Vamsi Kodali Kelly McGarvey Mike Murphy Nuala O'Leary Shashi Pujar Bhanu Rajput Sanjida Rangwala Lillian Riddick Dave Webb Matt Wright Susan Hiatt www.ncbi.nlm.nih.gov/refseq/ Collaborators Elspeth Bruford (HGNC) Jen Harrow (HAVANNA) Locus-Specific Databases Expert databases Individual scientists
  • 19.
    NCBI Posters &Booth 2405