Successfully reported this slideshow.
Your SlideShare is downloading. ×

Ashg2015 grc-pruitt

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 19 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (14)

Advertisement

Similar to Ashg2015 grc-pruitt (20)

More from Genome Reference Consortium (19)

Advertisement

Recently uploaded (20)

Ashg2015 grc-pruitt

  1. 1. RefSeq curation and annotation of the reference human genome GRCh38 Kim D. Pruitt National Center for Biotechnology Information National Library of Medicine National Institutes of Health www.ncbi.nlm.nih.gov/refseq/
  2. 2. RefSeq Background • RefSeq provides - • Human genome annotation • Known transcripts & proteins (manually curated) • Model transcripts & proteins (annotation pipeline) • Collaborations - • Genome Reference Consortium (GRC) • HUGO Gene Nomenclature Committee (HGNC) • Consensus CDS (CCDS) Collaboration (HAVANA curators) • RefSeqGene/Locus Reference Genomic (LRG)/LSDB RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/ An NCBI project to provide reference sequence standards that incorporate current knowledge. Archaea – Bacteria – Eukaryotes - Virus
  3. 3. Curation support of genic regions of the reference human assembly • RefSeqGene and LRG collaboration • Genomic and cDNA standards for clinical reporting • Report potential issues to the GRC • Consensus CDS collaboration • Stabilized human CDS annotation • Report potential issues to the GRC • RefSeq • Curation of genes, transcript & protein records • Report potential issues to the GRC • Review GRC patch updates for gene annotation impact
  4. 4. Genome annotation leverages curation + computation Genes: • Type, location, length Sequence: • Accuracy, length • Alternate splice products • Functional annotation Align curated RefSeqs Align transcripts, proteins Align RNA-Seq Filter best alignments Build model RefSeqs Assign accessions, GeneID Evidence-based genome annotation pipeline Manual Curation Sequence - Literature Transcripts Proteins Known RefSeqs 50,540 39,363 Model RefSeqs 112,735 60,599 Annotated Genes Count Protein-coding 20,576 Non-coding 18,037 Pseudogene 12,474
  5. 5. Transition from GRCh37 to GRCh38 • Identify gene/sequence differences vs. GRCh38 • Automatic update at synonymous mismatches • Curation review of remainder • >5,100 Known RefSeq transcripts updated since October 2013 • 47,031 Known RefSeqs identical to genome • 2,916 intentionally retain a mismatch or indel • ~600 pending • ~132 genes merged 0 200 400 600 800 1000 1200 2013 Q1 2013 Q3 2014 Q1 2014 Q3 2015 Q1 2015 Q3 Number of updates * GRCh38 12/24/2013 *
  6. 6. Updating RefSeq to match GRCh38 • Post GRCh38 review: • NM_173477 updated to match genome (NM_173477.4) • Model RefSeq XM_005257026.1 promoted to Known RefSeq GRCh38 GRCh37 alignment alignment
  7. 7. RefSeq curation & genome maintenance            GRCh38 GRCh37 GRCh37 Issue: SCX duplication MROH1 split GRCh38 update: Gap closed MROH1 complete One SCX gene gap
  8. 8. RefSeq curation & genome maintenance • POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion vs. GRCh38 • This maintains the correct reading frame GRCh38 alignment
  9. 9. RefSeq curation & genome maintenance • RefSeq reported this sequence issue to the GRC
  10. 10. GRCh38 ALT LOCI and PATCHES Pre-Patch & ALT review Polymorphic pseudogenes Haplotype & CNV variation ALT-specific RefSeq records Curator-stored placement data Evidence-based genome annotation pipeline Manual Curation Assembly-ALT alignments Alignment quality reports Subsequent genome annotation build corrects the annotation Interim alignment updates
  11. 11. Polymorphic pseudogenes • RefSeq provides different transcripts to represent the protein- coding gene versus the pseudogene • Curators store assembly placement information (chromosome versus ALT) in a local database • This is used by annotation pipeline to ensure correct annotation Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2 GRCh38 chr22 null pseudo coding pseudo null ALT_REF_LOCI_1 coding coding coding pseudo pseudo An example – GSTT cluster on chromosome 22:
  12. 12. GSTT* variation, chromosome 22 • Copy number variation of glutathione-S-transferase theta genes is associated with digestive track cancers and more • Accurate gene annotation is important to downstream users GRCh38 chr22 GRCh38 ALT pseudogene chr22 = null allelecoding allele ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer
  13. 13. GSTT2 polymorphism AT splice donor Premature stop codon GT splice donor Stop codon GRCh38 chr22 GRCh38 ALT
  14. 14. GRCh38 chr22 GSTT2 pseudogene GRCh38 chr22
  15. 15. Data access • Genes: • <…ncbi root url…>/gene/ • ftp://ftp.ncbi.nlm.nih.gov/gene/ • NCBI YouTube ‘Download genomic sequence for a gene’ • https://www.youtube.com/watch?v=RHz2nZbzjpA • RefSeq transcripts and proteins: • Links from NCBI Gene • Nucleotide/protein query: • human[organism] + use facets to specify RefSeq and molecule type • ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/ • NCBI Genome Annotation • Links from NCBI Assembly or Genome resources • <ncbi>/assembly/ or <ncbi>/genome/
  16. 16. Data access to annotated genome Gene Assembly details
  17. 17. Genome FTP formats • FASTA • genome, transcripts, proteins • GenBank file format • – genome transcripts, proteins • GFF genome annotation • Feature table • features and locations in tabular format • AGP, Assembly details & statistics • Repeat masker results • Md5checksums • Documentation • README files • <ncbi>/genome/doc/ftpfaq/
  18. 18. Acknowledgements RefSeq Curators Annotation pipeline Paul Kitts Terence Murphy Francoise Thibaud-Nissen Eric Cox Catherine Farrell Tamara Goldfarb Tripti Gupta Vinita Joardar Vamsi Kodali Kelly McGarvey Mike Murphy Nuala O'Leary Shashi Pujar Bhanu Rajput Sanjida Rangwala Lillian Riddick Dave Webb Matt Wright Susan Hiatt www.ncbi.nlm.nih.gov/refseq/ Collaborators Elspeth Bruford (HGNC) Jen Harrow (HAVANNA) Locus-Specific Databases Expert databases Individual scientists
  19. 19. NCBI Posters & Booth 2405

×