SlideShare a Scribd company logo
1 of 19
Download to read offline
RefSeq curation and annotation of the
reference human genome GRCh38
Kim D. Pruitt
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
www.ncbi.nlm.nih.gov/refseq/
RefSeq Background
• RefSeq provides -
• Human genome annotation
• Known transcripts & proteins (manually curated)
• Model transcripts & proteins (annotation pipeline)
• Collaborations -
• Genome Reference Consortium (GRC)
• HUGO Gene Nomenclature Committee (HGNC)
• Consensus CDS (CCDS) Collaboration (HAVANA curators)
• RefSeqGene/Locus Reference Genomic (LRG)/LSDB
RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/
An NCBI project to provide reference sequence
standards that incorporate current knowledge.
Archaea – Bacteria – Eukaryotes - Virus
Curation support of genic regions of
the reference human assembly
• RefSeqGene and LRG collaboration
• Genomic and cDNA standards for clinical reporting
• Report potential issues to the GRC
• Consensus CDS collaboration
• Stabilized human CDS annotation
• Report potential issues to the GRC
• RefSeq
• Curation of genes, transcript & protein records
• Report potential issues to the GRC
• Review GRC patch updates for gene annotation impact
Genome annotation leverages
curation + computation
Genes:
• Type, location, length
Sequence:
• Accuracy, length
• Alternate splice products
• Functional annotation
Align curated RefSeqs
Align transcripts, proteins
Align RNA-Seq
Filter best alignments
Build model RefSeqs
Assign accessions, GeneID
Evidence-based genome
annotation pipeline
Manual Curation
Sequence - Literature
Transcripts Proteins
Known RefSeqs 50,540 39,363
Model RefSeqs 112,735 60,599
Annotated Genes Count
Protein-coding 20,576
Non-coding 18,037
Pseudogene 12,474
Transition from GRCh37 to GRCh38
• Identify gene/sequence differences vs. GRCh38
• Automatic update at synonymous mismatches
• Curation review of remainder
• >5,100 Known RefSeq transcripts updated since October 2013
• 47,031 Known RefSeqs identical to genome
• 2,916 intentionally retain a mismatch or indel
• ~600 pending
• ~132 genes merged
0 200 400 600 800 1000 1200
2013 Q1
2013 Q3
2014 Q1
2014 Q3
2015 Q1
2015 Q3
Number of updates
* GRCh38 12/24/2013
*
Updating RefSeq to match GRCh38
• Post GRCh38 review:
• NM_173477 updated to match genome (NM_173477.4)
• Model RefSeq XM_005257026.1 promoted to Known RefSeq
GRCh38
GRCh37
alignment
alignment
RefSeq curation & genome
maintenance




 





GRCh38
GRCh37
GRCh37 Issue:
SCX duplication
MROH1 split
GRCh38 update:
Gap closed
MROH1 complete
One SCX gene
gap
RefSeq curation & genome
maintenance
• POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion
vs. GRCh38
• This maintains the correct reading frame
GRCh38
alignment
RefSeq curation & genome
maintenance
• RefSeq reported this sequence issue to the GRC
GRCh38 ALT LOCI and PATCHES
Pre-Patch & ALT review
Polymorphic pseudogenes
Haplotype & CNV variation
ALT-specific RefSeq records
Curator-stored placement data
Evidence-based genome
annotation pipeline
Manual Curation
Assembly-ALT alignments
Alignment quality reports
Subsequent genome
annotation build corrects
the annotation
Interim alignment updates
Polymorphic pseudogenes
• RefSeq provides different transcripts to represent the protein-
coding gene versus the pseudogene
• Curators store assembly placement information (chromosome
versus ALT) in a local database
• This is used by annotation pipeline to ensure correct annotation
Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2
GRCh38 chr22 null pseudo coding pseudo null
ALT_REF_LOCI_1 coding coding coding pseudo pseudo
An example – GSTT cluster on chromosome 22:
GSTT* variation, chromosome 22
• Copy number variation of glutathione-S-transferase theta genes
is associated with digestive track cancers and more
• Accurate gene annotation is important to downstream users
GRCh38 chr22
GRCh38 ALT
pseudogene
chr22 = null allelecoding allele
ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer
GSTT2 polymorphism
AT splice donor Premature
stop codon
GT splice donor Stop codon
GRCh38 chr22
GRCh38 ALT
GRCh38 chr22 GSTT2 pseudogene
GRCh38 chr22
Data access
• Genes:
• <…ncbi root url…>/gene/
• ftp://ftp.ncbi.nlm.nih.gov/gene/
• NCBI YouTube ‘Download genomic sequence for a gene’
• https://www.youtube.com/watch?v=RHz2nZbzjpA
• RefSeq transcripts and proteins:
• Links from NCBI Gene
• Nucleotide/protein query:
• human[organism] + use facets to specify RefSeq and molecule type
• ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/
• NCBI Genome Annotation
• Links from NCBI Assembly or Genome resources
• <ncbi>/assembly/ or <ncbi>/genome/
Data access to annotated genome
Gene
Assembly details
Genome FTP formats
• FASTA
• genome, transcripts, proteins
• GenBank file format
• – genome transcripts, proteins
• GFF genome annotation
• Feature table
• features and locations in
tabular format
• AGP, Assembly details & statistics
• Repeat masker results
• Md5checksums
• Documentation
• README files
• <ncbi>/genome/doc/ftpfaq/
Acknowledgements
RefSeq Curators
Annotation pipeline
Paul Kitts
Terence Murphy
Francoise Thibaud-Nissen
Eric Cox
Catherine Farrell
Tamara Goldfarb
Tripti Gupta
Vinita Joardar
Vamsi Kodali
Kelly McGarvey
Mike Murphy
Nuala O'Leary
Shashi Pujar
Bhanu Rajput
Sanjida Rangwala
Lillian Riddick
Dave Webb
Matt Wright
Susan Hiatt
www.ncbi.nlm.nih.gov/refseq/
Collaborators
Elspeth Bruford (HGNC)
Jen Harrow (HAVANNA)
Locus-Specific Databases
Expert databases
Individual scientists
NCBI Posters & Booth 2405

More Related Content

What's hot

Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonGenome Reference Consortium
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesGenome Reference Consortium
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)Shaojun Xie
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysisGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesGenome Reference Consortium
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
 

What's hot (20)

Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
ABGT 2016 Workshop Schneider
ABGT 2016 Workshop SchneiderABGT 2016 Workshop Schneider
ABGT 2016 Workshop Schneider
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)hg19 (GRCh37) vs. hg38 (GRCh38)
hg19 (GRCh37) vs. hg38 (GRCh38)
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Variant Calling II
Variant Calling IIVariant Calling II
Variant Calling II
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Grc ashg2015 workshop_mudge
Grc ashg2015 workshop_mudgeGrc ashg2015 workshop_mudge
Grc ashg2015 workshop_mudge
 
Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 

Viewers also liked

Μάγιας,Γ.Σκούρτης
Μάγιας,Γ.ΣκούρτηςΜάγιας,Γ.Σκούρτης
Μάγιας,Γ.ΣκούρτηςIliana Kouvatsou
 
Sistem pernapasan atau sistem respirasi adalah sistem organ yang digunakan un...
Sistem pernapasan atau sistem respirasi adalah sistem organ yang digunakan un...Sistem pernapasan atau sistem respirasi adalah sistem organ yang digunakan un...
Sistem pernapasan atau sistem respirasi adalah sistem organ yang digunakan un...Operator Warnet Vast Raha
 
Reputación digital_ Universidad de Vigo_160910
Reputación digital_ Universidad de Vigo_160910Reputación digital_ Universidad de Vigo_160910
Reputación digital_ Universidad de Vigo_160910Cristina Aced
 
259881368-Gartner-Research-ERP
259881368-Gartner-Research-ERP259881368-Gartner-Research-ERP
259881368-Gartner-Research-ERPGaurav Ahluwalia
 
Posibilidades educativas de la Realidad Aumentada
Posibilidades educativas de la Realidad AumentadaPosibilidades educativas de la Realidad Aumentada
Posibilidades educativas de la Realidad Aumentadagmsrosario
 
Drew Henry Resume
Drew Henry ResumeDrew Henry Resume
Drew Henry Resumedrew henry
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3GenomeInABottle
 
Configurer la voie modbus du variateur atv31
Configurer la voie modbus du variateur atv31Configurer la voie modbus du variateur atv31
Configurer la voie modbus du variateur atv31valentin Victoire
 
George haydock future-refrigerants
George haydock future-refrigerantsGeorge haydock future-refrigerants
George haydock future-refrigerantsARAaus
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGenome Reference Consortium
 
Clima ecuatoriano
Clima ecuatorianoClima ecuatoriano
Clima ecuatorianojacomec350
 

Viewers also liked (14)

Μάγιας,Γ.Σκούρτης
Μάγιας,Γ.ΣκούρτηςΜάγιας,Γ.Σκούρτης
Μάγιας,Γ.Σκούρτης
 
Sistem pernapasan atau sistem respirasi adalah sistem organ yang digunakan un...
Sistem pernapasan atau sistem respirasi adalah sistem organ yang digunakan un...Sistem pernapasan atau sistem respirasi adalah sistem organ yang digunakan un...
Sistem pernapasan atau sistem respirasi adalah sistem organ yang digunakan un...
 
Reputación digital_ Universidad de Vigo_160910
Reputación digital_ Universidad de Vigo_160910Reputación digital_ Universidad de Vigo_160910
Reputación digital_ Universidad de Vigo_160910
 
259881368-Gartner-Research-ERP
259881368-Gartner-Research-ERP259881368-Gartner-Research-ERP
259881368-Gartner-Research-ERP
 
Posibilidades educativas de la Realidad Aumentada
Posibilidades educativas de la Realidad AumentadaPosibilidades educativas de la Realidad Aumentada
Posibilidades educativas de la Realidad Aumentada
 
Drew Henry Resume
Drew Henry ResumeDrew Henry Resume
Drew Henry Resume
 
jkkaran 13052016v
jkkaran 13052016v jkkaran 13052016v
jkkaran 13052016v
 
Proses sistem pernapasan
Proses sistem pernapasanProses sistem pernapasan
Proses sistem pernapasan
 
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
 
Configurer la voie modbus du variateur atv31
Configurer la voie modbus du variateur atv31Configurer la voie modbus du variateur atv31
Configurer la voie modbus du variateur atv31
 
George haydock future-refrigerants
George haydock future-refrigerantsGeorge haydock future-refrigerants
George haydock future-refrigerants
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
 
Clima ecuatoriano
Clima ecuatorianoClima ecuatoriano
Clima ecuatoriano
 
Identidad Digital y Reputación Online
Identidad Digital y Reputación OnlineIdentidad Digital y Reputación Online
Identidad Digital y Reputación Online
 

Similar to Ashg2015 grc-pruitt

Full-length cDNA Sequencing.pdf
Full-length cDNA Sequencing.pdfFull-length cDNA Sequencing.pdf
Full-length cDNA Sequencing.pdfATPowr
 
Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...Integrated DNA Technologies
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim D. Pruitt
 
Using the GRCh38 reference assembly for clinical interpretation in VSClinical
 Using the GRCh38 reference assembly for clinical interpretation in VSClinical Using the GRCh38 reference assembly for clinical interpretation in VSClinical
Using the GRCh38 reference assembly for clinical interpretation in VSClinicalGolden Helix
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Genome Reference Consortium
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907GenomeInABottle
 
Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Stuart MacGowan
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGenomeInABottle
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSGolden Helix
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
Whole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxHaibo Liu
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 

Similar to Ashg2015 grc-pruitt (20)

Full-length cDNA Sequencing.pdf
Full-length cDNA Sequencing.pdfFull-length cDNA Sequencing.pdf
Full-length cDNA Sequencing.pdf
 
Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
Using the GRCh38 reference assembly for clinical interpretation in VSClinical
 Using the GRCh38 reference assembly for clinical interpretation in VSClinical Using the GRCh38 reference assembly for clinical interpretation in VSClinical
Using the GRCh38 reference assembly for clinical interpretation in VSClinical
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
Benchmarking with GIAB 220907
Benchmarking with GIAB 220907Benchmarking with GIAB 220907
Benchmarking with GIAB 220907
 
Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)Review of Liao et al - A draft human pangenome reference - Nature (2023)
Review of Liao et al - A draft human pangenome reference - Nature (2023)
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
Enfin, DAS and BioMart
Enfin, DAS and BioMartEnfin, DAS and BioMart
Enfin, DAS and BioMart
 
HUGenomics: a support for personalized medicine research
HUGenomics: a support for personalized medicine researchHUGenomics: a support for personalized medicine research
HUGenomics: a support for personalized medicine research
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Whole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptx
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 

More from Genome Reference Consortium

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsGenome Reference Consortium
 

More from Genome Reference Consortium (19)

What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 
ClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materialsClinVar: Getting the most from the reference assembly and reference materials
ClinVar: Getting the most from the reference assembly and reference materials
 

Recently uploaded

STELLAR SYSTEM IN PTERIDOPHYTE Seminar 2023- By Karishma
STELLAR SYSTEM IN PTERIDOPHYTE Seminar 2023- By KarishmaSTELLAR SYSTEM IN PTERIDOPHYTE Seminar 2023- By Karishma
STELLAR SYSTEM IN PTERIDOPHYTE Seminar 2023- By KarishmaAMiracle3
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Chiheb Ben Hammouda
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerLuis Miguel Chong Chong
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...Chayanika Das
 
Production technology of Brinjal -Solanum melongena
Production technology of Brinjal -Solanum melongenaProduction technology of Brinjal -Solanum melongena
Production technology of Brinjal -Solanum melongenajana861314
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Remote patient monitoring :Health care transformation
Remote patient monitoring :Health care transformationRemote patient monitoring :Health care transformation
Remote patient monitoring :Health care transformationfahad Alotaibiu
 
Think Science: What Are Eclipses, by Craig Bobchin
Think Science: What Are Eclipses, by Craig BobchinThink Science: What Are Eclipses, by Craig Bobchin
Think Science: What Are Eclipses, by Craig BobchinNathan Cone
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasChayanika Das
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
Presentation about adversarial image attacks
Presentation about adversarial image attacksPresentation about adversarial image attacks
Presentation about adversarial image attacksKoshinKhodiyar
 

Recently uploaded (20)

STELLAR SYSTEM IN PTERIDOPHYTE Seminar 2023- By Karishma
STELLAR SYSTEM IN PTERIDOPHYTE Seminar 2023- By KarishmaSTELLAR SYSTEM IN PTERIDOPHYTE Seminar 2023- By Karishma
STELLAR SYSTEM IN PTERIDOPHYTE Seminar 2023- By Karishma
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of Cancer
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
Production technology of Brinjal -Solanum melongena
Production technology of Brinjal -Solanum melongenaProduction technology of Brinjal -Solanum melongena
Production technology of Brinjal -Solanum melongena
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
Remote patient monitoring :Health care transformation
Remote patient monitoring :Health care transformationRemote patient monitoring :Health care transformation
Remote patient monitoring :Health care transformation
 
Think Science: What Are Eclipses, by Craig Bobchin
Think Science: What Are Eclipses, by Craig BobchinThink Science: What Are Eclipses, by Craig Bobchin
Think Science: What Are Eclipses, by Craig Bobchin
 
Bioenergetics and the role of ATP to drive the beats of life.
Bioenergetics and the role of ATP to drive the beats of life.Bioenergetics and the role of ATP to drive the beats of life.
Bioenergetics and the role of ATP to drive the beats of life.
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
Presentation about adversarial image attacks
Presentation about adversarial image attacksPresentation about adversarial image attacks
Presentation about adversarial image attacks
 

Ashg2015 grc-pruitt

  • 1. RefSeq curation and annotation of the reference human genome GRCh38 Kim D. Pruitt National Center for Biotechnology Information National Library of Medicine National Institutes of Health www.ncbi.nlm.nih.gov/refseq/
  • 2. RefSeq Background • RefSeq provides - • Human genome annotation • Known transcripts & proteins (manually curated) • Model transcripts & proteins (annotation pipeline) • Collaborations - • Genome Reference Consortium (GRC) • HUGO Gene Nomenclature Committee (HGNC) • Consensus CDS (CCDS) Collaboration (HAVANA curators) • RefSeqGene/Locus Reference Genomic (LRG)/LSDB RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/ An NCBI project to provide reference sequence standards that incorporate current knowledge. Archaea – Bacteria – Eukaryotes - Virus
  • 3. Curation support of genic regions of the reference human assembly • RefSeqGene and LRG collaboration • Genomic and cDNA standards for clinical reporting • Report potential issues to the GRC • Consensus CDS collaboration • Stabilized human CDS annotation • Report potential issues to the GRC • RefSeq • Curation of genes, transcript & protein records • Report potential issues to the GRC • Review GRC patch updates for gene annotation impact
  • 4. Genome annotation leverages curation + computation Genes: • Type, location, length Sequence: • Accuracy, length • Alternate splice products • Functional annotation Align curated RefSeqs Align transcripts, proteins Align RNA-Seq Filter best alignments Build model RefSeqs Assign accessions, GeneID Evidence-based genome annotation pipeline Manual Curation Sequence - Literature Transcripts Proteins Known RefSeqs 50,540 39,363 Model RefSeqs 112,735 60,599 Annotated Genes Count Protein-coding 20,576 Non-coding 18,037 Pseudogene 12,474
  • 5. Transition from GRCh37 to GRCh38 • Identify gene/sequence differences vs. GRCh38 • Automatic update at synonymous mismatches • Curation review of remainder • >5,100 Known RefSeq transcripts updated since October 2013 • 47,031 Known RefSeqs identical to genome • 2,916 intentionally retain a mismatch or indel • ~600 pending • ~132 genes merged 0 200 400 600 800 1000 1200 2013 Q1 2013 Q3 2014 Q1 2014 Q3 2015 Q1 2015 Q3 Number of updates * GRCh38 12/24/2013 *
  • 6. Updating RefSeq to match GRCh38 • Post GRCh38 review: • NM_173477 updated to match genome (NM_173477.4) • Model RefSeq XM_005257026.1 promoted to Known RefSeq GRCh38 GRCh37 alignment alignment
  • 7. RefSeq curation & genome maintenance            GRCh38 GRCh37 GRCh37 Issue: SCX duplication MROH1 split GRCh38 update: Gap closed MROH1 complete One SCX gene gap
  • 8. RefSeq curation & genome maintenance • POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion vs. GRCh38 • This maintains the correct reading frame GRCh38 alignment
  • 9. RefSeq curation & genome maintenance • RefSeq reported this sequence issue to the GRC
  • 10. GRCh38 ALT LOCI and PATCHES Pre-Patch & ALT review Polymorphic pseudogenes Haplotype & CNV variation ALT-specific RefSeq records Curator-stored placement data Evidence-based genome annotation pipeline Manual Curation Assembly-ALT alignments Alignment quality reports Subsequent genome annotation build corrects the annotation Interim alignment updates
  • 11. Polymorphic pseudogenes • RefSeq provides different transcripts to represent the protein- coding gene versus the pseudogene • Curators store assembly placement information (chromosome versus ALT) in a local database • This is used by annotation pipeline to ensure correct annotation Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2 GRCh38 chr22 null pseudo coding pseudo null ALT_REF_LOCI_1 coding coding coding pseudo pseudo An example – GSTT cluster on chromosome 22:
  • 12. GSTT* variation, chromosome 22 • Copy number variation of glutathione-S-transferase theta genes is associated with digestive track cancers and more • Accurate gene annotation is important to downstream users GRCh38 chr22 GRCh38 ALT pseudogene chr22 = null allelecoding allele ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer
  • 13. GSTT2 polymorphism AT splice donor Premature stop codon GT splice donor Stop codon GRCh38 chr22 GRCh38 ALT
  • 14. GRCh38 chr22 GSTT2 pseudogene GRCh38 chr22
  • 15. Data access • Genes: • <…ncbi root url…>/gene/ • ftp://ftp.ncbi.nlm.nih.gov/gene/ • NCBI YouTube ‘Download genomic sequence for a gene’ • https://www.youtube.com/watch?v=RHz2nZbzjpA • RefSeq transcripts and proteins: • Links from NCBI Gene • Nucleotide/protein query: • human[organism] + use facets to specify RefSeq and molecule type • ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/ • NCBI Genome Annotation • Links from NCBI Assembly or Genome resources • <ncbi>/assembly/ or <ncbi>/genome/
  • 16. Data access to annotated genome Gene Assembly details
  • 17. Genome FTP formats • FASTA • genome, transcripts, proteins • GenBank file format • – genome transcripts, proteins • GFF genome annotation • Feature table • features and locations in tabular format • AGP, Assembly details & statistics • Repeat masker results • Md5checksums • Documentation • README files • <ncbi>/genome/doc/ftpfaq/
  • 18. Acknowledgements RefSeq Curators Annotation pipeline Paul Kitts Terence Murphy Francoise Thibaud-Nissen Eric Cox Catherine Farrell Tamara Goldfarb Tripti Gupta Vinita Joardar Vamsi Kodali Kelly McGarvey Mike Murphy Nuala O'Leary Shashi Pujar Bhanu Rajput Sanjida Rangwala Lillian Riddick Dave Webb Matt Wright Susan Hiatt www.ncbi.nlm.nih.gov/refseq/ Collaborators Elspeth Bruford (HGNC) Jen Harrow (HAVANNA) Locus-Specific Databases Expert databases Individual scientists
  • 19. NCBI Posters & Booth 2405