This document discusses the evolution of genome references at the National Center for Biotechnology Information (NCBI). It describes how genomic data is stored and tracked in GenBank, and how reference assemblies are developed and annotated through collaborations between NCBI, other genome centers, and the research community. The goal is to provide consistent, high-quality reference genomes and annotations across multiple assemblies.
In-Depth Performance Testing Guide for IT Professionals
Church gmod2012 pt2
1. The Evolution of the Resources
Navigating Genome Reference
Human Genome
at NCBI
Part 2
Deanna M. Church, NCBI
@deannachurch
2. Data Archives
GenBank
Data in a common format
Data in a single location (and mirrored)
Most quality checked prior to deposition
Robust data tracking mechanism (accession.version)
Data owned by submitter
3. Data tracking
ABC14-1065514J1
Date Phase Gaps Length
FP565796.1 21-Oct-2009 1 1
FP565796.2 14-Oct-2010 1 0
FP565796.3 07-Nov-2010 3 0
8. By any other name…
Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
9. Genome Browser Agreement
Submitter deposits
assembly to Assembly QA
GenBank/EMBL/DDBJ
Submitter updates
assembly based on QA
results
Browsers pick up
assembly from
GenBank/EMBL/DDBJ
Assemblies must be in
GenBank/EMBL/DDBJ
12. Assembly (e.g. GRCh37.p5)
GCA_000001405.6 /GCF_000001405.17
ALT GCA_000001345.1/
Primary GCA_000001305.1/ 4 GCF_000001345.1
Assembly GCF_000001305.13
ALT GCA_000001355.1/
5 GCF_000001355.1
Non-nuclear ALT GCA_000001365.1/
GCA_000006015.1/
assembly unit 6 GCF_000001365.2
GCF_000006015.1
(e.g. MT)
ALT GCA_000001375.1/
7 GCF_000001375.1
ALT GCA_000001315.1/
1 GCF_000001315.1
ALT GCA_000001385.1/
8 GCF_000001385.1
ALT GCA_000001325.1/
2 GCF_000001325.2
ALT GCA_000001395.1/
9 GCF_000001395.1
ALT GCA_000001335.1/
3 GCF_000001335.1 GCA_000005045.5
Patches
GCF_000005045.4
13. GenBank vs RefSeq
Submitter Owned RefSeq Owned
Redundancy Non-Redundant
Updated rarely Curated
INSDC Not INSDC
BRCA1
83 genomic records 3 genomic records
31 mRNA records 5 mRNA records
27 protein records 1 RNA record
5 protein records
14.
15. RefSeq for Assemblies
Typical assembly edits
Addition of non-nuclear (e.g. MT) assembly units
Removal of contamination
Drop unlocalized/unplaced scaffolds
Mask contamination that is placed on chromosome
20. NCBI36
GRCh37.p5
No second pass alignments in GRCh37.p5
http://www.ncbi.nlm.nih.gov/tools/gbench/
21. Annotation pipeline
Assemblies
Transcripts Proteins
Set of genes
Other decoration
Francoise Thibaud-Nissen
22. Content of the final annotation product
Description In In a BLAST On FTP
sequence database site
database
Chromosomes (NC_or AC_)
Scaffolds (NW_ or NT_)
Curated transcripts/proteins (NM_, NR_/NP_)
Predicted transcripts/proteins (fully or partially
-supported) (XM_, XR_/XP_)
Non-transcribed pseudogenes
tRNA (annotated with tRNAScan)
Ab initio Gnomon models
Annotation Pipeline RefSeq
23. Where to find the annotation products?
• Nucleotide/Protein databases
• Gene http://www.ncbi.nlm.nih.gov/gene
http://www.ncbi.nlm.nih.gov/mapview
• Map Viewer
• BLAST databases
• FTP site
24. Annotating multiple assemblies
• Assembly-assembly alignments
Available at http://www.ncbi.nlm.nih.gov/genome/tools/remap
Group 1
Transcript
Assembly 1
Group 2
Assembly 2
• Consistent placement of transcripts
• Consistent labelling of the genes
• Consistent annotation on all assemblies
28. Thanks!
The Genome Reference Consortium
The Genome Center at Washington University
The Wellcome Trust Sanger Institute
The European Bioinformatics Institute
The National Center for Biotechnology Information
Church group at NCBI For Slides:
Valerie Schneider Francoise Thibaud-Nissen
Nathan Bouk Evan Eichler
Hsiu-Chuan Chen Steve Sherry
Peter Meric
Victor Ananiev
Chao Chen
John Lopez
John Garner
Tim Hefferon
NCBI
Cliff Clausen
Editor's Notes
Show alignment of a feature from first slide to show how far down the chromosome it has moved…
Keeping track of people is way easier than keeping track of assemblies.