Church nhgri 2012

The Evolution of Genome Data

Deanna M. Church, NCBI

@deannachurch

Collins FS et al, 1998

Throughput: 500 Mb/year
Cost: < $0.25 per base
Variation: 100,000 SNPs mapped

ClinVar
140,000 2,500,000
GTR
Twenty Two Years of Growth: Genome Remapping Service
PubMed Health
CloneDB
120,000
NCBI Data and User Services Public Access
Genome Decoration Page
Influenza Seqs.
GenBank Base Pairs GenSAT 2,000,000
Users (Average) GeneTests
PubChem Peptidome
100,000 Trace Archive BioSystems
CCDS Flu H1N1
Cancer Chromosomes
Environmental Samples
Discovery Initiative 1,500,000
Base Pairs (Millions)

80,000 PubMed Central Entrez Genes Entrez Sensors

Users/Weekday
BLINK Mouse Composite Primer BLAST
MapViewer Genome
GEO Gnomon Seq Read Archive
GeneRIFs UniSTS
WGS
RefSeqGene
60,000 HLA Haplotypes
Human Genome Human Genome-TPA Genome Reference
LinkOut Consortium 1,000,000
dbMHC dbVar
PubMed LocusLink Epigenomics
BookShelf
PSI-BLAST RefSeq MyNCBI
BankIt Human Genome-
VAST dbSNP 1000 Genomes
40,000 Genomes Transcripts Alignments
ePCR Project
Taxonomy Microbial Genomes Genome-Wide
PHI-BLAST Association Studies
3D Structure OMIM CGAP dbGap 500,000
Network Entrez GeneMap Entrez Portal
20,000 Cn3D
WWW
GenBank UniGene
dbSTS
Entrez at NCBI
BLAST dbEST

0 0
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Steve Sherry, NCBI

60
Millions
NCBI dbSNP database growth of rs-ids
human variations 50

40

30

20

Non-redundant STR & Indel
10
SNP
annotations
Ambiguous mapping

1999 2000 2005 2011
2010

Millions
Submissions of submissions
25
by project
50

75

100
1000 Genomes
125 Other projects
HapMap
150 TSC
dbSNP build 135. November 2011
175

Kidd et al, 2007 APOBEC cluster

BLACK: Deletion
White: Insertion

http://www.ncbi.nlm.nih.gov/dbvar

Church et al., 2011 PLoS

http://genomereference.org

GRC Beginnings

Distributed data

Old Assembly Model

Genome not in INSDC Database

Build sequence contigs based on contigs
defined in TPF.
Check for orientation consistencies
Select switch points
Instantiate sequence for further analysis

Switch point

Consensus sequence

Distributed data
Centralized Data

Old Assembly Model


Large-Scale Variation Complicates Genome Assembly

Sequences from haplotype 1
Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

UGT2B17 Region

NCBI36 (hg18)

UGT2B17 Region
NCBI36 NC_000004.10 (chr4) Tiling Path
AC079749.5 AC147055.2 AC019173.4 AC021146.7
AC074378.4 AC134921.2 AC140484.1 AC093720.2

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path
AC079749.5 AC147055.2 AC021146.7
AC074378.4 AC134921.1 AC093720.2

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC019173.4 AC021146.7
AC074378.4 AC226496.2
AC140484.1

TMPRSS11E2

Xue Y et al, 2008

UGT2B17 MHC MAPT GRCh37 (hg19)

7 alternate haplotypes
at the MHC

Alternate loci released as:
FASTA
AGP
Alignment to chromosome


Assembly (e.g. GRCh37)
PAR Non-nuclear
Primary assembly unit
Assembly (e.g. MT)

ALT ALT ALT
Genomic 1 2 3
Region
(MHC)
Genomic
ALT ALT ALT
Region 4 5 6
(UGT2B17)
Genomic
Region
ALT
ALT
(MAPT) 7
8

ALT
9

Richa Agarwala

MHC Alternate locus
Alignment to chr6

Oh No! Not a new
version of the human
genome!


Assembly (e.g. GRCh37.p5)
PAR Non-nuclear
Primary assembly unit
Assembly (e.g. MT)

ALT ALT ALT
Genomic 1 2 3
Region
(MHC)
Genomic
ALT ALT ALT
Region 4 5 6
(UGT2B17)
Genomic
Region
ALT
ALT
(MAPT) 7
Genomic 8
Region
(ABO)
Genomic ALT
Region 9
(SMA)
Genomic
Region
(PECAM1)
Patches
…

TBC1D3C TBC1D3 TBC1D3H

TBC1D3C

Myo19 region (17q21)

70 Fix PATCHES: Chromosome will update in GRCh38
(adds >1 Mb of novel sequence to the assembly)

71 Novel PATCHES: Additional sequence added
(adds >800K of novel sequence to the assembly)

Releasing patches quarterly

Distributed data
Centralized Data
Old Assembly Model
Updated Assembly Model
Genome in INSDC Database

Data Archives

GenBank

 Data in a common format
 Data in a single location (and mirrored)
 Most quality checked prior to deposition
 Robust data tracking mechanism (accession.version)
 Data owned by submitter

Data tracking

ABC14-1065514J1
Date Phase Gaps Length

FP565796.1 21-Oct-2009 1 1

FP565796.2 14-Oct-2010 1 0

FP565796.3 07-Nov-2010 3 0

Mouse chrX: 34,800,000-34,890,000

NC_000086.1
2
4
3
6
5
7 CM001013.1
2

Mouse chrX: 35,000,000-36,000000
MGSCv3 MGSCv36

X

What’s in a name?

GRCh37
hg19

Zv7
danRer5

MGSCv37
mm8
NCBIM37

By any other name…

chr21:8,913,216-9,246,964

By any other name…

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

hg19
GRCh37

http://www.ncbi.nlm.nih.gov/genome/assembly

Assembly (e.g. GRCh37.p5)
GCA_000001405.6 /GCF_000001405.17
ALT GCA_000001345.1/
Primary GCA_000001305.1/ 4 GCF_000001345.1
Assembly GCF_000001305.13
ALT GCA_000001355.1/
5 GCF_000001355.1

Non-nuclear GCA_000006015.1/ ALT GCA_000001365.1/
assembly unit GCF_000006015.1 6 GCF_000001365.2
(e.g. MT)
ALT GCA_000001375.1/
7 GCF_000001375.1
ALT GCA_000001315.1/
1 GCF_000001315.1
ALT GCA_000001385.1/
8 GCF_000001385.1
ALT GCA_000001325.1/
2 GCF_000001325.2
ALT GCA_000001395.1/
9 GCF_000001395.1
ALT GCA_000001335.1/
3 GCF_000001335.1 GCA_000005045.5
Patches
GCF_000005045.4

GenBank vs RefSeq
Submitter Owned RefSeq Owned
Redundancy Non-Redundant
Updated rarely Curated
INSDC Not INSDC

BRCA1
83 genomic records 3 genomic records
31 mRNA records 5 mRNA records
27 protein records 1 RNA record
5 protein records

RefSeq for Assemblies

Typical assembly edits
Addition of non-nuclear (e.g. MT) assembly units
Removal of contamination
Drop unlocalized/unplaced scaffolds
Mask contamination that is placed on chromosome

http://www.ncbi.nlm.nih.gov/genome

Understanding relationships between
assemblies using alignments

First Pass Reciprocal best hit

Second Pass Non-reciprocal, duplicative hits

NCBI36

GRCh37.p5

No second pass alignments in GRCh37.p5

http://www.ncbi.nlm.nih.gov/tools/gbench/

Genome Data is MORE than just the Genome

Genome Data is MORE than just the Genome
ATGCGTGCAAAATGCAGTGAGT

NM_000336.2:c.800C>T


NM_000336.2:c.800C>T
NC_000001.10:g.(?_20700513)_(21062644_?)del

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm

http://www.ncbi.nlm.nih.gov/education/

Thanks!
The Genome Reference Consortium
The Genome Center at Washington University
The Wellcome Trust Sanger Institute
The European Bioinformatics Institute
The National Center for Biotechnology Information

Church group at NCBI For Slides:
Valerie Schneider Francoise Thibaud-Nissen
Nathan Bouk Evan Eichler
Hsiu-Chuan Chen Steve Sherry
Peter Meric
Victor Ananiev
Chao Chen
John Lopez
John Garner
Tim Hefferon
NCBI
Cliff Clausen

Church nhgri 2012

Recommended

Recommended

More Related Content

Similar to Church nhgri 2012

Similar to Church nhgri 2012 (20)

More from Deanna Church

More from Deanna Church (16)

Recently uploaded

Recently uploaded (20)

Church nhgri 2012

Editor's Notes