Church gmod2012 pt1

The Evolution of the Resources
Navigating Genome Reference
Human Genome
at NCBI
Part 1
Deanna M. Church, NCBI

@deannachurch

NCBI

BLAST PubMed GenBank

ClinVar
140,000 2,500,000
GTR
Twenty Two Years of Growth: Genome Remapping Service
PubMed Health
CloneDB
120,000
NCBI Data and User Services Public Access
Genome Decoration Page
Influenza Seqs.
GenBank Base Pairs GenSAT 2,000,000
Users (Average) GeneTests
PubChem Peptidome
100,000 Trace Archive BioSystems
CCDS Flu H1N1
Cancer Chromosomes
Environmental Samples
Discovery Initiative 1,500,000
Base Pairs (Millions)

80,000 PubMed Central Entrez Genes Entrez Sensors

Users/Weekday
BLINK Mouse Composite Primer BLAST
MapViewer Genome
GEO Gnomon Seq Read Archive
GeneRIFs UniSTS
WGS
RefSeqGene
60,000 HLA Haplotypes
Human Genome Human Genome-TPA Genome Reference
LinkOut Consortium 1,000,000
dbMHC dbVar
PubMed LocusLink Epigenomics
BookShelf
PSI-BLAST RefSeq MyNCBI
BankIt Human Genome-
VAST dbSNP 1000 Genomes
40,000 Genomes Transcripts Alignments
ePCR Project
Taxonomy Microbial Genomes Genome-Wide
PHI-BLAST Association Studies
3D Structure OMIM CGAP dbGap 500,000
Network Entrez GeneMap Entrez Portal
20,000 Cn3D
WWW
GenBank UniGene
dbSTS
Entrez at NCBI
BLAST dbEST

0 0
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

NCBI

Tools Literature Data
Blast PubMed GenBank
GBench PubMed Central Protein DB
Splign Bookshelf SRA
Cn3D MeSH GEO
e-PCR Gene Reviews dbSNP
e-Utilities … Gene
… RefSeq
…

Entrez: Pathway to Discovery

Term frequency
statistics

MEDLINE
abstracts
Literature Literature citations
citations in in sequence
sequence databases
databases

Nucleotide Protein
sequences sequences
Nucleotide Amino acid sequence
sequence similarity Coding region similarity
features

Programmatic access
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+
AND+breast+cancer+AND+2008[pdat]&usehistory=y
<eSearchResult>
<Count>6</Count>
<RetMax>6</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>19008416</Id>
<Id>18927361</Id>
<Id>18787170</Id>
<Id>18487186</Id>
<Id>18239126</Id>
<Id>18239125</Id>
</IdList>
…
http://www.ncbi.nlm.nih.gov/books/NBK25501/

http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm

http://www.ncbi.nlm.nih.gov/education/

Collins FS et al, 1998

Throughput: 500 Mb/year
Cost: < $0.25 per base
Variation: 100,000 SNPs mapped

Steve Sherry, NCBI

60
Millions
NCBI dbSNP database growth of rs-ids
human variations 50

40

30

20

Non-redundant STR & Indel
10
SNP
annotations
Ambiguous mapping

1999 2000 2005 2011
2010

Millions
Submissions of submissions
25
by project
50

75

100
1000 Genomes
125 Other projects
HapMap
150 TSC
dbSNP build 135. November 2011
175

Kidd et al, 2007 APOBEC cluster

BLACK: Deletion
White: Insertion

http://www.ncbi.nlm.nih.gov/dbvar

Church et al., 2011 PLoS

http://genomereference.org

GRC Beginnings

Distributed data

Old Assembly Model

Genome not in INSDC Database

Build sequence contigs based on contigs
defined in TPF.
Check for orientation consistencies
Select switch points
Instantiate sequence for further analysis

Switch point

Consensus sequence

ftp://ftp.ncbi.nlm.nih.gov/pub/grc/human/

Distributed data
Centralized Data

Old Assembly Model


Large-Scale Variation Complicates Genome Assembly

Sequences from haplotype 1
Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

UGT2B17 Region

NCBI36 (hg18)

UGT2B17 Region
NCBI36 NC_000004.10 (chr4) Tiling Path
AC079749.5 AC147055.2 AC019173.4 AC021146.7
AC074378.4 AC134921.2 AC140484.1 AC093720.2

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path
AC079749.5 AC147055.2 AC021146.7
AC074378.4 AC134921.1 AC093720.2

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC019173.4 AC021146.7
AC074378.4 AC226496.2
AC140484.1

TMPRSS11E2

Xue Y et al, 2008

UGT2B17 MHC MAPT GRCh37 (hg19)

7 alternate haplotypes
at the MHC

Alternate loci released as:
FASTA
AGP
Alignment to chromosome


Assembly (e.g. GRCh37)
PAR Non-nuclear
Primary assembly unit
Assembly (e.g. MT)

ALT ALT ALT
Genomic 1 2 3
Region
(MHC)
Genomic
ALT ALT ALT
Region 4 5 6
(UGT2B17)
Genomic
Region
ALT
ALT
(MAPT) 7
8

ALT
9

Richa Agarwala

MHC Alternate locus
Alignment to chr6

Oh No! Not a new
version of the human
genome!


Assembly (e.g. GRCh37.p5)
PAR Non-nuclear
Primary assembly unit
Assembly (e.g. MT)

ALT ALT ALT
Genomic 1 2 3
Region
(MHC)
Genomic
ALT ALT ALT
Region 4 5 6
(UGT2B17)
Genomic
Region
ALT
ALT
(MAPT) 7
Genomic 8
Region
(ABO)
Genomic ALT
Region 9
(SMA)
Genomic
Region
(PECAM1)
Patches
…

TBC1D3C TBC1D3 TBC1D3H

TBC1D3C

Myo19 region (17q21)

60 Fix PATCHES: Chromosome will update in GRCh38
(adds >1 Mb of novel sequence to the assembly)

70 Novel PATCHES: Additional sequence added
(adds >800K of novel sequence to the assembly)

Releasing patches quarterly

Distributed data
Centralized Data
Old Assembly Model
Updated Assembly Model
Genome in INSDC Database

Church gmod2012 pt1

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Church gmod2012 pt1

Similar to Church gmod2012 pt1 (20)

More from Deanna Church

More from Deanna Church (16)

Recently uploaded

Recently uploaded (20)

Church gmod2012 pt1

Editor's Notes