  • Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
  • Show alignment of a feature from first slide to show how far down the chromosome it has moved…
  • Keeping track of people is way easier than keeping track of assemblies.
    1. 1. The Evolution of Genome Data Deanna M. Church, NCBI@deannachurch
    2. 2. Collins FS et al, 1998 Throughput: 500 Mb/year Cost: < $0.25 per baseVariation: 100,000 SNPs mapped
    3. 3. ClinVar 140,000 2,500,000 GTR Twenty Two Years of Growth: Genome Remapping Service PubMed Health CloneDB 120,000 NCBI Data and User Services Public Access Genome Decoration Page Influenza Seqs. GenBank Base Pairs GenSAT 2,000,000 Users (Average) GeneTests PubChem Peptidome 100,000 Trace Archive BioSystems CCDS Flu H1N1 Cancer Chromosomes Environmental Samples Discovery Initiative 1,500,000Base Pairs (Millions) 80,000 PubMed Central Entrez Genes Entrez Sensors Users/Weekday BLINK Mouse Composite Primer BLAST MapViewer Genome GEO Gnomon Seq Read Archive GeneRIFs UniSTS WGS RefSeqGene 60,000 HLA Haplotypes Human Genome Human Genome-TPA Genome Reference LinkOut Consortium 1,000,000 dbMHC dbVar PubMed LocusLink Epigenomics BookShelf PSI-BLAST RefSeq MyNCBI BankIt Human Genome- VAST dbSNP 1000 Genomes 40,000 Genomes Transcripts Alignments ePCR Project Taxonomy Microbial Genomes Genome-Wide PHI-BLAST Association Studies 3D Structure OMIM CGAP dbGap 500,000 Network Entrez GeneMap Entrez Portal 20,000 Cn3D WWW GenBank UniGene dbSTS Entrez at NCBI BLAST dbEST 0 0 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
    4. 4. Steve Sherry, NCBI 60 MillionsNCBI dbSNP database growth of rs-idshuman variations 50 40 30 20Non-redundant STR & Indel 10 SNPannotations Ambiguous mapping 1999 2000 2005 2011 2010 MillionsSubmissions of submissions 25by project 50 75 100 1000 Genomes 125 Other projects HapMap 150 TSCdbSNP build 135. November 2011 175
    5. 5. Kidd et al, 2007 APOBEC clusterBLACK: DeletionWhite: Insertion
    6. 6.
    7. 7. Church et al., 2011 PLoS
    8. 8. GRC Beginnings Distributed data Old Assembly ModelGenome not in INSDC Database
    9. 9. Build sequence contigs based on contigsdefined in TPF. Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
    10. 10.
    11. 11. Community Input
    12. 12. Distributed data Centralized Data Old Assembly ModelGenome not in INSDC Database
    13. 13. Large-Scale Variation Complicates Genome Assembly Sequences from haplotype 1 Sequences from haplotype 2Old Assembly model: compress into a consensusNew Assembly model: represent both haplotypes
    14. 14. UGT2B17 RegionNCBI36 (hg18)
    15. 15. UGT2B17 RegionNCBI36 NC_000004.10 (chr4) Tiling Path AC079749.5 AC147055.2 AC019173.4 AC021146.7 AC074378.4 AC134921.2 AC140484.1 AC093720.2 TMPRSS11E TMPRSS11E2GRCh37 NC_000004.11 (chr4) Tiling Path AC079749.5 AC147055.2 AC021146.7 AC074378.4 AC134921.1 AC093720.2 TMPRSS11EGRCh37: NT_167250.1 (UGT2B17 alternate locus) AC019173.4 AC021146.7 AC074378.4 AC226496.2 AC140484.1 TMPRSS11E2Xue Y et al, 2008
    16. 16. UGT2B17 MHC MAPT GRCh37 (hg19) 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome
    17. 17. Assembly (e.g. GRCh37)PAR Non-nuclear Primary assembly unit Assembly (e.g. MT) ALT ALT ALT Genomic 1 2 3 Region (MHC) Genomic ALT ALT ALT Region 4 5 6 (UGT2B17) Genomic Region ALT ALT (MAPT) 7 8 ALT 9
    18. 18. Richa AgarwalaMHC Alternate locus Alignment to chr6
    19. 19. Oh No! Not a new version of the human genome!
    20. 20. Assembly (e.g. GRCh37.p5)PAR Non-nuclear Primary assembly unit Assembly (e.g. MT) ALT ALT ALT Genomic 1 2 3 Region (MHC) Genomic ALT ALT ALT Region 4 5 6 (UGT2B17) Genomic Region ALT ALT (MAPT) 7 Genomic 8 Region (ABO) Genomic ALT Region 9 (SMA) Genomic Region (PECAM1) Patches …
    21. 21. TBC1D3C TBC1D3 TBC1D3H TBC1D3CMyo19 region (17q21)
    22. 22. 70 Fix PATCHES: Chromosome will update in GRCh38 (adds >1 Mb of novel sequence to the assembly)71 Novel PATCHES: Additional sequence added (adds >800K of novel sequence to the assembly) Releasing patches quarterly
    23. 23. Distributed data Centralized Data Old Assembly Model Updated Assembly ModelGenome not in INSDC Database Genome in INSDC Database
    24. 24. Data Archives GenBank Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter
    25. 25. Data trackingABC14-1065514J1 Date Phase Gaps LengthFP565796.1 21-Oct-2009 1 1FP565796.2 14-Oct-2010 1 0FP565796.3 07-Nov-2010 3 0
    26. 26. Mouse chrX: 34,800,000-34,890,000NC_000086.1 2 4 3 6 5 7 CM001013.1 2
    27. 27. Mouse chrX: 35,000,000-36,000000 MGSCv3 MGSCv36 X
    28. 28. What’s in a name?GRCh37hg19 Zv7 danRer5 MGSCv37mm8 NCBIM37
    29. 29. By any other name…chr21:8,913,216-9,246,964
    30. 30. By any other name…Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
    31. 31. hg19 GRCh37
    32. 32. Assembly (e.g. GRCh37.p5) GCA_000001405.6 /GCF_000001405.17 ALT GCA_000001345.1/ Primary GCA_000001305.1/ 4 GCF_000001345.1 Assembly GCF_000001305.13 ALT GCA_000001355.1/ 5 GCF_000001355.1 Non-nuclear GCA_000006015.1/ ALT GCA_000001365.1/ assembly unit GCF_000006015.1 6 GCF_000001365.2 (e.g. MT) ALT GCA_000001375.1/ 7 GCF_000001375.1ALT GCA_000001315.1/ 1 GCF_000001315.1 ALT GCA_000001385.1/ 8 GCF_000001385.1ALT GCA_000001325.1/ 2 GCF_000001325.2 ALT GCA_000001395.1/ 9 GCF_000001395.1ALT GCA_000001335.1/ 3 GCF_000001335.1 GCA_000005045.5 Patches GCF_000005045.4
    33. 33. GenBank vs RefSeqSubmitter Owned RefSeq Owned Redundancy Non-Redundant Updated rarely Curated INSDC Not INSDC BRCA183 genomic records 3 genomic records31 mRNA records 5 mRNA records27 protein records 1 RNA record 5 protein records
    34. 34. RefSeq for AssembliesTypical assembly edits Addition of non-nuclear (e.g. MT) assembly units Removal of contamination Drop unlocalized/unplaced scaffolds Mask contamination that is placed on chromosome
    35. 35.
    36. 36. Understanding relationships between assemblies using alignmentsFirst Pass Reciprocal best hitSecond Pass Non-reciprocal, duplicative hits
    37. 37. NCBI36 GRCh37.p5No second pass alignments in GRCh37.p5
    38. 38. Genome Data is MORE than just the Genome
    41. 41.
    42. 42. @NCBI
    43. 43. Thanks! The Genome Reference Consortium The Genome Center at Washington University The Wellcome Trust Sanger Institute The European Bioinformatics Institute The National Center for Biotechnology Information Church group at NCBI For Slides: Valerie Schneider Francoise Thibaud-Nissen Nathan Bouk Evan Eichler Hsiu-Chuan Chen Steve Sherry Peter Meric Victor Ananiev Chao Chen John Lopez John Garner Tim Hefferon NCBI Cliff Clausen