Church gmod2012 pt1


Published on

Part one of my talk at the GMOD 2012 meeting

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • TPFs are loaded to a centralized system for tracking. This system also manages QA on the files as an ongoing process. The first level of QA is to look at the overlap between adjacent sequences on the TPF.
  • When certifying an overlap, external evidence supporting the alignment must be available. Evidence typically consists of sequence data from another source, spanning clone ends or experimental verification (such as a PCR assay detecting the join).These certificates are reviewed by other GRC members and may be approved or rejected. Certification information is publicly available.
  • Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
  • Church gmod2012 pt1

    1. 1. The Evolution of the Resources Navigating Genome Reference Human Genome at NCBI Part 1 Deanna M. Church, NCBI@deannachurch
    2. 2. NCBIBLAST PubMed GenBank
    3. 3. ClinVar 140,000 2,500,000 GTR Twenty Two Years of Growth: Genome Remapping Service PubMed Health CloneDB 120,000 NCBI Data and User Services Public Access Genome Decoration Page Influenza Seqs. GenBank Base Pairs GenSAT 2,000,000 Users (Average) GeneTests PubChem Peptidome 100,000 Trace Archive BioSystems CCDS Flu H1N1 Cancer Chromosomes Environmental Samples Discovery Initiative 1,500,000Base Pairs (Millions) 80,000 PubMed Central Entrez Genes Entrez Sensors Users/Weekday BLINK Mouse Composite Primer BLAST MapViewer Genome GEO Gnomon Seq Read Archive GeneRIFs UniSTS WGS RefSeqGene 60,000 HLA Haplotypes Human Genome Human Genome-TPA Genome Reference LinkOut Consortium 1,000,000 dbMHC dbVar PubMed LocusLink Epigenomics BookShelf PSI-BLAST RefSeq MyNCBI BankIt Human Genome- VAST dbSNP 1000 Genomes 40,000 Genomes Transcripts Alignments ePCR Project Taxonomy Microbial Genomes Genome-Wide PHI-BLAST Association Studies 3D Structure OMIM CGAP dbGap 500,000 Network Entrez GeneMap Entrez Portal 20,000 Cn3D WWW GenBank UniGene dbSTS Entrez at NCBI BLAST dbEST 0 0 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
    4. 4. NCBITools Literature Data Blast PubMed GenBank GBench PubMed Central Protein DB Splign Bookshelf SRA Cn3D MeSH GEO e-PCR Gene Reviews dbSNPe-Utilities … Gene … RefSeq …
    5. 5. Entrez: Pathway to Discovery Term frequency statistics MEDLINE abstracts Literature Literature citations citations in in sequence sequence databases databases Nucleotide Protein sequences sequences Nucleotide Amino acid sequencesequence similarity Coding region similarity features
    6. 6. Programmatic access[journal]+AND+breast+cancer+AND+2008[pdat]&usehistory=y <eSearchResult> <Count>6</Count> <RetMax>6</RetMax> <RetStart>0</RetStart> <IdList> <Id>19008416</Id> <Id>18927361</Id> <Id>18787170</Id> <Id>18487186</Id> <Id>18239126</Id> <Id>18239125</Id> </IdList> …
    7. 7. @NCBI
    8. 8. Collins FS et al, 1998 Throughput: 500 Mb/year Cost: < $0.25 per baseVariation: 100,000 SNPs mapped
    9. 9. Steve Sherry, NCBI 60 MillionsNCBI dbSNP database growth of rs-idshuman variations 50 40 30 20Non-redundant STR & Indel 10 SNPannotations Ambiguous mapping 1999 2000 2005 2011 2010 MillionsSubmissions of submissions 25by project 50 75 100 1000 Genomes 125 Other projects HapMap 150 TSCdbSNP build 135. November 2011 175
    10. 10. Kidd et al, 2007 APOBEC clusterBLACK: DeletionWhite: Insertion
    11. 11.
    12. 12. Church et al., 2011 PLoS
    13. 13. GRC Beginnings Distributed data Old Assembly ModelGenome not in INSDC Database
    14. 14. Build sequence contigs based on contigsdefined in TPF. Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
    15. 15.
    16. 16. Community Input
    17. 17. Distributed data Centralized Data Old Assembly ModelGenome not in INSDC Database
    18. 18. Large-Scale Variation Complicates Genome Assembly Sequences from haplotype 1 Sequences from haplotype 2Old Assembly model: compress into a consensusNew Assembly model: represent both haplotypes
    19. 19. UGT2B17 RegionNCBI36 (hg18)
    20. 20. UGT2B17 RegionNCBI36 NC_000004.10 (chr4) Tiling Path AC079749.5 AC147055.2 AC019173.4 AC021146.7 AC074378.4 AC134921.2 AC140484.1 AC093720.2 TMPRSS11E TMPRSS11E2GRCh37 NC_000004.11 (chr4) Tiling Path AC079749.5 AC147055.2 AC021146.7 AC074378.4 AC134921.1 AC093720.2 TMPRSS11EGRCh37: NT_167250.1 (UGT2B17 alternate locus) AC019173.4 AC021146.7 AC074378.4 AC226496.2 AC140484.1 TMPRSS11E2Xue Y et al, 2008
    21. 21. UGT2B17 MHC MAPT GRCh37 (hg19) 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome
    22. 22. Assembly (e.g. GRCh37)PAR Non-nuclear Primary assembly unit Assembly (e.g. MT) ALT ALT ALT Genomic 1 2 3 Region (MHC) Genomic ALT ALT ALT Region 4 5 6 (UGT2B17) Genomic Region ALT ALT (MAPT) 7 8 ALT 9
    23. 23. Richa AgarwalaMHC Alternate locus Alignment to chr6
    24. 24. Oh No! Not a new version of the human genome!
    25. 25. Assembly (e.g. GRCh37.p5)PAR Non-nuclear Primary assembly unit Assembly (e.g. MT) ALT ALT ALT Genomic 1 2 3 Region (MHC) Genomic ALT ALT ALT Region 4 5 6 (UGT2B17) Genomic Region ALT ALT (MAPT) 7 Genomic 8 Region (ABO) Genomic ALT Region 9 (SMA) Genomic Region (PECAM1) Patches …
    26. 26. TBC1D3C TBC1D3 TBC1D3H TBC1D3CMyo19 region (17q21)
    27. 27. 60 Fix PATCHES: Chromosome will update in GRCh38 (adds >1 Mb of novel sequence to the assembly)70 Novel PATCHES: Additional sequence added (adds >800K of novel sequence to the assembly) Releasing patches quarterly
    28. 28. Distributed data Centralized Data Old Assembly Model Updated Assembly ModelGenome not in INSDC Database Genome in INSDC Database