Successfully reported this slideshow.

Church emory2013

2,550 views

Published on

Seminar at Emory Sep 2013

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Church emory2013

  1. 1. Deanna M. Church Staff Scientist, NCBI @deannachurch The intersection of genome assembly and variation management.
  2. 2. http://genomereference.org Valerie Schneider, NCBI
  3. 3. Variation ResourcesTeam at NCBI Ming Ward Lon Phan Brad Holmes Anna Glodek Michael Kholodov Rama Maiti Juliana Sampson David Shao Eugene Shekhtman Qiang Wang Hua Zhang Donna Maglott Melissa Landrum Jennifer Lee George Riley Ray Tully Craig Wallin Shanmuga Chitipiralla Douglas Hoffman Wonhee Jang Ken Katz Michael Ovetsky Ricardo Villamarin Tim Hefferon John Lopez John Garner Chao Chen
  4. 4. Learning Objectives Why the reference assembly matters for your analysis How the reference assembly is changing Tools and Resources to find data
  5. 5. Why should you care about the Reference Assembly?
  6. 6. Genes, NCBI Homo sapiens Annotation Release 105 Transcript CDS dbSNP Build 138 using annotation release 104
  7. 7. http://www.bioplanet.com/gcat
  8. 8. What is the Reference Assembly?
  9. 9. An assembly is a MODEL of the genome
  10. 10. BAC insert BAC vector Shotgun sequence Assemble GAPS “finishers” go in to manually fill the gaps, often by PCR
  11. 11. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1012
  12. 12. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-1321
  13. 13. RP11-34P13 64E8 RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7 Gaps
  14. 14. http://genomereference.org
  15. 15. NCBI36 (hg18) GRCh37(hg19)
  16. 16. NCBI35 (hg17) GRCh37 (hg19) AL139246.20 AL139246.21
  17. 17. Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
  18. 18. NCBI36
  19. 19. nsv832911 (nstd68) Submitted on NCBI35 (hg17)
  20. 20. NCBI35 (hg17) Tiling Path GRCh37 (hg19) Tiling Path Gap Inserted Moved approximately 2 Mb distal on chr15 NC_0000015.8 (chr15) NC_0000015.9 (chr15) Removed from assembly Added to assembly HG-24
  21. 21. Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
  22. 22. AC074378.4 AC079749.5 AC134921.2 AC147055.2 AC140484.1 AC019173.4 AC093720.2 AC021146.7 NCBI36NC_000004.10 (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37NC_000004.11 (chr4) Tiling Path AC074378.4 AC079749.5 AC134921.1 AC147055.2 AC093720.2 AC021146.7 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC074378.4 AC140484.1 AC019173.4 AC226496.2 AC021146.7 TMPRSS11E2 nsv532126 (nstd37)
  23. 23. GRCh37 (hg19) http://genomereference.org 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome UGT2B17 MHC MAPT
  24. 24. MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)
  25. 25. Data management and the Reference Assembly?
  26. 26. NC_000086.123456 CM001013.17 2Mouse chrX: 34,800,000-34,890,000
  27. 27. Mouse chrX: 35,000,000-36,000000 X MGSCv3 MGSCv36
  28. 28. ABC14-1065514J1 GapsPhase LengthDate FP565796.1 1 121-Oct-2009 FP565796.2 1 014-Oct-2010 FP565796.3 3 007-Nov-2010
  29. 29. hg19 GRCh37 mm8 MGSCv37 NCBIM37 danRer5 Zv7
  30. 30. chr21:8,913,216-9,246,964
  31. 31. Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
  32. 32. http://www.ncbi.nlm.nih.gov/genome/assembly
  33. 33. GenBank RefSeqvs Submitter Owned RefSeq Owned Redundancy Non-Redundant Updated rarely Curated INSDC Not INSDC BRCA1 83 genomic records 31 mRNA records 27 protein records 3 genomic records 5 mRNA records 1 RNA record 5 protein records
  34. 34. http://www.ncbi.nlm.nih.gov/refseq/rsg http://www.lrg-sequence.org/
  35. 35. http://www.ncbi.nlm.nih.gov/refseq/rsg RefSeq Gene L R
  36. 36. http://www.ncbi.nlm.nih.gov/genome/tools/remap From Assembly 1 <-> Assembly 2 Assembly <-> RefSeqGene/LRG Primary Assembly <-> Alternate loci
  37. 37. Variant Calling and the Reference Assembly
  38. 38. Kidd et al, 2007APOBEC cluster Part of chr22 assembly Alternate locus for chr22 White: Insertion Black: Deletion
  39. 39. http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
  40. 40. Hydin: chr16 (16q22.2) Hydin2: chr1 (1q21.1) Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38 Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID (Paralogous) (Allelic) Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID Doggett et al., 2006
  41. 41. http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes CDC27 1KG Phase 1 Strict accessibility mask SNP (all) SNP (not 1KG)
  42. 42. Sudmant et al., 2010
  43. 43. Issues with the Reference Assembly
  44. 44. http://genomereference.org
  45. 45. Dennis et al., 2012 1q32 1q21 1p21 1p21 patch alignment to chromosome 1
  46. 46. Fixing Rare/Incorrect Bases
  47. 47. Adding Novel Sequence Karen Miga and Jim Kent arXiv:1307.0035
  48. 48. Preview of GRCh38 (scheduled Fall 2013) TEX28 TKTL1 LOC101060233 (opsin related) LOC101060234 (TEX28 related) GRCh37 (current reference assembly) NC_000023.10 (chrX) NW_003871103.3
  49. 49. FAM23_MRC1 Region, chr10 Segmental Duplications 1KG accessibility Mask Novel Patch 250 kb of artificial duplication
  50. 50. Adding Novel Sequence
  51. 51. GRCh37p13 120 Fix Patches 60 Novel Human Resolved for GRCh38 http://genomereference.org
  52. 52. How to identify problem regions in the Reference Assembly
  53. 53. 1000 Genomes Browser: http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes GeT-RM Browser: http://www.ncbi.nlm.nih.gov/variation/tools/getrm Variation Viewer: http://www.ncbi.nlm.nih.gov/variation/view (coming Oct 2013!)
  54. 54. Tiling Path Sequence Bar Segmental Duplications, Eichler Lab 1000 Genomes strict accessibility mask Annotated clone assembly problems
  55. 55. dbSNP Build 138 based on annotation run 104 Model based paralogous sequence differences, NCBI annotation run # Paralogous/pseudo gene alignments, NCBI annotation run # Single Unique Nucleotide (SUN) map, Sudmant 2010 ClinVar Long Variations GRC Curation Issues ClinVar Short Variations

×