Church iowa2013


Published on

Talk at Iowa State University 6 Nov 2013

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Signpost for biological knowledge: ideogram + list of tracks.
  • To address assembly issues the GRC to centralize the production of the reference assembly. This gives the community a single point of contact for reporting problems and finding information about the assembly. Additionally, we serve as an aggregator of information- as individual labs find or fix problems, we can integrate this information into the reference assembly so everyone can have access to this data.
  • Insert dot matrix alignment- pull from assembly-assembly alignments
  • Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
  • If you are not using the entire assembly in your efforts, you may be missing genes in your exome capture reagents.
  • 44 SNVs between Ren2 Tx alignment and Primary, 29 of these have rsIDs: of these, 19 Alt base = Ref (likely paralog diff and no evidence for polymorphism), 9 Alt base = Tx base (SNP and Parolog diff?), 1 Alt base != Ref and Alt base != Tx (craziness)
  • There are several mechanisms we can use for capturing decoy.Much of the decoy represents centromeric repeat sequence. In collaboration with Karen Hayden in Jim Kent’s lab at UCSC, the GRC is planning to include modeled centromeric sequences in GRCh38.
  • Look up how much novel sequence addedAcross all patches: 35 Mb of sequence added
  • For the intermediate build GRCh37B, we are updating a subset of the high-confidence bases, about 1000, as our proof-of-principle. This panel shows reads from NA12878 aligned to chr. 19 that identify a base with MAF=0 in the LIN37 locus. This creates a non-consensus splice site.To create accessioned sequence for correcting the reference, we are using cortex_con (Iqbal and Caccamo) to generate mini-contigs (>= 50 bp) from collections of 1kG and RP11 WGS reads, the former selected from random 1kG populations.
  • Adding NOVEL sequence for GRCh38 doesn’t just mean adding sequence that is completely unrepresented in GRCh37. While many of the NOVEL patches, like the one on the previous slide, represent indels, adding novel sequence also means adding sequence variants for regions too complex to be represented by a single path.There is substantial variation at the LRC/KIR region on chr. 19. As shown on this slide, not only has the GRC replaced the GRCh37 path, which was derived from components from different clone libraries, with a single haplotype path from the CHM1 assembly, it also now has 8 different haplotypes represented as alternate loci. The addition of another 10+ haplotypes at this locus is also under consideration.
  • Update to GRCh37.p13The GRC has been releasing patches to the human assembly on a quarterly cycle, and we’re now at GRCh37.p12. There are two varieties of patches:FIX patches correct existing assembly problems: chromosome will update, patches integrated in GRCh38NOVEL patches add new sequence representations: will become alternate lociThis ideogram shows the current distribution of patches and alternate loci, and you can see that many regions have changed since GRCh37. Note that approximately 3% of the current public human assembly GRCh37 is associated with a region that is represented by a patch or alternate locus.
  • Remap
  • Church iowa2013

    1. 1. Analyzing Individual Genomes Deanna M. Church Staff Scientist, NCBI @deannachurch
    2. 2. Valerie Schneider, NCBI
    3. 3. ISCA ClinVar Christa Lese Martin (Geisinger) Erin Riggs (Geisinger) Jose Mena Mike Feolo Tim Hefferon John Garner John Lopez Alex Astashyn Shanmuga Chitipiralla Douglas Hoffman Wonhee Jang Brandi Kattman Melissa Landrum Jennifer Lee Adriana Malheiro Wendy Rubinstein George Riley Amanjeev Sethi Ricardo Villamarin Donna Maglott GRC Valerie Schneider (NCBI) The Genome Institute at Washington University The Wellcome Trust Sanger Institute The European Bioinformatics Institute Acknowledgements GeT-RM Lisa Kalman (CDC) Birgit Funke (Harvard) Mahduri Hegde (Emory) Maryam Halavi Chao Chen Jon Trow Douglas Slotta Peter Meric Daniel Frishberg Victor Ananiev
    4. 4. Phenotypes Variation
    5. 5. Why should you care about the Reference Assembly?
    6. 6. Genes, NCBI Homo sapiens Annotation Release 105 Transcript CDS dbSNP Build 138 using annotation release 104
    7. 7.
    8. 8.
    9. 9. What is the Reference Assembly?
    10. 10. An assembly is a MODEL of the genome
    11. 11. BAC insert BAC vector Shotgun sequence Assemble GAPS “finishers” go in to manually fill the gaps, often by PCR
    12. 12.
    13. 13.
    14. 14. RP11-34P13 64E8 Gaps RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7
    15. 15.
    16. 16. GRCh37 (hg19) NCBI36 (hg18)
    17. 17. AL139246.20 NCBI35 (hg17) AL139246.21 GRCh37 (hg19)
    18. 18. Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
    19. 19. NCBI36
    20. 20. nsv832911 (nstd68) Submitted on NCBI35 (hg17)
    21. 21. NCBI35 (hg17) Tiling Path Moved approximately 2 Mb distal on chr15 NC_0000015.8 (chr15) Gap Inserted GRCh37 (hg19) Tiling Path NC_0000015.9 (chr15) HG-24 Removed from assembly Added to assembly
    22. 22. Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
    23. 23. nsv532126 (nstd37) NCBI36 NC_000004.10 (chr4) Tiling Path AC079749.5 AC074378.4 AC147055.2 AC134921.2 AC019173.4 AC140484.1 AC021146.7 AC093720.2 TMPRSS11E2 TMPRSS11E GRCh37 NC_000004.11 (chr4) Tiling Path AC079749.5 AC074378.4 AC147055.2 AC134921.1 AC021146.7 AC093720.2 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC021146.7 AC019173.4 AC074378.4 AC226496.2 AC140484.1 TMPRSS11E2 Xue Y et al, 2008
    24. 24. UGT2B17 MHC MAPT GRCh37 (hg19) 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome
    25. 25. MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)
    26. 26. Variant Calling and the Reference Assembly
    27. 27. Part of chr22 assembly Alternate locus for chr22 White: Insertion Black: Deletion Kidd et al, 2007 APOBEC cluster
    28. 28. Rawe et al, 2013
    29. 29. Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N 129S6/SvEvTac Alt Locus Alignment Ren1 (allelic) FVB/N Transcript Alignment Ren2 (paralog)
    30. 30. Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N 129S6/SvEvTac Ren1 FVB Ren2 Tx Paralogous diff SNP + Paralogous diff
    31. 31. Doggett et al., 2006 Hydin: chr16 (16q22.2) Hydin2: chr1 (1q21.1) Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38 Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID (Paralogous) (Allelic) Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID
    32. 32. CDC27 1KG Phase 1 Strict accessibility mask SNP (all) SNP (not 1KG)
    33. 33.
    34. 34. Sudmant et al., 2010
    35. 35. GRCh38 is coming (September, 2013)
    36. 36.
    37. 37. Adding Novel Sequence Karen Miga and Jim Kent arXiv:1307.0035
    38. 38. Dennis et al., 2012 1q32 1q21 1p21 1p21 patch alignment to chromosome 1
    39. 39. Fixing Rare/Incorrect Bases
    40. 40. GRCh37 (current reference assembly) NC_000023.10 (chrX) Preview of GRCh38 (scheduled Fall 2013) NW_003871103.3 TEX28 LOC101060233 (opsin related) TKTL1 LOC101060234 (TEX28 related)
    41. 41. FAM23_MRC1 Region, chr10 Segmental Duplications 1KG accessibility Mask Novel Patch 250 kb of artificial duplication
    42. 42. Adding Novel Sequence
    43. 43. Human Resolved for GRCh38 GRCh37p13 120 Fix Patches 60 Novel
    44. 44. From Assembly 1 <-> Assembly 2 Assembly <-> RefSeqGene/LRG Primary Assembly <-> Alternate loci