Church iowa2013
Upcoming SlideShare
Loading in...5

Church iowa2013



Talk at Iowa State University 6 Nov 2013

Talk at Iowa State University 6 Nov 2013



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Signpost for biological knowledge: ideogram + list of tracks.
  • To address assembly issues the GRC to centralize the production of the reference assembly. This gives the community a single point of contact for reporting problems and finding information about the assembly. Additionally, we serve as an aggregator of information- as individual labs find or fix problems, we can integrate this information into the reference assembly so everyone can have access to this data.
  • Insert dot matrix alignment- pull from assembly-assembly alignments
  • Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
  • If you are not using the entire assembly in your efforts, you may be missing genes in your exome capture reagents.
  • 44 SNVs between Ren2 Tx alignment and Primary, 29 of these have rsIDs: of these, 19 Alt base = Ref (likely paralog diff and no evidence for polymorphism), 9 Alt base = Tx base (SNP and Parolog diff?), 1 Alt base != Ref and Alt base != Tx (craziness)
  • There are several mechanisms we can use for capturing decoy.Much of the decoy represents centromeric repeat sequence. In collaboration with Karen Hayden in Jim Kent’s lab at UCSC, the GRC is planning to include modeled centromeric sequences in GRCh38.
  • Look up how much novel sequence addedAcross all patches: 35 Mb of sequence added
  • For the intermediate build GRCh37B, we are updating a subset of the high-confidence bases, about 1000, as our proof-of-principle. This panel shows reads from NA12878 aligned to chr. 19 that identify a base with MAF=0 in the LIN37 locus. This creates a non-consensus splice site.To create accessioned sequence for correcting the reference, we are using cortex_con (Iqbal and Caccamo) to generate mini-contigs (>= 50 bp) from collections of 1kG and RP11 WGS reads, the former selected from random 1kG populations.
  • Adding NOVEL sequence for GRCh38 doesn’t just mean adding sequence that is completely unrepresented in GRCh37. While many of the NOVEL patches, like the one on the previous slide, represent indels, adding novel sequence also means adding sequence variants for regions too complex to be represented by a single path.There is substantial variation at the LRC/KIR region on chr. 19. As shown on this slide, not only has the GRC replaced the GRCh37 path, which was derived from components from different clone libraries, with a single haplotype path from the CHM1 assembly, it also now has 8 different haplotypes represented as alternate loci. The addition of another 10+ haplotypes at this locus is also under consideration.
  • Update to GRCh37.p13The GRC has been releasing patches to the human assembly on a quarterly cycle, and we’re now at GRCh37.p12. There are two varieties of patches:FIX patches correct existing assembly problems: chromosome will update, patches integrated in GRCh38NOVEL patches add new sequence representations: will become alternate lociThis ideogram shows the current distribution of patches and alternate loci, and you can see that many regions have changed since GRCh37. Note that approximately 3% of the current public human assembly GRCh37 is associated with a region that is represented by a patch or alternate locus.
  • Remap

Church iowa2013 Church iowa2013 Presentation Transcript

  • Analyzing Individual Genomes Deanna M. Church Staff Scientist, NCBI @deannachurch
  • Valerie Schneider, NCBI
  • ISCA ClinVar Christa Lese Martin (Geisinger) Erin Riggs (Geisinger) Jose Mena Mike Feolo Tim Hefferon John Garner John Lopez Alex Astashyn Shanmuga Chitipiralla Douglas Hoffman Wonhee Jang Brandi Kattman Melissa Landrum Jennifer Lee Adriana Malheiro Wendy Rubinstein George Riley Amanjeev Sethi Ricardo Villamarin Donna Maglott GRC Valerie Schneider (NCBI) The Genome Institute at Washington University The Wellcome Trust Sanger Institute The European Bioinformatics Institute Acknowledgements GeT-RM Lisa Kalman (CDC) Birgit Funke (Harvard) Mahduri Hegde (Emory) Maryam Halavi Chao Chen Jon Trow Douglas Slotta Peter Meric Daniel Frishberg Victor Ananiev
  • Phenotypes Variation
  • Why should you care about the Reference Assembly?
  • Genes, NCBI Homo sapiens Annotation Release 105 Transcript CDS dbSNP Build 138 using annotation release 104
  • What is the Reference Assembly?
  • An assembly is a MODEL of the genome
  • BAC insert BAC vector Shotgun sequence Assemble GAPS “finishers” go in to manually fill the gaps, often by PCR
  • RP11-34P13 64E8 Gaps RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7
  • GRCh37 (hg19) NCBI36 (hg18)
  • AL139246.20 NCBI35 (hg17) AL139246.21 GRCh37 (hg19)
  • Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
  • NCBI36
  • nsv832911 (nstd68) Submitted on NCBI35 (hg17)
  • NCBI35 (hg17) Tiling Path Moved approximately 2 Mb distal on chr15 NC_0000015.8 (chr15) Gap Inserted GRCh37 (hg19) Tiling Path NC_0000015.9 (chr15) HG-24 Removed from assembly Added to assembly
  • Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
  • nsv532126 (nstd37) NCBI36 NC_000004.10 (chr4) Tiling Path AC079749.5 AC074378.4 AC147055.2 AC134921.2 AC019173.4 AC140484.1 AC021146.7 AC093720.2 TMPRSS11E2 TMPRSS11E GRCh37 NC_000004.11 (chr4) Tiling Path AC079749.5 AC074378.4 AC147055.2 AC134921.1 AC021146.7 AC093720.2 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC021146.7 AC019173.4 AC074378.4 AC226496.2 AC140484.1 TMPRSS11E2 Xue Y et al, 2008
  • UGT2B17 MHC MAPT GRCh37 (hg19) 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome
  • MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)
  • Variant Calling and the Reference Assembly
  • Part of chr22 assembly Alternate locus for chr22 White: Insertion Black: Deletion Kidd et al, 2007 APOBEC cluster
  • Rawe et al, 2013
  • Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N 129S6/SvEvTac Alt Locus Alignment Ren1 (allelic) FVB/N Transcript Alignment Ren2 (paralog)
  • Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N 129S6/SvEvTac Ren1 FVB Ren2 Tx Paralogous diff SNP + Paralogous diff
  • Doggett et al., 2006 Hydin: chr16 (16q22.2) Hydin2: chr1 (1q21.1) Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38 Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID (Paralogous) (Allelic) Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID
  • CDC27 1KG Phase 1 Strict accessibility mask SNP (all) SNP (not 1KG)
  • Sudmant et al., 2010
  • GRCh38 is coming (September, 2013)
  • Adding Novel Sequence Karen Miga and Jim Kent arXiv:1307.0035
  • Dennis et al., 2012 1q32 1q21 1p21 1p21 patch alignment to chromosome 1
  • Fixing Rare/Incorrect Bases
  • GRCh37 (current reference assembly) NC_000023.10 (chrX) Preview of GRCh38 (scheduled Fall 2013) NW_003871103.3 TEX28 LOC101060233 (opsin related) TKTL1 LOC101060234 (TEX28 related)
  • FAM23_MRC1 Region, chr10 Segmental Duplications 1KG accessibility Mask Novel Patch 250 kb of artificial duplication
  • Adding Novel Sequence
  • Human Resolved for GRCh38 GRCh37p13 120 Fix Patches 60 Novel
  • From Assembly 1 <-> Assembly 2 Assembly <-> RefSeqGene/LRG Primary Assembly <-> Alternate loci