Church sfaf13


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Picture of a really bored teenager?
  • 1000Gs and ENCODE logos: What ties them together? Data analysis was absolutely dependent on the reference assembly.
  • CtgN50 stats here
  • Look up how much novel sequence addedAcross all patches: 35 Mb of sequence added
  • 44 SNVs between Ren2 Tx alignment and Primary, 29 of these have rsIDs: of these, 19 Alt base = Ref (likely paralog diff and no evidence for polymorphism), 9 Alt base = Tx base (SNP and Parolog diff?), 1 Alt base != Ref and Alt base != Tx (craziness)
  • Insert dot matrix alignment- pull from assembly-assembly alignments
  • Daly paper on VNTR
  • For the intermediate build GRCh37B, we are updating a subset of the high-confidence bases, about 1000, as our proof-of-principle. This panel shows reads from NA12878 aligned to chr. 19 that identify a base with MAF=0 in the LIN37 locus. This creates a non-consensus splice site.To create accessioned sequence for correcting the reference, we are using cortex_con (Iqbal and Caccamo) to generate mini-contigs (>= 50 bp) from collections of 1kG and RP11 WGS reads, the former selected from random 1kG populations.
  • In ph1, 1000G identified just over 235K bases with an MAF < 0.05. These may represent wrong or rare bases, and the GRC has been urged to change all of these.GRC decided to take a conservative approach:Focus on high-confidence subset of these bases (provided by 1kG analysis group: Poplin, Clarke, Streeter): 54K of these; 5K “wrong”; 1.5K overlap a trxpt:In strict accessibility maskHave clone sequence supporting alt baseNo failed variants within 150 bp of questionable baseWill fix wrong bases in set- those cause the most problems for variant analyses. Will update only rare bases in set with functional effects.Not updating ph1 indels. Sanger also doing some independent analyses for bases and indels. Summer will be spent defining the final collection of bases to be updated.
  • Stats for the mini-contigs built for GRCh37B.This slide shows the correction of the LIN37 issues via insertion of two mini-contigs into the tiling path.In GRCh37B, 56 RefSeqs, corresponding to 26 distinctloci have had their alignments improved by addition of the mini-contigs in GRCh37B. We have some development left for GRCh38:Tweaking process to build contigs that address clustered bases and/or indelsDefine the final set of bases
  • Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
  • Adding novel sequence for GRCh38.One source of this novel sequence is the 1kG ph1 decoy sequence.Decoy doesn’t provide chromosome context. Thus, if we can place much of the decoy in chromosome context for GRCh38, that adds even more value to the assembly. This slide shows the breakdown of the decoy: by source (bottom), by alignment to GenBank, and by amount and type of repeat.The GRC intends to assess capture by looking at 1kG reads that used to align to the decoy and seeing where they align in the updated assembly.
  • Other portions of the decoylikely represent sequence that belongs in reference assembly gaps. We are aligning all HuRef and ALLPATHS scaffolds to the reference assembly to identify sequences that extend into or span gaps. This slide shows how a combination of HuRef WGS and PCR product close a gap on chr. 16 and provide complete representation for TMEM114.Analysis of GRCh37B shows:46 of 73 HuRef scaffold insertions involve decoy. 77 ALLPATHS decoy contigs are being added at 46 gaps.
  • Lastly, some portions of the decoy will represent sequence variants. In these cases, the primary assembly does not need to be changed, but the decoy can be added as a NOVEL patch/alt locus.This slide shows a NOVEL locus that was created to capture a decoy sequence containing 30kb of additional sequence, which represents a repeat expansion.As of GRCh37.p12, 87 of 781 decoy sequences have been captured in chromosome updates/fix patches or as novel patches/alt loci.
  • There are several mechanisms we can use for capturing decoy.Much of the decoy represents centromeric repeat sequence. In collaboration with Karen Hayden in Jim Kent’s lab at UCSC, the GRC is planning to include modeled centromeric sequences in GRCh38.
  • The reference is not just the is the chromosome sequences of the primary assembly unit, but also includes the alternate loci and patches, which are used to provide additional sequence representations at selected genomic regions. The GRC has been releasing patches to the human assembly on a quarterly cycle, and we’re now at GRCh37.p12. There are two varieties of patches:FIX patches correct existing assembly problems: chromosome will update, patches integrated in GRCh38NOVEL patches add new sequence representations: will become alternate lociThis ideogram shows the current distribution of patches and alternate loci, and you can see that many regions have changed since GRCh37. Note that approximately 3% of the current public human assembly GRCh37 is associated with a region that is represented by a patch or alternate locus.
  • Adding NOVEL sequence for GRCh38 doesn’t just mean adding sequence that is completely unrepresented in GRCh37. While many of the NOVEL patches, like the one on the previous slide, represent indels, adding novel sequence also means adding sequence variants for regions too complex to be represented by a single path.There is substantial variation at the LRC/KIR region on chr. 19. As shown on this slide, not only has the GRC replaced the GRCh37 path, which was derived from components from different clone libraries, with a single haplotype path from the CHM1 assembly, it also now has 8 different haplotypes represented as alternate loci. The addition of another 10+ haplotypes at this locus is also under consideration.
  • The excess of red in the cSRA alignment track comes from secondary alignments. Somewhere in the SAM to cSRA conversion it seems that the secondary alignment CIGAR strings got messed up, resulting in what looks like really bad alignments. There’s no way to turn off the display for just the secondary alignments in Gbench. We will have to try and regenerate the cSRA to get rid of these…
  • Church sfaf13

    1. 1. Keep CalmAndCarry on SequencingDeanna M. ChurchStaff Scientist, NCBI@deannachurch
    2. 2. http://genomereference.orgValerie Schneider, NCBI
    3. 3. Photograph: Paul Popper/Popperfoto/Getty Images
    4. 4. GRCh38 is coming(September, 2013)
    5. 5.
    6. 6. 05,000,00010,000,00015,000,00020,000,00025,000,00030,000,00035,000,00040,000,00045,000,000GRCh37p12 CHM1.0 HuRef HsapALLPATHS1 YH1Con g N50Con g N50050,000100,000150,000200,000250,000300,000350,000400,000CHM1.0 HuRef HsapALLPATHS1 YH1Con g N50Con g N50
    7. 7. 0100000020000003000000400000050000006000000GRCh37p12 CHM1.0 HuRef HsapALLPATHS1 YH1Number of Con gsNumber of Con gs01000020000300004000050000600007000080000GRCh37p12CHM1.0HuRefHsapALLPATHS1Number of Con gsNumber of Con gs
    8. 8.
    9. 9.
    10. 10. Dennis et al., 20121q32 1q21 1p211p21 patch alignment to chromosome 1
    11. 11. Phase 1 Strict accessibility maskSNP (all)SNP (not 1KG)
    12. 12. Sudmant et al., 2010
    13. 13. Kidd et al, 2007APOBEC clusterPart of chr22 assemblyAlternate locus for chr22White: InsertionBlack: Deletion
    14. 14.
    15. 15. Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320129S6/SVEvTac tiling pathAlignment to C57BL/6J chr1B6 Genes129S6/SvEvTac Genes+ 32Kb in 129S6/SvEvTac
    16. 16. Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N129S6/SvEvTac Alt Locus Alignment (allelic)FVB/N Transcript Alignment (paralog)
    17. 17. 129S6/SvEvTac Ren1FVB Ren2 TxParalogousdiffSNP +ParalogousdiffMouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
    18. 18. An assembly is a MODEL of the genome
    19. 19. Assembly Model
    20. 20. BAC insertBAC vectorShotgun sequenceAssembleGAPSFinishing
    21. 21. (hg18)GRCh37(hg19)
    22. 22. NCBI35 (hg17)GRCh37 (hg19)AL139246.20AL139246.21
    23. 23. Daly et al., 2013
    24. 24.
    25. 25.
    26. 26. Fixing Rare/Incorrect Bases
    27. 27. Fixing Rare/Incorrect Bases
    28. 28. GRCh37B Sites for Update: n=1164Sites with unique successful ctg 1148 (98.6%)Avg Length 448 bpMin/Max Success Length 51/791 bpAvg Coverage 80xRead Source (all contigs)High coverage 32%Low coverage 57%Exome 10%Fixing Rare/Incorrect Bases
    29. 29. Build sequence contigs based on contigsdefined in TPF (Tiling Path File).Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysisSwitch pointRepresentative chromosomesequence
    30. 30. RP11-34P13 64E8 RP4-669L17 RP5-857K21 RP11-206L10 RP11-54O7Gaps
    31. 31. NCBI36
    32. 32. nsv832911 (nstd68) Submitted on NCBI35 (hg17)
    33. 33. NCBI35 (hg17) Tiling PathGRCh37 (hg19) Tiling PathGap InsertedMoved approximately 2 Mbdistal on chr15NC_0000015.8 (chr15)NC_0000015.9 (chr15)Removed from assemblyAdded to assembly
    34. 34. Sequences from haplotype 1Sequences from haplotype 2Old Assembly model: compress into a consensusNew Assembly model: represent both haplotypes
    35. 35. AC074378.4AC079749.5AC134921.2AC147055.2AC140484.1AC019173.4AC093720.2AC021146.7NCBI36NC_000004.10 (chr4) Tiling PathXue Y et al, 2008TMPRSS11E TMPRSS11E2GRCh37NC_000004.11 (chr4) Tiling PathAC074378.4AC079749.5AC134921.1AC147055.2AC093720.2AC021146.7TMPRSS11EGRCh37: NT_167250.1 (UGT2B17 alternate locus)AC074378.4AC140484.1AC019173.4AC226496.2AC021146.7TMPRSS11E2nsv532126 (nstd37)
    36. 36. Adding Novel Sequence1000G ph1 decoy sequence, viewed by:• GenBank alignment• Percent Repeat Masker• Repeat Masker type• Sequence Source (HTG, HuRef, ALLPATHS)
    37. 37. Adding Novel Sequence
    38. 38. Adding Novel Sequence
    39. 39. Genovese et al., 2013
    40. 40. Adding Novel SequenceKaren Hayden and Jim Kent
    41. 41. Human Resolved for GRCh38
    42. 42. Examples
    43. 43. Preview of GRCh38 (scheduled Fall 2013)TEX28 TKTL1LOC101060233(opsin related)LOC101060234(TEX28 related)GRCh37 (current reference assembly)chrX
    44. 44. Hydin: chr16 (16q22.2)Hydin2: chr1 (1q21.1)Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38Alignment to Hydin2 Genomic, 300 Kb, 99.4% IDAlignment to Hydin1 CHM1_1.0, >99.9% IDAlignment to Hydin2 Genomic, 300 Kb, 99.4% IDAlignment to Hydin1 CHM1_1.0, >99.9% IDDoggett et al., 2006
    45. 45. FAM23_MRC1 Region, chr10Segmental Duplications1KG accessibility MaskNovel Patch 250 kb of artificial duplication
    46. 46. Adding Novel Sequence
    47. 47. Richa AgarwalaMHC Alternate locusAlignment to chr6
    48. 48. Making the assembly accessible toexisting tools: maskingQuery set: 439,109,084 NA12878 HiSeq reads
    49. 49. Masking effectively blocks alignmentsin regions with high identitySimulated reads from GRCh37.p9• Unpaired reads• 101 bp• 1x coverage• Default wgsim parametersMasking parameters• Percent Id: 100%• Step size: 5 bp• Minimum length: 101 bp• Center SNPs in unmasked regions
    50. 50. Masking improves alignments inregions with alternate loci or patches
    51. 51. NA12878 reads whose bestalignment was on an alt/patch inthe masked assembly wereevaluated for their alignmentlocation when aligned to theprimary assembly aloneMasking effectively reduces theincrease in NA12878 reads thathave alignments with MAPQ=0 thatoccurs when the full assembly isused as an alignment substrate
    52. 52. GRCh38 is coming(September, 2013)