1000Gs and ENCODE logos: What ties them together? Data analysis was absolutely dependent on the reference assembly.
CtgN50 stats here
Look up how much novel sequence addedAcross all patches: 35 Mb of sequence added
44 SNVs between Ren2 Tx alignment and Primary, 29 of these have rsIDs: of these, 19 Alt base = Ref (likely paralog diff and no evidence for polymorphism), 9 Alt base = Tx base (SNP and Parolog diff?), 1 Alt base != Ref and Alt base != Tx (craziness)
Insert dot matrix alignment- pull from assembly-assembly alignments
Daly paper on VNTR
For the intermediate build GRCh37B, we are updating a subset of the high-confidence bases, about 1000, as our proof-of-principle. This panel shows reads from NA12878 aligned to chr. 19 that identify a base with MAF=0 in the LIN37 locus. This creates a non-consensus splice site.To create accessioned sequence for correcting the reference, we are using cortex_con (Iqbal and Caccamo) to generate mini-contigs (>= 50 bp) from collections of 1kG and RP11 WGS reads, the former selected from random 1kG populations.
In ph1, 1000G identified just over 235K bases with an MAF < 0.05. These may represent wrong or rare bases, and the GRC has been urged to change all of these.GRC decided to take a conservative approach:Focus on high-confidence subset of these bases (provided by 1kG analysis group: Poplin, Clarke, Streeter): 54K of these; 5K “wrong”; 1.5K overlap a trxpt:In strict accessibility maskHave clone sequence supporting alt baseNo failed variants within 150 bp of questionable baseWill fix wrong bases in set- those cause the most problems for variant analyses. Will update only rare bases in set with functional effects.Not updating ph1 indels. Sanger also doing some independent analyses for bases and indels. Summer will be spent defining the final collection of bases to be updated.
Stats for the mini-contigs built for GRCh37B.This slide shows the correction of the LIN37 issues via insertion of two mini-contigs into the tiling path.In GRCh37B, 56 RefSeqs, corresponding to 26 distinctloci have had their alignments improved by addition of the mini-contigs in GRCh37B. We have some development left for GRCh38:Tweaking process to build contigs that address clustered bases and/or indelsDefine the final set of bases
Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
Adding novel sequence for GRCh38.One source of this novel sequence is the 1kG ph1 decoy sequence.Decoy doesn’t provide chromosome context. Thus, if we can place much of the decoy in chromosome context for GRCh38, that adds even more value to the assembly. This slide shows the breakdown of the decoy: by source (bottom), by alignment to GenBank, and by amount and type of repeat.The GRC intends to assess capture by looking at 1kG reads that used to align to the decoy and seeing where they align in the updated assembly.
Other portions of the decoylikely represent sequence that belongs in reference assembly gaps. We are aligning all HuRef and ALLPATHS scaffolds to the reference assembly to identify sequences that extend into or span gaps. This slide shows how a combination of HuRef WGS and PCR product close a gap on chr. 16 and provide complete representation for TMEM114.Analysis of GRCh37B shows:46 of 73 HuRef scaffold insertions involve decoy. 77 ALLPATHS decoy contigs are being added at 46 gaps.
Lastly, some portions of the decoy will represent sequence variants. In these cases, the primary assembly does not need to be changed, but the decoy can be added as a NOVEL patch/alt locus.This slide shows a NOVEL locus that was created to capture a decoy sequence containing 30kb of additional sequence, which represents a repeat expansion.As of GRCh37.p12, 87 of 781 decoy sequences have been captured in chromosome updates/fix patches or as novel patches/alt loci.
There are several mechanisms we can use for capturing decoy.Much of the decoy represents centromeric repeat sequence. In collaboration with Karen Hayden in Jim Kent’s lab at UCSC, the GRC is planning to include modeled centromeric sequences in GRCh38.
The reference is not just the is the chromosome sequences of the primary assembly unit, but also includes the alternate loci and patches, which are used to provide additional sequence representations at selected genomic regions. The GRC has been releasing patches to the human assembly on a quarterly cycle, and we’re now at GRCh37.p12. There are two varieties of patches:FIX patches correct existing assembly problems: chromosome will update, patches integrated in GRCh38NOVEL patches add new sequence representations: will become alternate lociThis ideogram shows the current distribution of patches and alternate loci, and you can see that many regions have changed since GRCh37. Note that approximately 3% of the current public human assembly GRCh37 is associated with a region that is represented by a patch or alternate locus.
Adding NOVEL sequence for GRCh38 doesn’t just mean adding sequence that is completely unrepresented in GRCh37. While many of the NOVEL patches, like the one on the previous slide, represent indels, adding novel sequence also means adding sequence variants for regions too complex to be represented by a single path.There is substantial variation at the LRC/KIR region on chr. 19. As shown on this slide, not only has the GRC replaced the GRCh37 path, which was derived from components from different clone libraries, with a single haplotype path from the CHM1 assembly, it also now has 8 different haplotypes represented as alternate loci. The addition of another 10+ haplotypes at this locus is also under consideration.
The excess of red in the cSRA alignment track comes from secondary alignments. Somewhere in the SAM to cSRA conversion it seems that the secondary alignment CIGAR strings got messed up, resulting in what looks like really bad alignments. There’s no way to turn off the display for just the secondary alignments in Gbench. We will have to try and regenerate the cSRA to get rid of these…
Keep CalmAndCarry on SequencingDeanna M. ChurchStaff Scientist, NCBI@deannachurch
05,000,00010,000,00015,000,00020,000,00025,000,00030,000,00035,000,00040,000,00045,000,000GRCh37p12 CHM1.0 HuRef HsapALLPATHS1 YH1Con g N50Con g N50050,000100,000150,000200,000250,000300,000350,000400,000CHM1.0 HuRef HsapALLPATHS1 YH1Con g N50Con g N50
0100000020000003000000400000050000006000000GRCh37p12 CHM1.0 HuRef HsapALLPATHS1 YH1Number of Con gsNumber of Con gs01000020000300004000050000600007000080000GRCh37p12CHM1.0HuRefHsapALLPATHS1Number of Con gsNumber of Con gs
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320129S6/SVEvTac tiling pathAlignment to C57BL/6J chr1B6 Genes129S6/SvEvTac Genes+ 32Kb in 129S6/SvEvTac
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N129S6/SvEvTac Alt Locus Alignment (allelic)FVB/N Transcript Alignment (paralog)
129S6/SvEvTac Ren1FVB Ren2 TxParalogousdiffSNP +ParalogousdiffMouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
Build sequence contigs based on contigsdefined in TPF (Tiling Path File).Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysisSwitch pointRepresentative chromosomesequence
NCBI35 (hg17) Tiling PathGRCh37 (hg19) Tiling PathGap InsertedMoved approximately 2 Mbdistal on chr15NC_0000015.8 (chr15)NC_0000015.9 (chr15)Removed from assemblyAdded to assemblyhttp://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-24
Sequences from haplotype 1Sequences from haplotype 2Old Assembly model: compress into a consensusNew Assembly model: represent both haplotypes
AC074378.4AC079749.5AC134921.2AC147055.2AC140484.1AC019173.4AC093720.2AC021146.7NCBI36NC_000004.10 (chr4) Tiling PathXue Y et al, 2008TMPRSS11E TMPRSS11E2GRCh37NC_000004.11 (chr4) Tiling PathAC074378.4AC079749.5AC134921.1AC147055.2AC093720.2AC021146.7TMPRSS11EGRCh37: NT_167250.1 (UGT2B17 alternate locus)AC074378.4AC140484.1AC019173.4AC226496.2AC021146.7TMPRSS11E2nsv532126 (nstd37)
Richa AgarwalaMHC Alternate locusAlignment to chr6
Making the assembly accessible toexisting tools: maskingQuery set: 439,109,084 NA12878 HiSeq reads
Masking effectively blocks alignmentsin regions with high identitySimulated reads from GRCh37.p9• Unpaired reads• 101 bp• 1x coverage• Default wgsim parametersMasking parameters• Percent Id: 100%• Step size: 5 bp• Minimum length: 101 bp• Center SNPs in unmasked regions
Masking improves alignments inregions with alternate loci or patches
NA12878 reads whose bestalignment was on an alt/patch inthe masked assembly wereevaluated for their alignmentlocation when aligned to theprimary assembly aloneMasking effectively reduces theincrease in NA12878 reads thathave alignments with MAPQ=0 thatoccurs when the full assembly isused as an alignment substrate