Missing and misassembled sequence in the reference assembly can have dire consequences to genome interpretation. In this example, Gene2 is missing from the reference, but present in the sample we are analyzying. Regardless of whether gene2 is missing because of an assembly error, or because it is polymorphic in the population the outcome can be the same. In the best case scenario, reads from gene2 don’t align to the reference and we just can’t analyze gene2. However, if gene2 is related to gene1, we can get off targets alignments that can confound analysis of Gene1 as well, either leading to under calling in the region, or possibly leading to inappropriately calling paralogous sequence variants as allelic sequence variants. If we take sequences we know to be missing in GRCh37, simulate reads and then align these reads to GRCh37, we see that 75% of these find an off target alignment, regardless of the alignment method used. This is why Heng Li created decoy sequences for the 1000 genomes project- in an effort to reduce off-target alignments. However, we still lack the ability to analyze gene2 in this scenario. This underscores the importance of representing all common human sequences in the reference assembly.
Mutations in DPYD result in dihydropyrimidine dehydrogenase deficiency, an error in pyrimidine metabolism associated with thymine-uraciluria and an increased risk of toxicity in cancer patients receiving 5-flourouracil.
Replace this with protein coding info and stats? And Valerie’s poster
The CCL3 region on chromosome 17 allows us to explore two major updates seen in GRCh38, and hopefully will underscore the importance of representing missing paralogs in the reference. This region is known to be copy number variant, with individuals having 0-4 copies of a 90Kb repeat unit. In GRCh37, the region was assembled from several sources that contain different structural variants. This led to the creation of a false gap, and a genomic representation that does not likely exist in anyone on the planet. Being able to correctly represent the genomic architecture of this regions is important as there is some, albeit conflicting evidence, of the correlation of the number of copies of CCL3L1 with HIV infection and progression to AIDs.
To better represent this region, the GRC made a new clone tiling path in this region from a single haplotype resource derived from a hyaditiform mole. An additional allele, representing a 100 Kb insertion was also generated and placed in the assembly as an alternate locus. The reference assembly now has two correct representations of this region – though we may need more.
For this reason, many people have just ignored these sequences, but doing so in GRCh38 means losing 3.6 Mb of sequence unique to the alternate loci- sequence containing 153 genes. This graph shows the distribution of the amount of unique sequence per alternate locus- so while it is clear they do not all contribute equal amounts of novel sequence, in aggregate the amount is significant. The GRC recently held a workshop to encourage development of new tools that can handle the full assembly, and Heng Li has already distributed a version of BWA-MEM that is alt-locus aware, we need to do considerable testing and additional development to make sure we are using these sequences correctly. We also need to assess the ramifications of this new structure on other parts of the tool chain.
• Need aligners that can distinguish allelic and
• Need variant callers/modules than can correctly
assign genotypes in complex regions
• Need to extend file formats to accommodate new