2. Acknowledgements
Personalis
Jason Harris
Sarah Garcia
Jeanie Tirch
Gabor Bartha
Mark Pratt
Scott Kirk
Michael Clark
Rich Chen
John West
Genome Reference Consortium
Personalis, Inc. | Confidential 2 and Proprietary
NCBI
Valerie Schneider
Nathan Bouk
Terence Murphy
Alex Astashyn
Donna Maglott
Melissa Landrum
Wendy Rubinstein
Jennifer Lee
3. Who we are
Inherited
Disease
Diagnostics
Personalis, Inc. | Confidential 3 and Proprietary
Cancer
Services
ACE Platform
Research
Services
4. Accuracy is key to what we do
Novel 2bp deletion in GATAD2B
Called by GATK as paternally inherited
Personalis, Inc. | Confidential 4 and Proprietary
• Both affected children
– Macrocephaly
– Low muscle tone, hypotonia
– Delay in early milestones
– Dysphagia
– Esotropia
• Affected Male (3 yr)
– Intellectual disability
– Mild hearing loss
– High arched palate
– Small cyst near eye
• Affected Female (15 mo)
– Sleep apnea
– Failure to thrive
– Laryngomalacia
– Anisocoria
– Small optic nerves
Case courtesy of Geisinger Health System
5. Accuracy is key to what we do
Sample GATK-determined
Genotype
Personalis, Inc. | Confidential 5 and Proprietary
Ref Alt Depth Allele
Freq.
Father 0/1 108 11 121 0.09
Mother 0/0 111 0 112 0.00
Brother 0/1 63 44 109 0.40
Sister 0/1 64 52 119 0.44
Case courtesy of Geisinger Health System
6. Excitement about GRCh38
Personalis, Inc. | Confidential 6 and Proprietary
DPYD
GGAACGCAG
GGAACACAG
R->C
Alt loci
Model Centromere Sequences
Miga et al., 2014
7. Medical content not on chromosome sequences
Personalis, Inc. | Confidential 7 and Proprietary
8. Medical content not on chromosome sequences
GRCh37
NT_113939: chr19 unlocalized contig
GRCh38
Personalis, Inc. | Confidential 8 and Proprietary
9. Medical content not on chromosome sequences
NT_167246.2: MHC alternate locus
Sparse SNP No SNP annotation
annotation
Personalis, Inc. | Confidential 9 and Proprietary
10. By any other name
chr19 vs 19
GenBank: CM00681.2
RefSeq: NC_000019.10
Personalis, Inc. | Confidential 10 and Proprietary
11. By any other name
chr19_KI270938v1_alt
CHR_HSCHR19KIR_G248_BA2_HAP_CTG3_1
GenBank: KI270886.1
RefSeq: NT_187640.1
Personalis, Inc. | Confidential 11 and Proprietary
12. Unflattening the data MICB
Reporting formats (GFF, VCF, etc) don’t
manage multiple locations easily
Personalis, Inc. | Confidential 12 and Proprietary
15. Personalis, Inc. | Confidential 15 and Proprietary
Using Fix patches to improve alignments
Incremental steps: using fix patches
16. Migrating to GRCh38: using Fix patches
Personalis, Inc. | Confidential 16 and Proprietary
Fix patch
hs37d5
17. Migrating to GRCh38: using Fix patches
Personalis, Inc. | Confidential 17 and Proprietary
Fix patch
hs37d5
18. Migrating to GRCh38: using Fix patches
GRCh37 vs. Fix Patch
GRCh38
Personalis, Inc. | Confidential 18 and Proprietary
19. GRCh37.p13 Improved alignments outside of fix patch regions
Personalis, Inc. | Confidential 19 and Proprietary
Jason Harris
Regions outside of fix patches
hs37d5
GRCh37.p13
hs37d5
GRCh37.p13
378 Ten kb windows that don’t
overlap fix patches with >10 SNV
call differences
20. GRCh37.p13 Improved alignments outside of fix patch regions
Personalis, Inc. | Confidential 20 and Proprietary
Jason Harris
hs37d5
GRCh37.p13
hs37d5
GRCh37.p13
hs37d5
GRCh37.p13
21. Using Fix patches
Personalis, Inc. | Confidential 21 and Proprietary
22. Aligning GRCh37 and GRCh38
A A
B’
Seq in
assembly 1
Personalis, Inc. | Confidential 22 and Proprietary
Seq in
assembly 2
B
B
Unique well aligned
region in both assemblies.
Second Pass (SP) alignments
First Pass (FP) alignments
SP only
Expansion
Assembly 1
SP + FP
Collapse
Assembly 2
23. Aligning GRCh37 and GRCh38
Personalis, Inc. | Confidential 23 and Proprietary
24. Mapping to GRCh38
Personalis, Inc. | Confidential 24 and Proprietary
25. Mapping to GRCh38
Dataset Starting
loci
Failure Unique to
Personalis, Inc. | Confidential 25 and Proprietary
Primary
Unique to
Alts
Collapse
in
GRCh37
Collapse
in
GRCh38
GWAS
catalog
7,991 0 7,827 0 14 0
ClinVar* 88,343 3 86,549 5 278 4
GO-ESP
6500
1,982,177 180 1,920,864 339 5,792 324
GIAB 2,915,713 274 2,874,786 47 1,662 4
NCBI assembly-assembly alignments from:
Sept 20, 2014, software version 1.7
*clinvar_20140902.vcf
26. Remap vs. liftOver
liftOver-dbSNP remap
Personalis, Inc. | Confidential 26 and Proprietary
rs141109950
chr7
27. Remap vs. liftOver
Personalis, Inc. | Confidential 27 and Proprietary
rs267602252
remap liftOver
28. Migrating to GRCh38
First Pass remap Second Pass remap
Personalis, Inc. | Confidential 28 and Proprietary
29. Migrating to GRCh38
New PRODH paralog
Sequence is unlocalized on chr22.
Personalis, Inc. | Confidential 29 and Proprietary
30. Using GRCh38 to improve GRCh37 annotation
Personalis, Inc. | Confidential 30 and Proprietary
KCNE1
Alignment to new paralog added in GRCh38
31. Getting the most out of the reference
Still challenging because tools and
data structures expect a flat assembly
Remap/liftOver not the final answer for
moving variation
Even modest changes (via fix patches)
are promsing
Personalis, Inc. | Confidential 31 and Proprietary
Editor's Notes
Sarah mosacism/UDP slide
Mutations in DPYD result in dihydropyrimidine dehydrogenase deficiency, an error in pyrimidine metabolism associated with thymine-uraciluria and an increased risk of toxicity in cancer patients receiving 5-flourouracil.
Replace this with protein coding info and stats? And Valerie’s poster
We can also see improvements outside of fix patch regions. Here we see another normalized read plot, with blue representing GRCh37 and green showing alignments to our fix patch version. Not only do we see alignment improvements, but this carries through to variant calling. We have identified 378 10 Kb windows that don’t overlap fix patches but have greater than 10 SNV call differences. Here is one such example, where a seeming SNP dense region, with lots of heterozygous SNPs now looks much cleaner- and has not heterozygous SNPs.
The NCBI assembly-assembly alignment process uses a two step approach. In the first pass, a set of heuristics, including assembly structure are used to generate a set of essentially reciprocal best hits Then, the process does a second pass, looking for regions greater than 5kb in each assembly and tries to recover alignments in these regions. A report is produced that marks up whether a region is in a first pass or second pass alignment- analyzing this report can identify regions that are likely expanded in an assembly or collapsed in an assembly.
We can use this data to plot the amount (expressed as an percentage of each chromosome) of collapse in each assembly. It is worth noting that this is some collapse in GRCh38, which is expected as several misassembled regions contained more than one haplotype and these haplotype expansions where removed in GRCh38. However, we can see the landscape is dominated by collapse in GRCh37. Variants called within these regions in GRCh37 are candidates for false positive variant calls.