10X Genomics
Novel variants and variant validation
September 2016
2
Partitioning to Linked Reads
1.0ng input
3
Linked read data
Confidential — Do not distribute
4
Unlinked, unphased short read SNP
5
Linked reads, phased SNP
6
Standard Short Read Alignment
Close Paralogs
Short Reads
Short Read Aligners Cannot Place Reads Correctly
7
Long Ranger – LariatTM Aligner
1. Confident mapping
provides anchors
2. Barcodes recruit short
reads into paralogous
loci
Close Paralogs
LariatTM Aligner Correctly Places Short Reads Even in
Paralogous Loci
Linked-Reads
8
Improved alignment leads to improved
variant calling
•SMN1 and SMN2: part of an inverted tandem duplication on chr5
–Differ by 8 nucleotides (3 exonic)
• SMN1: causative of spinal muscular atrophy
• SMN2: low function copy, not disease-causing
Haplotype 2 Reads
Haplotype 1 Reads
Standard
Genome
Chromium
Genome
SMN2
NA12878 WGS 128Gb
9
Inference
chr1
chr3
chr5
chr11
chr13
source
sink
• For every active alignment in the sink whose read has an alignment in the sink,
switch the alignment in the sink to active and score probabilistically. If the source
has few or no active alignments, the score goes up.
10
Inference
chr1
chr3
chr5
chr11
chr13
• This source is also now inactive.
source
sink
11
Inference
chr1
chr3
chr5
chr11
chr13
• Fast forward and we have the following active molecules left.
12
•Called by 10X data not in GIAB 3.2.2 (whole genome, not
restricted to confident regions)
•Validated with PacBio requiring > 2 alt alleles supported and
>15% allele fraction
•Of regions with PacBio coverage >=12, validation rates are 94%
for 10X and 89% for truseq.
Novel variants
10X Truseq Diff 10x
validated
Truseq
validated
Diff
SNPs 335k 292k 43k 289k 237k 52k
Deletions 76k 56k 20k 73k 54k 19k
Insertions 59k 43k 16k 58k 42k 16k
Total 470k 391k 79k 420k 333k 87k
13
• PacBio validation – align pac bio reads to reference then align them to the
reference with the alt allele in place of the reference allele. Only count as
support if one scores higher than the other.
Novel variant validation method
• Can we validate this validation
method
• Sensitivity of validation in confident
region
• Negative predictive value of
“random” mutations
• For SNPs, random is straight
forward (could include TI/TV
bias)
• For indels
• Pick length from geometric
distribution
• For deletions, the alt allele is
trivial
• For insertions, the alt allele
used is the bases in the
reference at that locus
repeated.
14
•Entire 10X team especially Patrick Marks and Deanna Church
•GIAB workshop organizers
1. Zheng, Grace XY, et al. "Haplotyping germline and cancer genomes with high-
throughput linked-read sequencing." Nature biotechnology (2016).
2. Samonte, Rhea Vallente, and Evan E. Eichler. "Segmental duplications and the
evolution of the primate genome." Nature Reviews Genetics 3.1 (2002): 65-72.
3. Bishara A et al. (2015) Read clouds uncover variation in complex regions of the
human genome. Genome Res, 25:1570-1580.
4. Li, Heng, and Richard Durbin. "Fast and accurate short read alignment with Burrows–
Wheeler transform." Bioinformatics 25.14 (2009): 1754-1760.
Acknowledgements and references
15
Addendum
16
SNP validation validation 
Confidential — Do not distribute
Used for
validation
17
Deletion validation validation
Confidential — Do not distribute
Used for
validation
18
Insertion validation validation
Confidential — Do not distribute
Used for
validation

Sept2016 smallvar 10_x

  • 1.
    10X Genomics Novel variantsand variant validation September 2016
  • 2.
    2 Partitioning to LinkedReads 1.0ng input
  • 3.
    3 Linked read data Confidential— Do not distribute
  • 4.
  • 5.
  • 6.
    6 Standard Short ReadAlignment Close Paralogs Short Reads Short Read Aligners Cannot Place Reads Correctly
  • 7.
    7 Long Ranger –LariatTM Aligner 1. Confident mapping provides anchors 2. Barcodes recruit short reads into paralogous loci Close Paralogs LariatTM Aligner Correctly Places Short Reads Even in Paralogous Loci Linked-Reads
  • 8.
    8 Improved alignment leadsto improved variant calling •SMN1 and SMN2: part of an inverted tandem duplication on chr5 –Differ by 8 nucleotides (3 exonic) • SMN1: causative of spinal muscular atrophy • SMN2: low function copy, not disease-causing Haplotype 2 Reads Haplotype 1 Reads Standard Genome Chromium Genome SMN2 NA12878 WGS 128Gb
  • 9.
    9 Inference chr1 chr3 chr5 chr11 chr13 source sink • For everyactive alignment in the sink whose read has an alignment in the sink, switch the alignment in the sink to active and score probabilistically. If the source has few or no active alignments, the score goes up.
  • 10.
  • 11.
    11 Inference chr1 chr3 chr5 chr11 chr13 • Fast forwardand we have the following active molecules left.
  • 12.
    12 •Called by 10Xdata not in GIAB 3.2.2 (whole genome, not restricted to confident regions) •Validated with PacBio requiring > 2 alt alleles supported and >15% allele fraction •Of regions with PacBio coverage >=12, validation rates are 94% for 10X and 89% for truseq. Novel variants 10X Truseq Diff 10x validated Truseq validated Diff SNPs 335k 292k 43k 289k 237k 52k Deletions 76k 56k 20k 73k 54k 19k Insertions 59k 43k 16k 58k 42k 16k Total 470k 391k 79k 420k 333k 87k
  • 13.
    13 • PacBio validation– align pac bio reads to reference then align them to the reference with the alt allele in place of the reference allele. Only count as support if one scores higher than the other. Novel variant validation method • Can we validate this validation method • Sensitivity of validation in confident region • Negative predictive value of “random” mutations • For SNPs, random is straight forward (could include TI/TV bias) • For indels • Pick length from geometric distribution • For deletions, the alt allele is trivial • For insertions, the alt allele used is the bases in the reference at that locus repeated.
  • 14.
    14 •Entire 10X teamespecially Patrick Marks and Deanna Church •GIAB workshop organizers 1. Zheng, Grace XY, et al. "Haplotyping germline and cancer genomes with high- throughput linked-read sequencing." Nature biotechnology (2016). 2. Samonte, Rhea Vallente, and Evan E. Eichler. "Segmental duplications and the evolution of the primate genome." Nature Reviews Genetics 3.1 (2002): 65-72. 3. Bishara A et al. (2015) Read clouds uncover variation in complex regions of the human genome. Genome Res, 25:1570-1580. 4. Li, Heng, and Richard Durbin. "Fast and accurate short read alignment with Burrows– Wheeler transform." Bioinformatics 25.14 (2009): 1754-1760. Acknowledgements and references
  • 15.
  • 16.
    16 SNP validation validation Confidential — Do not distribute Used for validation
  • 17.
    17 Deletion validation validation Confidential— Do not distribute Used for validation
  • 18.
    18 Insertion validation validation Confidential— Do not distribute Used for validation