GIAB Workshop
Len Trigg & Sean Irvine
Phasing NA12878 by segregation in children
Phasing NA12878 by segregation in children
● Joint calling of 17 member CEPH pedigree.
● Benefits:
○ High Mendelian consistency across all members.
○ (Near) full phasing of NA12878 (and NA12877) according
to segregation in the 11 children.
● Latest run incorporates 300x Illumina reads for NA12878
RM8398 sample (other members ~30x).
● Calls that segregate well are more likely to be correct.
● Could look at phasing inconsistent calls in more detail.
○ Structural variants
○ Somatic variants
Concordance of NA12878 with GIAB 3.2.2
NA24385, RTG on 10X Genomics Chromium
Unifying call sets
Different callers, different representations.
Different samples, different representations.
Given some number of call sets, represent the calls in as
consistent manner as possible.
● Incrementally accumulate alleles from call sets.
● Recode call sets using accumulated alleles.
● Harmonization rather than Canonicalization (chosen
representation comes from within rather than externally
specified).
Example: chr20, NA12878
Example: Harmonization of AJ trio
Example from v3.3 AJ trio
3 non-Mendelian calls become consistent on recoding.
12 original alleles recoded into 6 alleles.
Original child mother father
1:73974514 GAACCC G . 0|1 .
1:73974515 A T 0/1 . .
1:73974516 ACCC A 0/1 . .
1:73974520 TC T . 0|1 .
1:73974521 CATA C 0/1 . .
1:73974524 A C . 0|1 .
Recoded
1:73974515 A T 0/1 0/1 .
1:73974516 ACCC A 0/1 0/1 .
1:73974521 CATA C 0/1 0/1 .
Notes and Limitations
● Recoding loses existing annotations. Could recover in
simple cases, but not clear what to do when calls are
moved, split, or combined as a result of the recoding.
● If a new call set needs to be added, can incrementally
accumulate new sample, but existing ones will need to be
recoded.
● Final result is dependent on the order in which call sets are
accumulated.
● Minimizes number of alleles (can in rare cases introduce
Mendelian violations).
Phase Transfer
Another mode of operation for vcfeval. The phasing in one call
set can be lifted over to another call set without losing
annotations or changing the representation of calls.
v3.3 HG002/NA24385
9.7%
RTG AJ trio 300x
88.1%
phase-transferred
90.2%
chr20 NA12878 GATK
0%
RTG CEPH SP 37.7.0
99.9%
89.0%
Illumina PG 8.0.1
99.9%
phase-transferred
90.8%
Phase Transfer
During normal operation vcfeval ignores phasing information
and tries each allele on each haplotype.
During phase transfer vcfeval will obey the phasing of one (or
both) of the samples. Effectively restricts the matches that
can be made. Ideally want at least one sample to be fully
phased.
A special output mode is used to report the phasing found
during the matching. Apart from the phasing, the calls are
not changed and all the original annotations are retained.

Sept2016 smallvar rtg

  • 1.
  • 2.
    Phasing NA12878 bysegregation in children
  • 3.
    Phasing NA12878 bysegregation in children ● Joint calling of 17 member CEPH pedigree. ● Benefits: ○ High Mendelian consistency across all members. ○ (Near) full phasing of NA12878 (and NA12877) according to segregation in the 11 children. ● Latest run incorporates 300x Illumina reads for NA12878 RM8398 sample (other members ~30x). ● Calls that segregate well are more likely to be correct. ● Could look at phasing inconsistent calls in more detail. ○ Structural variants ○ Somatic variants
  • 4.
    Concordance of NA12878with GIAB 3.2.2
  • 6.
    NA24385, RTG on10X Genomics Chromium
  • 8.
    Unifying call sets Differentcallers, different representations. Different samples, different representations. Given some number of call sets, represent the calls in as consistent manner as possible. ● Incrementally accumulate alleles from call sets. ● Recode call sets using accumulated alleles. ● Harmonization rather than Canonicalization (chosen representation comes from within rather than externally specified).
  • 9.
  • 10.
  • 11.
    Example from v3.3AJ trio 3 non-Mendelian calls become consistent on recoding. 12 original alleles recoded into 6 alleles. Original child mother father 1:73974514 GAACCC G . 0|1 . 1:73974515 A T 0/1 . . 1:73974516 ACCC A 0/1 . . 1:73974520 TC T . 0|1 . 1:73974521 CATA C 0/1 . . 1:73974524 A C . 0|1 . Recoded 1:73974515 A T 0/1 0/1 . 1:73974516 ACCC A 0/1 0/1 . 1:73974521 CATA C 0/1 0/1 .
  • 12.
    Notes and Limitations ●Recoding loses existing annotations. Could recover in simple cases, but not clear what to do when calls are moved, split, or combined as a result of the recoding. ● If a new call set needs to be added, can incrementally accumulate new sample, but existing ones will need to be recoded. ● Final result is dependent on the order in which call sets are accumulated. ● Minimizes number of alleles (can in rare cases introduce Mendelian violations).
  • 14.
    Phase Transfer Another modeof operation for vcfeval. The phasing in one call set can be lifted over to another call set without losing annotations or changing the representation of calls. v3.3 HG002/NA24385 9.7% RTG AJ trio 300x 88.1% phase-transferred 90.2% chr20 NA12878 GATK 0% RTG CEPH SP 37.7.0 99.9% 89.0% Illumina PG 8.0.1 99.9% phase-transferred 90.8%
  • 15.
    Phase Transfer During normaloperation vcfeval ignores phasing information and tries each allele on each haplotype. During phase transfer vcfeval will obey the phasing of one (or both) of the samples. Effectively restricts the matches that can be made. Ideally want at least one sample to be fully phased. A special output mode is used to report the phasing found during the matching. Apart from the phasing, the calls are not changed and all the original annotations are retained.