Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Development & applications of a
segregation-phasing ground truth
GENOME- IN- A- BOTTLE W ORKSHOP

Francisco M. De La Vega,...
Evaluating Variant Calls

O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications ...
Beyond Venn Diagrams
Experimental validation (e.g. Sanger, qPCR)
 Expensive
 Limited by platform success
 Statistical s...
Mendelian segregation as “ground truth”
CEPH/Utah Pedigree 1463
Sequenced by CGI and Illumina (Platinum Genomes)
Started with 2x100bp 50X WGS Illumina Platinum da...
Example: Heterozygous variant segregation

NA12890

NA12877

NA12891

0/0

0/1

Trio Cal ling

NA12889

NA12892

NA12878

...
Segregation of heterozygous variants to offspring
SNV

All Variants
80,000

80,000

SNV count

Variant count

100,000

60,...
Steps for haplotype phasing in large family

Identify crossovers
Phase contiguity extension
Connect haplotype islands
Chec...
Phasing labels given parent and child genotypes
Parents

Children

fa/fb

ma/mb

0/0

0/1

fa/mb

fb/ma

fb/mb

0/0

0/1

...
Identification of recombination crossovers
Chr 1 Mother

Chr 6, Mother
Recombination crossovers statistics
45

Total: 686

40

35
30
25
20
15
10
5

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1...
Linking of phased regions
Chr 1, Mother

Chr 6, Mother
Testing for Phase Consistency

Example with 4 offspring
Father

Phasing
Labels

fa

Phasings

Genotypes

fb

ma

0/1
0
0
1...
Probability of a set of genotypes being phase-consistent
by chance
Given that there are d different genotypes across both ...
Probability of a set of genotypes being phase-consistent
by chance – some examples
Genotype Counts
0/0

0/1

1/1

0/2

1/2...
Phasing consistent variants

Illumina 2x100 bp 50X WGS Data, RTG Trio Calls

Raw
Call Set

AVR >0.15

n

%

n

%

Phase co...
Phasing consistent variants

Illumina 2x100 bp 50X WGS Data, BWA/GATK UG v1.7 Calls

VQSR 1st Tranche

Raw
Call Set

n

%
...
ROC curve: NA12878 vs Phased-Consistent
4,000,000

3,500,000

3,000,000

True Positive

2,500,000

2,000,000

1,500,000

s...
NIST GiaB arbitration vs Phase-Consistent
Confident regions
Genome-wide
Assessment of score recalibration models

rtgVariant v 1.1; NA12878
21

Assessment of MNP & indel calling (rtgVariant 1.0)
Deletions

Insertions

•

•
•

In rtgVariant 1.0,
longer insertions...
Summary & Perspectives
• Genetic segregation in a large family offers a unique
opportunity to identify “true” sets of vari...
rtgTools v1.0
A toolkit to compare and analyze VCFs

•
•
•
•
•
•
•

vcfeval – comparison of VCFs for ROC curves
rocplot – ...
http://biorxiv.org/content/early/2014/01/24/001958
Acknowledgements
RTG, Hamilton, New Zealand
 John Cleary
 Ross Braithwaite
 Len Trigg
RTG, San Bruno, CA
 Sahar Malaks...
Upcoming SlideShare
Loading in …5
×

140127 rtg phased pedigree analyses

1,977 views

Published on

Published in: Technology
  • Be the first to comment

140127 rtg phased pedigree analyses

  1. 1. Development & applications of a segregation-phasing ground truth GENOME- IN- A- BOTTLE W ORKSHOP Francisco M. De La Vega, D.Sc. Visiting Scholar, Department of Genetics Stanford University School of Medicine In collaboration with Real Time Genomics, Inc.
  2. 2. Evaluating Variant Calls O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine 5, 28 (2013).
  3. 3. Beyond Venn Diagrams Experimental validation (e.g. Sanger, qPCR)  Expensive  Limited by platform success  Statistical sample Reference orthogonal data available for some genomes  SNP array data  Sparse fosmid sequencing data  Incomplete Reference genomes sequenced by multiple platforms  Arbitration methods (e.g. NIST, Genome-in-a-Bottle)  Low FP, but unknown FN (genome-wide)  Biases?
  4. 4. Mendelian segregation as “ground truth”
  5. 5. CEPH/Utah Pedigree 1463 Sequenced by CGI and Illumina (Platinum Genomes) Started with 2x100bp 50X WGS Illumina Platinum data  Aligned & variant called with rtgVariant 1.1, filter by quality score (AVR≥0.15) across the samples, excluding problematic sites NA12889 NA12890 NA12891 NA12877 NA12879 NA12880 NA12881 NA12882 NA12892 NA12878 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893
  6. 6. Example: Heterozygous variant segregation NA12890 NA12877 NA12891 0/0 0/1 Trio Cal ling NA12889 NA12892 NA12878 NA12879 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893 0/0 0/1 0/1 0/1 0/1 0/0 0/1 0/0 0/1 0/0 0/0
  7. 7. Segregation of heterozygous variants to offspring SNV All Variants 80,000 80,000 SNV count Variant count 100,000 60,000 40,000 60,000 40,000 20,000 20,000 0 0 1 2 3 4 5 6 7 8 9 10 1 11 2 3 4 6 7 8 9 10 9 10 11 # of offspring segregating # of offspirng segregating MNP indel 500 8,000 400 MNP count 10,000 indel count 5 6,000 4,000 300 200 2,000 100 0 0 1 2 3 4 5 6 7 8 # of offspring segregating 9 10 11 1 2 3 4 5 6 7 8 # of offspring segregating 11
  8. 8. Steps for haplotype phasing in large family Identify crossovers Phase contiguity extension Connect haplotype islands Check calls vs haplotype framework
  9. 9. Phasing labels given parent and child genotypes Parents Children fa/fb ma/mb 0/0 0/1 fa/mb fb/ma fb/mb 0/0 0/1 1/1 fa/ma 0/1 0/1 fa/ma 0/1 0/0 fb/ma fb/mb fa/mb 0/0 2/3 fa/mb fb/mb 0/1 0/2 1/1 1/2 fa/ma 0/1 0/2 fb/ma 1/2 0/1 fa/ma 0/1 1/2 fa/mb fb/ma fb/mb 0/2 0/3 1/2 1/3 fa/ma fa/mb fb/ma fb/mb
  10. 10. Identification of recombination crossovers Chr 1 Mother Chr 6, Mother
  11. 11. Recombination crossovers statistics 45 Total: 686 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Father Mother
  12. 12. Linking of phased regions Chr 1, Mother Chr 6, Mother
  13. 13. Testing for Phase Consistency Example with 4 offspring Father Phasing Labels fa Phasings Genotypes fb ma 0/1 0 0 1 1 Genotypes Phasings Mother mb Offspring 2 Offspring 3 Offspring 4 fa fa fb fb 0/1 1 1 0 0 0 1 0 1 0/0 0 0 Offspring 1 0/1 1 0 1 0 0 0 1 1 0/1 0 0 0 1 ma 0/0 0 1 0 1 0 0 1 1 0/0 1 0 0 0 mb 1/1 1 0 1 0 1 1 0 0 0/1 0 1 0 0 ma 0/1 0 1 0 1 1 1 0 0 0/0 1 0 0 0 mb 1 0 1 0 0/1 0 1 0 0 1 0
  14. 14. Probability of a set of genotypes being phase-consistent by chance Given that there are d different genotypes across both the parents and children and that the number of times each of these genotypes occurs is ni and , then the probability is: Cleary, J. G., et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. bioRxiv (2014). doi:10.1101/001958
  15. 15. Probability of a set of genotypes being phase-consistent by chance – some examples Genotype Counts 0/0 0/1 1/1 0/2 1/2 13 Probability 1 13 3.01x10-1 6 7 1.01x10-2 1 12 1.11x10-1 1 11 1 1.36x10-2 4 4 5 5.53x10-4 3 3 3 4 6.13x10-5 1 3 3 12 3.68x10-1 1 5 6 1 2.75x10-4 1 11 13 1 7.46x10-2
  16. 16. Phasing consistent variants Illumina 2x100 bp 50X WGS Data, RTG Trio Calls Raw Call Set AVR >0.15 n % n % Phase consistent 5,224,138 77.35 4,606,574 99.28 Phase inconsistent 1,329,189 19.68 13,951 0.30 200,450 2.96 19,197 0.41 6,753,777 99.99 4,639,722 99.99 Repaired Calls inside phased segments Y-chromosome excluded
  17. 17. Phasing consistent variants Illumina 2x100 bp 50X WGS Data, BWA/GATK UG v1.7 Calls VQSR 1st Tranche Raw Call Set n % n % Phase consistent 6,941,213 68.34 5,863,035 96.00 Phase inconsistent 2,263,975 22.29 184,169 3.01 951,682 9.36 59,592 0.97 10,156,870 99.53 6,106,796 99.98 Repaired Calls inside phased segments Y-chromosome excluded
  18. 18. ROC curve: NA12878 vs Phased-Consistent 4,000,000 3,500,000 3,000,000 True Positive 2,500,000 2,000,000 1,500,000 singleton 1,000,000 trio trio-cohort 500,000 gatk 0 0 50,000 100,000 150,000 200,000 250,000 300,000 False Positive RTG sorted by AVR; GATK sorted by VQSLOD (1st tranche) 350,000 400,000
  19. 19. NIST GiaB arbitration vs Phase-Consistent Confident regions Genome-wide
  20. 20. Assessment of score recalibration models rtgVariant v 1.1; NA12878
  21. 21. 21 Assessment of MNP & indel calling (rtgVariant 1.0) Deletions Insertions • • • In rtgVariant 1.0, longer insertions have higher FP than small and deletions. More FP in MNP Improvements in aligner for v1.2 SNV/MNPs 0.5% Percentage of phase inconsistent calls rtgVariant v 1.0; NA12878
  22. 22. Summary & Perspectives • Genetic segregation in a large family offers a unique opportunity to identify “true” sets of variants • Requires collecting data for whole family as new chemistries and platforms become available (e.g. 2x250bp, Moleculo reads) • Data from multiple platforms can be merged to create a comprehensive phase-consistent ground truth • Allows rational assessment of variant pipelines and improvement of algorithms • Some issues that need to be dealt with: cell line artifacts, CNVs, systematic errors, SVs.
  23. 23. rtgTools v1.0 A toolkit to compare and analyze VCFs • • • • • • • vcfeval – comparison of VCFs for ROC curves rocplot – draw ROC curves from vcfeval output medelian – counts of Mendelian inheritance errors in pedigrees vcfstats – basic statistics of VCF files vcffilter – filtering of VCFs by scores, etc. vcfannotate – annotation of VCF files vcfmerge – merge VCF files Java compiled code freely available at GiaB repository: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
  24. 24. http://biorxiv.org/content/early/2014/01/24/001958
  25. 25. Acknowledgements RTG, Hamilton, New Zealand  John Cleary  Ross Braithwaite  Len Trigg RTG, San Bruno, CA  Sahar Malakshah  Minita Shah Michael Eberle, Illumina, Inc. – Platinum Project data Complete Genomics, Inc. – CEPH pedigree data Justin Zook – NIST Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab) This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA. © 2014 Real Time Genomics, Inc. All rights reserved.

×