140127 rtg phased pedigree analyses

1,498 views
1,278 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,498
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The lengths of the female and male genetic maps are 1,817 cM and 1,386 cM, respectively
  • 140127 rtg phased pedigree analyses

    1. 1. Development & applications of a segregation-phasing ground truth GENOME- IN- A- BOTTLE W ORKSHOP Francisco M. De La Vega, D.Sc. Visiting Scholar, Department of Genetics Stanford University School of Medicine In collaboration with Real Time Genomics, Inc.
    2. 2. Evaluating Variant Calls O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine 5, 28 (2013).
    3. 3. Beyond Venn Diagrams Experimental validation (e.g. Sanger, qPCR)  Expensive  Limited by platform success  Statistical sample Reference orthogonal data available for some genomes  SNP array data  Sparse fosmid sequencing data  Incomplete Reference genomes sequenced by multiple platforms  Arbitration methods (e.g. NIST, Genome-in-a-Bottle)  Low FP, but unknown FN (genome-wide)  Biases?
    4. 4. Mendelian segregation as “ground truth”
    5. 5. CEPH/Utah Pedigree 1463 Sequenced by CGI and Illumina (Platinum Genomes) Started with 2x100bp 50X WGS Illumina Platinum data  Aligned & variant called with rtgVariant 1.1, filter by quality score (AVR≥0.15) across the samples, excluding problematic sites NA12889 NA12890 NA12891 NA12877 NA12879 NA12880 NA12881 NA12882 NA12892 NA12878 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893
    6. 6. Example: Heterozygous variant segregation NA12890 NA12877 NA12891 0/0 0/1 Trio Cal ling NA12889 NA12892 NA12878 NA12879 NA12880 NA12881 NA12882 NA12883 NA12884 NA12885 NA12886 NA12887 NA12888 NA12893 0/0 0/1 0/1 0/1 0/1 0/0 0/1 0/0 0/1 0/0 0/0
    7. 7. Segregation of heterozygous variants to offspring SNV All Variants 80,000 80,000 SNV count Variant count 100,000 60,000 40,000 60,000 40,000 20,000 20,000 0 0 1 2 3 4 5 6 7 8 9 10 1 11 2 3 4 6 7 8 9 10 9 10 11 # of offspring segregating # of offspirng segregating MNP indel 500 8,000 400 MNP count 10,000 indel count 5 6,000 4,000 300 200 2,000 100 0 0 1 2 3 4 5 6 7 8 # of offspring segregating 9 10 11 1 2 3 4 5 6 7 8 # of offspring segregating 11
    8. 8. Steps for haplotype phasing in large family Identify crossovers Phase contiguity extension Connect haplotype islands Check calls vs haplotype framework
    9. 9. Phasing labels given parent and child genotypes Parents Children fa/fb ma/mb 0/0 0/1 fa/mb fb/ma fb/mb 0/0 0/1 1/1 fa/ma 0/1 0/1 fa/ma 0/1 0/0 fb/ma fb/mb fa/mb 0/0 2/3 fa/mb fb/mb 0/1 0/2 1/1 1/2 fa/ma 0/1 0/2 fb/ma 1/2 0/1 fa/ma 0/1 1/2 fa/mb fb/ma fb/mb 0/2 0/3 1/2 1/3 fa/ma fa/mb fb/ma fb/mb
    10. 10. Identification of recombination crossovers Chr 1 Mother Chr 6, Mother
    11. 11. Recombination crossovers statistics 45 Total: 686 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Father Mother
    12. 12. Linking of phased regions Chr 1, Mother Chr 6, Mother
    13. 13. Testing for Phase Consistency Example with 4 offspring Father Phasing Labels fa Phasings Genotypes fb ma 0/1 0 0 1 1 Genotypes Phasings Mother mb Offspring 2 Offspring 3 Offspring 4 fa fa fb fb 0/1 1 1 0 0 0 1 0 1 0/0 0 0 Offspring 1 0/1 1 0 1 0 0 0 1 1 0/1 0 0 0 1 ma 0/0 0 1 0 1 0 0 1 1 0/0 1 0 0 0 mb 1/1 1 0 1 0 1 1 0 0 0/1 0 1 0 0 ma 0/1 0 1 0 1 1 1 0 0 0/0 1 0 0 0 mb 1 0 1 0 0/1 0 1 0 0 1 0
    14. 14. Probability of a set of genotypes being phase-consistent by chance Given that there are d different genotypes across both the parents and children and that the number of times each of these genotypes occurs is ni and , then the probability is: Cleary, J. G., et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data. bioRxiv (2014). doi:10.1101/001958
    15. 15. Probability of a set of genotypes being phase-consistent by chance – some examples Genotype Counts 0/0 0/1 1/1 0/2 1/2 13 Probability 1 13 3.01x10-1 6 7 1.01x10-2 1 12 1.11x10-1 1 11 1 1.36x10-2 4 4 5 5.53x10-4 3 3 3 4 6.13x10-5 1 3 3 12 3.68x10-1 1 5 6 1 2.75x10-4 1 11 13 1 7.46x10-2
    16. 16. Phasing consistent variants Illumina 2x100 bp 50X WGS Data, RTG Trio Calls Raw Call Set AVR >0.15 n % n % Phase consistent 5,224,138 77.35 4,606,574 99.28 Phase inconsistent 1,329,189 19.68 13,951 0.30 200,450 2.96 19,197 0.41 6,753,777 99.99 4,639,722 99.99 Repaired Calls inside phased segments Y-chromosome excluded
    17. 17. Phasing consistent variants Illumina 2x100 bp 50X WGS Data, BWA/GATK UG v1.7 Calls VQSR 1st Tranche Raw Call Set n % n % Phase consistent 6,941,213 68.34 5,863,035 96.00 Phase inconsistent 2,263,975 22.29 184,169 3.01 951,682 9.36 59,592 0.97 10,156,870 99.53 6,106,796 99.98 Repaired Calls inside phased segments Y-chromosome excluded
    18. 18. ROC curve: NA12878 vs Phased-Consistent 4,000,000 3,500,000 3,000,000 True Positive 2,500,000 2,000,000 1,500,000 singleton 1,000,000 trio trio-cohort 500,000 gatk 0 0 50,000 100,000 150,000 200,000 250,000 300,000 False Positive RTG sorted by AVR; GATK sorted by VQSLOD (1st tranche) 350,000 400,000
    19. 19. NIST GiaB arbitration vs Phase-Consistent Confident regions Genome-wide
    20. 20. Assessment of score recalibration models rtgVariant v 1.1; NA12878
    21. 21. 21 Assessment of MNP & indel calling (rtgVariant 1.0) Deletions Insertions • • • In rtgVariant 1.0, longer insertions have higher FP than small and deletions. More FP in MNP Improvements in aligner for v1.2 SNV/MNPs 0.5% Percentage of phase inconsistent calls rtgVariant v 1.0; NA12878
    22. 22. Summary & Perspectives • Genetic segregation in a large family offers a unique opportunity to identify “true” sets of variants • Requires collecting data for whole family as new chemistries and platforms become available (e.g. 2x250bp, Moleculo reads) • Data from multiple platforms can be merged to create a comprehensive phase-consistent ground truth • Allows rational assessment of variant pipelines and improvement of algorithms • Some issues that need to be dealt with: cell line artifacts, CNVs, systematic errors, SVs.
    23. 23. rtgTools v1.0 A toolkit to compare and analyze VCFs • • • • • • • vcfeval – comparison of VCFs for ROC curves rocplot – draw ROC curves from vcfeval output medelian – counts of Mendelian inheritance errors in pedigrees vcfstats – basic statistics of VCF files vcffilter – filtering of VCFs by scores, etc. vcfannotate – annotation of VCF files vcfmerge – merge VCF files Java compiled code freely available at GiaB repository: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
    24. 24. http://biorxiv.org/content/early/2014/01/24/001958
    25. 25. Acknowledgements RTG, Hamilton, New Zealand  John Cleary  Ross Braithwaite  Len Trigg RTG, San Bruno, CA  Sahar Malakshah  Minita Shah Michael Eberle, Illumina, Inc. – Platinum Project data Complete Genomics, Inc. – CEPH pedigree data Justin Zook – NIST Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab) This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA. © 2014 Real Time Genomics, Inc. All rights reserved.

    ×