Platinum Genomes:                                                                                                         ...
Platinum Genome project: Goals    Problem: No comprehensive truth set of variant calls for validation    Solution: Sequenc...
Using inheritance to detect conflicts: trio analysis      MOM        DAD         CHILD                                    ...
Using inheritance to determine accuracy: larger pedigree                                CHILDREN       MOM   DAD    1     ...
Using inheritance to determine accuracy: larger pedigree       MOM   DAD    1     2     3     4        5       6       7  ...
Using inheritance to determine accuracy: larger pedigree       MOM   DAD    1     2     3     4        5     6     7      ...
Using inheritance to determine accuracy: larger pedigree       MOM   DAD    1     2     3     4        5     6     7      ...
Using inheritance to determine accuracy: larger pedigree       MOM   DAD    1     2     3     4        5     6     7      ...
Using inheritance to determine accuracy: larger pedigree       MOM      DAD        1        2        3        4        5  ...
Platinum Genomes - CEPH/Utah Pedigree 1463                      12889           12890           12891           12892     ...
Analysis of the data     50x raw data was aligned and variants called using BWA + GATK + VQSR       – Accurate calls were ...
Set               Input all possible data and                          Set          A                use the inheritance t...
Cataloging the accurate SNPs13
Accurate SNP positions based on the pedigree analysis                          3.5   3,217,748                            ...
Hamming distance for the “accurate” SNPs to the 2nd best solution                  60                                     ...
Using other call sets for a more comprehensive catalogue                      60                   57,270 (1.6%)     Count...
Concordance between “pedigree-accurate” GTs                                                                               ...
Indel analysis18
Accurate GATK indel positions based on pedigree                                240,490                          250       ...
Using other call sets for a more comprehensive catalogue                      60     Counts (x1000)                       ...
Concordance between overlapping “accurate” indels                                                                         ...
CNVs22
Conflict mode: Hemizygous deletions      MOM       DAD         1        2         3       4      5     6     7       A T  ...
Conflict mode: Hemizygous deletions       MOM      DAD        1        2         3        4        5     6       7       A...
Read depth of 5,180 SNPs predicted to overlap deletions                           Hom Del    Haploid   Diploid            ...
Have many potential large deletions to validate…     5,180 SNPs are predicted to overlap a hemizygous deletion     These S...
Summary     We have sequenced a large pedigree and used the inheritance information to     create a catalogue of ~4.45M ac...
Acknowledgements Morten Kallberg – alignment & variant calling Han-Yu Chuang – analysis of SNP calls Phil Tedder – validat...
Upcoming SlideShare
Loading in...5
×

Mar2013 RM Characterization Working Group

3,922

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,922
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
29
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Mar2013 RM Characterization Working Group

  1. 1. Platinum Genomes: Towards a comprehensive truth data set Michael A. Eberle Morten Kallberg, Han-Yu Chuang© 2010 Illumina, Inc. All rights reserved.Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
  2. 2. Platinum Genome project: Goals Problem: No comprehensive truth set of variant calls for validation Solution: Sequence and analyze large family pedigree Use Mendelian inheritance to identify good / bad variant calls – Including SNPs, indels & SVs Aggressively incorporate variant calls – Incorporate multiple algorithms and sequencing technologies – Do not limit this just to what is currently easy to call Make the data available publicly – Both raw data and processed calls with accuracy assessment Re-assess algorithms against a better truth data – Better and more comprehensive truth data will allow for rapid advances in software2
  3. 3. Using inheritance to detect conflicts: trio analysis MOM DAD CHILD Child receives blue chromosome from mother and green chromosome from father: e.g. typical trio analysis Father’s chromosomes Mother’s chromosomes When we do a trio analysis like this only 50% of the parents DNA is passed on to the child so many of the variants will only be called in one parent – Have no power to detect false positives in the parents A trio analysis is also not very sensitive to detecting errors – For example if father is AC and mother is AC then the child can be AA, AC or CC and still be consistent with Mendelian inheritance – Many errors occur at sites that are systematically het but trio analysis assumes that these are correct3
  4. 4. Using inheritance to determine accuracy: larger pedigree CHILDREN MOM DAD 1 2 3 4 5 6 7 Possible GT Patterns A T A A A A T A A A T A A A T A A A T A A A T A A A T A A A T A A A T A A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T4
  5. 5. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A # Errors / Hamming Distance OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T5
  6. 6. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T6
  7. 7. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T7
  8. 8. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T8
  9. 9. Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T 100% consistent therefore we predict that all genotypes are correct9
  10. 10. Platinum Genomes - CEPH/Utah Pedigree 1463 12889 12890 12891 12892 12877 12877 12878 12878 Analysis of SNPs in the parents and 11 children 12879 12880 12881 12882 12882 12883 12884 12885 12886 12887 12888 12893 All 17 members sequenced to at least 50x depth (PCR-Free protocol) – SNPs & indels called using BWA + GATK + VQSR Each member of the trio highlighted in bold is sequenced to 200x An additional 200x technical replicate was done for NA1288210
  11. 11. Analysis of the data 50x raw data was aligned and variants called using BWA + GATK + VQSR – Accurate calls were supplemented with accurate variant calls made by Cortex using the same sequence data and accurate CGI calls made across the same pedigree First step is to define the inheritance of the parental chromosomes to the eleven children everywhere in the genome – Identified 709 crossover events between the parents and eleven children Define accurate variants as those where the genotypes are 100% consistent with the transmission of the parental haplotypes – At any position of the genome there are only 16 possible combinations of genotypes (biallelic & diploid) across the pedigree that are consistent with the inheritance pattern – 313 (~1.6M) possible genotype combinations Subsequent analysis mostly excludes all variants that are homozygous alternative across the last two generations of this pedigree (~750k) – Mostly will be accurate but for these “trivially consistent” sites we cannot differentiate accurate from systematic errors or validate ploidy11
  12. 12. Set Input all possible data and Set A use the inheritance to B separate good from bad: Set C Variants are unlikely to accidentally match inheritance Compare Against Inheritance NO CONFLICTS CONFLICTS Score Assess (plat./gold) Problem BIOLOGY BAD Score db w/score Comment (gold/silver) db db w/comments w/comments12
  13. 13. Cataloging the accurate SNPs13
  14. 14. Accurate SNP positions based on the pedigree analysis 3.5 3,217,748 Pedigree Analysis 3.0 Correct Counts (Millions) Normally might exclude 2.5 these from our analysis Problematic because the variant 2.0 caller filtered some of the calls 1.5 Additional 754,014 SNPs are “trivially consistent” – i.e. all 13 1.0 samples are hom alt. 408,915 0.5 0.0 All Pass Filtered GATK Site Description*14 *Filtered means that at least one variant call was called but quality filtered
  15. 15. Hamming distance for the “accurate” SNPs to the 2nd best solution 60 At these sites >85% of the positions would require at least four (very specific) genotype errors to have erroneously ended 40 up with the observed predicted- Percent accurate calls 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 1315 Hamming Distance
  16. 16. Using other call sets for a more comprehensive catalogue 60 57,270 (1.6%) Counts (x1000) 40 Pedigree Analysis Unique 22,922 (0.6%) Common 20 0 Cortex CGI16
  17. 17. Concordance between “pedigree-accurate” GTs # Same GT Comparison* # Sites # Diff GTs GTs Concordance GATK & Cortex 2,053,136 5 26,690,763 99.99998% GATK & CGI 3,146,399 19 40,903,168 99.99995% Cortex & CGI 1,890,718 7 24,579,327 99.99997% *Excluding sites where alleles did not match or all samples homozygous alternative Includes 763,085 GT calls and 264,771 positions quality filtered by GATK Attempting to validate a sample of the sites that are unique to a single call set – Targeting ~300 per call set17
  18. 18. Indel analysis18
  19. 19. Accurate GATK indel positions based on pedigree 240,490 250 Pedigree Analysis Correct Counts (thousands) 200 141,508 Problematic 150 Additional 115,587 100 indels are “trivially consistent” – i.e. all 13 samples are hom alt. 50 0 All Pass Filtered Site Description19
  20. 20. Using other call sets for a more comprehensive catalogue 60 Counts (x1000) 39,335 (10%) 40 Pedigree Analysis Unique Common 20 9,637 (2.4%) 0 Cortex CGI20
  21. 21. Concordance between overlapping “accurate” indels # Same GT Comparison*1 # Sites # Diff GTs GTs Concordance GATK & Cortex 96,228 43 1,250,921 99.997% GATK & CGI 219,445 2,817 2,514,785 99.901% Cortex & CGI 78,050 198 1,014,650 99.981% *Excluding sites where alleles did not match or all samples homozygous alternative Attempting to validate a sample of the sites that are unique to a single call set – Targeting ~300 per call set21
  22. 22. CNVs22
  23. 23. Conflict mode: Hemizygous deletions MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 7 A A A T A A A T A T A A A A A A A T 2 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T T T A A A A A A T T “Best” solution still indicates multiple errors23
  24. 24. Conflict mode: Hemizygous deletions MOM DAD 1 2 3 4 5 6 7 A - A T A A - T A T - A A A - A A T 6 - A T A - T A A - A A T - T A T - A 5 - A A T - A A T - T A A - A A A - T 0 A - T A A T - A A A - T A T - T A A 7 OBSERVED GENOTYPES A A A T A A A T T T A A A A A A T T 100% consistent therefore we predict that there is a deletion Hamming distance will be less when including deletions so need to be careful24
  25. 25. Read depth of 5,180 SNPs predicted to overlap deletions Hom Del Haploid Diploid 5000 Depth shown for positions where 4000 the genotypes indicate that the SNP overlaps a deletion. Large number of children allows us to more-reliably separate errors 3000 Counts from deletions. 2000 A- AA AB 1000 -B BB 0 0 20 40 60 80 100 Depth25
  26. 26. Have many potential large deletions to validate… 5,180 SNPs are predicted to overlap a hemizygous deletion These SNPs cluster into ~902 unique events – Clusters show evidence for ~279 deletions >1kb segregating in this pedigree – Largest event is >152kb with 274 SNPs supporting the call Have begun validating these events beyond just visual inspection – 132 overlap with previously reported events (1kGP) – Working to define the breakpoints for wet lab validation Incorporating other calling methods (Cortex, breakdancer…) Some SNPs also support the presence of duplications in a single parent26
  27. 27. Summary We have sequenced a large pedigree and used the inheritance information to create a catalogue of ~4.45M accurate SNP calls – Over 3.7M biallelic SNPs agree with transmission of parental chromosomes – Over 750k homozygous alternative SNPs are trivially accurate across the pedigree Have called indels using four different methods also to produce over 550k “accurate” indel calls across the pedigree – Over 428k bi-allelic indels agree with transmission of parental chromosomes – Over 110k homozygous alternative indels are trivially accurate across the pedigree Concordance for the bi-allelic, pedigree-accurate calls is >99.9999% for SNPs and 99.9% for indels between call sets SVs are in progress (just deletions right now) The SNP and indel results presented here can be used for comparison – Incorporating homozygous reference calls across the pedigree for completeness – May see immediate gains by testing new algorithms against a better truth set27
  28. 28. Acknowledgements Morten Kallberg – alignment & variant calling Han-Yu Chuang – analysis of SNP calls Phil Tedder – validation of de novo SNPs Sean Humphray Epameinondas Fritzilas Wendy Wong David Bentley Elliott Margulies28
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×