Mar2013 RM Characterization Working Group
Upcoming SlideShare
Loading in...5
×
 

Mar2013 RM Characterization Working Group

on

  • 3,416 views

 

Statistics

Views

Total Views
3,416
Views on SlideShare
3,262
Embed Views
154

Actions

Likes
1
Downloads
26
Comments
0

3 Embeds 154

http://www.twylah.com 102
https://twitter.com 41
http://www.dnalinklabs.com 11

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mar2013 RM Characterization Working Group Mar2013 RM Characterization Working Group Presentation Transcript

  • Platinum Genomes: Towards a comprehensive truth data set Michael A. Eberle Morten Kallberg, Han-Yu Chuang© 2010 Illumina, Inc. All rights reserved.Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro,GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
  • Platinum Genome project: Goals Problem: No comprehensive truth set of variant calls for validation Solution: Sequence and analyze large family pedigree Use Mendelian inheritance to identify good / bad variant calls – Including SNPs, indels & SVs Aggressively incorporate variant calls – Incorporate multiple algorithms and sequencing technologies – Do not limit this just to what is currently easy to call Make the data available publicly – Both raw data and processed calls with accuracy assessment Re-assess algorithms against a better truth data – Better and more comprehensive truth data will allow for rapid advances in software2
  • Using inheritance to detect conflicts: trio analysis MOM DAD CHILD Child receives blue chromosome from mother and green chromosome from father: e.g. typical trio analysis Father’s chromosomes Mother’s chromosomes When we do a trio analysis like this only 50% of the parents DNA is passed on to the child so many of the variants will only be called in one parent – Have no power to detect false positives in the parents A trio analysis is also not very sensitive to detecting errors – For example if father is AC and mother is AC then the child can be AA, AC or CC and still be consistent with Mendelian inheritance – Many errors occur at sites that are systematically het but trio analysis assumes that these are correct3
  • Using inheritance to determine accuracy: larger pedigree CHILDREN MOM DAD 1 2 3 4 5 6 7 Possible GT Patterns A T A A A A T A A A T A A A T A A A T A A A T A A A T A A A T A A A T A A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T4
  • Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A # Errors / Hamming Distance OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T5
  • Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T6
  • Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T7
  • Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T8
  • Using inheritance to determine accuracy: larger pedigree MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 5 A A A T A A A T A T A A A A A A A T 0 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T A T A A A A A A A T 100% consistent therefore we predict that all genotypes are correct9
  • Platinum Genomes - CEPH/Utah Pedigree 1463 12889 12890 12891 12892 12877 12877 12878 12878 Analysis of SNPs in the parents and 11 children 12879 12880 12881 12882 12882 12883 12884 12885 12886 12887 12888 12893 All 17 members sequenced to at least 50x depth (PCR-Free protocol) – SNPs & indels called using BWA + GATK + VQSR Each member of the trio highlighted in bold is sequenced to 200x An additional 200x technical replicate was done for NA1288210
  • Analysis of the data 50x raw data was aligned and variants called using BWA + GATK + VQSR – Accurate calls were supplemented with accurate variant calls made by Cortex using the same sequence data and accurate CGI calls made across the same pedigree First step is to define the inheritance of the parental chromosomes to the eleven children everywhere in the genome – Identified 709 crossover events between the parents and eleven children Define accurate variants as those where the genotypes are 100% consistent with the transmission of the parental haplotypes – At any position of the genome there are only 16 possible combinations of genotypes (biallelic & diploid) across the pedigree that are consistent with the inheritance pattern – 313 (~1.6M) possible genotype combinations Subsequent analysis mostly excludes all variants that are homozygous alternative across the last two generations of this pedigree (~750k) – Mostly will be accurate but for these “trivially consistent” sites we cannot differentiate accurate from systematic errors or validate ploidy11
  • Set Input all possible data and Set A use the inheritance to B separate good from bad: Set C Variants are unlikely to accidentally match inheritance Compare Against Inheritance NO CONFLICTS CONFLICTS Score Assess (plat./gold) Problem BIOLOGY BAD Score db w/score Comment (gold/silver) db db w/comments w/comments12
  • Cataloging the accurate SNPs13
  • Accurate SNP positions based on the pedigree analysis 3.5 3,217,748 Pedigree Analysis 3.0 Correct Counts (Millions) Normally might exclude 2.5 these from our analysis Problematic because the variant 2.0 caller filtered some of the calls 1.5 Additional 754,014 SNPs are “trivially consistent” – i.e. all 13 1.0 samples are hom alt. 408,915 0.5 0.0 All Pass Filtered GATK Site Description*14 *Filtered means that at least one variant call was called but quality filtered
  • Hamming distance for the “accurate” SNPs to the 2nd best solution 60 At these sites >85% of the positions would require at least four (very specific) genotype errors to have erroneously ended 40 up with the observed predicted- Percent accurate calls 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 1315 Hamming Distance
  • Using other call sets for a more comprehensive catalogue 60 57,270 (1.6%) Counts (x1000) 40 Pedigree Analysis Unique 22,922 (0.6%) Common 20 0 Cortex CGI16
  • Concordance between “pedigree-accurate” GTs # Same GT Comparison* # Sites # Diff GTs GTs Concordance GATK & Cortex 2,053,136 5 26,690,763 99.99998% GATK & CGI 3,146,399 19 40,903,168 99.99995% Cortex & CGI 1,890,718 7 24,579,327 99.99997% *Excluding sites where alleles did not match or all samples homozygous alternative Includes 763,085 GT calls and 264,771 positions quality filtered by GATK Attempting to validate a sample of the sites that are unique to a single call set – Targeting ~300 per call set17
  • Indel analysis18
  • Accurate GATK indel positions based on pedigree 240,490 250 Pedigree Analysis Correct Counts (thousands) 200 141,508 Problematic 150 Additional 115,587 100 indels are “trivially consistent” – i.e. all 13 samples are hom alt. 50 0 All Pass Filtered Site Description19
  • Using other call sets for a more comprehensive catalogue 60 Counts (x1000) 39,335 (10%) 40 Pedigree Analysis Unique Common 20 9,637 (2.4%) 0 Cortex CGI20
  • Concordance between overlapping “accurate” indels # Same GT Comparison*1 # Sites # Diff GTs GTs Concordance GATK & Cortex 96,228 43 1,250,921 99.997% GATK & CGI 219,445 2,817 2,514,785 99.901% Cortex & CGI 78,050 198 1,014,650 99.981% *Excluding sites where alleles did not match or all samples homozygous alternative Attempting to validate a sample of the sites that are unique to a single call set – Targeting ~300 per call set21
  • CNVs22
  • Conflict mode: Hemizygous deletions MOM DAD 1 2 3 4 5 6 7 A T A A A A T A A A T A A A T A A A 6 T A A A T A A A T A A A T A A A T A 7 A A A T A A A T A T A A A A A A A T 2 A A T A A T A A A A A T A T A T A A 7 OBSERVED GENOTYPES A A A T A A A T T T A A A A A A T T “Best” solution still indicates multiple errors23
  • Conflict mode: Hemizygous deletions MOM DAD 1 2 3 4 5 6 7 A - A T A A - T A T - A A A - A A T 6 - A T A - T A A - A A T - T A T - A 5 - A A T - A A T - T A A - A A A - T 0 A - T A A T - A A A - T A T - T A A 7 OBSERVED GENOTYPES A A A T A A A T T T A A A A A A T T 100% consistent therefore we predict that there is a deletion Hamming distance will be less when including deletions so need to be careful24
  • Read depth of 5,180 SNPs predicted to overlap deletions Hom Del Haploid Diploid 5000 Depth shown for positions where 4000 the genotypes indicate that the SNP overlaps a deletion. Large number of children allows us to more-reliably separate errors 3000 Counts from deletions. 2000 A- AA AB 1000 -B BB 0 0 20 40 60 80 100 Depth25
  • Have many potential large deletions to validate… 5,180 SNPs are predicted to overlap a hemizygous deletion These SNPs cluster into ~902 unique events – Clusters show evidence for ~279 deletions >1kb segregating in this pedigree – Largest event is >152kb with 274 SNPs supporting the call Have begun validating these events beyond just visual inspection – 132 overlap with previously reported events (1kGP) – Working to define the breakpoints for wet lab validation Incorporating other calling methods (Cortex, breakdancer…) Some SNPs also support the presence of duplications in a single parent26
  • Summary We have sequenced a large pedigree and used the inheritance information to create a catalogue of ~4.45M accurate SNP calls – Over 3.7M biallelic SNPs agree with transmission of parental chromosomes – Over 750k homozygous alternative SNPs are trivially accurate across the pedigree Have called indels using four different methods also to produce over 550k “accurate” indel calls across the pedigree – Over 428k bi-allelic indels agree with transmission of parental chromosomes – Over 110k homozygous alternative indels are trivially accurate across the pedigree Concordance for the bi-allelic, pedigree-accurate calls is >99.9999% for SNPs and 99.9% for indels between call sets SVs are in progress (just deletions right now) The SNP and indel results presented here can be used for comparison – Incorporating homozygous reference calls across the pedigree for completeness – May see immediate gains by testing new algorithms against a better truth set27
  • Acknowledgements Morten Kallberg – alignment & variant calling Han-Yu Chuang – analysis of SNP calls Phil Tedder – validation of de novo SNPs Sean Humphray Epameinondas Fritzilas Wendy Wong David Bentley Elliott Margulies28