Aug2013 illumina platinum genomes
Upcoming SlideShare
Loading in...5

Aug2013 illumina platinum genomes






Total Views
Views on SlideShare
Embed Views



1 Embed 6 6



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Thank you Tanya and thanks to everyone for attending this seminar.
  • This project grew out of an observation that there is no comprehensive truth set of variant calls and this gap is becoming increasingly problematic as sequencing moves to the clinic. Additionally, the validation that has been done using trio conflicts or perpendicular technologies usually only assess a relatively small percentage of the variants. Alternatively, we are working to solve this by sequencing a large pedigree and using the parental inheritance to assess accuracy of variant calls with the goal that we will deliver a set of highly accurate variant calls, make the data available publicly as a community resource and also demonstrating a framework for validating variant calls and improving variant callers – especially for more complicated variants such as indels and structural variants.
  • To demonstate the utility of analyzing a full pedigree we have sequenced all 17 members of a well-characterized CEPH pedigree to 50x depth. In addition we have sequenced the trio highlighted in bold to 200x each and performed a technical replicate of the child of this trio (NA12882) again to 200x so that we have a total of 400x sequence depth on this child. For the work I’m presenting today we will concentrate on SNP analysis in the parents and 11 children of the last two generations but we are already looking at indels and larger variants.
  • The way that we are able to gain power for error detection is by having the ability to calculated inheritance of the parental haplotypes. With a large number of children we will observe all 4 possible pairings of the parental haplotypes and when that occurs we have much increased power to identify genotype errors. Because there are 11 siblings we even have additional power because there are internal replicates built in for some inherited parental haplotype pairings. In this figure, I’ve highlighted the inheritance pattern for six of the children in a small region of chromosome 22 where a single inheritance pattern occurs – e.g. a region bounded by detected crossover events. Within this region we can convert genotypes to haplotypes as I’ve illustrated above.
  • If we just look at the haplotypes in blue, we can immediately detect conflicts. For example, one child is the “odd man out” out showing a T rather than a G at the fourth site indicating that there is an error in this genotype. This also illustrates the power of this method. Each genotype call is supported or not supported based on the surrounding genotype calls across the pedigree. In practice, when we calculate conflict rates we choose a parsimonious solution that agrees most closely with the observed genotypes and thus will under-estimate the true error rate though likely this effect is small. This method allows us to assign an error to a sample, impute missing calls and, in some cases, error correct.

Aug2013 illumina platinum genomes Aug2013 illumina platinum genomes Presentation Transcript

  • © 2010 Illumina, Inc. All rights reserved. Illumina, illuminaDx, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, GoldenGate Indexing, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, GenomeStudio, Genetic Energy, HiSeq, and HiScan are registered trademarks or trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Platinum Genomes: Identifying variants using a large pedigree Michael A. Eberle GIAB August, 2013
  • 2 Platinum Genome project: Improving technology & tools Create a catalogue of highly accurate whole-genome variant calls within a well characterized pedigree – SNPs, indels & CNVs – Including highly confident reference positions – Provide direct supporting evidence for every variant call Develop a framework to assess variant callers Provide a path to improve variant callers by providing a better truth data to sensitively assess sensitivity and precision – Modifying the SNP filters to maximize accuracy Correct FPFN Truth Test
  • 3 NIST GIAB – Pedigree analysis 12889 12890 12891 12892 12877 12878 12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893 All 17 members sequenced to at least 50x depth (PCR-Free protocol) Variants are called across the pedigree using different software & technology Inheritance information provides high confident, direct validation of variant calls Analysis of SNPs in the parents and 11 children
  • 4 Pedigree Analysis – Using haplotypes to detect conflicts A C A G T A A C A G T A A C A G T A A C A T T A A C A G T A A T C T G A A T C T G A A T C T G A G T C G T C G T C G T C G T C G T C G C A T T A G C A T T A G C A T T A G C A T T A G C A T T A With a sufficiently large pedigree all four possible inheritance patterns will be observed and most of the genotypes can be phased into haplotypes Parents Children
  • 5 Using haplotypes to detect conflicts A C A G T A A C A G T A A C A G T A A C A T T A A C A G T A A T C T G A A T C T G A A T C T G A G T C G T C G T C G T C G T C G T C G C A T T A G C A T T A G C A T T A G C A T T A G C A T T A Individual GT accuracy is assessed using surrounding genotype calls across the pedigree Genotypes are parsimoniously phased to minimize the number of conflicts across the pedigree Facilitates assigning conflicts to sample, imputation of missing data and error correction Error at this sample/position Parents Children
  • 6 First step is to define the inheritance of the parental chromosomes to the eleven children everywhere in the genome – Identified 709 crossover events between the parents and eleven children Variants called across the pedigree using multiple callers – E.g. GATK, Cortex, Isaac & CGI for SNPs Define accurate variants as those where the genotypes are 100% consistent with the transmission of the parental haplotypes – At any position of the genome there are only 16 possible combinations of genotypes (biallelic & diploid) across the pedigree that are consistent with the inheritance pattern – 313 (~1.6M) possible genotype combinations Analysis of variant calls within the pedigree structure
  • 7 Homozygous positions (GATK) – ~2.6B positions identified as homozygous reference across the pedigree SNPs (GATK, Cortex, Isaac & CGI) – ~4.7M positions where SNPs agree with transmission of parental chromosomes – >95% (4.5M) called consistent with transmission by multiple algorithms/technologies – >98% (4.6M) with supporting evidence from other call sets (i.e. same variant called in at least one of the samples) Indels (GATK, Cortex & CGI) – ~640k indels consistent with transmission of parental chromosomes – Events range in size from 1 to 350bp CNVs (BreakDancer & Grouper) – ~772 CNVs - mostly deletions though a couple of duplications – Events range from 1kb to 322kb though still refining break points Current state
  • 8 CNVs
  • 9 Incorporating larger variants SNPs and small indels work well because the genotypes are highly accurate – A single genotyping error in any of the 13 samples will almost never be consistent with the haplotype transmission Developing approaches for other variants types that have lower calling accuracy – Many CNV callers do not provide GT information – Accuracy is too low to use pedigree-consistency
  • 10 Incorporating CNVs into this framework Make breakpoint calls within each sample using BreakDancer & Grouper Identify regions of overlap between samples (keeping singletons) Corroborate based on read counts within the putative CNV events Refine to breakpoint resolution NA12877 NA12878 NA12879 NA12880 NA12881 NA12882 Test Regions • Count the uniquely aligned reads within the defined break points for the test regions for each sample & identify events where the read counts are consistent with a deletion or duplication • For internally-consistent events, follow up with targeted analysis to identify bp resolution of events • On average ~150x depth for every event
  • 11 AB CD CB DA CB DB DA CB CA DB CB CA DA 0 500 1000 1500 2000 ReadCounts 0 1 2 Using read counts to confirm deletions – 8.5kb deletion Best Sol’n: A=0 ; B=1 ; C=1 ; D=1 All Samples with haplotype A are consistent with haploid based on read countsA A A A A A Diploid Haploid Zero-ploid
  • 12 Breakdown of 772 “accurate” CNVs (1kb to 322kb in size) 26640898 BreakDancerGrouper
  • 13 Assembling breakpoints for the 772 CNVs – Reassessing the “failed” calls where applicable Incorporating different calling algorithms / methods – E.g. SNP inheritance can help identify CNVs that are missed by other methods – Including mate pair data (~2kb insert size) Working on different methods to improve our catalogue of ~30bp to 2kb events & incorporating different callers Assigning error modes for “failed” SNPs – Many look like cell line mutations & alignment errors Comparing our call set to other datasets to assess accuracy and completeness – Other GIAB call sets – Fosmid data (Jaffe & Kidd) Next steps
  • 14 Illumina Oxford Morten Kallberg Zamin Iqbal Xiaoyu Chen Gil McVean Han-Yu Chuang Phil Tedder Sean Humphray Elliott Margulies David Bentley This data and more available at Acknowledgements