Aug2014 nist structural variant integration

687 views

Published on

NIST SVs

Published in: Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
687
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Aug2014 nist structural variant integration

  1. 1. ANALYSIS OF STRUCTURAL VARIANTS FROM NEXT GENERATION SEQUENCING Hemang Parikh, Ph.D. NIST
  2. 2. Challenges for identifying true SVs This Venn diagram shows the numbers of unique and shared structural variants (SVs) found by different sequencing-based discovery approaches that have been used in the 1000 Genomes Project Hence we decided to develop methods to look for evidence of SVs in mapped sequencing reads from multiple sequencing technologies From Alkan et al. (2011)
  3. 3. • Coverage (mean and standard deviation) • Paired-end distance/insert size (mean and standard deviation) • # of discordant paired-ends reads • Soft clipping of the reads (mean and standard deviation) • Mapping quality (mean and standard deviation) • # of heterozygous and homozygous SNP genotype calls • % of GC content Validation parameters for each SV
  4. 4. Reference sequence Repeatmasker data Perl script About 180 annotations per SV Aligned sequence data (BAM file) List of structural variants (bed file)
  5. 5. NA12878 Data Sets—RM for GIAB • Illumina (250 bp long sequences with 50X coverage) • Illumina NIST (150 bp long sequences with 300X coverage) • Illumina Platinum Genome (100 bp long sequences with 200X coverage) • Illumina Moleculo • Pacific Biosciences
  6. 6. Deletions Gold Sets for NA12878 • Personalis (n=2,306) • The 1000 Genomes pilot (n=2,773) • Complete Genomics (n=2,032) • Conrad et al. (n=515) • Kidds et al. (n=317) • McCaroll et al. (n=128) • The 1000 Genomes—aCGH array based (n=3,901) • Roche NimbleGen 42 million—aCGH array based (n=719) • Randomly generated (n=2,306)
  7. 7. Personalis deletions call set (n=2,306) Log10 (SV Size) 2 3 4 5 Counts 600 400 200 0 • BAM-level evidence in the vicinity of each SV, in most of the 19 CEPH pedigree samples • SV breakpoints were identified • Some SVs were validated with PCR
  8. 8. Illumina NIST -2 0 2 4 400 300 200 100 0 Counts Log10 (M coverage) Log10 (M coverage) -1 0 1 2 3 Counts 900 600 300 0 Personalis Random genome
  9. 9. Identifying likely SVs and likely non-SVs Log10 (M coverage) Counts 400 300 200 100 0 Random genome Identify 99 percentile value of an annotation parameter -3 -2 -1 0 1 2 Compared this value with an annotation parameter from SV Gold Set
  10. 10. Annotatingwith IlluminaNIST and IlluminaMoleculo Personalis SV Gold Set for Illumina NIST annotation parameters Personalis SV Gold Set for Illumina Moleculo annotation parameters L Insert size L Soft Clipped L # of discordant paired-ends reads M Coverage M Coverage SD M Mapping quality M Insert size M Soft Clipped M # of discordant paired-ends reads L Soft Clipped M Coverage M Coverage SD M Mapping quality M Soft Clipped
  11. 11. 0 1 2 3 4 5 6 7 8 9 10 0 21 96 323 350 231 126 80 40 10 2 1 1 4 19 45 59 61 29 16 9 9 0 1 2 1 22 108 200 214 111 69 36 8 3 0 3 0 0 0 1 1 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 Illumina NIST Molecul o 0 1 2 3 4 5 6 7 8 9 10 0 2059 94 18 6 2 3 1 0 0 0 0 1 62 15 12 5 1 3 2 0 0 1 0 2 13 3 5 0 0 0 0 1 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 Illumina NIST Molecul o (B) Random genome (A) Personalis
  12. 12. Conclusions • Graphical visualization of the annotation parameters has shown clear distinction between true positive and false positive SVs • A key advantage of the proposed method is its simplicity and flexibility to generate various annotation parameters from aligned sequence data based on different sequencing datasets from the same genome • This allows integration of multiple sequencing datasets to identify high- confidence SV and non-SV calls that can be used as a benchmark to assess false positive and false negative rates • We are currently testing classification methods based on the annotation parameters to generate both high-confidence SV calls and high-confidence non-SV calls for NA12878
  13. 13. Acknowledgements NIST Marc Salit Justin Zook Hariharan Iyer Desu Chen Sumona Sarkar Jennifer McDaniel Lindsay Vang David Catoe Nathanael Olson Genome in a Bottle Consortium Personalis Inc. Mark Pratt Gabor Bartha Jason Harris Illumina Inc. Michael Eberle Stanford University Michael Snyder Amin Zia Somalee Datta Cuiping Pan Sean Michael Boyle Rajini Haraksingh Natalie Jaeger

×