Ryan Poplin - Sources of Bias

564 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
564
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Ryan Poplin - Sources of Bias

  1. 1. Understanding sources of bias and error from a prospective Reference Material (NA12878)Ryan Poplin, on behalf of theGenome Sequencing and Analysis GroupProgram in Medical and Population GeneticsAugust 16, 2012
  2. 2. NA12878 is a wonderful reference sample!•  Unrestricted cell lines!•  Extensive pedigree available!•  Extensively sequenced and genotyped at the Broad and elsewhere! –  All Broad techs (both production and experimental)! –  Fosmids! –  Many library designs and sample prep protocols!
  3. 3. Our framework for variation discovery ! Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis Typically by lane Typically multiple samples simultaneously but can be single sample alone Sample 1 Sample N Raw Raw Raw Input Raw reads reads reads indels SNPs SVs External data Mapping Known Pedigrees SNPs variation Population Known Local structure genotypes realignment Indels Duplicate Variant quality marking recalibration Structural Base quality variation (SV) Genotype recalibration refinement Analysis-ready Analysis-ready Output Raw variants reads variantsDePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
  4. 4. Lots of work required to turn raw sequencing reads into something that is useful! Phase 1:! NGS data processing!Input Raw reads Desired  proper=es  of  analysis-­‐ready  reads:   Mapping •  Unbiased  sampling  of  alleles   •  Calibrated  mapping  quality  scores   Local realignment •  Indels  have  correct  and  consistent   alignment  in  reads   Duplicate marking •  Duplicate  molecules  shouldn’t  count  as   extra  evidence  for  event   Base quality recalibration •  Calibrated  base  quality  scores  for  base   subs=tu=ons,  base  inser=ons,  and  base  Output Analysis-ready reads dele=ons  
  5. 5. Indels  have  correct  and  consistent  alignment  in  reads   through multiple sequence local realignment! Phase 1:! NGS data processing! Effect of MSA on alignments NA12878, chr1:1,510,530-1,510,589 rs28782535 Input Raw reads rs28783181 rs28788974 rs34877486 rs28788974 Mapping Local realignment 1,000 Genomes Pilot 2 data, raw MAQ alignments 1,000 Genomes Pilot 2 data, after MSA Duplicate marking Base quality recalibration Analysis-ready Output reads HiSeq data, raw BWA alignments HiSeq data, after MSA 5!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !

×