Understanding sources of bias and error from a prospective Reference Material (NA12878)Ryan Poplin, on behalf of theGenome Sequencing and Analysis GroupProgram in Medical and Population GeneticsAugust 16, 2012
NA12878 is a wonderful reference sample!• Unrestricted cell lines!• Extensive pedigree available!• Extensively sequenced and genotyped at the Broad and elsewhere! – All Broad techs (both production and experimental)! – Fosmids! – Many library designs and sample prep protocols!
Our framework for variation discovery ! Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis Typically by lane Typically multiple samples simultaneously but can be single sample alone Sample 1 Sample N Raw Raw Raw Input Raw reads reads reads indels SNPs SVs External data Mapping Known Pedigrees SNPs variation Population Known Local structure genotypes realignment Indels Duplicate Variant quality marking recalibration Structural Base quality variation (SV) Genotype recalibration reﬁnement Analysis-ready Analysis-ready Output Raw variants reads variantsDePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
Lots of work required to turn raw sequencing reads into something that is useful! Phase 1:! NGS data processing!Input Raw reads Desired proper=es of analysis-‐ready reads: Mapping • Unbiased sampling of alleles • Calibrated mapping quality scores Local realignment • Indels have correct and consistent alignment in reads Duplicate marking • Duplicate molecules shouldn’t count as extra evidence for event Base quality recalibration • Calibrated base quality scores for base subs=tu=ons, base inser=ons, and base Output Analysis-ready reads dele=ons
Indels have correct and consistent alignment in reads through multiple sequence local realignment! Phase 1:! NGS data processing! Effect of MSA on alignments NA12878, chr1:1,510,530-1,510,589 rs28782535 Input Raw reads rs28783181 rs28788974 rs34877486 rs28788974 Mapping Local realignment 1,000 Genomes Pilot 2 data, raw MAQ alignments 1,000 Genomes Pilot 2 data, after MSA Duplicate marking Base quality recalibration Analysis-ready Output reads HiSeq data, raw BWA alignments HiSeq data, after MSA 5!DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !