150219 agbt giab_poster_marc


General GIAB poster

Published in: Health & Medicine
  1. 1. Bioinformatics, Data Integration, and Data Representation Group In 2012, NIST convened the Genome in a Bottle Consortium to develop the metrology infrastructure needed to enable confidence in human whole genome variant calls. Consortium products will include: • Well-characterized whole genome and synthetic DNA Reference Materials (RMs) • Reference data associated with the RMs • Reference methods (Comparison tools, documentary standards) These Genome in a Bottle products will help enable translation of whole genome sequencing to clinical applications. Expected use cases of these products include: • Enable regulated applications • Validation, QC, proficiency testing • Identify and quantify sources of bias & variability • Optimize measurement technologies • Resolve structural variants • Improve reference assembly • Integrate data from multiple platforms Overview Reference Material Selection and Design Group • Personal Genome Project samples – consent for commercialization • Ashkenazi Jewish trio • East Asian trio • Additional diversity and a large family? • Supporting inter-laboratory analysis of potential commercial reference materials - recruiting labs now • Are synthetic spike-ins a good surrogate for real somatic mutations? • Spike-ins vs. FFPE engineered cell lines vs. FFPE tissue Genome in a Bottle: So you’ve sequenced a genome, how well did you do? Marc Salit, Justin Zook, Genome in a Bottle Consortium Genome-scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, MD 20899 Measurements for Reference Material Characterization Group Performance Metrics Group Developing Benchmark Genotypes • Performance Metrics Specification • Available on GIAB blog • Global Alliance for Genomics & Health • Formed Benchmarking Task Team to develop methods and tools for comparing variant calls to a benchmark • Developed standardized definitions for performance metrics like TP, FP, and FN. • Developing benchmarking tool in 3 parts: Comparison, Reporting, and Visualization • NCBI/CDC GeT-RM Genome Browser • Visualization of data Mutation of Interest Alien Barcode Point Mutation Control Plasmids from M. Williams et al. Frederick National Laboratory for Cancer Research • Developed data integration methods and benchmark genotype calls for NA12878 • Multi-platform method • Published by Zook et al. (2014) in Nature Biotechnology • Newest calls integrate Pedigree methods • Real Time Genomics (RTG) • Illumina Platinum Genomes • NCBI hosts FTP with raw data and calls • • Mirrored to AWS S3: How you can get involved: • Join Analysis Group for Personal Genome Project trios • Help with Structural Variant calls and difficult regions of the genome • Help with analyzing data from long-read technologies • Attend our biannual workshops (January in CA, August in MD) • Help develop definitions and methods to measure performance using our well-characterized genomes with Global Alliance for Genomics & Health Benchmarking Working Group ( • Use our integrated SNP/indel/homozygous reference genotypes for NA12878 and give us feedback Reference Materials Sample Preparation Sequencing Bioinformatics Variant List, Performance metrics Genome in a Bottle Consortium New members welcome! Sign up for newsletters at Overlap of SNP calls between three variant call files and proposed methods to arbitrate between multiple datasets and produce high-confidence integrated SNP, indel, and homozygous reference genotypes. A similar integration process has been applied to our pilot genome based on NA12878 (see Zook et al, Nat. Biotech, 2014), and we plan to use these methods to produce high-confidence calls for the Ashkenazim and Asian trios from the Personal Genome Project. Structural Variants • We are developing similar methods for SVs (see Zook et al. poster) • Methodology development to annotate each SV using coverage, insert size, discordant paired ends, mapping quality, soft-clipping … • How to use long-read technologies? Normalize and take union of calls Simple SNPs/indels Illumina/SOLiD – GATK HC force calls Ion – TVC force calls If all biased or low qual, uncertain Elseif all concordant, high- conf Elseif all unbiased are concordant, high-conf Else uncertain CG – use Ref file Complex Variants Use GA4GH methods for sequential pair- wise comparison Dataset Characteristics Coverage Availability Good for… Illumina Paired-end 150x150bp ~300x/individual Fastq on FTP SNPs/indels/some SVs Illumina Long Mate pair ~6000 bp insert ~40x/individual Feb-Mar 2015 SVs Illumina “moleculo” Custom library ~30x by long fragments Feb-Mar 2015 SVs/phasing/assembly Complete Genomics Paired end ~100x/individual On FTP SNPs/indels/some SVs Complete Genomics LFR Mar 2015 SNPs/indels/phasing Ion Proton Exome 1000x/individual On FTP/SRA SNPs/indels in exome BioNano Genomics Optical mapping Feb 2015 SVs/assembly PacBio ~10kb reads ~120-150x on AJ trio 50% on FTP; Finished ~Mar 2015 SVs/phasing/assembly/S TRs Forming an analysis group: • Using long-reads • SV analysis • De novo assembly • Complex variants • All data is public • Now recruiting members