Who are we?• Justin Johnson – Managing Director of Services – Director of Bioinformatics – 10 Years at JCVI before EdgeBio – Project Manager - Archon Genomics XPrize• EdgeBio – CLIA Lab – Illumina Hiseq & Miseq, Ion Proton & PGM
Overview – GIAB as I See It.• Which genomes?• How do we sequence them?• How do we analyze them?• How do we enable their usage?
Overview Bioinformatics Experimental DataData Integration • Sequence Data & Variation • Metadata/ Representation Database Refine and Feedback • RM vs. Reference • Every Base Compare and Report Visualize and Filter • Single Genome Browser • Browser over DB • ValidationProtocol.org • Query by Experiment Data Experimental Data = Combination of Prep / Sequencing / Analysis
Experimental Data• GetRM Model for Collection – http://www.ncbi.nlm.nih.gov/projects/variation/get-rm/• Preparation – Link to published prep protocol – ROI in Bed/GFF/GBK Format• Sequencing – Platform Information (Minimally - Name) – Chemistry (Minimally - Version)• Analysis – Link to published analysis protocol or best practices – Read Data (fastq, sra, hdf5, others) – Alignment/Assembly Data (bam) • Minimal Tag Set TBD – Variation (VCF or gVCF) • Minimal Tag Set TBD in INFO field of VCF or define external XSD • https://sites.google.com/site/gvcftools/home/about-gvcf
Meta Data• All required fields in VCF 4.1• Others (Examples) – AA : ancestral allele – AC : allele count in genotypes, for each ALT allele, in the same order as listed – AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes – AN : total number of alleles in called genotypes – BQ : RMS base quality at this position – CIGAR : cigar string describing how to align an alternate allele to the reference allele – DB : dbSNP membership – DP : combined depth across samples, e.g. DP=154 – END : end position of the variant described in this record (for use with symbolic alleles) – H2 : membership in hapmap2 – VALIDATED : validated by follow-up experiment• Reference Block Implementations• Handle Indel Conflicts and Resolution• Genotype Quality for non-variant sites (GQX)
Database• Store Each Base + Meta of RM versus Reference for each Experiment from gVCF – Distinguish missing versus homozygous reference – Include copy number and phasing when available, not required• Engine that drives front end visualization (Genome Browser)• Build on GetRM/NCBI Database Work
Visualize and Filter• Build on GetRM/NCBI Browser Work• Single RM -> Many Experiments• Not all metadata will be visual, but most/all will be filterable• Filter data to generate ROI or VOI – Canned: i.e. Intersect of All Platforms + Analysis, All OMIM SNPs, Clinical Cert SNV List, etc – Dynamic: allowing people to explore prep, sequence, or analysis bias• Slice, Dice, Export VOI to compare and reporting SW• Allow user defined tracks• By product is community educational resource – I have a ROI for a test and want to know what platform, prep, exome kit version, etc covers it best. What do I do?
Parallel Database, Filter Effort (Gemini) Quinlan Lab at UVA - https://github.com/arq5x/gemini • Gemini – simple, flexible, and powerful framework for exploring genetic variation • Basic browser capabilities being developed • Flexible custom annotation and metadata addition to DB • Leverage the expressive power of SQL while overcoming fundamental challenges associated with using databases for very large datasets
Compare and Reporting• Take in ROI or VOI from the visualize and filter stage• Take in user defined VOI or VOI + ROI• Leverage SW under ValidationProtocol.org to generate reports and files including BNLT: – Summary of completeness, accuracy, phasing – Discordant variants in VCF – Concordant variants in VCF – Phasing errors in VCF• Provide intuitive way to feed these resultants in downstream analysis SW (VarinatViz, IO8) or back into browser (User Defined Track)
• $10 million prize competition to showcase whole genome sequencing technology• Award to the team(s) who can most completely, accurately and affordably sequence 100 human genomes in 30 days or less• Competing Teams will sequence the genomes of the 100 centenarians who have evaded the usual diseases of aging such as heart disease, diabetes, cancer and Alzheimer’s
AGXP Validation Study Analysis• 2 Major Phases using NA19239 and NA12878 – Develop Reference Standards • Fosmid Reconstruction, Variation Discovery • Technology Comparison and Bias Removal – Develop Performance Metrics • Software Development • Help labs use the data
Compare and Report• The validationprotocol.org website provides a simple way for anyone to compare their variant calls against the public reference genomes.• Encourages submission and analysis in public tools like Galaxy through transparent interoperability with GenomeSpace.
Follow On• Export different categories (Concordant/Discordant/Phasing Error) variants to VariantViz IO8• Visualize Quality, Allele Frequencies, Depth, etc Info to detect patterns in and between variant categories
Xprize Team• Justin H. Johnson and Team - EdgeBio• Brad Chapman Harvard: automated high-throughput analysis pipelines with custom visualization and processing tools• Gabor Marth Boston College: Read mapping, single-nucleotide and insertion-deletion polymorphism detection, and discovery of structural variants.• Aaron Quinlin University of Virginia: structural variation (SV)• Granger Sutton JCVI: Oversight Committee• Victor Jongeneel University of Illinois and NCSA: Oversight Committee• Larry Kedes UCLA: Oversight Committee
EdgeBio Team• LAB • IFX – Joy Adigun – David Jenkins – Ryan Mease – Anju Varadarajan – Jennifer Sheffield – Vani Rajan – Aaron Johnson – Karthik Kota – Jackie Jackson – Phil Dagasto • Adam Bennett • Isabel Llorente
Thank You! More info available at http://bit.ly/agxpvalhttp://www.genomeinabottle.org