Bioninformatics, Data Integration, and Data Representation Steve Sherry, NCBI WG Lead Justin Zook, Presenter
WG Charge• Develop strategy to • Building on NCBI-CDC analyze each data set efforts in GeT-RM• Develop plan for Project integrating data and – developing repository forming consensus – developing browser variant calls and • shared work with Performance Metrics WG confidence estimates – will be scaled for GiaB• Develop consensus plan • Building on NIST work for data representation for integration and confidence estimation
Task 1: Inventory existing NA12878 data• Assignees: NIST & GET-RM (NCBI)• Timeline: Evolving document with version 1 by August 31• Prepare a project document (e.g. Google doc) with matrix of all sources to include: – Submitter details – DNA source (cell line DNA / source collection DNA) – Coverage characteristics – Instrument platform – Library design – fosmid, WGS, WES with ascertainment characteristics (fosmid) and target design (WES) where known – Source of data on internet – Data release / availability date for new data sets & platforms – Priority rank for data consolidation, filtering, and analysis
Task 1: NA12878 source discussionSource NotesGet RM Several available sources have been inventoriedBroad Many lab protocols & designs, NA12878 Hi Seq 60x and WES 100x, to check if assembly or integrated analysis is available. BED available for WES design, but GATK reverse engineers the target set.SRA trace Chromatograms for Eicher’s fosmids. Made from DNA (not cell line) and assembled at wash u, submitted to NCBI. Debate about which to use. Some fosmids picked because they are tricky. G1K should have usable set available in 1 month. Note the set may have 1-2% contamination from other fosmid libraries so be aware.Xprize Fosmids covering 3-5% of genome and whole genome at 30x. Not all fosmids are done. NA19239 available at GenomeSpace and NA12878 coming in 2 months.Illumina Platinum reads at ENA 200xCG 50-60x. on CG ftp siteOpgen Check if data are publicly available
Task 2: define quality for reads, runs, lanes• Assignees: Real Time Genomics and NCBI• Timeline: Proposal for comment by August 31.• The consortium should define a protocol for quality filters on reads, runs and lanes.• A prototype can be taken from 1000 genomes and other current large scale studies.
Task 3 compile data.• Assignees: NIST and NCBI• Timeline: First set filtered by end of Sept.• Iterative process to follow matrix priority• Identify data hosting sites: Google, AWS, NCBI, EBI?• Hosted data would provide centralized and synchronized stores for filtered reads to use in reanalysis.• Separate areas clearly labeled as “working” and “released”.• All areas publicly accessible.• Results from pipelines would be posted for group analysis.
Task 4 run pipelines• Volunteers to run existing pipelines:• Real time Genomics (Illumina & CG)• NCBI (Illumina, SOLiD, and CG)• Edge Genomics (Illumina and SOLiD)• Timeline: As data are staged and callers installed• References: GRCh37 with baits (G1K standard) no chr Y.• GRCh38 to assess effects of alt haplotypes / fixes• Run modules for all mutation types including mobile elements:• Freebayes. GATK2, CG.• Lobster for LINES, Alu’s CA repeats• Hydra, pindel, proprietary for structural variation.
Task 5: consensus call integration.• Recalibration: All or subset of dbSNP, e.g. G1K plus GoESP.• Analysis to produce – BAM, and archive compressed versions in cSRA & CRAM – VCF – or – – Novel compressed genome format: variants and probability that a region matches the reference, i.e. confident nothing but reference in region. [gVCF?] – Quantile the genome by difficulty to align align / call variants – Should analysis include WES on/off target specificity?• Single file per individual.• Future references may include tissue-specific DNA/RNA samples• Think about epigenomic markup to future-proof resource
Working as a consortium• Analysis group will need a listserv and periodic discussions via standing conference calls.• Google Doc area• Data staging areas• Further group tasks to be discussed via blogs, telcon and maillists
Archival of Data & Pipelines• Ongoing discussions – Cloud data availability – Data formats• Pipeline archival – startup commercial services in this space • robustness? – Amazon – Google – Federal Resources?
Data Representation/Data Standards• CDC taking lead in • Alternate approaches convening standards – representation as proposal assemblies – focusing on VCF • not variant calls • gVCF • – assembling working team from stakeholder communities – workshops? – telecons?