Bioinformatics, Data Integration, and Data Representation Working Group Summary Aug2012
Bioinformatics, data integration, and data representation Breakout group C
Task 1: Inventory existing NA12878 dataAssignees: NIST & GET-RM (NCBI)Timeline: Evolving document with version 1 by August 31Prepare a project document (e.g. Google doc) with matrix of all sources to include: Submitter details DNA source (cell line DNA / source collection DNA) Coverage characteristics Instrument platform Library design – fosmid, WGS, WES with ascertainment characteristics (fosmid) and target design (WES) where known Source of data on internet Data release / availability date for new data sets & platforms Priority rank for data consolidation, filtering, and analysis
Task 1: NA12878 source discussionSource NotesGet RM Several available sources have been inventoriedBroad Many lab protocols & designs, NA12878 Hi Seq 60x and WES 100x, to check if assembly or integrated analysis is available. BED available for WES design, but GATK reverse engineers the target set.SRA trace Chromatograms for Eicher’s fosmids. Made from DNA (not cell line) and assembled at wash u, submitted to NCBI. Debate about which to use. Some fosmids picked because they are tricky. G1K should have usable set available in 1 month. Note the set may have 1-2% contamination from other fosmid libraries so be aware.Xprize Fosmids covering 3-5% of genome and whole genome at 30x. Not all fosmids are done. NA19239 available at GenomeSpace and NA12878 coming in 2 months.Illumina Platinum reads at ENA 200xCG 50-60x. on CG ftp siteOpgen Check if data are publicly available
Task 2: define quality for reads, runs, lanesAssignees: Real Time Genomics and NCBITimeline: Proposal for comment by August 31.The consortium should define a protocol for quality filterson reads, runs and lanes.A prototype can be taken from 1000 genomes and othercurrent large scale studies.
Task 3 compile data.Assignees: NIST and NCBITimeline: First set filtered by end of Sept. Iterative process to follow matrix priorityIdentify data hosting sites: Google, AWS, NCBI, EBI?Hosted data would provide centralized and synchronized storesfor filtered reads to use in reanalysis.Separate areas clearly labeled as “working” and “released”.All areas publicly accessible.Results from pipelines would be posted for group analysis.
Task 4 run pipelinesVolunteers to run existing pipelines: Real time Genomics (Illumina & CG) NCBI (Illumina, SOLiD, and CG) Edge Genomics (Illumina and SOLiD)Timeline: As data are staged and callers installedReferences: GRCh37 with baits (G1K standard) no chr Y. GRCh38 to assess effects of alt haplotypes / fixesRun modules for all mutation types including mobile elements: Freebayes. GATK2, CG. Lobster for LINES, Alu’s CA repeats Hydra, pindel, proprietary for structural variation.
Task 5: consensus call integration.Recalibration: All or subset of dbSNP, e.g. G1K plus GoESP.Analysis to produce BAM, and archive compressed versions in cSRA & CRAM VCF – or – Novel compressed genome format: variants and probability that a region matches the reference, i.e. confident nothing but reference in region. [gVCF?] Quantile the genome by difficulty to align align / call variants Should analysis include WES on/off target specificity?Single file per individual.Future references may include tissue-specific DNA/RNA samplesThink about epigenomic markup to future-proof resource
Working as a consortiumAnalysis group will need a listserv and periodic discussionsvia standing conference calls.Google Doc areaData staging areasFurther group tasks to be discussed via blogs, telcon andmaillists