Aug2013 bioinformatics working group

656 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
656
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Aug2013 bioinformatics working group

  1. 1. Bioinformatics working group GIAB, August 2013
  2. 2. 1. Review of data release policy As a community resource project, the Genome in a Bottle Consortium publicly releases data on a regular basis. Anyone who sequences the candidate Reference Materials can submit raw reads and/or aligned reads to the SRA. The Bioinformatics Working Group may also release additional alignments, re-calibrated error rates, and variant calls on individual samples. Data that passed quality filters are the most visible, but data that did not pass quality filters are also available from the SRA. A goal of the Consortium is to generate highly confident phased genotype calls for all types of variants across the entire genome of each sample. These genotype calls and their associated supporting data will improve over time as sequencing and analysis methods improve, and regions of low confidence will be differentiated from confident homozygous reference calls. Data formats and analysis software developed by the Project are also made publicly available.
  3. 3. 2. What analyses have people already performed for NA12878 • NIST – integrating multiple datasets. some discordance analysis • X prize – integrated Illumina and SOLiD data for phosmids covering ~5% of genome. • Illumina Platinum Genomes – pedigree analysis of NA12878 and offspring. • 1000 genomes – high coverage of trio (na12878 as daughter). 250bp sequencing. • Real Time Genomics – Pedigree-based analysis of entire pedigree 1463: parents, NA12878, spouse, children. Illumina data. Possibly Complete Genomics data. • Broad – curated call set of chr20 and exome.
  4. 4. 3. Analysis interests and timelines • Mt. Sinai -- may have pac bio data to analyze. Target for 50x • Ion Torrent – exome data • Real Time Genomics – parental exome data? • Sanger – de novo local SGA alignments • Cortex – NCBI to run for trio, NIST has data too • STR analysis – Repeat Seq run by D. Mittelman • CNV calls – Mike Eberle, NIST, CG data, Personalis • mtDNA – anybody doing this? • moleculo – perhaps coming from Illumina in future
  5. 5. 4. Data integration • Pedigree methods like platinum genomics and RTG with multi-platform methods like NIST’s. – NIST model as a pilot – Consider growth in # of data sets • Different products: high and moderate confidence products to reflect certainty/difficulty of regions, or the degree of difficulty (easy, difficult). Products should recognize the heterogeneity in the quality of data across the genome. • HeLa-like characterization of NA12878 (e.g. Shendure or Steinmetz methods) to identify regions of cell-line rearrangement • small variants with CNVs/SVs? • validation • what standard metrics of quality do we need for calls and placements • what fraction of the genome is arbitrated by the NIST method?
  6. 6. 5. GIAB ftp site @ NCBI • Organization – following template of 1000 genomes in terms of layout. /data for release and /technical for contributions • data from parents – NA12891 and NA12892 have been added for data, but they will not have RM material made. • cloud – site mirrored to Amazon at s3://giab. Need to check consistency of metadata for read permissions. • data upload for bam or vcf – contact NCBI (Dr. Chulin Xiao) for aspera upload account. Data uploaded to dropbox will be moved into project /technical area for community use. • New sequence data should be submitted directly to SRA and cite the GIAB BioProject ID (PRJNA200694) to link the data to the project. NCBI will dump data into FTP area. Submissions need to be added to the Google Doc tracking document also.
  7. 7. 6. Data integration from RM sequence • Treat new SRA submissions as inputs using the same workflows. Individual RMs would need to be reanalyzed when a significant amount of new data is accumulated or reference changes, etc. • Membership roster of active analysis and submitters likely to change over time.

×