1. Review of data release policy
As a community resource project, the Genome in a Bottle Consortium
publicly releases data on a regular basis. Anyone who sequences the
candidate Reference Materials can submit raw reads and/or aligned
reads to the SRA.
The Bioinformatics Working Group may also release additional
alignments, re-calibrated error rates, and variant calls on individual
samples. Data that passed quality filters are the most visible, but data
that did not pass quality filters are also available from the SRA.
A goal of the Consortium is to generate highly confident phased
genotype calls for all types of variants across the entire genome of
each sample. These genotype calls and their associated supporting
data will improve over time as sequencing and analysis methods
improve, and regions of low confidence will be differentiated from
confident homozygous reference calls.
Data formats and analysis software developed by the Project are also
made publicly available.
2. What analyses have people already
performed for NA12878
• NIST – integrating multiple datasets. some discordance
• X prize – integrated Illumina and SOLiD data for phosmids
covering ~5% of genome.
• Illumina Platinum Genomes – pedigree analysis of
NA12878 and offspring.
• 1000 genomes – high coverage of trio (na12878 as
daughter). 250bp sequencing.
• Real Time Genomics – Pedigree-based analysis of entire
pedigree 1463: parents, NA12878, spouse, children.
Illumina data. Possibly Complete Genomics data.
• Broad – curated call set of chr20 and exome.
3. Analysis interests and timelines
• Mt. Sinai -- may have pac bio data to analyze. Target
• Ion Torrent – exome data
• Real Time Genomics – parental exome data?
• Sanger – de novo local SGA alignments
• Cortex – NCBI to run for trio, NIST has data too
• STR analysis – Repeat Seq run by D. Mittelman
• CNV calls – Mike Eberle, NIST, CG data, Personalis
• mtDNA – anybody doing this?
• moleculo – perhaps coming from Illumina in future
4. Data integration
• Pedigree methods like platinum genomics and RTG with multi-platform
methods like NIST’s.
– NIST model as a pilot
– Consider growth in # of data sets
• Different products: high and moderate confidence products to reflect
certainty/difficulty of regions, or the degree of difficulty (easy, difficult).
Products should recognize the heterogeneity in the quality of data across
• HeLa-like characterization of NA12878 (e.g. Shendure or Steinmetz
methods) to identify regions of cell-line rearrangement
• small variants with CNVs/SVs?
• what standard metrics of quality do we need for calls and placements
• what fraction of the genome is arbitrated by the NIST method?
5. GIAB ftp site @ NCBI
• Organization – following template of 1000 genomes in terms of
layout. /data for release and /technical for contributions
• data from parents – NA12891 and NA12892 have been added for
data, but they will not have RM material made.
• cloud – site mirrored to Amazon at s3://giab. Need to check
consistency of metadata for read permissions.
• data upload for bam or vcf – contact NCBI (Dr. Chulin Xiao) for
aspera upload account. Data uploaded to dropbox will be moved
into project /technical area for community use.
• New sequence data should be submitted directly to SRA and cite
the GIAB BioProject ID (PRJNA200694) to link the data to the
project. NCBI will dump data into FTP area. Submissions need to be
added to the Google Doc tracking document also.
6. Data integration from RM sequence
• Treat new SRA submissions as inputs using the
same workflows. Individual RMs would need
to be reanalyzed when a significant amount of
new data is accumulated or reference
• Membership roster of active analysis and
submitters likely to change over time.