BioHDF : Open binary file formats for large scale data management Todd Smith(1),  Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2).  1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119.  2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Microsoft Corporation, Redmond WA . TM
Overview Driver: Next Generation DNA Sequencing What is HDF5 BioHDF Project Laboratory and data workflow management for genetic analysis
Next Generation DNA Sequencing Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger) A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments In Sequence quotes - July 2007 Toby Bloom, Broad Institute “Next-gen sequencing  i mpacts all aspects of informatics.” Phil Butcher, Sanger  “ T he best way to move terabytes of data is still disk.”  Want to process data closer to the machine. Eugen Clark, Harvard “[community] needs to start talks about data retention.” Kelly Carpenter, Wash U “these sequencers are going to totally screw you.” Nature Methods July 2008: “Byte-ing off more than you can chew”
Three Phases of Data Production Primary Data Analysis - Images to bases Secondary Data Analysis Tertiary Data Analysis Sequences + Quality values Run quality Gene lists Read Density Variant list Sample, run quality Differential expression Methylation sites Gene association Genomic structure Experiment, science Ref Seq + Aligner One or more Data sets Secondary Data Production De novo  assembly => Assembler Contigs + Annotation
Proliferation  of files, formats, formatters Tag profiling ChIP-Seq Resequencing Example: MAQ  - http://maq.sourceforge.net Secondary Analysis for: Additional files and formats needed for tertiary analysis
Challenges Complexity Numerous programs, scripts, files, and formats Redundant data Computational overhead All data typically reside in RAM during computation Output and input formats differ, so data must be frequently reprocessed  Space, time, and bandwidth efficiency Increased storage Computation times increase disproportionately Large data sets must be transported for processing
What Needs to Be Done Reduce complexity Decrease numbers and kinds of files Eliminate data duplication (performance) API and tools for data access Improve resource utilization Reduce redundancy, work with compressed data Improve program access to data, random reads and writes, map disk to computer memory Parallel I/O, Remote access Facilitate data sharing, preservation Adopt a standard from other data intensive fields Benefit from history and experience Benefit from refinement Build on a proven, widely accepted platform
HDF5: Single Platform / Multiple Uses  A file format for managing any kind of data Software system to manage data in the format Designed for high volume or complex data Designed every size and type of system Open format and software One library, with Options to adapt I/O and storage to data needs Layers on top and below Ability to interact well with other technologies Attention to past, present, future compatibility
HDF5 - 20 yrs in Physical Sciences Gain multiple “working with data efficiencies” slice, recombine … Arrays, sets, organizations, compression already there Server and remote access Quick access to data via HDFView, MATLAB, other tools Widely used - MATLAB, Mathematica, IDL, NASA-EOS,  Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data
HDF Software  HDF I/O Library Tools, Applications, Libraries (e.g. BioHDF) HDF File
BioHDF SBIR Funded Project Phase I  - Feasibility for genotyping Phase II - Open source technologies to support computation in Next Gen DNA sequencing applications Support diverse types of data from multiple sequencing technologies by extending the BioHDF data model  Develop prototype BioHDF software applications that support common activities utilizing DNA Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics
Phase I - Pilot Project Combined view of HapMap, chromosome LD, PolyPhred details A 53,000x53000 LD array BioHDF file structure 53,000 row, 100+ column HapMap table  polyPhred data table, graphs, and chromats
Benefits Separated the model, implementation, and view of the data Multiple levels of data in a single view Hapmap: convert, display, and scroll 100,000s genotypes Compressed 5.2 GB LD data into 300 MB (17x) Quickly and randomly access subsets of data Made use of standard features and a data viewer (HDFview) Only had to build the model and data importer
Phase II Primary Data Analysis Models for storing and accessing primary data Implement  and test models, develop compression methods Create research tools to access and work with the data Secondary Data Analysis Models for storing common data structures (assembly graphs, density plots, variants) APIs to work with programs, enable out-of-core processing Develop research level applications utilizing HDFView, current and emerging genome browsers
Collaborations Planned Software: SRF working group (A. Siddiqui), AMOS project (M. Pop), Assembly formats (G. Marth), Consed (D. Gordon) Applications and data: University of Washington, University of Florida, Johns Hopkins University, Applied Biosystems Emerging Additional Sequencing Vendors, Microsoft Research, Intel, Institutes for Systems Biology Seeking Algorithm developers Application developers Frameworks, Bio* Data sets
Summary Data challenges for Next Gen sequencing Manage high volumes of data Workflow complexity Computational performance BioHDF will be built on existing, available, and proven HDF5 technology Geospiza and The HDF Group are seeking collaborations Funding - NIH STTR 1R41HG003792-02 Interested? Contact todd@geospiza.com

Smith T Bio Hdf Bosc2008

  • 1.
    BioHDF : Openbinary file formats for large scale data management Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Microsoft Corporation, Redmond WA . TM
  • 2.
    Overview Driver: NextGeneration DNA Sequencing What is HDF5 BioHDF Project Laboratory and data workflow management for genetic analysis
  • 3.
    Next Generation DNASequencing Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger) A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments In Sequence quotes - July 2007 Toby Bloom, Broad Institute “Next-gen sequencing i mpacts all aspects of informatics.” Phil Butcher, Sanger “ T he best way to move terabytes of data is still disk.” Want to process data closer to the machine. Eugen Clark, Harvard “[community] needs to start talks about data retention.” Kelly Carpenter, Wash U “these sequencers are going to totally screw you.” Nature Methods July 2008: “Byte-ing off more than you can chew”
  • 4.
    Three Phases ofData Production Primary Data Analysis - Images to bases Secondary Data Analysis Tertiary Data Analysis Sequences + Quality values Run quality Gene lists Read Density Variant list Sample, run quality Differential expression Methylation sites Gene association Genomic structure Experiment, science Ref Seq + Aligner One or more Data sets Secondary Data Production De novo assembly => Assembler Contigs + Annotation
  • 5.
    Proliferation offiles, formats, formatters Tag profiling ChIP-Seq Resequencing Example: MAQ - http://maq.sourceforge.net Secondary Analysis for: Additional files and formats needed for tertiary analysis
  • 6.
    Challenges Complexity Numerousprograms, scripts, files, and formats Redundant data Computational overhead All data typically reside in RAM during computation Output and input formats differ, so data must be frequently reprocessed Space, time, and bandwidth efficiency Increased storage Computation times increase disproportionately Large data sets must be transported for processing
  • 7.
    What Needs toBe Done Reduce complexity Decrease numbers and kinds of files Eliminate data duplication (performance) API and tools for data access Improve resource utilization Reduce redundancy, work with compressed data Improve program access to data, random reads and writes, map disk to computer memory Parallel I/O, Remote access Facilitate data sharing, preservation Adopt a standard from other data intensive fields Benefit from history and experience Benefit from refinement Build on a proven, widely accepted platform
  • 8.
    HDF5: Single Platform/ Multiple Uses A file format for managing any kind of data Software system to manage data in the format Designed for high volume or complex data Designed every size and type of system Open format and software One library, with Options to adapt I/O and storage to data needs Layers on top and below Ability to interact well with other technologies Attention to past, present, future compatibility
  • 9.
    HDF5 - 20yrs in Physical Sciences Gain multiple “working with data efficiencies” slice, recombine … Arrays, sets, organizations, compression already there Server and remote access Quick access to data via HDFView, MATLAB, other tools Widely used - MATLAB, Mathematica, IDL, NASA-EOS, Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data
  • 10.
    HDF Software HDF I/O Library Tools, Applications, Libraries (e.g. BioHDF) HDF File
  • 11.
    BioHDF SBIR FundedProject Phase I - Feasibility for genotyping Phase II - Open source technologies to support computation in Next Gen DNA sequencing applications Support diverse types of data from multiple sequencing technologies by extending the BioHDF data model Develop prototype BioHDF software applications that support common activities utilizing DNA Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics
  • 12.
    Phase I -Pilot Project Combined view of HapMap, chromosome LD, PolyPhred details A 53,000x53000 LD array BioHDF file structure 53,000 row, 100+ column HapMap table polyPhred data table, graphs, and chromats
  • 13.
    Benefits Separated themodel, implementation, and view of the data Multiple levels of data in a single view Hapmap: convert, display, and scroll 100,000s genotypes Compressed 5.2 GB LD data into 300 MB (17x) Quickly and randomly access subsets of data Made use of standard features and a data viewer (HDFview) Only had to build the model and data importer
  • 14.
    Phase II PrimaryData Analysis Models for storing and accessing primary data Implement and test models, develop compression methods Create research tools to access and work with the data Secondary Data Analysis Models for storing common data structures (assembly graphs, density plots, variants) APIs to work with programs, enable out-of-core processing Develop research level applications utilizing HDFView, current and emerging genome browsers
  • 15.
    Collaborations Planned Software:SRF working group (A. Siddiqui), AMOS project (M. Pop), Assembly formats (G. Marth), Consed (D. Gordon) Applications and data: University of Washington, University of Florida, Johns Hopkins University, Applied Biosystems Emerging Additional Sequencing Vendors, Microsoft Research, Intel, Institutes for Systems Biology Seeking Algorithm developers Application developers Frameworks, Bio* Data sets
  • 16.
    Summary Data challengesfor Next Gen sequencing Manage high volumes of data Workflow complexity Computational performance BioHDF will be built on existing, available, and proven HDF5 technology Geospiza and The HDF Group are seeking collaborations Funding - NIH STTR 1R41HG003792-02 Interested? Contact todd@geospiza.com