Smith T Bio Hdf Bosc2008

BioHDF : Open binary file formats for large scale data management Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2). 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Microsoft Corporation, Redmond WA . TM

Overview Driver: Next Generation DNA Sequencing What is HDF5 BioHDF Project Laboratory and data workflow management for genetic analysis

Next Generation DNA Sequencing Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger) A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments In Sequence quotes - July 2007 Toby Bloom, Broad Institute “Next-gen sequencing i mpacts all aspects of informatics.” Phil Butcher, Sanger “ T he best way to move terabytes of data is still disk.” Want to process data closer to the machine. Eugen Clark, Harvard “[community] needs to start talks about data retention.” Kelly Carpenter, Wash U “these sequencers are going to totally screw you.” Nature Methods July 2008: “Byte-ing off more than you can chew”

Three Phases of Data Production Primary Data Analysis - Images to bases Secondary Data Analysis Tertiary Data Analysis Sequences + Quality values Run quality Gene lists Read Density Variant list Sample, run quality Differential expression Methylation sites Gene association Genomic structure Experiment, science Ref Seq + Aligner One or more Data sets Secondary Data Production De novo assembly => Assembler Contigs + Annotation

Proliferation of files, formats, formatters Tag profiling ChIP-Seq Resequencing Example: MAQ - http://maq.sourceforge.net Secondary Analysis for: Additional files and formats needed for tertiary analysis

Challenges Complexity Numerous programs, scripts, files, and formats Redundant data Computational overhead All data typically reside in RAM during computation Output and input formats differ, so data must be frequently reprocessed Space, time, and bandwidth efficiency Increased storage Computation times increase disproportionately Large data sets must be transported for processing

What Needs to Be Done Reduce complexity Decrease numbers and kinds of files Eliminate data duplication (performance) API and tools for data access Improve resource utilization Reduce redundancy, work with compressed data Improve program access to data, random reads and writes, map disk to computer memory Parallel I/O, Remote access Facilitate data sharing, preservation Adopt a standard from other data intensive fields Benefit from history and experience Benefit from refinement Build on a proven, widely accepted platform

HDF5: Single Platform / Multiple Uses A file format for managing any kind of data Software system to manage data in the format Designed for high volume or complex data Designed every size and type of system Open format and software One library, with Options to adapt I/O and storage to data needs Layers on top and below Ability to interact well with other technologies Attention to past, present, future compatibility

HDF5 - 20 yrs in Physical Sciences Gain multiple “working with data efficiencies” slice, recombine … Arrays, sets, organizations, compression already there Server and remote access Quick access to data via HDFView, MATLAB, other tools Widely used - MATLAB, Mathematica, IDL, NASA-EOS, Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data

HDF Software HDF I/O Library Tools, Applications, Libraries (e.g. BioHDF) HDF File

BioHDF SBIR Funded Project Phase I - Feasibility for genotyping Phase II - Open source technologies to support computation in Next Gen DNA sequencing applications Support diverse types of data from multiple sequencing technologies by extending the BioHDF data model Develop prototype BioHDF software applications that support common activities utilizing DNA Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics

Phase I - Pilot Project Combined view of HapMap, chromosome LD, PolyPhred details A 53,000x53000 LD array BioHDF file structure 53,000 row, 100+ column HapMap table polyPhred data table, graphs, and chromats

Benefits Separated the model, implementation, and view of the data Multiple levels of data in a single view Hapmap: convert, display, and scroll 100,000s genotypes Compressed 5.2 GB LD data into 300 MB (17x) Quickly and randomly access subsets of data Made use of standard features and a data viewer (HDFview) Only had to build the model and data importer

Phase II Primary Data Analysis Models for storing and accessing primary data Implement and test models, develop compression methods Create research tools to access and work with the data Secondary Data Analysis Models for storing common data structures (assembly graphs, density plots, variants) APIs to work with programs, enable out-of-core processing Develop research level applications utilizing HDFView, current and emerging genome browsers

Collaborations Planned Software: SRF working group (A. Siddiqui), AMOS project (M. Pop), Assembly formats (G. Marth), Consed (D. Gordon) Applications and data: University of Washington, University of Florida, Johns Hopkins University, Applied Biosystems Emerging Additional Sequencing Vendors, Microsoft Research, Intel, Institutes for Systems Biology Seeking Algorithm developers Application developers Frameworks, Bio* Data sets

Summary Data challenges for Next Gen sequencing Manage high volumes of data Workflow complexity Computational performance BioHDF will be built on existing, available, and proven HDF5 technology Geospiza and The HDF Group are seeking collaborations Funding - NIH STTR 1R41HG003792-02 Interested? Contact todd@geospiza.com

Smith T Bio Hdf Bosc2008

More Related Content

What's hot

Viewers also liked

Similar to Smith T Bio Hdf Bosc2008

More from bosc_2008

Recently uploaded

Smith T Bio Hdf Bosc2008