NGS Data Challenges
Very large quantities of data
(100s of GB)
"Drinking from the firehose"
Analysis methods vary greatly, so a flexible yet unified
data store would be useful.
July 9, 2010 2 www.hdfgroup.org
What is Needed
A Data Model
A data model which accurately describes the data and can
be expanded to contain new types of data
A Data Store
A file format or data store which is efficient in access time
and storage size and which scales well
A flexible software toolkit that can be used to create tools
and pipelines based on the data model and file format
July 9, 2010 3 www.hdfgroup.org
What is BioHDF?
An open-source, community-driven project, funded by an NIH
SBIR grant and led by Geospiza, Inc. in collaboration with
The HDF Group.
BioHDF is a particular arrangement of objects in an HDF5
file (similar to a database schema)
BioHDF is a library and C API which can be used to write
applications (coming soon)
BioHDF is a set of command line tools for
storing, retrieving and manipulating data in BioHDF files
July 9, 2010 4 www.hdfgroup.org
HDF = Hierarchical Data Format
An example of how data is stored in HDF5
July 9, 2010 5 www.hdfgroup.org
Benefits of BioHDF
• Portability and data sharing:
Platform independent, endian independent, self
describing, common data models.
• High performance:
Fast random access and efficient, scalable, petabyte level
• Widespread adoption:
MATLAB, IDL, NASA-Earth Observing System, Pacific
Biosciences, SOLiD, 100's of products.
• 20 year history:
Robust, performance tuned, and well supported by The HDF
Group, an independent non-profit entity.
July 9, 2010 6 www.hdfgroup.org
HDF in Bioinformatics
• Baylor Imaging Group
• Life Technologies
• Pacific Biosciences
• Oxford Nanopore
• GenomeData (UW)
July 9, 2010 www.hdfgroup.org
The prototype BioHDF stores
Clusters of Aligned Reads
Indexes (NCList or simple)
July 9, 2010 8 www.hdfgroup.org
Additional user-specific data can be stored without breaking
the library or tools.
Similar to how
BioHDF adding additional
Data tables to a
does not invalidate
July 9, 2010 9 www.hdfgroup.org
A "pipeline prototype " set of tools to demonstrate the
suitability of HDF5 for NGS data storage.
An version 1.0 release of a BioHDF library and C API targeting
the functionality of samtools.
A higher-level C API that abstracts out and hides the
underlying storage technology.
July 9, 2010 10 www.hdfgroup.org
HDF5 API and Applications
BioHDF Applications and
Wrappers (e.g. Perl, Python)
July 9, 2010 11 www.hdfgroup.org
A Higher-Level API
A high-level API will encapsulate and hide the underlying
C APIs samtools
API high-level tool
July 9, 2010 12 www.hdfgroup.org
The HDF Group
BioHDF is supported by NIH SBIR Phase II grant HG003792
awarded to Geospiza, Inc.
July 9, 2010 13 www.hdfgroup.org
The HDF Group
Thank you for your time!
If you are interested in using or contributing to
BioHDF, please contact us!
Dana Robinson (firstname.lastname@example.org)
BOSC BoF: Friday 5:10-6:00
ISMB Poster J18: Monday, July 12: 12:40-2:30
ISMB BoF: Tuesday, July 13 1-2 pm, room 306
July 9, 2010 14 www.hdfgroup.org