Robinson bosc2010 bio_hdf


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • My goal here is to show people how data is stored in HDF5 (groups, datasets, attributes), not to speak about NGS data storage in BioHDF. I get the impression that people have little understanding of what HDF5 is so I'd like to give them a bare-bones overview.
  • The reason people will be discouraged from using the HDF5 API directly is that would encourage them to meddle with low-level data elements that can change. This would make their software more brittle.
  • A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.
  • A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.
  • Robinson bosc2010 bio_hdf

    1. 1. The HDF Group BioHDF Open Binary File Formats for Next-Generation Sequencing Data Current Status and Future Directions Dana Robinson The HDF Group Copyright © 2010 The HDF Group. All Rights Reserved July 9, 2010 1
    2. 2. NGS Data Challenges Very large quantities of data (100s of GB) "Drinking from the firehose" Analysis methods vary greatly, so a flexible yet unified data store would be useful. July 9, 2010 2
    3. 3. What is Needed A Data Model A data model which accurately describes the data and can be expanded to contain new types of data A Data Store A file format or data store which is efficient in access time and storage size and which scales well A Toolkit A flexible software toolkit that can be used to create tools and pipelines based on the data model and file format July 9, 2010 3
    4. 4. What is BioHDF? An open-source, community-driven project, funded by an NIH SBIR grant and led by Geospiza, Inc. in collaboration with The HDF Group. BioHDF is a particular arrangement of objects in an HDF5 file (similar to a database schema) BioHDF is a library and C API which can be used to write applications (coming soon) BioHDF is a set of command line tools for storing, retrieving and manipulating data in BioHDF files July 9, 2010 4
    5. 5. HDF = Hierarchical Data Format An example of how data is stored in HDF5 somefile.h5 datasets / Reads/ Alignments/ is_sorted groups References attributes July 9, 2010 5
    6. 6. Benefits of BioHDF • Portability and data sharing: Platform independent, endian independent, self describing, common data models. • High performance: Fast random access and efficient, scalable, petabyte level compressed storage. • Widespread adoption: MATLAB, IDL, NASA-Earth Observing System, Pacific Biosciences, SOLiD, 100's of products. • 20 year history: Robust, performance tuned, and well supported by The HDF Group, an independent non-profit entity. July 9, 2010 6
    7. 7. HDF in Bioinformatics • Baylor Imaging Group • Life Technologies • Pacific Biosciences • Oxford Nanopore • GenomeData (UW) • Geospiza • Others July 9, 2010
    8. 8. Data Stored The prototype BioHDF stores Reads Alignments Annotations Clusters of Aligned Reads Reference Sequences Indexes (NCList or simple) July 9, 2010 8
    9. 9. Data Stored Additional user-specific data can be stored without breaking the library or tools. Similar to how BioHDF adding additional Data tables to a database schema does not invalidate existing queries. User-Specific Data July 9, 2010 9
    10. 10. Project Stages A "pipeline prototype " set of tools to demonstrate the suitability of HDF5 for NGS data storage. An version 1.0 release of a BioHDF library and C API targeting the functionality of samtools. A higher-level C API that abstracts out and hides the underlying storage technology. July 9, 2010 10
    11. 11. HDF5 API and Applications BioHDF Applications and Wrappers (e.g. Perl, Python) High-Level API BioHDF API HDF5 API Physical Storage July 9, 2010 11
    12. 12. A Higher-Level API A high-level API will encapsulate and hide the underlying storage technology. low-level C APIs samtools BioHDF API high-level tool C API BAM wrapper API July 9, 2010 12
    13. 13. Acknowledgements Geospiza Todd Smith Mark Welsh The HDF Group Mike Folk BioHDF is supported by NIH SBIR Phase II grant HG003792 awarded to Geospiza, Inc. July 9, 2010 13
    14. 14. The HDF Group Thank you for your time! If you are interested in using or contributing to BioHDF, please contact us! Dana Robinson ( BOSC BoF: Friday 5:10-6:00 ISMB Poster J18: Monday, July 12: 12:40-2:30 ISMB BoF: Tuesday, July 13 1-2 pm, room 306 July 9, 2010 14