Your SlideShare is downloading. ×
0
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Robinson bosc2010 bio_hdf

539

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
539
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • My goal here is to show people how data is stored in HDF5 (groups, datasets, attributes), not to speak about NGS data storage in BioHDF. I get the impression that people have little understanding of what HDF5 is so I'd like to give them a bare-bones overview.
  • The reason people will be discouraged from using the HDF5 API directly is that would encourage them to meddle with low-level data elements that can change. This would make their software more brittle.
  • A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.
  • A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.
  • Transcript

    • 1. The HDF Group BioHDF Open Binary File Formats for Next-Generation Sequencing Data Current Status and Future Directions Dana Robinson The HDF Group derobins@hdfgroup.org Copyright © 2010 The HDF Group. All Rights Reserved July 9, 2010 1 www.hdfgroup.org
    • 2. NGS Data Challenges Very large quantities of data (100s of GB) "Drinking from the firehose" Analysis methods vary greatly, so a flexible yet unified data store would be useful. July 9, 2010 2 www.hdfgroup.org
    • 3. What is Needed A Data Model A data model which accurately describes the data and can be expanded to contain new types of data A Data Store A file format or data store which is efficient in access time and storage size and which scales well A Toolkit A flexible software toolkit that can be used to create tools and pipelines based on the data model and file format July 9, 2010 3 www.hdfgroup.org
    • 4. What is BioHDF? An open-source, community-driven project, funded by an NIH SBIR grant and led by Geospiza, Inc. in collaboration with The HDF Group. BioHDF is a particular arrangement of objects in an HDF5 file (similar to a database schema) BioHDF is a library and C API which can be used to write applications (coming soon) BioHDF is a set of command line tools for storing, retrieving and manipulating data in BioHDF files July 9, 2010 4 www.hdfgroup.org
    • 5. HDF = Hierarchical Data Format An example of how data is stored in HDF5 somefile.h5 datasets / Reads/ Alignments/ is_sorted groups References attributes July 9, 2010 5 www.hdfgroup.org
    • 6. Benefits of BioHDF • Portability and data sharing: Platform independent, endian independent, self describing, common data models. • High performance: Fast random access and efficient, scalable, petabyte level compressed storage. • Widespread adoption: MATLAB, IDL, NASA-Earth Observing System, Pacific Biosciences, SOLiD, 100's of products. • 20 year history: Robust, performance tuned, and well supported by The HDF Group, an independent non-profit entity. July 9, 2010 6 www.hdfgroup.org
    • 7. HDF in Bioinformatics • Baylor Imaging Group • Life Technologies • Pacific Biosciences • Oxford Nanopore • GenomeData (UW) • Geospiza • Others July 9, 2010 www.hdfgroup.org
    • 8. Data Stored The prototype BioHDF stores Reads Alignments Annotations Clusters of Aligned Reads Reference Sequences Indexes (NCList or simple) July 9, 2010 8 www.hdfgroup.org
    • 9. Data Stored Additional user-specific data can be stored without breaking the library or tools. Similar to how BioHDF adding additional Data tables to a database schema does not invalidate existing queries. User-Specific Data July 9, 2010 9 www.hdfgroup.org
    • 10. Project Stages A "pipeline prototype " set of tools to demonstrate the suitability of HDF5 for NGS data storage. An version 1.0 release of a BioHDF library and C API targeting the functionality of samtools. A higher-level C API that abstracts out and hides the underlying storage technology. July 9, 2010 10 www.hdfgroup.org
    • 11. HDF5 API and Applications BioHDF Applications and Wrappers (e.g. Perl, Python) High-Level API BioHDF API HDF5 API Physical Storage July 9, 2010 11 www.hdfgroup.org
    • 12. A Higher-Level API A high-level API will encapsulate and hide the underlying storage technology. low-level C APIs samtools BioHDF API high-level tool C API BAM wrapper API July 9, 2010 12 www.hdfgroup.org
    • 13. Acknowledgements Geospiza Todd Smith Mark Welsh The HDF Group Mike Folk BioHDF is supported by NIH SBIR Phase II grant HG003792 awarded to Geospiza, Inc. July 9, 2010 13 www.hdfgroup.org
    • 14. The HDF Group Thank you for your time! If you are interested in using or contributing to BioHDF, please contact us! Dana Robinson (derobins@hdfgroup.org) http://www.biohdf.org BOSC BoF: Friday 5:10-6:00 ISMB Poster J18: Monday, July 12: 12:40-2:30 ISMB BoF: Tuesday, July 13 1-2 pm, room 306 July 9, 2010 14 www.hdfgroup.org

    ×