Your SlideShare is downloading. ×
  • Like
Robinson bosc2010 bio_hdf
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Robinson bosc2010 bio_hdf

  • 491 views
Published

 

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
491
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
10
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • My goal here is to show people how data is stored in HDF5 (groups, datasets, attributes), not to speak about NGS data storage in BioHDF. I get the impression that people have little understanding of what HDF5 is so I'd like to give them a bare-bones overview.
  • The reason people will be discouraged from using the HDF5 API directly is that would encourage them to meddle with low-level data elements that can change. This would make their software more brittle.
  • A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.
  • A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.

Transcript

  • 1. The HDF Group BioHDF Open Binary File Formats for Next-Generation Sequencing Data Current Status and Future Directions Dana Robinson The HDF Group derobins@hdfgroup.org Copyright © 2010 The HDF Group. All Rights Reserved July 9, 2010 1 www.hdfgroup.org
  • 2. NGS Data Challenges Very large quantities of data (100s of GB) "Drinking from the firehose" Analysis methods vary greatly, so a flexible yet unified data store would be useful. July 9, 2010 2 www.hdfgroup.org
  • 3. What is Needed A Data Model A data model which accurately describes the data and can be expanded to contain new types of data A Data Store A file format or data store which is efficient in access time and storage size and which scales well A Toolkit A flexible software toolkit that can be used to create tools and pipelines based on the data model and file format July 9, 2010 3 www.hdfgroup.org
  • 4. What is BioHDF? An open-source, community-driven project, funded by an NIH SBIR grant and led by Geospiza, Inc. in collaboration with The HDF Group. BioHDF is a particular arrangement of objects in an HDF5 file (similar to a database schema) BioHDF is a library and C API which can be used to write applications (coming soon) BioHDF is a set of command line tools for storing, retrieving and manipulating data in BioHDF files July 9, 2010 4 www.hdfgroup.org
  • 5. HDF = Hierarchical Data Format An example of how data is stored in HDF5 somefile.h5 datasets / Reads/ Alignments/ is_sorted groups References attributes July 9, 2010 5 www.hdfgroup.org
  • 6. Benefits of BioHDF • Portability and data sharing: Platform independent, endian independent, self describing, common data models. • High performance: Fast random access and efficient, scalable, petabyte level compressed storage. • Widespread adoption: MATLAB, IDL, NASA-Earth Observing System, Pacific Biosciences, SOLiD, 100's of products. • 20 year history: Robust, performance tuned, and well supported by The HDF Group, an independent non-profit entity. July 9, 2010 6 www.hdfgroup.org
  • 7. HDF in Bioinformatics • Baylor Imaging Group • Life Technologies • Pacific Biosciences • Oxford Nanopore • GenomeData (UW) • Geospiza • Others July 9, 2010 www.hdfgroup.org
  • 8. Data Stored The prototype BioHDF stores Reads Alignments Annotations Clusters of Aligned Reads Reference Sequences Indexes (NCList or simple) July 9, 2010 8 www.hdfgroup.org
  • 9. Data Stored Additional user-specific data can be stored without breaking the library or tools. Similar to how BioHDF adding additional Data tables to a database schema does not invalidate existing queries. User-Specific Data July 9, 2010 9 www.hdfgroup.org
  • 10. Project Stages A "pipeline prototype " set of tools to demonstrate the suitability of HDF5 for NGS data storage. An version 1.0 release of a BioHDF library and C API targeting the functionality of samtools. A higher-level C API that abstracts out and hides the underlying storage technology. July 9, 2010 10 www.hdfgroup.org
  • 11. HDF5 API and Applications BioHDF Applications and Wrappers (e.g. Perl, Python) High-Level API BioHDF API HDF5 API Physical Storage July 9, 2010 11 www.hdfgroup.org
  • 12. A Higher-Level API A high-level API will encapsulate and hide the underlying storage technology. low-level C APIs samtools BioHDF API high-level tool C API BAM wrapper API July 9, 2010 12 www.hdfgroup.org
  • 13. Acknowledgements Geospiza Todd Smith Mark Welsh The HDF Group Mike Folk BioHDF is supported by NIH SBIR Phase II grant HG003792 awarded to Geospiza, Inc. July 9, 2010 13 www.hdfgroup.org
  • 14. The HDF Group Thank you for your time! If you are interested in using or contributing to BioHDF, please contact us! Dana Robinson (derobins@hdfgroup.org) http://www.biohdf.org BOSC BoF: Friday 5:10-6:00 ISMB Poster J18: Monday, July 12: 12:40-2:30 ISMB BoF: Tuesday, July 13 1-2 pm, room 306 July 9, 2010 14 www.hdfgroup.org