D Robinson - Using HDF5 to work with large quantities of rich biological data


Published on

Presentation at BOSC2012 by D Robinson - Using HDF5 to work with large quantities of rich biological data

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Add Sony Pictures
  • The second statement is what we mean by "rich"
  • High-level view, point out that the file format is NOT "HDF5" (mention VOL).Gerd is a little unhappy with "structured", but it should be ok for this audience.
  • HDF5 has the characteristics of other formats that are outthere.It’s hard to store metadata in a binary flat file and it is not scalable
  • Gerd points out that a library is properly a part of the self-describing representation
  • High performance can have many meanings
  • Again, note that links are named, not objects
  • Much more low-level than, say, an RDBMS, though the ease of use of a database can come at a performance cost"easy" access via Python, Gerd'sPowershell snap-in, etc.Can write your own data access API to create queries, etc.
  • Need to reword this! "These are calleddataspaces" = bad.
  • Add resource links to this slide
  • Why should you listen to my talk?
  • Note that links are named, not objects!Gerd thinks of names as NAVIGATORS
  • Wide variety of integer and floating point types, enum types, etc.Need to point out that variable-length strings have compression issues (fixable, with $$$)
  • Might mention sparsity for chunks here.Mike suggests not mentioning chunks, so perhaps that could be replaced with a note about sparse data.
  • D Robinson - Using HDF5 to work with large quantities of rich biological data

    1. 1. Using HDF5 To Work With LargeQuantities of Rich Biological Data Dana Robinson (derobins @hdfgroup.org) The HDF Group July 13, 2012 BOSC 2012 1
    2. 2. Todays GoalIs that you walk away from this talk with a basicunderstanding of the HDF5 technology stack.July 13, 2012 BOSC 2012 2
    3. 3. Where is HDF5 used?July 13, 2012 BOSC 2012 3
    4. 4. What is HDF5?HDF5 is a highly scalable way to organize andstore heterogeneous, multidimensional dataof user-defined types.HDF5 also allows data relationships andcontext to be stored using annotation andlinking.July 13, 2012 BOSC 2012 4
    5. 5. HDF5The HDF5 technology suite includes:• A structured binary file format• An abstract data model for describing your data• A data access library, written in C (w/ bindings for C++, Fortran 95/2003, and Java) July 13, 2012 BOSC 2012 5
    6. 6. HDF5 has characteristics of … Directories and Files PDF • standard • hierarchical exchange format • collections of • heterogeneous related information HDF5 information Databases XML • subsetting • self-describing • random access Binary Flat File • extensible • high- types performance • rich metadata July 13, 2012April 17-19, 2012 BOSC 2012 6
    7. 7. Advantages of HDF5• Platform and architecture-independent• Scalable in space and time • File size only limited by OS and filesystem • Data access time (esp. parallel) scales well• Flexible (user-defined types and organization)• Files are self-describing July 13, 2012 BOSC 2012 7
    8. 8. Advantages of HDF5 (2)• High-performance• Parallel I/O via MPI-IO• Supports compression and other filters• Open source (BSD license)• THG committed to provide long-term support July 13, 2012 BOSC 2012 8
    9. 9. HDF5 Data Objects• Groups • Datatypes• Datasets • Metadata (Attributes) July 13, 2012 BOSC 2012 9
    10. 10. Example: LCMS Data sample namechromatography parameters ms parameters ms/ms parameters July 13, 2012 BOSC 2012 10
    11. 11. HDF5 Data AccessUnlike many data storage systems, HDF5 has nobuilt-in query engine or indexes.You will have to write your own data access code,usually using the HDF5 API. July 13, 2012 BOSC 2012 11
    12. 12. DataspacesHDF5 has a rich set of data subsetting functionality.Example: displaying a thumbnail of a high-resolution image. July 13, 2012 BOSC 2012 12
    13. 13. Filters and Compression HDF5 supports data filters, including compression, which transform data as it enters or leaves the file. compression filter compressed data uncompressed data in the file in users bufferNote that HDF5 data objects are filtered individually,not the entire file! July 13, 2012 BOSC 2012 13
    14. 14. Higher Language Bindings C++ Fortran (95 & 2003) Java .NET Python• C++ & Fortran distributed with library• Java distributed separately• .NET distributed separately, not supported by THG (as-is)• Python (PyTables, h5py) not distributed by THGNOTE:HDF5 bindings are thin wrappers over the C API. • There is no object-oriented interface to HDF5 • Not pure Java, .NET, etc. July 13, 2012 BOSC 2012 14
    15. 15. Questions? Helpful linksTHG www.hdfgroup.orgDownloads www.hdfgroup.org/HDF5/release/obtain5.htmlDocumentation www.hdfgroup.org/HDF5/doc/index.htmlBioinformatics www.hdfgroup.org/projects/bioinformatics/Tutorials www.hdfgroup.org/HDF5/Tutor/index.htmlContact/help desk www.hdfgroup.org/about/contact.html July 13, 2012 BOSC 2012 15