Your SlideShare is downloading. ×
D Robinson - Using HDF5 to work with large quantities of rich biological data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

D Robinson - Using HDF5 to work with large quantities of rich biological data

1,062

Published on

Presentation at BOSC2012 by D Robinson - Using HDF5 to work with large quantities of rich biological data

Presentation at BOSC2012 by D Robinson - Using HDF5 to work with large quantities of rich biological data

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,062
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • HDF is an ADJECTIVE
  • Add Sony Pictures
  • The second statement is what we mean by "rich"
  • High-level view, point out that the file format is NOT "HDF5" (mention VOL).Gerd is a little unhappy with "structured", but it should be ok for this audience.
  • HDF5 has the characteristics of other formats that are outthere.It’s hard to store metadata in a binary flat file and it is not scalable
  • Gerd points out that a library is properly a part of the self-describing representation
  • High performance can have many meanings
  • Again, note that links are named, not objects
  • Much more low-level than, say, an RDBMS, though the ease of use of a database can come at a performance cost"easy" access via Python, Gerd'sPowershell snap-in, etc.Can write your own data access API to create queries, etc.
  • Need to reword this! "These are calleddataspaces" = bad.
  • Add resource links to this slide
  • Why should you listen to my talk?
  • Note that links are named, not objects!Gerd thinks of names as NAVIGATORS
  • Wide variety of integer and floating point types, enum types, etc.Need to point out that variable-length strings have compression issues (fixable, with $$$)
  • Might mention sparsity for chunks here.Mike suggests not mentioning chunks, so perhaps that could be replaced with a note about sparse data.
  • Transcript

    • 1. Using HDF5 To Work With LargeQuantities of Rich Biological Data Dana Robinson (derobins @hdfgroup.org) The HDF Group July 13, 2012 BOSC 2012 1
    • 2. Todays GoalIs that you walk away from this talk with a basicunderstanding of the HDF5 technology stack.July 13, 2012 BOSC 2012 2
    • 3. Where is HDF5 used?July 13, 2012 BOSC 2012 3
    • 4. What is HDF5?HDF5 is a highly scalable way to organize andstore heterogeneous, multidimensional dataof user-defined types.HDF5 also allows data relationships andcontext to be stored using annotation andlinking.July 13, 2012 BOSC 2012 4
    • 5. HDF5The HDF5 technology suite includes:• A structured binary file format• An abstract data model for describing your data• A data access library, written in C (w/ bindings for C++, Fortran 95/2003, and Java) July 13, 2012 BOSC 2012 5
    • 6. HDF5 has characteristics of … Directories and Files PDF • standard • hierarchical exchange format • collections of • heterogeneous related information HDF5 information Databases XML • subsetting • self-describing • random access Binary Flat File • extensible • high- types performance • rich metadata July 13, 2012April 17-19, 2012 BOSC 2012 6
    • 7. Advantages of HDF5• Platform and architecture-independent• Scalable in space and time • File size only limited by OS and filesystem • Data access time (esp. parallel) scales well• Flexible (user-defined types and organization)• Files are self-describing July 13, 2012 BOSC 2012 7
    • 8. Advantages of HDF5 (2)• High-performance• Parallel I/O via MPI-IO• Supports compression and other filters• Open source (BSD license)• THG committed to provide long-term support July 13, 2012 BOSC 2012 8
    • 9. HDF5 Data Objects• Groups • Datatypes• Datasets • Metadata (Attributes) July 13, 2012 BOSC 2012 9
    • 10. Example: LCMS Data sample namechromatography parameters ms parameters ms/ms parameters July 13, 2012 BOSC 2012 10
    • 11. HDF5 Data AccessUnlike many data storage systems, HDF5 has nobuilt-in query engine or indexes.You will have to write your own data access code,usually using the HDF5 API. July 13, 2012 BOSC 2012 11
    • 12. DataspacesHDF5 has a rich set of data subsetting functionality.Example: displaying a thumbnail of a high-resolution image. July 13, 2012 BOSC 2012 12
    • 13. Filters and Compression HDF5 supports data filters, including compression, which transform data as it enters or leaves the file. compression filter compressed data uncompressed data in the file in users bufferNote that HDF5 data objects are filtered individually,not the entire file! July 13, 2012 BOSC 2012 13
    • 14. Higher Language Bindings C++ Fortran (95 & 2003) Java .NET Python• C++ & Fortran distributed with library• Java distributed separately• .NET distributed separately, not supported by THG (as-is)• Python (PyTables, h5py) not distributed by THGNOTE:HDF5 bindings are thin wrappers over the C API. • There is no object-oriented interface to HDF5 • Not pure Java, .NET, etc. July 13, 2012 BOSC 2012 14
    • 15. Questions? Helpful linksTHG www.hdfgroup.orgDownloads www.hdfgroup.org/HDF5/release/obtain5.htmlDocumentation www.hdfgroup.org/HDF5/doc/index.htmlBioinformatics www.hdfgroup.org/projects/bioinformatics/Tutorials www.hdfgroup.org/HDF5/Tutor/index.htmlContact/help desk www.hdfgroup.org/about/contact.html July 13, 2012 BOSC 2012 15

    ×