Python and HDF5: Overview

6,168 views

Published on

From a talk by Andrew Collette to the Boulder Earth and Space Science Informatics Group (BESSIG) on November 20, 2013.

This talk explores how researchers can use the scalable, self-describing HDF5 data format together with the Python programming language to improve the analysis pipeline, easily archive and share large datasets, and improve confidence in scientific results. The discussion will focus on real-world applications of HDF5 in experimental physics at two multimillion-dollar research facilities: the Large Plasma Device at UCLA, and the NASA-funded hypervelocity dust accelerator at CU Boulder. This event coincides with the launch of a new O’Reilly book, Python and HDF5: Unlocking Scientific Data.

As scientific datasets grow from gigabytes to terabytes and beyond, the use of standard formats for data storage and communication becomes critical. HDF5, the most recent version of the Hierarchical Data Format originally developed at the National Center for Supercomputing Applications (NCSA), has rapidly emerged as the mechanism of choice for storing and sharing large datasets. At the same time, many researchers who routinely deal with large numerical datasets have been drawn to the Python by its ease of use and rapid development capabilities.

Over the past several years, Python has emerged as a credible alternative to scientific analysis environments like IDL or MATLAB. In addition to stable core packages for handling numerical arrays, analysis, and plotting, the Python ecosystem provides a huge selection of more specialized software, reducing the amount of work necessary to write scientific code while also increasing the quality of results. Python’s excellent support for standard data formats allows scientists to interact seamlessly with colleagues using other platforms.

Published in: Technology

Python and HDF5: Overview

  1. 1. Python and HDF5 Andrew Collette University of Colorado
  2. 2. What makes scientific data special?
  3. 3. What makes scientific data special? It’s meant to be shared - collaborative Ad-hoc or changing structure - flexible Archived and preserved - robust Python and HDF5 together address all three
  4. 4. High-level language Fully object-oriented Almost no “boilerplate” code Readable Free (the language) “Exception” error handling Self-documenting First-class module/namespace support
  5. 5. (the platform) Mature numerical, plotting and scientific modules Hundreds of specialized science packages Thousands more general-purpose Python itself is “batteries included”
  6. 6. Core analysis packages NumPy - Array objects and basic operations SciPy - Advanced science & engineering library Matplotlib - Publication-quality plots
 (both rendered and interactive)
  7. 7. Thousands of others Distribution - distutils/pip single-command installs Unit testing - unittest module in stdlib Interface: F2PY (Fortran), Cython (C), ctypes, others Web servers and development - literally hundreds Only need to write code for your problem
  8. 8. Python highlights
  9. 9. Readable
  10. 10. Iteration C IDL Python
  11. 11. Speed
  12. 12. Speed FFTs and optimized routines built in to NumPy/Scipy
  13. 13. Speed FFTs and optimized routines built in to NumPy/Scipy ctypes and Cython
  14. 14. ctypes Advanced foreign function interface Call C libraries from pure Python code
  15. 15. Cython Example from the HDF5 C Library:
  16. 16. HDF5
  17. 17. HDF5
  18. 18. Hierarchical Data Format 3 things: File specification and object model C library Ecosystem of users and developers
  19. 19. Objects Datasets - Homogenous arrays of data Groups: containers holding datasets and groups Attributes: arbitrary metadata on groups & datasets Standard constructs using these, or make your own!
  20. 20. Dataset features Partial I/O: read and write just what you want (In Python, we even use the array-access syntax!) Automatic type conversion On-the-fly compression Parallel reads & writes with MPI (Directly from Python!)
  21. 21. Metadata & Organization Groups form a POSIX-style “filesystem” in the file Attributes can store arbitrary data on arbitrary objects How should the file be organized? You decide! ! Thousands of domain-specific “application formats” Anyone can read them because HDF5 is self-describing!
  22. 22. Example
  23. 23. Open an HDF5 file Extract a particular dataset Read the data Make an interactive plot Close the file
  24. 24. Open an HDF5 file Extract a particular dataset Read the data Make an interactive plot Close the file
  25. 25. Open an HDF5 file Extract a particular dataset Read the data Make an interactive plot Close the file
  26. 26. Open an HDF5 file Extract a particular dataset Read the data Make an interactive plot Close the file
  27. 27. Open an HDF5 file Extract a particular dataset Read the data Make an interactive plot Close the file
  28. 28. Open an HDF5 file Extract a particular dataset Read the data Make an interactive plot Close the file
  29. 29. Demo
  30. 30. Real-world use
  31. 31. UCLA Large Plasma Device
  32. 32. UCLA Large Plasma Device Image credit: Basic Plasma Science Facility
  33. 33. Laser Experiment Image credit: Basic Plasma Science Facility
  34. 34. LAPD Data Products Acquisition file - “Planes” of data in HDF5 Metadata:
 timestamps, digitizer settings, probe positions, background plasma conditions… Packaged into HDF5 following “lab layout” Users take their data back home and analyze
  35. 35. Visualization
  36. 36. Python 2D plotting A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
  37. 37. Only 160 lines of code! A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
  38. 38. Python does 3D too! “MayaVi” 3D visualizer Development sponsored by Enthought Both offline (scripted) and interactive modes A. Collette et al. Phys. Plasmas 18, 055705 (2011)
  39. 39. CU Accelerator
  40. 40. CU Accelerator
  41. 41. CU Accelerator
  42. 42. CU Accelerator
  43. 43. CU Accelerator Raw data HDF5 Shot file Automated speed/mass calculation Data search HDF5 file for user MySQL
  44. 44. Where to get Python
  45. 45. Where to get Python Distributions are the best way to get started (they include HDF5/h5py!) Anaconda (Windows, Mac, Linux): http://continuum.io PythonXY (Windows) http://pythonxy.googlecode.com
  46. 46. Questions?

×