Python and HDF5
Andrew Collette
University of Colorado
What makes scientific data special?
What makes scientific data special?

It’s meant to be shared - collaborative
Ad-hoc or changing structure - flexible
Archived and preserved - robust

Python and HDF5 together address all three
High-level language
Fully object-oriented
Almost no “boilerplate” code

Readable
Free
(the language)
“Exception” error handling

Self-documenting

First-class module/namespace support
(the platform)

Mature numerical, plotting and scientific modules
Hundreds of specialized science packages
Thousands more general-purpose
Python itself is “batteries included”
Core analysis packages
NumPy - Array objects and basic operations
SciPy - Advanced science & engineering library
Matplotlib - Publication-quality plots

(both rendered and interactive)
Thousands of others
Distribution - distutils/pip single-command installs
Unit testing - unittest module in stdlib
Interface: F2PY (Fortran), Cython (C), ctypes, others
Web servers and development - literally hundreds
Only need to write code for your problem
Python highlights
Readable
Iteration
C

IDL

Python
Speed
Speed
FFTs and optimized routines built in to NumPy/Scipy
Speed
FFTs and optimized routines built in to NumPy/Scipy
ctypes and Cython
ctypes
Advanced foreign function interface
Call C libraries from pure Python code
Cython
Example from the HDF5 C Library:
HDF5
HDF5
Hierarchical Data Format
3 things:
File specification and object model
C library
Ecosystem of users and developers
Objects
Datasets - Homogenous arrays of data
Groups: containers holding datasets and groups
Attributes: arbitrary metadata on groups & datasets

Standard constructs using these, or make your own!
Dataset features
Partial I/O: read and write just what you want
(In Python, we even use the array-access syntax!)
Automatic type conversion
On-the-fly compression
Parallel reads & writes with MPI
(Directly from Python!)
Metadata & Organization
Groups form a POSIX-style “filesystem” in the file
Attributes can store arbitrary data on arbitrary objects
How should the file be organized?
You decide!
!

Thousands of domain-specific “application formats”
Anyone can read them because HDF5 is self-describing!
Example
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Open an HDF5 file
Extract a particular dataset
Read the data
Make an interactive plot
Close the file
Demo
Real-world use
UCLA Large Plasma Device
UCLA Large Plasma Device

Image credit: Basic Plasma Science Facility
Laser Experiment

Image credit: Basic Plasma Science Facility
LAPD Data Products
Acquisition file - “Planes” of data in HDF5
Metadata:

timestamps, digitizer settings, probe positions,
background plasma conditions…
Packaged into HDF5 following “lab layout”
Users take their data back home and analyze
Visualization
Python 2D plotting

A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
Only 160 lines of code!

A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
Python does 3D too!
“MayaVi” 3D visualizer
Development sponsored
by Enthought
Both offline (scripted) and
interactive modes

A. Collette et al. Phys. Plasmas 18, 055705 (2011)
CU Accelerator
CU Accelerator
CU Accelerator
CU Accelerator
CU Accelerator
Raw data

HDF5 Shot file
Automated
speed/mass
calculation

Data search
HDF5 file for user

MySQL
Where to get Python
Where to get Python
Distributions are the best way to get started
(they include HDF5/h5py!)
Anaconda (Windows, Mac, Linux):
http://continuum.io
PythonXY (Windows)
http://pythonxy.googlecode.com
Questions?

Python and HDF5: Overview