SlideShare a Scribd company logo
1 of 188
Download to read offline
PyData SV 2013 - Tutorials




     HDF5 is for Lovers
     March 18th, 2013, PyData, Silicon Valley
                Anthony Scopatz
               The FLASH Center
           The University of Chicago
              scopatz@gmail.com



1
What is HDF5?
HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive.




2
What is HDF5?
HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive.
It is supported by the lovely people at




2
What is HDF5?
HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive.
It is supported by the lovely people at
At its core HDF5 is binary file type specification.




2
What is HDF5?
HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive.
It is supported by the lovely people at
At its core HDF5 is binary file type specification.
However, what makes HDF5 great is the numerous
libraries written to interact with files of this type and their
extremely rich feature set.




2
What is HDF5?
HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive.
It is supported by the lovely people at
At its core HDF5 is binary file type specification.
However, what makes HDF5 great is the numerous
libraries written to interact with files of this type and their
extremely rich feature set.


               Which you will learn today!

2
A Note on the Format
Intermixed, there will be:

    • Slides
    • Interactive Hacking
    • Exercises




3
A Note on the Format
Intermixed, there will be:

    • Slides
    • Interactive Hacking
    • Exercises

Feel free to:

    • Ask questions at anytime
    • Explore at your own pace.

3
Class Makeup
By a show of hands, how many people have used:

    • HDF5 before?




4
Class Makeup
By a show of hands, how many people have used:

    • HDF5 before?
    • PyTables?




4
Class Makeup
By a show of hands, how many people have used:

    • HDF5 before?
    • PyTables?
    • h5py?




4
Class Makeup
By a show of hands, how many people have used:

    • HDF5 before?
    • PyTables?
    • h5py?
    • the HDF5 C API?




4
Class Makeup
By a show of hands, how many people have used:

    • HDF5 before?
    • PyTables?
    • h5py?
    • the HDF5 C API?
    • SQL?



4
Class Makeup
By a show of hands, how many people have used:

    • HDF5 before?
    • PyTables?
    • h5py?
    • the HDF5 C API?
    • SQL?
    • Other binary data formats?

4
Setup

Please clone the repo:


git clone git://github.com/scopatz/hdf5-is-for-lovers.git



Or download a tarball from:

    https://github.com/scopatz/hdf5-is-for-lovers




5
Warm up exercise
In IPython:

import numpy as np
import tables as tb

f = tb.openFile('temp.h5', 'a')
heart = np.ones(42, dtype=[('rate', int), ('beat', float)])
f.createTable('/', 'heart', heart)
f.close()


Or run python exer/warmup.py



6
Warm up exercise
You should see in ViTables:




7
A Brief Introduction
For persisting structured numerical data, binary formats
are superior to plaintext.




8
A Brief Introduction
For persisting structured numerical data, binary formats
are superior to plaintext.
For one thing, they are often smaller:
# small ints             # med ints
42   (4 bytes)           123456   (4 bytes)
'42' (2 bytes)           '123456' (6 bytes)

# near-int floats        # e-notation floats
12.34   (8 bytes)        42.424242E+42   (8 bytes)
'12.34' (5 bytes)        '42.424242E+42' (13 bytes)


8
A Brief Introduction
For another, binary formats are often faster for I/O because
atoi() and atof() are expensive.




9
A Brief Introduction
For another, binary formats are often faster for I/O because
atoi() and atof() are expensive.
However, you often want some thing more than a binary
chunk of data in a file.




9
A Brief Introduction
For another, binary formats are often faster for I/O because
atoi() and atof() are expensive.
However, you often want some thing more than a binary
chunk of data in a file.

                           Note

    This is the mechanism behind numpy.save() and
    numpy.savez().


9
A Brief Introduction
Instead, you want a real database with the ability to store
many datasets, user-defined metadata, optimized I/O, and
the ability to query its contents.




10
A Brief Introduction
Instead, you want a real database with the ability to store
many datasets, user-defined metadata, optimized I/O, and
the ability to query its contents.
Unlike SQL, where every dataset lives in a flat namespace,
HDF allows datasets to live in a nested tree structure.




10
A Brief Introduction
Instead, you want a real database with the ability to store
many datasets, user-defined metadata, optimized I/O, and
the ability to query its contents.
Unlike SQL, where every dataset lives in a flat namespace,
HDF allows datasets to live in a nested tree structure.
In effect, HDF5 is a file system within a file.




10
A Brief Introduction
Instead, you want a real database with the ability to store
many datasets, user-defined metadata, optimized I/O, and
the ability to query its contents.
Unlike SQL, where every dataset lives in a flat namespace,
HDF allows datasets to live in a nested tree structure.
In effect, HDF5 is a file system within a file.
(More on this later.)




10
A Brief Introduction
Basic dataset classes include:

     • Array




11
A Brief Introduction
Basic dataset classes include:

     • Array
     • CArray (chunked array)




11
A Brief Introduction
Basic dataset classes include:

     • Array
     • CArray (chunked array)
     • EArray (extendable array)




11
A Brief Introduction
Basic dataset classes include:

     • Array
     • CArray (chunked array)
     • EArray (extendable array)
     • VLArray (variable length array)




11
A Brief Introduction
Basic dataset classes include:

     • Array
     • CArray (chunked array)
     • EArray (extendable array)
     • VLArray (variable length array)
     • Table (structured array w/ named fields)



11
A Brief Introduction
Basic dataset classes include:

     • Array
     • CArray (chunked array)
     • EArray (extendable array)
     • VLArray (variable length array)
     • Table (structured array w/ named fields)

All of these must be composed of atomic types.

11
A Brief Introduction
There are six kinds of types supported by PyTables:

 • bool: Boolean (true/false) types. 8 bits.




12
A Brief Introduction
There are six kinds of types supported by PyTables:

 • bool: Boolean (true/false) types. 8 bits.
 • int: Signed integer types. 8, 16, 32 (default) and 64 bits.




12
A Brief Introduction
There are six kinds of types supported by PyTables:

 • bool: Boolean (true/false) types. 8 bits.
 • int: Signed integer types. 8, 16, 32 (default) and 64 bits.
 • uint: Unsigned integers. 8, 16, 32 (default) and 64 bits.




12
A Brief Introduction
There are six kinds of types supported by PyTables:

 • bool: Boolean (true/false) types. 8 bits.
 • int: Signed integer types. 8, 16, 32 (default) and 64 bits.
 • uint: Unsigned integers. 8, 16, 32 (default) and 64 bits.
 • float: Floating point types. 16, 32 and 64 (default) bits.




12
A Brief Introduction
There are six kinds of types supported by PyTables:

 • bool: Boolean (true/false) types. 8 bits.
 • int: Signed integer types. 8, 16, 32 (default) and 64 bits.
 • uint: Unsigned integers. 8, 16, 32 (default) and 64 bits.
 • float: Floating point types. 16, 32 and 64 (default) bits.
 • complex: Complex number. 64 and 128 (default) bits.



12
A Brief Introduction
There are six kinds of types supported by PyTables:

 • bool: Boolean (true/false) types. 8 bits.
 • int: Signed integer types. 8, 16, 32 (default) and 64 bits.
 • uint: Unsigned integers. 8, 16, 32 (default) and 64 bits.
 • float: Floating point types. 16, 32 and 64 (default) bits.
 • complex: Complex number. 64 and 128 (default) bits.
 • string: Raw string types. 8-bit positive multiples.

12
A Brief Introduction
Other elements of the hierarchy may include:

     • Groups (dirs)




13
A Brief Introduction
Other elements of the hierarchy may include:

     • Groups (dirs)
     • Links




13
A Brief Introduction
Other elements of the hierarchy may include:

     • Groups (dirs)
     • Links
     • File Nodes




13
A Brief Introduction
Other elements of the hierarchy may include:

     • Groups (dirs)
     • Links
     • File Nodes
     • Hidden Nodes




13
A Brief Introduction
Other elements of the hierarchy may include:

     • Groups (dirs)
     • Links
     • File Nodes
     • Hidden Nodes

PyTables docs may be found at http://pytables.github.com/


13
Opening Files

import tables as tb
f = tb.openFile('/path/to/file', 'a')




14
Opening Files

import tables as tb
f = tb.openFile('/path/to/file', 'a')

 • 'r': Read-only; no data can be modified.
 • 'w': Write; a new file is created (an existing file with the
   same name would be deleted).
 • 'a': Append; an existing file is opened for reading and
   writing, and if the file does not exist it is created.
 • 'r+': It is similar to 'a', but the file must already exist.

14
Using the Hierarchy
In HDF5, all nodes stem from a root ("/" or f.root).




15
Using the Hierarchy
In HDF5, all nodes stem from a root ("/" or f.root).
In PyTables, you may access nodes as attributes on a
Python object (f.root.a_group.some_data).




15
Using the Hierarchy
In HDF5, all nodes stem from a root ("/" or f.root).
In PyTables, you may access nodes as attributes on a
Python object (f.root.a_group.some_data).
This is known as natural naming.




15
Using the Hierarchy
In HDF5, all nodes stem from a root ("/" or f.root).
In PyTables, you may access nodes as attributes on a
Python object (f.root.a_group.some_data).
This is known as natural naming.
Creating new nodes must be done on the file handle:

f.createGroup('/', 'a_group', "My Group")
f.root.a_group



15
Creating Datasets
The two most common datasets are Tables & Arrays.




16
Creating Datasets
The two most common datasets are Tables & Arrays.
Appropriate create methods live on the file handle:
# integer array
f.createArray('/a_group', 'arthur_count', [1, 2, 5, 3])




16
Creating Datasets
The two most common datasets are Tables & Arrays.
Appropriate create methods live on the file handle:
# integer array
f.createArray('/a_group', 'arthur_count', [1, 2, 5, 3])


# tables, need descriptions
dt = np.dtype([('id', int), ('name', 'S10')])
knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
f.createTable('/', 'knights', dt)
f.root.knights.append(knights)




16
Reading Datasets
Arrays and Tables try to preserve the original flavor that
they were created with.




17
Reading Datasets
Arrays and Tables try to preserve the original flavor that
they were created with.

>>> print f.root.a_group.arthur_count[:]
[1, 2, 5, 3]

>>> type(f.root.a_group.arthur_count[:])
list

>>> type(f.root.a_group.arthur_count)
tables.array.Array

17
Reading Datasets
So if they come from NumPy arrays, they may be accessed
in a numpy-like fashion (slicing, fancy indexing, masking).




18
Reading Datasets
So if they come from NumPy arrays, they may be accessed
in a numpy-like fashion (slicing, fancy indexing, masking).

>>> f.root.knights[1]
(12, 'Bedivere')

>>> f.root.knights[:1]
array([(42, 'Lancelot')], dtype=[('id', '<i8'), ('name', 'S10')])

>>> mask = (f.root.knights.cols.id[:] < 28)
>>> f.root.knights[mask]
array([(12, 'Bedivere')], dtype=[('id', '<i8'), ('name', 'S10')])

>>> f.root.knights[([1, 0],)]
array([(12, 'Bedivere'), (42, 'Lancelot')], dtype=[('id', '<i8'), ('name', 'S10')])




18
Reading Datasets
So if they come from NumPy arrays, they may be accessed
in a numpy-like fashion (slicing, fancy indexing, masking).

>>> f.root.knights[1]
(12, 'Bedivere')

>>> f.root.knights[:1]
array([(42, 'Lancelot')], dtype=[('id', '<i8'), ('name', 'S10')])

>>> mask = (f.root.knights.cols.id[:] < 28)
>>> f.root.knights[mask]
array([(12, 'Bedivere')], dtype=[('id', '<i8'), ('name', 'S10')])

>>> f.root.knights[([1, 0],)]
array([(12, 'Bedivere'), (42, 'Lancelot')], dtype=[('id', '<i8'), ('name', 'S10')])




Data accessed in this way is memory mapped.

18
Exercise
     exer/peaks_of_kilimanjaro.py




19
Exercise
     sol/peaks_of_kilimanjaro.py




20
Hierarchy Layout
Suppose there is a big table of like-things:
# people: name,               profession,      home
people = [('Arthur',          'King',          'Camelot'),
          ('Lancelot',        'Knight',        'Lake'),
          ('Bedevere',        'Knight',        'Wales'),
          ('Witch',           'Witch',         'Village'),
          ('Guard',           'Man-at-Arms',   'Swamp Castle'),
          ('Ni',              'Knight',        'Shrubbery'),
          ('Strange Woman',   'Lady',          'Lake'),
          ...
          ]




21
Hierarchy Layout
Suppose there is a big table of like-things:
# people: name,               profession,      home
people = [('Arthur',          'King',          'Camelot'),
          ('Lancelot',        'Knight',        'Lake'),
          ('Bedevere',        'Knight',        'Wales'),
          ('Witch',           'Witch',         'Village'),
          ('Guard',           'Man-at-Arms',   'Swamp Castle'),
          ('Ni',              'Knight',        'Shrubbery'),
          ('Strange Woman',   'Lady',          'Lake'),
          ...
          ]

It is tempting to throw everyone into a big people table.


21
Hierarchy Layout
However, a search over a class of people can be eliminated
by splitting these tables up:
knight = [('Lancelot',        'Knight',        'Lake'),
          ('Bedevere',        'Knight',        'Wales'),
          ('Ni',              'Knight',        'Shrubbery'),
          ]

others = [('Arthur',          'King',          'Camelot'),
          ('Witch',           'Witch',         'Village'),
          ('Guard',           'Man-at-Arms',   'Swamp Castle'),
          ('Strange Woman',   'Lady',          'Lake'),
          ...
          ]




22
Hierarchy Layout
The profession column is now redundant:
knight = [('Lancelot', 'Lake'),
          ('Bedevere', 'Wales'),
          ('Ni',       'Shrubbery'),
          ]

others = [('Arthur',          'King',          'Camelot'),
          ('Witch',           'Witch',         'Village'),
          ('Guard',           'Man-at-Arms',   'Swamp Castle'),
          ('Strange Woman',   'Lady',          'Lake'),
          ...
          ]




23
Hierarchy Layout
Information can be embedded implicitly in the hierarchy as
well:

root
  | - England
  |     | - knight
  |     | - others
  |
  | - France
  |     | - knight
  |     | - others

24
Hierarchy Layout
Why bother pivoting the data like this at all?




25
Hierarchy Layout
Why bother pivoting the data like this at all?

     • Fewer rows to search over.




25
Hierarchy Layout
Why bother pivoting the data like this at all?

     • Fewer rows to search over.
     • Fewer rows to pull from disk.




25
Hierarchy Layout
Why bother pivoting the data like this at all?

     • Fewer rows to search over.
     • Fewer rows to pull from disk.
     • Fewer columns in description.




25
Hierarchy Layout
Why bother pivoting the data like this at all?

     • Fewer rows to search over.
     • Fewer rows to pull from disk.
     • Fewer columns in description.

Ultimately, it is all about speed, especially for big tables.




25
Access Time Analogy
If a processor's access of L1 cache is analogous to you finding
a word on a computer screen (3 seconds), then




26
Access Time Analogy
If a processor's access of L1 cache is analogous to you finding
a word on a computer screen (3 seconds), then
Accessing L2 cache is getting a book from a bookshelf (15 s).




26
Access Time Analogy
If a processor's access of L1 cache is analogous to you finding
a word on a computer screen (3 seconds), then
Accessing L2 cache is getting a book from a bookshelf (15 s).
Accessing main memory is going to the break room, get a
candy bar, and chatting with your co-worker (4 min).




26
Access Time Analogy
If a processor's access of L1 cache is analogous to you finding
a word on a computer screen (3 seconds), then
Accessing L2 cache is getting a book from a bookshelf (15 s).
Accessing main memory is going to the break room, get a
candy bar, and chatting with your co-worker (4 min).
Accessing a (mechanical) HDD is leaving your office, leaving
your building, wandering the planet for a year and four months
to return to your desk with the information finally made
available.
Thanks K. Smith & http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait


26
Starving CPU Problem
Waiting around for access times prior to computation is
known as the Starving CPU Problem.




Francesc Alted. 2010. Why Modern CPUs Are Starving and What Can Be Done about It. IEEE Des.
Test 12, 2 (March 2010), 68-71. DOI=10.1109/MCSE.2010.51
http://dx.doi.org/10.1109/MCSE.2010.51

27
Tables
Tables are a high-level interface to extendable arrays of
structs.




28
Tables
Tables are a high-level interface to extendable arrays of
structs.
Sort-of.




28
Tables
Tables are a high-level interface to extendable arrays of
structs.
Sort-of.
In fact, the struct / dtype / description concept is only a
convenient way to assign meaning to bytes:
| ids |         first       |        last       |
|-------|-------------------|-------------------|
| | | | | | | | | | | | | | | | | | | | | | | | |




28
Tables
Data types may be nested (though they are stored in
flattened way).
dt = np.dtype([('id', int),
               ('first', 'S5'),
               ('last', 'S5'),
               ('parents', [
                     ('mom_id', int),
                     ('dad_id', int),
                 ]),
              ])

people = np.fromstring(np.random.bytes(dt.itemsize * 10000), dt)
f.createTable('/', 'random_peeps', people)




29
Tables




30
Tables
Python already has the ability to dynamically declare the
size of descriptions.




31
Tables
Python already has the ability to dynamically declare the
size of descriptions.
This is accomplished in compiled languages through
normal memory allocation and careful byte counting:

typedef struct mat {
  double mass;
  int atoms_per_mol;
  double comp [];
} mat;


31
Tables
typedef struct mat {
  double mass;
  int atoms_per_mol;
  double comp [];
} mat;

size_t mat_size = sizeof(mat) + sizeof(double)*comp_size;
hid_t desc = H5Tcreate(H5T_COMPOUND, mat_size);
hid_t comptype = H5Tarray_create2(H5T_NATIVE_DOUBLE, 1, nuc_dims);

// make the data table type
H5Tinsert(desc, "mass", HOFFSET(mat, mass), H5T_NATIVE_DOUBLE);
H5Tinsert(desc, "atoms_per_mol", HOFFSET(mat, atoms_per_mol), H5T_NATIVE_DOUBLE);
H5Tinsert(desc, "comp", HOFFSET(mat, comp), comp_type);

// make the data array for a single row, have to over-allocate
mat * mat_data = new mat[mat_size];

// ...fill in data array...

// Write the row
H5Dwrite(data_set, desc, mem_space, data_hyperslab, H5P_DEFAULT, mat_data);




32
Exercise
     exer/boatload.py




33
Exercise
     sol/boatload.py




34
Chunking
Chunking is a feature with no direct analogy in NumPy.




35
Chunking
Chunking is a feature with no direct analogy in NumPy.
 Chunking is the ability to split up a dataset into smaller
             blocks of equal or lesser rank.




35
Chunking
Chunking is a feature with no direct analogy in NumPy.
 Chunking is the ability to split up a dataset into smaller
             blocks of equal or lesser rank.
Extra metadata pointing to the location of the chunk in the
file and in dataspace must be stored.




35
Chunking
Chunking is a feature with no direct analogy in NumPy.
 Chunking is the ability to split up a dataset into smaller
             blocks of equal or lesser rank.
Extra metadata pointing to the location of the chunk in the
file and in dataspace must be stored.
By chunking, sparse data may be stored efficiently and
datasets may extend infinitely in all dimensions.




35
Chunking
Chunking is a feature with no direct analogy in NumPy.
 Chunking is the ability to split up a dataset into smaller
             blocks of equal or lesser rank.
Extra metadata pointing to the location of the chunk in the
file and in dataspace must be stored.
By chunking, sparse data may be stored efficiently and
datasets may extend infinitely in all dimensions.
Note: Currently, PyTables only allows one extendable dim.


35
Chunking




     Contiguous Dataset




36
Chunked Dataset




37
Chunking
All I/O happens by chunk. This is important for:

     • edge chunks may extend beyond the dataset




37
Chunking
All I/O happens by chunk. This is important for:

     • edge chunks may extend beyond the dataset
     • default fill values are set in unallocated space




37
Chunking
All I/O happens by chunk. This is important for:

     • edge chunks may extend beyond the dataset
     • default fill values are set in unallocated space
     • reading and writing in parallel




37
Chunking
All I/O happens by chunk. This is important for:

     • edge chunks may extend beyond the dataset
     • default fill values are set in unallocated space
     • reading and writing in parallel
     • small chunks are good for accessing some of data




37
Chunking
All I/O happens by chunk. This is important for:

     • edge chunks may extend beyond the dataset
     • default fill values are set in unallocated space
     • reading and writing in parallel
     • small chunks are good for accessing some of data
     • large chunks are good for accessing lots of data



37
Chunking
Any chunked dataset allows you to set the chunksize.
f.createTable('/', 'omnomnom', data, chunkshape=(42,42))




38
Chunking
Any chunked dataset allows you to set the chunksize.
f.createTable('/', 'omnomnom', data, chunkshape=(42,42))

For example, a 4x4 chunked array could have a 3x3
chunksize.




38
Chunking
Any chunked dataset allows you to set the chunksize.
f.createTable('/', 'omnomnom', data, chunkshape=(42,42))

For example, a 4x4 chunked array could have a 3x3
chunksize.
However, it could not have a 12x12 chunksize, since the
ranks must be less than or equal to that of the array.




38
Chunking
Any chunked dataset allows you to set the chunksize.
f.createTable('/', 'omnomnom', data, chunkshape=(42,42))

For example, a 4x4 chunked array could have a 3x3
chunksize.
However, it could not have a 12x12 chunksize, since the
ranks must be less than or equal to that of the array.
Manipulating the chunksize is a great way to fine-tune an
application.


38
Chunking




     Contiguous 4x4 Dataset




39
Chunked 4x4 Dataset




40
Chunking
Note that the addresses of chunks in dataspace (memory)
has no bearing on their arrangement in the actual file.




     Dataspace (top) vs File (bottom) Chunk Locations



40
In-Core vs Out-of-Core
Calculations depend on the current memory layout.




41
In-Core vs Out-of-Core
Calculations depend on the current memory layout.
Recall access time analogy (wander Earth for 16 months).




41
In-Core vs Out-of-Core
Calculations depend on the current memory layout.
Recall access time analogy (wander Earth for 16 months).
Definitions:




41
In-Core vs Out-of-Core
Calculations depend on the current memory layout.
Recall access time analogy (wander Earth for 16 months).
Definitions:

     • Operations which require all data to be in memory are
       in-core and may be memory bound (NumPy).




41
In-Core vs Out-of-Core
Calculations depend on the current memory layout.
Recall access time analogy (wander Earth for 16 months).
Definitions:

     • Operations which require all data to be in memory are
       in-core and may be memory bound (NumPy).
     • Operations where the dataset is external to memory are
       out-of-core (or in-kernel) and may be CPU bound.


41
In-Core Operations
Say, a and b are arrays sitting in memory:

a = np.array(...)
b = np.array(...)
c = 42 * a + 28 * b + 6




42
In-Core Operations
Say, a and b are arrays sitting in memory:

a = np.array(...)
b = np.array(...)
c = 42 * a + 28 * b + 6

The expression for c creates three temporary arrays!




42
In-Core Operations
Say, a and b are arrays sitting in memory:

a = np.array(...)
b = np.array(...)
c = 42 * a + 28 * b + 6

The expression for c creates three temporary arrays!
For N operations, N-1 temporaries are made.



42
In-Core Operations
Say, a and b are arrays sitting in memory:

a = np.array(...)
b = np.array(...)
c = 42 * a + 28 * b + 6

The expression for c creates three temporary arrays!
For N operations, N-1 temporaries are made.
Wastes memory and is slow. Pulling from disk is slower.


42
In-Core Operations
A less memory intensive implementation would be an
element-wise evaluation:

c = np.empty(...)
for i in range(len(c)):
    c[i] = 42 * a[i] + 28 * b[i] + 6




43
In-Core Operations
A less memory intensive implementation would be an
element-wise evaluation:

c = np.empty(...)
for i in range(len(c)):
    c[i] = 42 * a[i] + 28 * b[i] + 6
But if a and b were HDF5 arrays on disk, individual element
access time would kill you.




43
In-Core Operations
A less memory intensive implementation would be an
element-wise evaluation:

c = np.empty(...)
for i in range(len(c)):
    c[i] = 42 * a[i] + 28 * b[i] + 6
But if a and b were HDF5 arrays on disk, individual element
access time would kill you.

Even with in memory NumPy arrays, there are problems with
gratuitous Python type checking.

43
Out-of-Core Operations
Say there was a virtual machine (or kernel) which could be fed
arrays and perform specified operations.




44
Out-of-Core Operations
Say there was a virtual machine (or kernel) which could be fed
arrays and perform specified operations.

Giving this machine only chunks of data at a time, it could function
on infinite-length data using only finite memory.




44
Out-of-Core Operations
Say there was a virtual machine (or kernel) which could be fed
arrays and perform specified operations.

Giving this machine only chunks of data at a time, it could function
on infinite-length data using only finite memory.

for i in range(0, len(a), 256):
    r0, r1 = a[i:i+256], b[i:i+256]
    multiply(r0, 42, r2)
    multiply(r1, 28, r3)
    add(r2, r3, r2); add(r2, 6, r2)
    c[i:i+256] = r2

44
Out-of-Core Operations
This is the basic idea behind numexpr, which provides a
general virtual machine for NumPy arrays.




45
Out-of-Core Operations
This is the basic idea behind numexpr, which provides a
general virtual machine for NumPy arrays.
This problem lends itself nicely to parallelism.




45
Out-of-Core Operations
This is the basic idea behind numexpr, which provides a
general virtual machine for NumPy arrays.
This problem lends itself nicely to parallelism.
Numexpr has low-level multithreading, avoiding the GIL.




45
Out-of-Core Operations
This is the basic idea behind numexpr, which provides a
general virtual machine for NumPy arrays.
This problem lends itself nicely to parallelism.
Numexpr has low-level multithreading, avoiding the GIL.
PyTables implements a tb.Expr class which backends
to the numexpr VM but has additional optimizations for
disk reading and writing.




45
Out-of-Core Operations
This is the basic idea behind numexpr, which provides a
general virtual machine for NumPy arrays.
This problem lends itself nicely to parallelism.
Numexpr has low-level multithreading, avoiding the GIL.
PyTables implements a tb.Expr class which backends
to the numexpr VM but has additional optimizations for
disk reading and writing.
The full array need never be in memory.


45
Out-of-Core Operations
Fully out-of-core expression example:

shape = (10, 10000)
f = tb.openFile("/tmp/expression.h5", "w")

a =   f.createCArray(f.root, 'a', tb.Float32Atom(dflt=1.), shape)
b =   f.createCArray(f.root, 'b', tb.Float32Atom(dflt=2.), shape)
c =   f.createCArray(f.root, 'c', tb.Float32Atom(dflt=3.), shape)
out   = f.createCArray(f.root, 'out', tb.Float32Atom(dflt=3.), shape)

expr = tb.Expr("a*b+c")
expr.setOutput(out)
d = expr.eval()

print "returned-->", repr(d)
f.close()




46
Querying
The most common operation is asking an existing dataset
whether its elements satisfy some criteria. This is known
as querying.




47
Querying
The most common operation is asking an existing dataset
whether its elements satisfy some criteria. This is known
as querying.
Because querying is so common PyTables defines special
methods on Tables.




47
Querying
The most common operation is asking an existing dataset
whether its elements satisfy some criteria. This is known
as querying.
Because querying is so common PyTables defines special
methods on Tables.

tb.Table.where(cond)
tb.Table.getWhereList(cond)
tb.Table.readWhere(cond)
tb.Table.whereAppend(dest, cond)


47
Querying
The conditions used in where() calls are strings which
are evaluated by numexpr. These expressions must return
boolean values.




48
Querying
The conditions used in where() calls are strings which
are evaluated by numexpr. These expressions must return
boolean values.
They are executed in the context of table itself combined
with locals() and globals().




48
Querying
The conditions used in where() calls are strings which
are evaluated by numexpr. These expressions must return
boolean values.
They are executed in the context of table itself combined
with locals() and globals().
The where() method itself returns an iterator over all
matched (hit) rows:
for row in table.where('(col1 < 42) & (col2 == col3)'):
    # do something with row



48
Querying
For a speed comparison, here is a complex query using
regular Python:
result = [row['col2'] for row in table if (
          ((row['col4'] >= lim1 and row['col4'] < lim2) or
          ((row['col2'] > lim3 and row['col2'] < lim4])) and
          ((row['col1']+3.1*row['col2']+row['col3']*row['col4']) > lim5)
          )]




49
Querying
For a speed comparison, here is a complex query using
regular Python:
result = [row['col2'] for row in table if (
          ((row['col4'] >= lim1 and row['col4'] < lim2) or
          ((row['col2'] > lim3 and row['col2'] < lim4])) and
          ((row['col1']+3.1*row['col2']+row['col3']*row['col4']) > lim5)
          )]

And this is the equivalent out-of-core search:
result = [row['col2'] for row in table.where(
            '(((col4 >= lim1) & (col4 < lim2)) | '
            '((col2 > lim3) & (col2 < lim4)) &   '
            '((col1+3.1*col2+col3*col4) > lim5)) ')]


49
Querying




Complex query with 10 million rows. Data fits in memory.

50
51
Querying




 Complex query with 1 billion rows. Too big for memory.

51
52
Exercise
     exer/crono.py




52
Exercise
     sol/crono.py




53
Compression
A more general way to solve the starving CPU problem is
through compression.




54
Compression
A more general way to solve the starving CPU problem is
through compression.
Compression is when the dataset is piped through a
zipping algorithm on write and the inverse unzipping
algorithm on read.




54
Compression
A more general way to solve the starving CPU problem is
through compression.
Compression is when the dataset is piped through a
zipping algorithm on write and the inverse unzipping
algorithm on read.
Each chunk is compressed independently, so chunks end
up with a varying number bytes.




54
Compression
A more general way to solve the starving CPU problem is
through compression.
Compression is when the dataset is piped through a
zipping algorithm on write and the inverse unzipping
algorithm on read.
Each chunk is compressed independently, so chunks end
up with a varying number bytes.
Has some storage overhead, but may drastically reduce file
sizes for very regular data.

54
Compression
At first glance this is counter-intuitive. (Why?)




55
Compression
At first glance this is counter-intuitive. (Why?)
Compression/Decompression is clearly more CPU
intensive than simply blitting an array into memory.




55
Compression
At first glance this is counter-intuitive. (Why?)
Compression/Decompression is clearly more CPU
intensive than simply blitting an array into memory.
However, because there is less total information to
transfer, the time spent unpacking the array can be far less
than moving the array around wholesale.




55
Compression
At first glance this is counter-intuitive. (Why?)
Compression/Decompression is clearly more CPU
intensive than simply blitting an array into memory.
However, because there is less total information to
transfer, the time spent unpacking the array can be far less
than moving the array around wholesale.
This is kind of like power steering, you can either tell
wheels how to turn manually or you can tell the car how
you want the wheels turned.

55
Compression
Compression is a guaranteed feature of HDF5 itself.




56
Compression
Compression is a guaranteed feature of HDF5 itself.
At minimum, HDF5 requires zlib.




56
Compression
Compression is a guaranteed feature of HDF5 itself.
At minimum, HDF5 requires zlib.
The compression capabilities feature a plugin architecture
which allow for a variety of different algorithms, including
user defined ones!




56
Compression
Compression is a guaranteed feature of HDF5 itself.
At minimum, HDF5 requires zlib.
The compression capabilities feature a plugin architecture
which allow for a variety of different algorithms, including
user defined ones!
PyTables supports:
         • zlib (default), • lzo, • bzip2, and • blosc.



56
Compression
Compression is enabled in PyTables through filters.




57
Compression
Compression is enabled in PyTables through filters.

# complevel goes from [0,9]
filters = tb.Filters(complevel=5, complib='blosc', ...)




57
Compression
Compression is enabled in PyTables through filters.

# complevel goes from [0,9]
filters = tb.Filters(complevel=5, complib='blosc', ...)

# filters may be set on the whole file,
f = tb.openFile('/path/to/file', 'a', filters=filters)
f.filters = filters




57
Compression
Compression is enabled in PyTables through filters.

# complevel goes from [0,9]
filters = tb.Filters(complevel=5, complib='blosc', ...)

# filters may be set on the whole file,
f = tb.openFile('/path/to/file', 'a', filters=filters)
f.filters = filters

# filters may also be set on most other nodes
f.createTable('/', 'table', desc, filters=filters)
f.root.group._v_filters = filters




57
Compression
Compression is enabled in PyTables through filters.

# complevel goes from [0,9]
filters = tb.Filters(complevel=5, complib='blosc', ...)

# filters may be set on the whole file,
f = tb.openFile('/path/to/file', 'a', filters=filters)
f.filters = filters

# filters may also be set on most other nodes
f.createTable('/', 'table', desc, filters=filters)
f.root.group._v_filters = filters

Filters only act on chunked datasets.

57
Compression
Tips for choosing compression parameters:




58
Compression
Tips for choosing compression parameters:

     • A mid-level (5) compression is sufficient. No need to
       go all the way up (9).




58
Compression
Tips for choosing compression parameters:

     • A mid-level (5) compression is sufficient. No need to
       go all the way up (9).
     • Use zlib if you must guarantee complete portability.




58
Compression
Tips for choosing compression parameters:

     • A mid-level (5) compression is sufficient. No need to
       go all the way up (9).
     • Use zlib if you must guarantee complete portability.
     • Use blosc all other times. It is optimized for HDF5.




58
Compression
Tips for choosing compression parameters:

     • A mid-level (5) compression is sufficient. No need to
       go all the way up (9).
     • Use zlib if you must guarantee complete portability.
     • Use blosc all other times. It is optimized for HDF5.

But why? (I don't have time to go into the details of blosc.
However here are some justifications...)

58
Compression




     Comparison of different compression levels of zlib.
59
60
Compression




     Creation time per element for a 15 GB EArray and
                    different chunksizes.

60
61
Compression




     File sizes for a 15 GB EArray and different chunksizes.
61
62
Compression




     Sequential access time per element for a 15 GB EArray
                    and different chunksizes.

62
63
Compression




 Random access time per element for a 15 GB EArray and
                  different chunksizes.

63
64
Exercise
     exer/spam_filter.py




64
Exercise
     sol/spam_filter.py




65
Other Python Data Structures
Overwhelmingly, numpy arrays have been the in-memory
data structure of choice.




66
Other Python Data Structures
Overwhelmingly, numpy arrays have been the in-memory
data structure of choice.
Using lists or tuples instead of arrays follows analogously.




66
Other Python Data Structures
Overwhelmingly, numpy arrays have been the in-memory
data structure of choice.
Using lists or tuples instead of arrays follows analogously.
It is data structures like sets and dictionaries which do not
quite map.




66
Other Python Data Structures
Overwhelmingly, numpy arrays have been the in-memory
data structure of choice.
Using lists or tuples instead of arrays follows analogously.
It is data structures like sets and dictionaries which do not
quite map.
However, as long as all elements may be cast into the same
atomic type, these structures can be stored in HDF5 with
relative ease.


66
Sets
Example of serializing and deserializing sets:
>>> s = {1.0, 42, 77.7, 6E+01, True}

>>> f.createArray('/', 's', [float(x) for x in s])
/s (Array(4,)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'python'
  byteorder := 'little'
  chunkshape := None

>>> set(f.root.s)
set([1.0, 42.0, 77.7, 60.0])

67
Exercise
     exer/dict_table.py




68
Exercise
     sol/dict_table.py




69
What Was Missed

 • Walking Nodes
 • File Nodes
 • Indexing
 • Migrating to / from SQL
 • HDF5 in other database formats
 • Other Databases in HDF5
 • HDF5 as a File System

70
Acknowledgements
Many thanks to everyone who made this possible!




71
Acknowledgements
Many thanks to everyone who made this possible!

     • The HDF Group




71
Acknowledgements
Many thanks to everyone who made this possible!

     • The HDF Group
     • The PyTables Governance Team:

       • Josh Moore, • Antonio Valentino, • Josh Ayers




71
Acknowledgements
(Cont.)

     • The NumPy Developers




72
Acknowledgements
(Cont.)

     • The NumPy Developers
     • h5py, the symbiotic project




72
Acknowledgements
(Cont.)

     • The NumPy Developers
     • h5py, the symbiotic project
     • Francesc Alted




72
Acknowledgements
(Cont.)

     • The NumPy Developers
     • h5py, the symbiotic project
     • Francesc Alted

 Shameless Plug: We are always looking for more hands.
                     Join Now!



72
Questions




73

More Related Content

What's hot

Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsInteroperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsThe HDF-EOS Tools and Information Center
 
Delving (Smalltalk) Source Code
Delving (Smalltalk) Source CodeDelving (Smalltalk) Source Code
Delving (Smalltalk) Source CodeESUG
 

What's hot (20)

Images of HDF5
Images of HDF5Images of HDF5
Images of HDF5
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
HDF5 FastQuery
HDF5 FastQueryHDF5 FastQuery
HDF5 FastQuery
 
Projection Indexes for HDF5 Datasets
Projection Indexes for HDF5 DatasetsProjection Indexes for HDF5 Datasets
Projection Indexes for HDF5 Datasets
 
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsInteroperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
 
HDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSSHDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSS
 
NASA HDF/HDF-EOS Data for Dummies (and Developers)
NASA HDF/HDF-EOS Data for Dummies (and Developers)NASA HDF/HDF-EOS Data for Dummies (and Developers)
NASA HDF/HDF-EOS Data for Dummies (and Developers)
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Digital Object Identifiers for EOSDIS data
Digital Object Identifiers for EOSDIS dataDigital Object Identifiers for EOSDIS data
Digital Object Identifiers for EOSDIS data
 
HDF4 Mapping Project Update
HDF4 Mapping Project UpdateHDF4 Mapping Project Update
HDF4 Mapping Project Update
 
Tools to improve the usability of NASA HDF Data
Tools to improve the usability of NASA HDF DataTools to improve the usability of NASA HDF Data
Tools to improve the usability of NASA HDF Data
 
NASA HDF/HDF-EOS Data Access Challenges
NASA HDF/HDF-EOS Data Access ChallengesNASA HDF/HDF-EOS Data Access Challenges
NASA HDF/HDF-EOS Data Access Challenges
 
NetCDF and HDF5
NetCDF and HDF5NetCDF and HDF5
NetCDF and HDF5
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Data Interoperability
Data InteroperabilityData Interoperability
Data Interoperability
 
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 dataUsage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
Usage of NCL, IDL, and MATLAB to access NASA HDF4/HDF-EOS2/HDF-EOS5 data
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Delving (Smalltalk) Source Code
Delving (Smalltalk) Source CodeDelving (Smalltalk) Source Code
Delving (Smalltalk) Source Code
 
Migrating from HDF5 1.6 to 1.8
Migrating from HDF5 1.6 to 1.8Migrating from HDF5 1.6 to 1.8
Migrating from HDF5 1.6 to 1.8
 

Viewers also liked

Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create PyData
 
How Soon is Now: automatically extracting publication dates of news articles ...
How Soon is Now: automatically extracting publication dates of news articles ...How Soon is Now: automatically extracting publication dates of news articles ...
How Soon is Now: automatically extracting publication dates of news articles ...PyData
 
Doing frequentist statistics with scipy
Doing frequentist statistics with scipyDoing frequentist statistics with scipy
Doing frequentist statistics with scipyPyData
 
Promoting a Data Driven Culture in a Microservices Environment
Promoting a Data Driven Culture in a Microservices EnvironmentPromoting a Data Driven Culture in a Microservices Environment
Promoting a Data Driven Culture in a Microservices EnvironmentPyData
 
Making your code faster cython and parallel processing in the jupyter notebook
Making your code faster   cython and parallel processing in the jupyter notebookMaking your code faster   cython and parallel processing in the jupyter notebook
Making your code faster cython and parallel processing in the jupyter notebookPyData
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...QuantInsti
 

Viewers also liked (6)

Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
How Soon is Now: automatically extracting publication dates of news articles ...
How Soon is Now: automatically extracting publication dates of news articles ...How Soon is Now: automatically extracting publication dates of news articles ...
How Soon is Now: automatically extracting publication dates of news articles ...
 
Doing frequentist statistics with scipy
Doing frequentist statistics with scipyDoing frequentist statistics with scipy
Doing frequentist statistics with scipy
 
Promoting a Data Driven Culture in a Microservices Environment
Promoting a Data Driven Culture in a Microservices EnvironmentPromoting a Data Driven Culture in a Microservices Environment
Promoting a Data Driven Culture in a Microservices Environment
 
Making your code faster cython and parallel processing in the jupyter notebook
Making your code faster   cython and parallel processing in the jupyter notebookMaking your code faster   cython and parallel processing in the jupyter notebook
Making your code faster cython and parallel processing in the jupyter notebook
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
 

Similar to Hdf5 is for Lovers (PyData SV 2013)

An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath Format
An IDL-Based  Validation Toolkit: Extensions to  use the HDF-EOS Swath FormatAn IDL-Based  Validation Toolkit: Extensions to  use the HDF-EOS Swath Format
An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath FormatThe HDF-EOS Tools and Information Center
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)DevDays
 
Hdf Augmentation: Interoperability in the Last Mile
Hdf Augmentation: Interoperability in the Last MileHdf Augmentation: Interoperability in the Last Mile
Hdf Augmentation: Interoperability in the Last MileTed Habermann
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallelmfolk
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Preparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsPreparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsHPCC Systems
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overvieweposthumus
 

Similar to Hdf5 is for Lovers (PyData SV 2013) (20)

Hive
HiveHive
Hive
 
An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath Format
An IDL-Based  Validation Toolkit: Extensions to  use the HDF-EOS Swath FormatAn IDL-Based  Validation Toolkit: Extensions to  use the HDF-EOS Swath Format
An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath Format
 
Update on HDF5 1.8
Update on HDF5 1.8Update on HDF5 1.8
Update on HDF5 1.8
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
Hfile
HfileHfile
Hfile
 
HDF5 Advanced Topics - Datatypes and Partial I/O
HDF5 Advanced Topics - Datatypes and Partial I/OHDF5 Advanced Topics - Datatypes and Partial I/O
HDF5 Advanced Topics - Datatypes and Partial I/O
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Using HDF5 tools for performance tuning and troubleshooting
Using HDF5 tools for performance tuning and troubleshootingUsing HDF5 tools for performance tuning and troubleshooting
Using HDF5 tools for performance tuning and troubleshooting
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
HDF5 Backward and Forward Compatibility Issues
HDF5 Backward and Forward Compatibility IssuesHDF5 Backward and Forward Compatibility Issues
HDF5 Backward and Forward Compatibility Issues
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
 
Python Training
Python TrainingPython Training
Python Training
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
Intro to Redis
Intro to RedisIntro to Redis
Intro to Redis
 
Hdf Augmentation: Interoperability in the Last Mile
Hdf Augmentation: Interoperability in the Last MileHdf Augmentation: Interoperability in the Last Mile
Hdf Augmentation: Interoperability in the Last Mile
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallel
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Preparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for TranslationsPreparing an Open Source Documentation Repository for Translations
Preparing an Open Source Documentation Repository for Translations
 
Fedora Overview
Fedora OverviewFedora Overview
Fedora Overview
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Hdf5 is for Lovers (PyData SV 2013)

  • 1. PyData SV 2013 - Tutorials HDF5 is for Lovers March 18th, 2013, PyData, Silicon Valley Anthony Scopatz The FLASH Center The University of Chicago scopatz@gmail.com 1
  • 2. What is HDF5? HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive. 2
  • 3. What is HDF5? HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive. It is supported by the lovely people at 2
  • 4. What is HDF5? HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive. It is supported by the lovely people at At its core HDF5 is binary file type specification. 2
  • 5. What is HDF5? HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive. It is supported by the lovely people at At its core HDF5 is binary file type specification. However, what makes HDF5 great is the numerous libraries written to interact with files of this type and their extremely rich feature set. 2
  • 6. What is HDF5? HDF5 stands for (H)eirarchical (D)ata (F)ormat (5)ive. It is supported by the lovely people at At its core HDF5 is binary file type specification. However, what makes HDF5 great is the numerous libraries written to interact with files of this type and their extremely rich feature set. Which you will learn today! 2
  • 7. A Note on the Format Intermixed, there will be: • Slides • Interactive Hacking • Exercises 3
  • 8. A Note on the Format Intermixed, there will be: • Slides • Interactive Hacking • Exercises Feel free to: • Ask questions at anytime • Explore at your own pace. 3
  • 9. Class Makeup By a show of hands, how many people have used: • HDF5 before? 4
  • 10. Class Makeup By a show of hands, how many people have used: • HDF5 before? • PyTables? 4
  • 11. Class Makeup By a show of hands, how many people have used: • HDF5 before? • PyTables? • h5py? 4
  • 12. Class Makeup By a show of hands, how many people have used: • HDF5 before? • PyTables? • h5py? • the HDF5 C API? 4
  • 13. Class Makeup By a show of hands, how many people have used: • HDF5 before? • PyTables? • h5py? • the HDF5 C API? • SQL? 4
  • 14. Class Makeup By a show of hands, how many people have used: • HDF5 before? • PyTables? • h5py? • the HDF5 C API? • SQL? • Other binary data formats? 4
  • 15. Setup Please clone the repo: git clone git://github.com/scopatz/hdf5-is-for-lovers.git Or download a tarball from: https://github.com/scopatz/hdf5-is-for-lovers 5
  • 16. Warm up exercise In IPython: import numpy as np import tables as tb f = tb.openFile('temp.h5', 'a') heart = np.ones(42, dtype=[('rate', int), ('beat', float)]) f.createTable('/', 'heart', heart) f.close() Or run python exer/warmup.py 6
  • 17. Warm up exercise You should see in ViTables: 7
  • 18. A Brief Introduction For persisting structured numerical data, binary formats are superior to plaintext. 8
  • 19. A Brief Introduction For persisting structured numerical data, binary formats are superior to plaintext. For one thing, they are often smaller: # small ints # med ints 42 (4 bytes) 123456 (4 bytes) '42' (2 bytes) '123456' (6 bytes) # near-int floats # e-notation floats 12.34 (8 bytes) 42.424242E+42 (8 bytes) '12.34' (5 bytes) '42.424242E+42' (13 bytes) 8
  • 20. A Brief Introduction For another, binary formats are often faster for I/O because atoi() and atof() are expensive. 9
  • 21. A Brief Introduction For another, binary formats are often faster for I/O because atoi() and atof() are expensive. However, you often want some thing more than a binary chunk of data in a file. 9
  • 22. A Brief Introduction For another, binary formats are often faster for I/O because atoi() and atof() are expensive. However, you often want some thing more than a binary chunk of data in a file. Note This is the mechanism behind numpy.save() and numpy.savez(). 9
  • 23. A Brief Introduction Instead, you want a real database with the ability to store many datasets, user-defined metadata, optimized I/O, and the ability to query its contents. 10
  • 24. A Brief Introduction Instead, you want a real database with the ability to store many datasets, user-defined metadata, optimized I/O, and the ability to query its contents. Unlike SQL, where every dataset lives in a flat namespace, HDF allows datasets to live in a nested tree structure. 10
  • 25. A Brief Introduction Instead, you want a real database with the ability to store many datasets, user-defined metadata, optimized I/O, and the ability to query its contents. Unlike SQL, where every dataset lives in a flat namespace, HDF allows datasets to live in a nested tree structure. In effect, HDF5 is a file system within a file. 10
  • 26. A Brief Introduction Instead, you want a real database with the ability to store many datasets, user-defined metadata, optimized I/O, and the ability to query its contents. Unlike SQL, where every dataset lives in a flat namespace, HDF allows datasets to live in a nested tree structure. In effect, HDF5 is a file system within a file. (More on this later.) 10
  • 27. A Brief Introduction Basic dataset classes include: • Array 11
  • 28. A Brief Introduction Basic dataset classes include: • Array • CArray (chunked array) 11
  • 29. A Brief Introduction Basic dataset classes include: • Array • CArray (chunked array) • EArray (extendable array) 11
  • 30. A Brief Introduction Basic dataset classes include: • Array • CArray (chunked array) • EArray (extendable array) • VLArray (variable length array) 11
  • 31. A Brief Introduction Basic dataset classes include: • Array • CArray (chunked array) • EArray (extendable array) • VLArray (variable length array) • Table (structured array w/ named fields) 11
  • 32. A Brief Introduction Basic dataset classes include: • Array • CArray (chunked array) • EArray (extendable array) • VLArray (variable length array) • Table (structured array w/ named fields) All of these must be composed of atomic types. 11
  • 33. A Brief Introduction There are six kinds of types supported by PyTables: • bool: Boolean (true/false) types. 8 bits. 12
  • 34. A Brief Introduction There are six kinds of types supported by PyTables: • bool: Boolean (true/false) types. 8 bits. • int: Signed integer types. 8, 16, 32 (default) and 64 bits. 12
  • 35. A Brief Introduction There are six kinds of types supported by PyTables: • bool: Boolean (true/false) types. 8 bits. • int: Signed integer types. 8, 16, 32 (default) and 64 bits. • uint: Unsigned integers. 8, 16, 32 (default) and 64 bits. 12
  • 36. A Brief Introduction There are six kinds of types supported by PyTables: • bool: Boolean (true/false) types. 8 bits. • int: Signed integer types. 8, 16, 32 (default) and 64 bits. • uint: Unsigned integers. 8, 16, 32 (default) and 64 bits. • float: Floating point types. 16, 32 and 64 (default) bits. 12
  • 37. A Brief Introduction There are six kinds of types supported by PyTables: • bool: Boolean (true/false) types. 8 bits. • int: Signed integer types. 8, 16, 32 (default) and 64 bits. • uint: Unsigned integers. 8, 16, 32 (default) and 64 bits. • float: Floating point types. 16, 32 and 64 (default) bits. • complex: Complex number. 64 and 128 (default) bits. 12
  • 38. A Brief Introduction There are six kinds of types supported by PyTables: • bool: Boolean (true/false) types. 8 bits. • int: Signed integer types. 8, 16, 32 (default) and 64 bits. • uint: Unsigned integers. 8, 16, 32 (default) and 64 bits. • float: Floating point types. 16, 32 and 64 (default) bits. • complex: Complex number. 64 and 128 (default) bits. • string: Raw string types. 8-bit positive multiples. 12
  • 39. A Brief Introduction Other elements of the hierarchy may include: • Groups (dirs) 13
  • 40. A Brief Introduction Other elements of the hierarchy may include: • Groups (dirs) • Links 13
  • 41. A Brief Introduction Other elements of the hierarchy may include: • Groups (dirs) • Links • File Nodes 13
  • 42. A Brief Introduction Other elements of the hierarchy may include: • Groups (dirs) • Links • File Nodes • Hidden Nodes 13
  • 43. A Brief Introduction Other elements of the hierarchy may include: • Groups (dirs) • Links • File Nodes • Hidden Nodes PyTables docs may be found at http://pytables.github.com/ 13
  • 44. Opening Files import tables as tb f = tb.openFile('/path/to/file', 'a') 14
  • 45. Opening Files import tables as tb f = tb.openFile('/path/to/file', 'a') • 'r': Read-only; no data can be modified. • 'w': Write; a new file is created (an existing file with the same name would be deleted). • 'a': Append; an existing file is opened for reading and writing, and if the file does not exist it is created. • 'r+': It is similar to 'a', but the file must already exist. 14
  • 46. Using the Hierarchy In HDF5, all nodes stem from a root ("/" or f.root). 15
  • 47. Using the Hierarchy In HDF5, all nodes stem from a root ("/" or f.root). In PyTables, you may access nodes as attributes on a Python object (f.root.a_group.some_data). 15
  • 48. Using the Hierarchy In HDF5, all nodes stem from a root ("/" or f.root). In PyTables, you may access nodes as attributes on a Python object (f.root.a_group.some_data). This is known as natural naming. 15
  • 49. Using the Hierarchy In HDF5, all nodes stem from a root ("/" or f.root). In PyTables, you may access nodes as attributes on a Python object (f.root.a_group.some_data). This is known as natural naming. Creating new nodes must be done on the file handle: f.createGroup('/', 'a_group', "My Group") f.root.a_group 15
  • 50. Creating Datasets The two most common datasets are Tables & Arrays. 16
  • 51. Creating Datasets The two most common datasets are Tables & Arrays. Appropriate create methods live on the file handle: # integer array f.createArray('/a_group', 'arthur_count', [1, 2, 5, 3]) 16
  • 52. Creating Datasets The two most common datasets are Tables & Arrays. Appropriate create methods live on the file handle: # integer array f.createArray('/a_group', 'arthur_count', [1, 2, 5, 3]) # tables, need descriptions dt = np.dtype([('id', int), ('name', 'S10')]) knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt) f.createTable('/', 'knights', dt) f.root.knights.append(knights) 16
  • 53. Reading Datasets Arrays and Tables try to preserve the original flavor that they were created with. 17
  • 54. Reading Datasets Arrays and Tables try to preserve the original flavor that they were created with. >>> print f.root.a_group.arthur_count[:] [1, 2, 5, 3] >>> type(f.root.a_group.arthur_count[:]) list >>> type(f.root.a_group.arthur_count) tables.array.Array 17
  • 55. Reading Datasets So if they come from NumPy arrays, they may be accessed in a numpy-like fashion (slicing, fancy indexing, masking). 18
  • 56. Reading Datasets So if they come from NumPy arrays, they may be accessed in a numpy-like fashion (slicing, fancy indexing, masking). >>> f.root.knights[1] (12, 'Bedivere') >>> f.root.knights[:1] array([(42, 'Lancelot')], dtype=[('id', '<i8'), ('name', 'S10')]) >>> mask = (f.root.knights.cols.id[:] < 28) >>> f.root.knights[mask] array([(12, 'Bedivere')], dtype=[('id', '<i8'), ('name', 'S10')]) >>> f.root.knights[([1, 0],)] array([(12, 'Bedivere'), (42, 'Lancelot')], dtype=[('id', '<i8'), ('name', 'S10')]) 18
  • 57. Reading Datasets So if they come from NumPy arrays, they may be accessed in a numpy-like fashion (slicing, fancy indexing, masking). >>> f.root.knights[1] (12, 'Bedivere') >>> f.root.knights[:1] array([(42, 'Lancelot')], dtype=[('id', '<i8'), ('name', 'S10')]) >>> mask = (f.root.knights.cols.id[:] < 28) >>> f.root.knights[mask] array([(12, 'Bedivere')], dtype=[('id', '<i8'), ('name', 'S10')]) >>> f.root.knights[([1, 0],)] array([(12, 'Bedivere'), (42, 'Lancelot')], dtype=[('id', '<i8'), ('name', 'S10')]) Data accessed in this way is memory mapped. 18
  • 58. Exercise exer/peaks_of_kilimanjaro.py 19
  • 59. Exercise sol/peaks_of_kilimanjaro.py 20
  • 60. Hierarchy Layout Suppose there is a big table of like-things: # people: name, profession, home people = [('Arthur', 'King', 'Camelot'), ('Lancelot', 'Knight', 'Lake'), ('Bedevere', 'Knight', 'Wales'), ('Witch', 'Witch', 'Village'), ('Guard', 'Man-at-Arms', 'Swamp Castle'), ('Ni', 'Knight', 'Shrubbery'), ('Strange Woman', 'Lady', 'Lake'), ... ] 21
  • 61. Hierarchy Layout Suppose there is a big table of like-things: # people: name, profession, home people = [('Arthur', 'King', 'Camelot'), ('Lancelot', 'Knight', 'Lake'), ('Bedevere', 'Knight', 'Wales'), ('Witch', 'Witch', 'Village'), ('Guard', 'Man-at-Arms', 'Swamp Castle'), ('Ni', 'Knight', 'Shrubbery'), ('Strange Woman', 'Lady', 'Lake'), ... ] It is tempting to throw everyone into a big people table. 21
  • 62. Hierarchy Layout However, a search over a class of people can be eliminated by splitting these tables up: knight = [('Lancelot', 'Knight', 'Lake'), ('Bedevere', 'Knight', 'Wales'), ('Ni', 'Knight', 'Shrubbery'), ] others = [('Arthur', 'King', 'Camelot'), ('Witch', 'Witch', 'Village'), ('Guard', 'Man-at-Arms', 'Swamp Castle'), ('Strange Woman', 'Lady', 'Lake'), ... ] 22
  • 63. Hierarchy Layout The profession column is now redundant: knight = [('Lancelot', 'Lake'), ('Bedevere', 'Wales'), ('Ni', 'Shrubbery'), ] others = [('Arthur', 'King', 'Camelot'), ('Witch', 'Witch', 'Village'), ('Guard', 'Man-at-Arms', 'Swamp Castle'), ('Strange Woman', 'Lady', 'Lake'), ... ] 23
  • 64. Hierarchy Layout Information can be embedded implicitly in the hierarchy as well: root | - England | | - knight | | - others | | - France | | - knight | | - others 24
  • 65. Hierarchy Layout Why bother pivoting the data like this at all? 25
  • 66. Hierarchy Layout Why bother pivoting the data like this at all? • Fewer rows to search over. 25
  • 67. Hierarchy Layout Why bother pivoting the data like this at all? • Fewer rows to search over. • Fewer rows to pull from disk. 25
  • 68. Hierarchy Layout Why bother pivoting the data like this at all? • Fewer rows to search over. • Fewer rows to pull from disk. • Fewer columns in description. 25
  • 69. Hierarchy Layout Why bother pivoting the data like this at all? • Fewer rows to search over. • Fewer rows to pull from disk. • Fewer columns in description. Ultimately, it is all about speed, especially for big tables. 25
  • 70. Access Time Analogy If a processor's access of L1 cache is analogous to you finding a word on a computer screen (3 seconds), then 26
  • 71. Access Time Analogy If a processor's access of L1 cache is analogous to you finding a word on a computer screen (3 seconds), then Accessing L2 cache is getting a book from a bookshelf (15 s). 26
  • 72. Access Time Analogy If a processor's access of L1 cache is analogous to you finding a word on a computer screen (3 seconds), then Accessing L2 cache is getting a book from a bookshelf (15 s). Accessing main memory is going to the break room, get a candy bar, and chatting with your co-worker (4 min). 26
  • 73. Access Time Analogy If a processor's access of L1 cache is analogous to you finding a word on a computer screen (3 seconds), then Accessing L2 cache is getting a book from a bookshelf (15 s). Accessing main memory is going to the break room, get a candy bar, and chatting with your co-worker (4 min). Accessing a (mechanical) HDD is leaving your office, leaving your building, wandering the planet for a year and four months to return to your desk with the information finally made available. Thanks K. Smith & http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait 26
  • 74. Starving CPU Problem Waiting around for access times prior to computation is known as the Starving CPU Problem. Francesc Alted. 2010. Why Modern CPUs Are Starving and What Can Be Done about It. IEEE Des. Test 12, 2 (March 2010), 68-71. DOI=10.1109/MCSE.2010.51 http://dx.doi.org/10.1109/MCSE.2010.51 27
  • 75. Tables Tables are a high-level interface to extendable arrays of structs. 28
  • 76. Tables Tables are a high-level interface to extendable arrays of structs. Sort-of. 28
  • 77. Tables Tables are a high-level interface to extendable arrays of structs. Sort-of. In fact, the struct / dtype / description concept is only a convenient way to assign meaning to bytes: | ids | first | last | |-------|-------------------|-------------------| | | | | | | | | | | | | | | | | | | | | | | | | | 28
  • 78. Tables Data types may be nested (though they are stored in flattened way). dt = np.dtype([('id', int), ('first', 'S5'), ('last', 'S5'), ('parents', [ ('mom_id', int), ('dad_id', int), ]), ]) people = np.fromstring(np.random.bytes(dt.itemsize * 10000), dt) f.createTable('/', 'random_peeps', people) 29
  • 80. Tables Python already has the ability to dynamically declare the size of descriptions. 31
  • 81. Tables Python already has the ability to dynamically declare the size of descriptions. This is accomplished in compiled languages through normal memory allocation and careful byte counting: typedef struct mat { double mass; int atoms_per_mol; double comp []; } mat; 31
  • 82. Tables typedef struct mat { double mass; int atoms_per_mol; double comp []; } mat; size_t mat_size = sizeof(mat) + sizeof(double)*comp_size; hid_t desc = H5Tcreate(H5T_COMPOUND, mat_size); hid_t comptype = H5Tarray_create2(H5T_NATIVE_DOUBLE, 1, nuc_dims); // make the data table type H5Tinsert(desc, "mass", HOFFSET(mat, mass), H5T_NATIVE_DOUBLE); H5Tinsert(desc, "atoms_per_mol", HOFFSET(mat, atoms_per_mol), H5T_NATIVE_DOUBLE); H5Tinsert(desc, "comp", HOFFSET(mat, comp), comp_type); // make the data array for a single row, have to over-allocate mat * mat_data = new mat[mat_size]; // ...fill in data array... // Write the row H5Dwrite(data_set, desc, mem_space, data_hyperslab, H5P_DEFAULT, mat_data); 32
  • 83. Exercise exer/boatload.py 33
  • 84. Exercise sol/boatload.py 34
  • 85. Chunking Chunking is a feature with no direct analogy in NumPy. 35
  • 86. Chunking Chunking is a feature with no direct analogy in NumPy. Chunking is the ability to split up a dataset into smaller blocks of equal or lesser rank. 35
  • 87. Chunking Chunking is a feature with no direct analogy in NumPy. Chunking is the ability to split up a dataset into smaller blocks of equal or lesser rank. Extra metadata pointing to the location of the chunk in the file and in dataspace must be stored. 35
  • 88. Chunking Chunking is a feature with no direct analogy in NumPy. Chunking is the ability to split up a dataset into smaller blocks of equal or lesser rank. Extra metadata pointing to the location of the chunk in the file and in dataspace must be stored. By chunking, sparse data may be stored efficiently and datasets may extend infinitely in all dimensions. 35
  • 89. Chunking Chunking is a feature with no direct analogy in NumPy. Chunking is the ability to split up a dataset into smaller blocks of equal or lesser rank. Extra metadata pointing to the location of the chunk in the file and in dataspace must be stored. By chunking, sparse data may be stored efficiently and datasets may extend infinitely in all dimensions. Note: Currently, PyTables only allows one extendable dim. 35
  • 90. Chunking Contiguous Dataset 36
  • 92. Chunking All I/O happens by chunk. This is important for: • edge chunks may extend beyond the dataset 37
  • 93. Chunking All I/O happens by chunk. This is important for: • edge chunks may extend beyond the dataset • default fill values are set in unallocated space 37
  • 94. Chunking All I/O happens by chunk. This is important for: • edge chunks may extend beyond the dataset • default fill values are set in unallocated space • reading and writing in parallel 37
  • 95. Chunking All I/O happens by chunk. This is important for: • edge chunks may extend beyond the dataset • default fill values are set in unallocated space • reading and writing in parallel • small chunks are good for accessing some of data 37
  • 96. Chunking All I/O happens by chunk. This is important for: • edge chunks may extend beyond the dataset • default fill values are set in unallocated space • reading and writing in parallel • small chunks are good for accessing some of data • large chunks are good for accessing lots of data 37
  • 97. Chunking Any chunked dataset allows you to set the chunksize. f.createTable('/', 'omnomnom', data, chunkshape=(42,42)) 38
  • 98. Chunking Any chunked dataset allows you to set the chunksize. f.createTable('/', 'omnomnom', data, chunkshape=(42,42)) For example, a 4x4 chunked array could have a 3x3 chunksize. 38
  • 99. Chunking Any chunked dataset allows you to set the chunksize. f.createTable('/', 'omnomnom', data, chunkshape=(42,42)) For example, a 4x4 chunked array could have a 3x3 chunksize. However, it could not have a 12x12 chunksize, since the ranks must be less than or equal to that of the array. 38
  • 100. Chunking Any chunked dataset allows you to set the chunksize. f.createTable('/', 'omnomnom', data, chunkshape=(42,42)) For example, a 4x4 chunked array could have a 3x3 chunksize. However, it could not have a 12x12 chunksize, since the ranks must be less than or equal to that of the array. Manipulating the chunksize is a great way to fine-tune an application. 38
  • 101. Chunking Contiguous 4x4 Dataset 39
  • 103. Chunking Note that the addresses of chunks in dataspace (memory) has no bearing on their arrangement in the actual file. Dataspace (top) vs File (bottom) Chunk Locations 40
  • 104. In-Core vs Out-of-Core Calculations depend on the current memory layout. 41
  • 105. In-Core vs Out-of-Core Calculations depend on the current memory layout. Recall access time analogy (wander Earth for 16 months). 41
  • 106. In-Core vs Out-of-Core Calculations depend on the current memory layout. Recall access time analogy (wander Earth for 16 months). Definitions: 41
  • 107. In-Core vs Out-of-Core Calculations depend on the current memory layout. Recall access time analogy (wander Earth for 16 months). Definitions: • Operations which require all data to be in memory are in-core and may be memory bound (NumPy). 41
  • 108. In-Core vs Out-of-Core Calculations depend on the current memory layout. Recall access time analogy (wander Earth for 16 months). Definitions: • Operations which require all data to be in memory are in-core and may be memory bound (NumPy). • Operations where the dataset is external to memory are out-of-core (or in-kernel) and may be CPU bound. 41
  • 109. In-Core Operations Say, a and b are arrays sitting in memory: a = np.array(...) b = np.array(...) c = 42 * a + 28 * b + 6 42
  • 110. In-Core Operations Say, a and b are arrays sitting in memory: a = np.array(...) b = np.array(...) c = 42 * a + 28 * b + 6 The expression for c creates three temporary arrays! 42
  • 111. In-Core Operations Say, a and b are arrays sitting in memory: a = np.array(...) b = np.array(...) c = 42 * a + 28 * b + 6 The expression for c creates three temporary arrays! For N operations, N-1 temporaries are made. 42
  • 112. In-Core Operations Say, a and b are arrays sitting in memory: a = np.array(...) b = np.array(...) c = 42 * a + 28 * b + 6 The expression for c creates three temporary arrays! For N operations, N-1 temporaries are made. Wastes memory and is slow. Pulling from disk is slower. 42
  • 113. In-Core Operations A less memory intensive implementation would be an element-wise evaluation: c = np.empty(...) for i in range(len(c)): c[i] = 42 * a[i] + 28 * b[i] + 6 43
  • 114. In-Core Operations A less memory intensive implementation would be an element-wise evaluation: c = np.empty(...) for i in range(len(c)): c[i] = 42 * a[i] + 28 * b[i] + 6 But if a and b were HDF5 arrays on disk, individual element access time would kill you. 43
  • 115. In-Core Operations A less memory intensive implementation would be an element-wise evaluation: c = np.empty(...) for i in range(len(c)): c[i] = 42 * a[i] + 28 * b[i] + 6 But if a and b were HDF5 arrays on disk, individual element access time would kill you. Even with in memory NumPy arrays, there are problems with gratuitous Python type checking. 43
  • 116. Out-of-Core Operations Say there was a virtual machine (or kernel) which could be fed arrays and perform specified operations. 44
  • 117. Out-of-Core Operations Say there was a virtual machine (or kernel) which could be fed arrays and perform specified operations. Giving this machine only chunks of data at a time, it could function on infinite-length data using only finite memory. 44
  • 118. Out-of-Core Operations Say there was a virtual machine (or kernel) which could be fed arrays and perform specified operations. Giving this machine only chunks of data at a time, it could function on infinite-length data using only finite memory. for i in range(0, len(a), 256): r0, r1 = a[i:i+256], b[i:i+256] multiply(r0, 42, r2) multiply(r1, 28, r3) add(r2, r3, r2); add(r2, 6, r2) c[i:i+256] = r2 44
  • 119. Out-of-Core Operations This is the basic idea behind numexpr, which provides a general virtual machine for NumPy arrays. 45
  • 120. Out-of-Core Operations This is the basic idea behind numexpr, which provides a general virtual machine for NumPy arrays. This problem lends itself nicely to parallelism. 45
  • 121. Out-of-Core Operations This is the basic idea behind numexpr, which provides a general virtual machine for NumPy arrays. This problem lends itself nicely to parallelism. Numexpr has low-level multithreading, avoiding the GIL. 45
  • 122. Out-of-Core Operations This is the basic idea behind numexpr, which provides a general virtual machine for NumPy arrays. This problem lends itself nicely to parallelism. Numexpr has low-level multithreading, avoiding the GIL. PyTables implements a tb.Expr class which backends to the numexpr VM but has additional optimizations for disk reading and writing. 45
  • 123. Out-of-Core Operations This is the basic idea behind numexpr, which provides a general virtual machine for NumPy arrays. This problem lends itself nicely to parallelism. Numexpr has low-level multithreading, avoiding the GIL. PyTables implements a tb.Expr class which backends to the numexpr VM but has additional optimizations for disk reading and writing. The full array need never be in memory. 45
  • 124. Out-of-Core Operations Fully out-of-core expression example: shape = (10, 10000) f = tb.openFile("/tmp/expression.h5", "w") a = f.createCArray(f.root, 'a', tb.Float32Atom(dflt=1.), shape) b = f.createCArray(f.root, 'b', tb.Float32Atom(dflt=2.), shape) c = f.createCArray(f.root, 'c', tb.Float32Atom(dflt=3.), shape) out = f.createCArray(f.root, 'out', tb.Float32Atom(dflt=3.), shape) expr = tb.Expr("a*b+c") expr.setOutput(out) d = expr.eval() print "returned-->", repr(d) f.close() 46
  • 125. Querying The most common operation is asking an existing dataset whether its elements satisfy some criteria. This is known as querying. 47
  • 126. Querying The most common operation is asking an existing dataset whether its elements satisfy some criteria. This is known as querying. Because querying is so common PyTables defines special methods on Tables. 47
  • 127. Querying The most common operation is asking an existing dataset whether its elements satisfy some criteria. This is known as querying. Because querying is so common PyTables defines special methods on Tables. tb.Table.where(cond) tb.Table.getWhereList(cond) tb.Table.readWhere(cond) tb.Table.whereAppend(dest, cond) 47
  • 128. Querying The conditions used in where() calls are strings which are evaluated by numexpr. These expressions must return boolean values. 48
  • 129. Querying The conditions used in where() calls are strings which are evaluated by numexpr. These expressions must return boolean values. They are executed in the context of table itself combined with locals() and globals(). 48
  • 130. Querying The conditions used in where() calls are strings which are evaluated by numexpr. These expressions must return boolean values. They are executed in the context of table itself combined with locals() and globals(). The where() method itself returns an iterator over all matched (hit) rows: for row in table.where('(col1 < 42) & (col2 == col3)'): # do something with row 48
  • 131. Querying For a speed comparison, here is a complex query using regular Python: result = [row['col2'] for row in table if ( ((row['col4'] >= lim1 and row['col4'] < lim2) or ((row['col2'] > lim3 and row['col2'] < lim4])) and ((row['col1']+3.1*row['col2']+row['col3']*row['col4']) > lim5) )] 49
  • 132. Querying For a speed comparison, here is a complex query using regular Python: result = [row['col2'] for row in table if ( ((row['col4'] >= lim1 and row['col4'] < lim2) or ((row['col2'] > lim3 and row['col2'] < lim4])) and ((row['col1']+3.1*row['col2']+row['col3']*row['col4']) > lim5) )] And this is the equivalent out-of-core search: result = [row['col2'] for row in table.where( '(((col4 >= lim1) & (col4 < lim2)) | ' '((col2 > lim3) & (col2 < lim4)) & ' '((col1+3.1*col2+col3*col4) > lim5)) ')] 49
  • 133. Querying Complex query with 10 million rows. Data fits in memory. 50
  • 134. 51
  • 135. Querying Complex query with 1 billion rows. Too big for memory. 51
  • 136. 52
  • 137. Exercise exer/crono.py 52
  • 138. Exercise sol/crono.py 53
  • 139. Compression A more general way to solve the starving CPU problem is through compression. 54
  • 140. Compression A more general way to solve the starving CPU problem is through compression. Compression is when the dataset is piped through a zipping algorithm on write and the inverse unzipping algorithm on read. 54
  • 141. Compression A more general way to solve the starving CPU problem is through compression. Compression is when the dataset is piped through a zipping algorithm on write and the inverse unzipping algorithm on read. Each chunk is compressed independently, so chunks end up with a varying number bytes. 54
  • 142. Compression A more general way to solve the starving CPU problem is through compression. Compression is when the dataset is piped through a zipping algorithm on write and the inverse unzipping algorithm on read. Each chunk is compressed independently, so chunks end up with a varying number bytes. Has some storage overhead, but may drastically reduce file sizes for very regular data. 54
  • 143. Compression At first glance this is counter-intuitive. (Why?) 55
  • 144. Compression At first glance this is counter-intuitive. (Why?) Compression/Decompression is clearly more CPU intensive than simply blitting an array into memory. 55
  • 145. Compression At first glance this is counter-intuitive. (Why?) Compression/Decompression is clearly more CPU intensive than simply blitting an array into memory. However, because there is less total information to transfer, the time spent unpacking the array can be far less than moving the array around wholesale. 55
  • 146. Compression At first glance this is counter-intuitive. (Why?) Compression/Decompression is clearly more CPU intensive than simply blitting an array into memory. However, because there is less total information to transfer, the time spent unpacking the array can be far less than moving the array around wholesale. This is kind of like power steering, you can either tell wheels how to turn manually or you can tell the car how you want the wheels turned. 55
  • 147. Compression Compression is a guaranteed feature of HDF5 itself. 56
  • 148. Compression Compression is a guaranteed feature of HDF5 itself. At minimum, HDF5 requires zlib. 56
  • 149. Compression Compression is a guaranteed feature of HDF5 itself. At minimum, HDF5 requires zlib. The compression capabilities feature a plugin architecture which allow for a variety of different algorithms, including user defined ones! 56
  • 150. Compression Compression is a guaranteed feature of HDF5 itself. At minimum, HDF5 requires zlib. The compression capabilities feature a plugin architecture which allow for a variety of different algorithms, including user defined ones! PyTables supports: • zlib (default), • lzo, • bzip2, and • blosc. 56
  • 151. Compression Compression is enabled in PyTables through filters. 57
  • 152. Compression Compression is enabled in PyTables through filters. # complevel goes from [0,9] filters = tb.Filters(complevel=5, complib='blosc', ...) 57
  • 153. Compression Compression is enabled in PyTables through filters. # complevel goes from [0,9] filters = tb.Filters(complevel=5, complib='blosc', ...) # filters may be set on the whole file, f = tb.openFile('/path/to/file', 'a', filters=filters) f.filters = filters 57
  • 154. Compression Compression is enabled in PyTables through filters. # complevel goes from [0,9] filters = tb.Filters(complevel=5, complib='blosc', ...) # filters may be set on the whole file, f = tb.openFile('/path/to/file', 'a', filters=filters) f.filters = filters # filters may also be set on most other nodes f.createTable('/', 'table', desc, filters=filters) f.root.group._v_filters = filters 57
  • 155. Compression Compression is enabled in PyTables through filters. # complevel goes from [0,9] filters = tb.Filters(complevel=5, complib='blosc', ...) # filters may be set on the whole file, f = tb.openFile('/path/to/file', 'a', filters=filters) f.filters = filters # filters may also be set on most other nodes f.createTable('/', 'table', desc, filters=filters) f.root.group._v_filters = filters Filters only act on chunked datasets. 57
  • 156. Compression Tips for choosing compression parameters: 58
  • 157. Compression Tips for choosing compression parameters: • A mid-level (5) compression is sufficient. No need to go all the way up (9). 58
  • 158. Compression Tips for choosing compression parameters: • A mid-level (5) compression is sufficient. No need to go all the way up (9). • Use zlib if you must guarantee complete portability. 58
  • 159. Compression Tips for choosing compression parameters: • A mid-level (5) compression is sufficient. No need to go all the way up (9). • Use zlib if you must guarantee complete portability. • Use blosc all other times. It is optimized for HDF5. 58
  • 160. Compression Tips for choosing compression parameters: • A mid-level (5) compression is sufficient. No need to go all the way up (9). • Use zlib if you must guarantee complete portability. • Use blosc all other times. It is optimized for HDF5. But why? (I don't have time to go into the details of blosc. However here are some justifications...) 58
  • 161. Compression Comparison of different compression levels of zlib. 59
  • 162. 60
  • 163. Compression Creation time per element for a 15 GB EArray and different chunksizes. 60
  • 164. 61
  • 165. Compression File sizes for a 15 GB EArray and different chunksizes. 61
  • 166. 62
  • 167. Compression Sequential access time per element for a 15 GB EArray and different chunksizes. 62
  • 168. 63
  • 169. Compression Random access time per element for a 15 GB EArray and different chunksizes. 63
  • 170. 64
  • 171. Exercise exer/spam_filter.py 64
  • 172. Exercise sol/spam_filter.py 65
  • 173. Other Python Data Structures Overwhelmingly, numpy arrays have been the in-memory data structure of choice. 66
  • 174. Other Python Data Structures Overwhelmingly, numpy arrays have been the in-memory data structure of choice. Using lists or tuples instead of arrays follows analogously. 66
  • 175. Other Python Data Structures Overwhelmingly, numpy arrays have been the in-memory data structure of choice. Using lists or tuples instead of arrays follows analogously. It is data structures like sets and dictionaries which do not quite map. 66
  • 176. Other Python Data Structures Overwhelmingly, numpy arrays have been the in-memory data structure of choice. Using lists or tuples instead of arrays follows analogously. It is data structures like sets and dictionaries which do not quite map. However, as long as all elements may be cast into the same atomic type, these structures can be stored in HDF5 with relative ease. 66
  • 177. Sets Example of serializing and deserializing sets: >>> s = {1.0, 42, 77.7, 6E+01, True} >>> f.createArray('/', 's', [float(x) for x in s]) /s (Array(4,)) '' atom := Float64Atom(shape=(), dflt=0.0) maindim := 0 flavor := 'python' byteorder := 'little' chunkshape := None >>> set(f.root.s) set([1.0, 42.0, 77.7, 60.0]) 67
  • 178. Exercise exer/dict_table.py 68
  • 179. Exercise sol/dict_table.py 69
  • 180. What Was Missed • Walking Nodes • File Nodes • Indexing • Migrating to / from SQL • HDF5 in other database formats • Other Databases in HDF5 • HDF5 as a File System 70
  • 181. Acknowledgements Many thanks to everyone who made this possible! 71
  • 182. Acknowledgements Many thanks to everyone who made this possible! • The HDF Group 71
  • 183. Acknowledgements Many thanks to everyone who made this possible! • The HDF Group • The PyTables Governance Team: • Josh Moore, • Antonio Valentino, • Josh Ayers 71
  • 184. Acknowledgements (Cont.) • The NumPy Developers 72
  • 185. Acknowledgements (Cont.) • The NumPy Developers • h5py, the symbiotic project 72
  • 186. Acknowledgements (Cont.) • The NumPy Developers • h5py, the symbiotic project • Francesc Alted 72
  • 187. Acknowledgements (Cont.) • The NumPy Developers • h5py, the symbiotic project • Francesc Alted Shameless Plug: We are always looking for more hands. Join Now! 72