Caching and Buffering in
HDF5
The HDF Group

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

1
Software stack and the “magic box”
• Life cycle: What happens to data when it is transferred from
application buffer to HD...
Inside the magic box
• Understanding of what is happening to data inside the
magic box will help to write efficient applic...
Topics
• Dataset metadata and array data storage layouts
• Types of dataset storage layouts
• Factors affecting I/O perfor...
HDF5 dataset metadata and
array data storage layouts

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

5
HDF5 Dataset
• Data array
• Ordered collection of identically typed data items
distinguished by their indices

• Metadata
...
Separate Components of a Dataset

Header

Data array

Dataspace

Rank

Dimensions

3

Dim_1 = 4
Dim_2 = 5
Dim_3 = 7

Datat...
Metadata cache and array data
• Dataset array data typically kept in application memory
• Dataset header in separate space...
Metadata and metadata cache
• HDF5 metadata
• Information about HDF5 objects used by the library
• Examples: object header...
Types of data storage layouts

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

10
HDF5 datasets storage layouts
• Contiguous
• Chunked
• Compact

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

11
Contiguous storage layout
• Metadata header separate from raw data
• Raw data stored in one contiguous block on disk
Metad...
Chunked storage
• Chunking – storage layout where a dataset is partitioned
in fixed-size multi-dimensional tiles or chunks...
Chunked storage layout
• Raw data divided into equal sized blocks (chunks).
• Each chunk stored separately as a contiguous...
Compact storage layout
• Data array and metadata stored together in the header
Dataset header
………….
Datatype
Dataspace
………...
Factors affecting I/O
performance

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

16
What goes on inside the magic box?
• Operations on data inside the magic box
• Copying to/from internal buffers
• Datatype...
Operations on data inside the magic box
• Copying to/from internal buffers
• Datatype conversion, such as
• float  intege...
I/O performance depends on
•
•
•
•
•
•
•

Storage layouts
Dataset storage properties
Chunking strategy
Metadata cache perf...
I/O with different storage
layouts

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

20
Writing compact dataset

Dataset header

Metadata cache

………….
Datatype
Dataspace
………….
Attributes
…

Array data
Data

App...
Writing contiguous dataset – no conversion

Metadata cache
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

Dat...
Writing a contiguous dataset with datatype conversion

Dataset header
………….
Datatype
Dataspace
………….
Attribute 1
Attribute...
Partial I/O with contiguous
datasets

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

24
Writing whole dataset – contiguous rows
N

M
One I/O operation
Application data in memory
M rows

File

Nov. 6, 2007

Data...
Sub-setting of contiguous dataset
Series of adjacent rows
Application data in memory
N
M
One I/O operation

M rows
Subset ...
Sub-setting of contiguous dataset
Adjacent, partial rows
Application data in memory
N
Several small I/O operation

M

N el...
Sub-setting of contiguous dataset
Extreme case: writing a column
Application data in memory
N
Several small I/O operation
...
Sub-setting of contiguous dataset
Data sieve buffer
Application data in memory
N

Data is gathered in a sieve buffer in me...
Performance tuning for contiguous dataset
• Datatype conversion
• Avoid for better performance
• Use H5Pset_buffer functio...
I/O with Chunking

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

31
Reminder – chunked storage layout

Metadata cache

Dataset array data

Dataset header

A

………….
Datatype
Dataspace
………….
A...
Information about chunking
• HDF5 library treats each chunk as atomic object
• Compression is applied to each chunk
• Data...
Chunk cache

Dataset_1 header

Metadata cache

…………
………

Dataset_N header Chunking B-tree nodes
…………

Chunk cache
Default ...
Writing chunked dataset
Chunked dataset
A
C

Chunk cache
C

B

Filter pipeline

File

B

A

…………..

C

• Compression perfo...
Partial I/O with Chunking

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

36
Partial I/O for chunked dataset
1

2

3

4

• Example: write the green subset from the dataset , converting
the data
• Dat...
Partial I/O for chunked dataset
• For each of the four chunks:
1

2

3

4

• Read chunk from file into chunk cache,
unless...
Partial I/O for chunked dataset

Application buffer

Chunk cache

3

3

Chunk
memcopy
Elements participating in I/O are ga...
Partial I/O for chunked dataset

Chunk cache
Memcopy
Conversion buffer
3

Memcopy

Application memory

Compress and write ...
Variable length data and I/O

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

41
Examples of variable length data
• String
A[0] “the first string we want to write”
…………………………………
A[N-1] “the N-th string w...
Variable length data in HDF5
• Variable length description in HDF5 application
typedef struct {
size_t length;
void
*p;
}h...
How variable length data is stored in HDF5

Actual variable
length data
Global
heap

File

Dataset header

Nov. 6, 2007

D...
Variable length datasets and I/O
• When writing variable length data, elements in application
buffer point to global heaps...
There may be more than one global heap

Raw data

Application buffer

Global
heap
Global
heap

Nov. 6, 2007

HDF-EOS Works...
Variable length datasets and I/O
Raw data
Global
heap
Global
heap

File

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

47
VL chunked dataset in a file

Chunk B-tree

File

Dataset header
Heaps with
VL data

Nov. 6, 2007

Dataset chunks

HDF-EOS...
Writing chunked VL datasets
Metadata cache

B-tree nodes

Chunk cache

Dataset header
…………

Application memory

Global hea...
Hints for variable length data I/O
• Avoid closing/opening a file while writing VL datasets
• Global heap information is l...
Thank you!

Questions ?

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

51
Upcoming SlideShare
Loading in …5
×

Caching and Buffering in HDF5

1,058 views

Published on

In this talk we will discuss caching and buffering strategies in HDF5. The information presented will help developers write more efficient applications and avoid performance bottlenecks.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,058
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Caching and Buffering in HDF5

  1. 1. Caching and Buffering in HDF5 The HDF Group Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 1
  2. 2. Software stack and the “magic box” • Life cycle: What happens to data when it is transferred from application buffer to HDF5 file? Application Data buffer Object API H5Dwrite Library internals Magic box Virtual file I/O Unbuffered I/O File or other “storage” Nov. 6, 2007 HDF-EOS Workshop XI Tutorial Data in a file 2
  3. 3. Inside the magic box • Understanding of what is happening to data inside the magic box will help to write efficient applications • HDF5 library has mechanisms to control behavior inside the magic box • Goals of this talk:  Describe some basic operations and data structures and explain how they affect performance and storage sizes  Give some “recipes” for how to improve performance Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 3
  4. 4. Topics • Dataset metadata and array data storage layouts • Types of dataset storage layouts • Factors affecting I/O performance • • • • I/O with compact datasets I/O with contiguous datasets I/O with chunked datasets Variable length data and I/O Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 4
  5. 5. HDF5 dataset metadata and array data storage layouts Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 5
  6. 6. HDF5 Dataset • Data array • Ordered collection of identically typed data items distinguished by their indices • Metadata • • • • Dataspace: Rank, dimensions of dataset array Datatype: Information on how to interpret data Storage Properties: How array is organized on disk Attributes: User-defined metadata (optional) Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 6
  7. 7. Separate Components of a Dataset Header Data array Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32-bit float Storage info Attributes Time = 32.4 Chunked Pressure = 987 Compressed Temp = 56 Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 7
  8. 8. Metadata cache and array data • Dataset array data typically kept in application memory • Dataset header in separate space – metadata cache Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … Dataset array data Application memory HDF5 metadata Dataset array data File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 8
  9. 9. Metadata and metadata cache • HDF5 metadata • Information about HDF5 objects used by the library • Examples: object headers, B-tree nodes for group, B-Tree nodes for chunks, heaps, super-block, etc. • Usually small compared to raw data sizes (KB vs. MB-GB) • Metadata cache • Space allocated to handle pieces of the HDF5 metadata • Allocated by the HDF5 library in application’s memory space • Cache behavior affects overall performance • Metadata cache implementation prior to HDF5 1.6.5 could cause performance degradation for some applications Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 9
  10. 10. Types of data storage layouts Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 10
  11. 11. HDF5 datasets storage layouts • Contiguous • Chunked • Compact Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 11
  12. 12. Contiguous storage layout • Metadata header separate from raw data • Raw data stored in one contiguous block on disk Metadata cache Dataset array data Dataset header …………. Datatype Dataspace …………. Attributes … Application memory File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 12
  13. 13. Chunked storage • Chunking – storage layout where a dataset is partitioned in fixed-size multi-dimensional tiles or chunks • Used for extendible datasets and datasets with filters applied (checksum, compression) • HDF5 library treats each chunk as atomic object • Greatly affects performance and file sizes Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 13
  14. 14. Chunked storage layout • Raw data divided into equal sized blocks (chunks). • Each chunk stored separately as a contiguous block on disk Metadata cache Dataset array data Dataset header A …………. Datatype Dataspace …………. Attributes … File B C D Chunk index Application memory header Nov. 6, 2007 Chunk index A C HDF-EOS Workshop XI Tutorial D 14 B
  15. 15. Compact storage layout • Data array and metadata stored together in the header Dataset header …………. Datatype Dataspace …………. Attributes … Array data Data Metadata cache Array data Application memory File* * “File” may in fact be a collection of files, memory, or other storage destination. Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 15
  16. 16. Factors affecting I/O performance Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 16
  17. 17. What goes on inside the magic box? • Operations on data inside the magic box • Copying to/from internal buffers • Datatype conversion • Scattering - gathering • Data transformation (filters, compression) • Data structures used • B-trees (groups, dataset chunks) • Hash tables • Local and Global heaps (variable length data: link names, strings, etc.) • Other concepts • HDF5 metadata, metadata cache • Chunking, chunk cache Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 17
  18. 18. Operations on data inside the magic box • Copying to/from internal buffers • Datatype conversion, such as • float  integer • LE  BE • 64-bit integer to 16-bit integer • Scattering - gathering • Data is scattered/gathered from/to application buffers into internal buffers for datatype conversion and partial I/O • Data transformation (filters, compression) • Checksum on raw data and metadata (in 1.8.0) • Algebraic transform • GZIP and SZIP compressions • User-defined filters Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 18
  19. 19. I/O performance depends on • • • • • • • Storage layouts Dataset storage properties Chunking strategy Metadata cache performance Datatype conversion performance Other filters, such as compression Access patterns Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 19
  20. 20. I/O with different storage layouts Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 20
  21. 21. Writing compact dataset Dataset header Metadata cache …………. Datatype Dataspace …………. Attributes … Array data Data Application memory One write to store header and data array File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 21
  22. 22. Writing contiguous dataset – no conversion Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … Dataset array data Application memory File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 22
  23. 23. Writing a contiguous dataset with datatype conversion Dataset header …………. Datatype Dataspace …………. Attribute 1 Attribute 2 ………… Metadata cache Dataset array data Conversion buffer 1MB Application memory File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 23
  24. 24. Partial I/O with contiguous datasets Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 24
  25. 25. Writing whole dataset – contiguous rows N M One I/O operation Application data in memory M rows File Nov. 6, 2007 Data is contiguous in a file HDF-EOS Workshop XI Tutorial 25
  26. 26. Sub-setting of contiguous dataset Series of adjacent rows Application data in memory N M One I/O operation M rows Subset – contiguous in file File Entire dataset – contiguous in file Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 26
  27. 27. Sub-setting of contiguous dataset Adjacent, partial rows Application data in memory N Several small I/O operation M N elements File Nov. 6, 2007 … Data is scattered in a file in M contiguous blocks HDF-EOS Workshop XI Tutorial 27
  28. 28. Sub-setting of contiguous dataset Extreme case: writing a column Application data in memory N Several small I/O operation M 1 element … Subset data is scattered in a file in M different locations Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 28
  29. 29. Sub-setting of contiguous dataset Data sieve buffer Application data in memory N Data is gathered in a sieve buffer in memory 64K memcopy M 1 element File Nov. 6, 2007 … Data is scattered in a file in M contiguous blocks HDF-EOS Workshop XI Tutorial 29
  30. 30. Performance tuning for contiguous dataset • Datatype conversion • Avoid for better performance • Use H5Pset_buffer function to customize conversion buffer size • Partial I/O • Write/read in big contiguous blocks • Use H5Pset_sieve_buf_size to improve performance for complex subsetting Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 30
  31. 31. I/O with Chunking Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 31
  32. 32. Reminder – chunked storage layout Metadata cache Dataset array data Dataset header A …………. Datatype Dataspace …………. Attributes … File B C D Chunk index Application memory header Nov. 6, 2007 Chunk index A C HDF-EOS Workshop XI Tutorial D 32 B
  33. 33. Information about chunking • HDF5 library treats each chunk as atomic object • Compression is applied to each chunk • Datatype conversion, other filters applied per chunk • Chunk size greatly affects performance • Chunk overhead adds to file size • Chunk processing involves many steps • Chunk cache • • • • • • Caches chunks for better performance Created for each chunked dataset Size of chunk cache is set for file (default size 1MB) Each chunked dataset has its own chunk cache Chunk may be too big to fit into cache Memory may grow if application keeps opening datasets Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 33
  34. 34. Chunk cache Dataset_1 header Metadata cache ………… ……… Dataset_N header Chunking B-tree nodes ………… Chunk cache Default size is 1MB Application memory Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 34
  35. 35. Writing chunked dataset Chunked dataset A C Chunk cache C B Filter pipeline File B A ………….. C • Compression performed when chunk evicted from the chunk cache • Other filters applied as data goes through filter pipeline Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 35
  36. 36. Partial I/O with Chunking Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 36
  37. 37. Partial I/O for chunked dataset 1 2 3 4 • Example: write the green subset from the dataset , converting the data • Dataset is stored as six chunks in the file. • The subset spans four chunks, numbered 1-4 in the figure. • Hence four chunks must be written to the file. • But first, the four chunks must be read from the file, to preserve those parts of each chunk that are not to be overwritten. Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 37
  38. 38. Partial I/O for chunked dataset • For each of the four chunks: 1 2 3 4 • Read chunk from file into chunk cache, unless it’s already there. • Determine which part of the chunk will be replaced by the selection. • Replace that part of the chunk in the cache with the corresponding elements from the application’s array. • Move those elements to conversion buffer and perform conversion • Move those elements back from conversion buffer to chunk cache. • Apply filters (compression) when chunk is flushed from chunk cache • For each element 3 memcopy performed Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 38
  39. 39. Partial I/O for chunked dataset Application buffer Chunk cache 3 3 Chunk memcopy Elements participating in I/O are gathered into corresponding chunk Application memory Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 39
  40. 40. Partial I/O for chunked dataset Chunk cache Memcopy Conversion buffer 3 Memcopy Application memory Compress and write to file File Nov. 6, 2007 Chunk HDF-EOS Workshop XI Tutorial 40
  41. 41. Variable length data and I/O Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 41
  42. 42. Examples of variable length data • String A[0] “the first string we want to write” ………………………………… A[N-1] “the N-th string we want to write” • Each element is a record of variable-length A[0] (1,1,0,0,0,5,6,7,8,9) [length = 10] A[1] (0,0,110,2005) [length = 4] ……………………….. A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) [length = M] Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 42
  43. 43. Variable length data in HDF5 • Variable length description in HDF5 application typedef struct { size_t length; void *p; }hvl_t; • Base type can be any HDF5 type H5Tvlen_create(base_type) • ~ 20 bytes overhead for each element • Data cannot be compressed Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 43
  44. 44. How variable length data is stored in HDF5 Actual variable length data Global heap File Dataset header Nov. 6, 2007 Dataset with variable length elements Pointer into global heap HDF-EOS Workshop XI Tutorial 44
  45. 45. Variable length datasets and I/O • When writing variable length data, elements in application buffer point to global heaps in the metadata cache where actual data is stored. Raw data Application buffer Global heap Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 45
  46. 46. There may be more than one global heap Raw data Application buffer Global heap Global heap Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 46
  47. 47. Variable length datasets and I/O Raw data Global heap Global heap File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 47
  48. 48. VL chunked dataset in a file Chunk B-tree File Dataset header Heaps with VL data Nov. 6, 2007 Dataset chunks HDF-EOS Workshop XI Tutorial 48
  49. 49. Writing chunked VL datasets Metadata cache B-tree nodes Chunk cache Dataset header ………… Application memory Global heap ……… Raw data Chunk cache Conversion buffer Filter pipeline VL chunked dataset with selected region File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 49
  50. 50. Hints for variable length data I/O • Avoid closing/opening a file while writing VL datasets • Global heap information is lost • Global heaps may have unused space • Avoid alternately writing different VL datasets • Data from different datasets will go into to the same heap • If maximum length of the record is known, consider using fixed-length records and compression Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 50
  51. 51. Thank you! Questions ? Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 51

×