Parallel HDF5

Albert Cheng
The HDF Group

Sep 28-30, 2010

HDF and HDF-EOS Workshop XIV

1
Advantage of Parallel HDF5

cpu time

cpu ration wall time

wall ratio

seconds

pp/serial

pp/serial

seconds

serial

10...
Outline
•
•
•
•
•

Overview of Parallel HDF5 design
Parallel Environment Requirements
Performance Analysis
Parallel tools
...
Overview of Parallel HDF5
Design

Sep 28-30, 2010

HDF and HDF-EOS Workshop XIV

4
PHDF5 Requirements
• Support Message Passing Interface (MPI)
programming
• PHDF5 files compatible with serial HDF5 files
•...
PHDF5 Implementation Layers
Application
Parallel computing system (Linux cluster)
Compute

Compute

Compute

Compute

node...
PHDF5 Implementation Layers
Application
Parallel computing system (Linux cluster)
Compute

Compute

Compute

Compute

node...
Parallel Environment Requirements
• MPI with MPI-IO. E.g.,
• MPICH2 ROMIO
• Vendor’s MPI-IO

• POSIX compliant parallel fi...
POSIX Compliant Requirement
• IEEE Std 1003.1-2008 definition of the write
operation specifies that:
… After a write() to ...
Again in English
For all processes of communicator, comm, that
have opened a file together:
When one process does
lseek(fd...
MPI-IO vs. HDF5
• MPI-IO is an Input/Output API.
• It treats the data file as a “linear byte stream”
and each MPI applicat...
MPI-IO vs. HDF5 Cont.
• HDF5 is a data management software.
• It stores the data and metadata according to
the HDF5 data f...
Performance Analysis
• Some common causes of poor performance
• Possible solutions

Sep 28-30, 2010

HDF and HDF-EOS Works...
My PHDF5 Application I/O is slow
•
•
•
•

Use larger I/O data sizes
Independent vs. Collective I/O
Specific I/O system hin...
Write Speed vs. Block Size

MB/Sec

TFLOPS: HDF5 Write vs MPIO Write
(File size 3200MB, Nodes: 8)
120
100
80
60
40
20
0

H...
My PHDF5 Application I/O is slow
•
•
•
•

Use larger I/O data sizes
Independent vs. Collective I/O
Specific I/O system hin...
Independent vs. Collective Access

• User reported
Independent data
transfer mode was much
slower than the
Collective data...
Collective vs. Independent Calls

• MPI definition of collective calls
• All processes of the communicator must
participat...
Debug Slow Parallel I/O Speed(1)
• Writing to one dataset
• Using 4 processes == 4 columns
• data type is 8 bytes doubles
...
Debug Slow Parallel I/O Speed(2)
• Build a version of PHDF5 with
• ./configure --enable-debug --enable-parallel …
• This a...
Debug Slow Parallel I/O Speed(3)
•
•
•
•
•
•
•
•
•
•
•
•
•
•

% setenv H5FD_mpio_Debug ’rw’
% mpirun -np 4 ./a.out i t 100...
Independent calls are many and small

• Each process writes one
element of one row,
skips to next row, write
one element, ...
Debug Slow Parallel I/O Speed (4)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•

% setenv H5FD_mpio_Debug ’rw’
% mpirun -np 4 ./a.out i ...
Use Collective Mode or Chunked Storage

• Collective mode will
combine many small
independent calls into
few but bigger
ca...
Independent vs. Collective write
6 processes, IBM p-690, AIX, GPFS
# of Rows

Data Size
(MB)

Independent
(Sec.)

Collecti...
Independent vs. Collective write (cont.)
Performance (non-contiguous)
1000
900
800

Time (s)

700
600
Independent

500

Co...
My PHDF5 Application I/O is slow
•
•
•
•

Use larger I/O data sizes
Independent vs. Collective I/O
Specific I/O system hin...
Effects of I/O Hints: IBM_largeblock_io
• GPFS at LLNL Blue

• 4 nodes, 16 tasks
• Total data size 1024MB
• I/O buffer siz...
Effects of I/O Hints: IBM_largeblock_io
• GPFS at LLNL ASCI Blue machine
• 4 nodes, 16 tasks
• Total data size 1024MB
• I/...
My PHDF5 Application I/O is slow
• If my application I/O performance is slow, what
can I do?
•
•
•
•

Use larger I/O data ...
Parallel Tools

• h5perf
• Performance measuring tools showing
I/O performance for different I/O API

Sep 28-30, 2010

HDF...
h5perf
• An I/O performance measurement tool
• Test 3 File I/O API
• POSIX I/O (open/write/read/close…)
• MPIO (MPI_File_{...
h5perf: Some features
• Check (-c) verify data correctness
• Added 2-D chunk patterns in v1.8
• -h shows the help page.

S...
h5perf: example output 1/3
%mpirun -np 4 h5perf
# Ran in a Linux system
Number of processors = 4
Transfer Buffer Size: 131...
h5perf: example output 2/3
%mpirun -np 4 h5perf
…
IO API = MPIO
Write (1 iteration(s)):
Maximum Throughput: 611.95 MB/s
Av...
h5perf: example output 3/3
%mpirun -np 4 h5perf
…
IO API = PHDF5 (w/MPI-I/O driver)
Write (1 iteration(s)):
Maximum Throug...
Useful Parallel HDF Links
• Parallel HDF information site
http://www.hdfgroup.org/HDF5/PHDF5/

• Parallel HDF5 tutorial av...
Questions?

Sep 28-30, 2010

HDF and HDF-EOS Workshop XIV

38
How to Compile PHDF5 Applications
• h5pcc – HDF5 C compiler command
• Similar to mpicc
• h5pfc – HDF5 F90 compiler command...
h5pcc/h5pfc -show option
• -show displays the compiler commands and options
without executing them, i.e., dry run
% h5pcc ...
Programming Restrictions
• Most PHDF5 APIs are collective
• PHDF5 opens a parallel file with a communicator
• Returns a fi...
Collective vs. Independent Calls
• MPI definition of collective calls
• All processes of the communicator must
participate...
Examples of PHDF5 API
• Examples of PHDF5 collective API
• File operations: H5Fcreate, H5Fopen, H5Fclose
• Objects creatio...
What Does PHDF5 Support ?
• After a file is opened by the processes of a
communicator
• All parts of file are accessible b...
PHDF5 API Languages
• C and F90 language interfaces
• Platforms supported:
• Most platforms with MPI-IO supported. E.g.,
•...
Programming model for creating and accessing a file

• HDF5 uses access template object (property
list) to control the fil...
Setup MPI-IO access template
Each process of the MPI communicator creates an
access template and sets it up with MPI paral...
C Example Parallel File Create
23
24
26
27
28
29
30
34
35
->36
->37
38
->42
49
50
51
52
54

March 8, 2010

comm = MPI_COMM...
F90 Example Parallel File Create
23
24
26
29
30
32
34
35
->37
->38
40
41
->43
45
46
49
51
52
54
56

comm = MPI_COMM_WORLD
...
Creating and Opening Dataset
• All processes of the communicator
open/close a dataset by a collective call
C: H5Dcreate o...
C Example: Create Dataset
56
57
58
59
60
61
62
63
64
65
66
->67
68
70
71
72
73
74

file_id = H5Fcreate(…);
/*
* Create the...
F90 Example: Create Dataset
43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id,
error, access_prp = plist_id)
73 CALL ...
Accessing a Dataset
• All processes that have opened dataset may do
collective I/O
• Each process may do independent and a...
Programming model for dataset access
• Create and set dataset transfer property
• C: H5Pset_dxpl_mpio
• H5FD_MPIO_COLLECTI...
C Example: Collective write

95
96
97
98
->99
100
101
102

/*
* Create property list for collective dataset write.
*/
plis...
F90 Example: Collective write

88
89
90
->91
92
93
94
95
96

! Create property list for collective dataset write
!
CALL h5...
Writing and Reading Hyperslabs
• Distributed memory model: data is split among
processes
• PHDF5 uses HDF5 hyperslab model...
Set up the Hyperslab for Read/Write
H5Sselect_hyperslab(
filespace,H5S_SELECT_SET,
offset, stride, count, block
)

March 8...
Example 1: Writing dataset by rows
P0
P1

File

P2
P3

March 8, 2010

11th International LCI Conference - HDF5 Tutorial

5...
Writing by rows: Output of h5dump
HDF5 "SDS_row.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE SI...
Example 1: Writing dataset by rows
Process P1
File

Memory

offset[1]

count[1]
count[0]

offset[0]

count[0] = dimsf[0]/m...
Example 1: Writing dataset by rows
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

March 8, 2010

/*
* Each process defines ...
Example 2: Writing dataset by columns

P0
File

P1

March 8, 2010

11th International LCI Conference - HDF5 Tutorial

63
Writing by columns: Output of h5dump
HDF5 "SDS_col.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE...
Example 2: Writing dataset by column
Memory

Process P0

File

P0 offset[1]
block[0]

dimsm[0]
dimsm[1]

block[1]

P1 offs...
Example 2: Writing dataset by column
85
86

/*
* Each process defines a hyperslab in
* the file
*/
count[0] = 1;
count[1] ...
Example 3: Writing dataset by pattern
Memory

File

Process P0

Process P1
Process P2
Process P3

March 8, 2010

11th Inte...
Writing by Pattern: Output of h5dump
HDF5 "SDS_pat.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE...
Example 3: Writing dataset by pattern
Memory

File
stride[1]

Process P2
stride[0]
offset[0]
offset[1]
count[0]
count[1]
s...
Example 3: Writing by pattern
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
March 8, 2...
Example 4: Writing dataset by chunks

P0

March 8, 2010

P1

P2

P3

11th International LCI Conference - HDF5 Tutorial

Fi...
Writing by Chunks: Output of h5dump
HDF5 "SDS_chnk.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32BE
DATASPACE...
Example 4: Writing dataset by chunks
File

Process P2: Memory

offset[1]
chunk_dims[1]
offset[0]
chunk_dims[0]
block[0]
bl...
Example 4: Writing by chunks
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
March 8,...
Upcoming SlideShare
Loading in...5
×

Parallel HDF5

331

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
331
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Parallel HDF5

  1. 1. Parallel HDF5 Albert Cheng The HDF Group Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 1
  2. 2. Advantage of Parallel HDF5 cpu time cpu ration wall time wall ratio seconds pp/serial pp/serial seconds serial 10.78 1.00 21.92 1.00 pp -n 2 19.80 1.84 15.03 0.69 pp -n 4 24.72 2.29 8.42 0.38 pp -n 8 64.62 5.99 12.69 0.58 Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 2
  3. 3. Outline • • • • • Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis Parallel tools PHDF5 Programming Model Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 3
  4. 4. Overview of Parallel HDF5 Design Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 4
  5. 5. PHDF5 Requirements • Support Message Passing Interface (MPI) programming • PHDF5 files compatible with serial HDF5 files • Shareable between different serial or parallel platforms • Single file image to all processes • One file per process design is undesirable • Expensive post processing • Not usable by different number of processes • Standard parallel I/O interface • Must be portable to different platforms Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 5
  6. 6. PHDF5 Implementation Layers Application Parallel computing system (Linux cluster) Compute Compute Compute Compute node node node node I/O library (HDF5) PHDF5 built on top of standard MPI-IO API Parallel PHDF5 File I/O library (MPI-I/O) Parallel PHDF5 File(GPFS) file system Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 6
  7. 7. PHDF5 Implementation Layers Application Parallel computing system (Linux cluster) Compute Compute Compute Compute node node node node I/O library (HDF5) PHDF5 built on top of standard MPI-IO API Parallel I/O library (MPI-I/O) Parallel file system (GPFS) Switch network/I/O servers Disk architecture & layout of data on disk Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 7
  8. 8. Parallel Environment Requirements • MPI with MPI-IO. E.g., • MPICH2 ROMIO • Vendor’s MPI-IO • POSIX compliant parallel file system. E.g., • GPFS (General Parallel File System) • Lustre Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 8
  9. 9. POSIX Compliant Requirement • IEEE Std 1003.1-2008 definition of the write operation specifies that: … After a write() to a regular file has successfully returned: • Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified. • Any subsequent successful write() to the same byte position in the file shall overwrite that file data. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 9
  10. 10. Again in English For all processes of communicator, comm, that have opened a file together: When one process does lseek(fd, 1000) == success write(fd, writebuf, nbytes) == success All processes MPI_Barrier(comm)==success All processes does lseek(fd, 1000) == success read(fd, readbuf, nbytes) == success Then all processes have writebuf == readbuf Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 10
  11. 11. MPI-IO vs. HDF5 • MPI-IO is an Input/Output API. • It treats the data file as a “linear byte stream” and each MPI application needs to provide its own file view and data representations to interpret those bytes. • All data stored are machine dependent except the “external32” representation. • External32 is defined in Big Endianness • Little-endian machines have to do the data conversion in both read or write operations. • 64bit sized data types may lose information. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 11
  12. 12. MPI-IO vs. HDF5 Cont. • HDF5 is a data management software. • It stores the data and metadata according to the HDF5 data format definition. • HDF5 file is self-described. • Each machine can store the data in its own native representation for efficient I/O without loss of data precision. • Any necessary data representation conversion is done by the HDF5 library automatically. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 12
  13. 13. Performance Analysis • Some common causes of poor performance • Possible solutions Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 13
  14. 14. My PHDF5 Application I/O is slow • • • • Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 14
  15. 15. Write Speed vs. Block Size MB/Sec TFLOPS: HDF5 Write vs MPIO Write (File size 3200MB, Nodes: 8) 120 100 80 60 40 20 0 HDF5 Write MPIO Write 1 2 4 8 16 32 Block Size (MB) Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 15
  16. 16. My PHDF5 Application I/O is slow • • • • Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 16
  17. 17. Independent vs. Collective Access • User reported Independent data transfer mode was much slower than the Collective data transfer mode • Data array was tall and thin: 230,000 rows by 6 columns Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV : : : 230,000 rows : : : 17
  18. 18. Collective vs. Independent Calls • MPI definition of collective calls • All processes of the communicator must participate in the right order. E.g., • Process1 • call A(); call B(); • call A(); call B(); Process2 call A(); call B(); **right** call B(); call A(); **wrong** • Independent means not collective • Collective is not necessarily synchronous Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 18
  19. 19. Debug Slow Parallel I/O Speed(1) • Writing to one dataset • Using 4 processes == 4 columns • data type is 8 bytes doubles • 4 processes, 1000 rows == 4x1000x8 = 32,000 bytes • % mpirun -np 4 ./a.out i t 1000 • Execution time: 1.783798 s. • % mpirun -np 4 ./a.out i t 2000 • Execution time: 3.838858 s. • # Difference of 2 seconds for 1000 more rows = 32,000 Bytes. • # A speed of 16KB/Sec!!! Way too slow. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 19
  20. 20. Debug Slow Parallel I/O Speed(2) • Build a version of PHDF5 with • ./configure --enable-debug --enable-parallel … • This allows the tracing of MPIO I/O calls in the HDF5 library. • E.g., to trace • MPI_File_read_xx and MPI_File_write_xx calls • % setenv H5FD_mpio_Debug “rw” Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 20
  21. 21. Debug Slow Parallel I/O Speed(3) • • • • • • • • • • • • • • % setenv H5FD_mpio_Debug ’rw’ % mpirun -np 4 ./a.out i t 1000 # Indep.; contiguous. in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=2056 size_i=8 in H5FD_mpio_write mpi_off=2048 size_i=8 in H5FD_mpio_write mpi_off=2072 size_i=8 in H5FD_mpio_write mpi_off=2064 size_i=8 in H5FD_mpio_write mpi_off=2088 size_i=8 in H5FD_mpio_write mpi_off=2080 size_i=8 … # total of 4000 of this little 8 bytes writes == 32,000 bytes. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 21
  22. 22. Independent calls are many and small • Each process writes one element of one row, skips to next row, write one element, so on. • Each process issues 230,000 writes of 8 bytes each. • Not good==just like many independent cars driving to work, waste gas, time, total traffic jam. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV : : : 230,000 rows : : : 22
  23. 23. Debug Slow Parallel I/O Speed (4) • • • • • • • • • • • • • • • • % setenv H5FD_mpio_Debug ’rw’ % mpirun -np 4 ./a.out i h 1000 # Indep., Chunked by column. in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=3688 size_i=8000 in H5FD_mpio_write mpi_off=11688 size_i=8000 in H5FD_mpio_write mpi_off=27688 size_i=8000 in H5FD_mpio_write mpi_off=19688 size_i=8000 in H5FD_mpio_write mpi_off=96 size_i=40 in H5FD_mpio_write mpi_off=136 size_i=544 in H5FD_mpio_write mpi_off=680 size_i=120 in H5FD_mpio_write mpi_off=800 size_i=272 … Execution time: 0.011599 s. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 23
  24. 24. Use Collective Mode or Chunked Storage • Collective mode will combine many small independent calls into few but bigger calls==like people going to work by trains collectively. • Chunks of columns speeds up too==like people live and work in suburbs to reduce overlapping traffics. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV : : : 230,000 rows : : : 24
  25. 25. Independent vs. Collective write 6 processes, IBM p-690, AIX, GPFS # of Rows Data Size (MB) Independent (Sec.) Collective (Sec.) 16384 8.26 1.72 32768 0.50 65.12 1.80 65536 1.00 108.20 2.68 122918 1.88 276.57 3.11 150000 2.29 528.15 3.63 180300 Sep 28-30, 2010 0.25 2.75 881.39 4.12 HDF and HDF-EOS Workshop XIV 25
  26. 26. Independent vs. Collective write (cont.) Performance (non-contiguous) 1000 900 800 Time (s) 700 600 Independent 500 Collective 400 300 200 100 0 0.00 0.50 1.00 1.50 2.00 2.50 3.00 Data space size (MB) Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 26
  27. 27. My PHDF5 Application I/O is slow • • • • Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 27
  28. 28. Effects of I/O Hints: IBM_largeblock_io • GPFS at LLNL Blue • 4 nodes, 16 tasks • Total data size 1024MB • I/O buffer size 1MB Tasks 16 16 Sep 28-30, 2010 IBM_largeblock_io=false IBM_largeblock_io=true MPI-IO PHDF5 MPI-IO PHDF5 write (MB/S) 60 48 354 294 read (MB/S) 44 39 256 248 HDF and HDF-EOS Workshop XIV 28
  29. 29. Effects of I/O Hints: IBM_largeblock_io • GPFS at LLNL ASCI Blue machine • 4 nodes, 16 tasks • Total data size 1024MB • I/O buffer size 1MB 400 350 300 250 200 150 16 write 16 read 100 50 0 MPI-IO PHDF5 MPI-IO PHDF5 IBM_largeblock_io=false IBM_largeblock_io=true Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 29
  30. 30. My PHDF5 Application I/O is slow • If my application I/O performance is slow, what can I do? • • • • Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 30
  31. 31. Parallel Tools • h5perf • Performance measuring tools showing I/O performance for different I/O API Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 31
  32. 32. h5perf • An I/O performance measurement tool • Test 3 File I/O API • POSIX I/O (open/write/read/close…) • MPIO (MPI_File_{open,write,read,close}) • PHDF5 • H5Pset_fapl_mpio (using MPI-IO) • H5Pset_fapl_mpiposix (using POSIX I/O) • An indication of I/O speed upper limits Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 32
  33. 33. h5perf: Some features • Check (-c) verify data correctness • Added 2-D chunk patterns in v1.8 • -h shows the help page. Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 33
  34. 34. h5perf: example output 1/3 %mpirun -np 4 h5perf # Ran in a Linux system Number of processors = 4 Transfer Buffer Size: 131072 bytes, File size: 1.00 MBs # of files: 1, # of datasets: 1, dataset size: 1.00 MBs IO API = POSIX Write (1 iteration(s)): Maximum Throughput: 18.75 MB/s Average Throughput: 18.75 MB/s Minimum Throughput: 18.75 MB/s Write Open-Close (1 iteration(s)): Maximum Throughput: 10.79 MB/s Average Throughput: 10.79 MB/s Minimum Throughput: 10.79 MB/s Read (1 iteration(s)): Maximum Throughput: 2241.74 MB/s Average Throughput: 2241.74 MB/s Minimum Throughput: 2241.74 MB/s Read Open-Close (1 iteration(s)): Maximum Throughput: 756.41 MB/s Average Throughput: 756.41 MB/s Minimum Throughput: 756.41 MB/s Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 34
  35. 35. h5perf: example output 2/3 %mpirun -np 4 h5perf … IO API = MPIO Write (1 iteration(s)): Maximum Throughput: 611.95 MB/s Average Throughput: 611.95 MB/s Minimum Throughput: 611.95 MB/s Write Open-Close (1 iteration(s)): Maximum Throughput: 16.89 MB/s Average Throughput: 16.89 MB/s Minimum Throughput: 16.89 MB/s Read (1 iteration(s)): Maximum Throughput: 421.75 MB/s Average Throughput: 421.75 MB/s Minimum Throughput: 421.75 MB/s Read Open-Close (1 iteration(s)): Maximum Throughput: 109.22 MB/s Average Throughput: 109.22 MB/s Minimum Throughput: 109.22 MB/s Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 35
  36. 36. h5perf: example output 3/3 %mpirun -np 4 h5perf … IO API = PHDF5 (w/MPI-I/O driver) Write (1 iteration(s)): Maximum Throughput: 304.40 MB/s Average Throughput: 304.40 MB/s Minimum Throughput: 304.40 MB/s Write Open-Close (1 iteration(s)): Maximum Throughput: 15.14 MB/s Average Throughput: 15.14 MB/s Minimum Throughput: 15.14 MB/s Read (1 iteration(s)): Maximum Throughput: 1718.27 MB/s Average Throughput: 1718.27 MB/s Minimum Throughput: 1718.27 MB/s Read Open-Close (1 iteration(s)): Maximum Throughput: 78.06 MB/s Average Throughput: 78.06 MB/s Minimum Throughput: 78.06 MB/s Transfer Buffer Size: 262144 bytes, File size: 1.00 MBs # of files: 1, # of datasets: 1, dataset size: 1.00 MBs Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 36
  37. 37. Useful Parallel HDF Links • Parallel HDF information site http://www.hdfgroup.org/HDF5/PHDF5/ • Parallel HDF5 tutorial available at http://www.hdfgroup.org/HDF5/Tutor/ • HDF Help email address help@hdfgroup.org Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 37
  38. 38. Questions? Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 38
  39. 39. How to Compile PHDF5 Applications • h5pcc – HDF5 C compiler command • Similar to mpicc • h5pfc – HDF5 F90 compiler command • Similar to mpif90 • To compile: • % h5pcc h5prog.c • % h5pfc h5prog.f90 March 8, 2010 11th International LCI Conference - HDF5 Tutorial 39
  40. 40. h5pcc/h5pfc -show option • -show displays the compiler commands and options without executing them, i.e., dry run % h5pcc -show Sample_mpio.c mpicc -I/home/packages/phdf5/include -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -D_POSIX_SOURCE -D_BSD_SOURCE -std=c99 -c Sample_mpio.c mpicc -std=c99 Sample_mpio.o -L/home/packages/phdf5/lib home/packages/phdf5/lib/libhdf5_hl.a /home/packages/phdf5/lib/libhdf5.a -lz -lm -Wl,-rpath -Wl,/home/packages/phdf5/lib March 8, 2010 11th International LCI Conference - HDF5 Tutorial 40
  41. 41. Programming Restrictions • Most PHDF5 APIs are collective • PHDF5 opens a parallel file with a communicator • Returns a file-handle • Future access to the file via the file-handle • All processes must participate in collective PHDF5 APIs • Different files can be opened via different communicators March 8, 2010 11th International LCI Conference - HDF5 Tutorial 41
  42. 42. Collective vs. Independent Calls • MPI definition of collective calls • All processes of the communicator must participate in the right order. E.g., • Process1 • call A(); call B(); • call A(); call B(); Process2 call A(); call B(); **right** call B(); call A(); **wrong** • Independent means not collective • Collective is not necessarily synchronous Sep 28-30, 2010 HDF and HDF-EOS Workshop XIV 42
  43. 43. Examples of PHDF5 API • Examples of PHDF5 collective API • File operations: H5Fcreate, H5Fopen, H5Fclose • Objects creation: H5Dcreate, H5Dclose • Objects structure: H5Dextend (increase dimension sizes) • Array data transfer can be collective or independent • Dataset operations: H5Dwrite, H5Dread • Collectiveness is indicated by function parameters, not by function names as in MPI API March 8, 2010 11th International LCI Conference - HDF5 Tutorial 43
  44. 44. What Does PHDF5 Support ? • After a file is opened by the processes of a communicator • All parts of file are accessible by all processes • All objects in the file are accessible by all processes • Multiple processes may write to the same data array • Each process may write to individual data array March 8, 2010 11th International LCI Conference - HDF5 Tutorial 44
  45. 45. PHDF5 API Languages • C and F90 language interfaces • Platforms supported: • Most platforms with MPI-IO supported. E.g., • • • • March 8, 2010 IBM AIX Linux clusters SGI Altrix Cray XT 11th International LCI Conference - HDF5 Tutorial 45
  46. 46. Programming model for creating and accessing a file • HDF5 uses access template object (property list) to control the file access mechanism • General model to access HDF5 file in parallel: • Setup MPI-IO access template (access property list) • Open File • Access Data • Close File March 8, 2010 11th International LCI Conference - HDF5 Tutorial 46
  47. 47. Setup MPI-IO access template Each process of the MPI communicator creates an access template and sets it up with MPI parallel access information C: herr_t H5Pset_fapl_mpio(hid_t plist_id, MPI_Comm comm, MPI_Info info); F90: h5pset_fapl_mpio_f(plist_id, comm, info) integer(hid_t) :: plist_id integer :: comm, info plist_id is a file access property list identifier March 8, 2010 11th International LCI Conference - HDF5 Tutorial 47
  48. 48. C Example Parallel File Create 23 24 26 27 28 29 30 34 35 ->36 ->37 38 ->42 49 50 51 52 54 March 8, 2010 comm = MPI_COMM_WORLD; info = MPI_INFO_NULL; /* * Initialize MPI */ MPI_Init(&argc, &argv); /* * Set up file access property list for MPI-IO access */ plist_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpio(plist_id, comm, info); file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id); /* * Close the file. */ H5Fclose(file_id); MPI_Finalize(); 11th International LCI Conference - HDF5 Tutorial 48
  49. 49. F90 Example Parallel File Create 23 24 26 29 30 32 34 35 ->37 ->38 40 41 ->43 45 46 49 51 52 54 56 comm = MPI_COMM_WORLD info = MPI_INFO_NULL CALL MPI_INIT(mpierror) ! !Initialize FORTRAN predefined datatypes CALL h5open_f(error) ! !Setup file access property list for MPI-IO access. CALL h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, error) CALL h5pset_fapl_mpio_f(plist_id, comm, info, error) ! !Create the file collectively. CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id) ! !Close the file. CALL h5fclose_f(file_id, error) ! !Close FORTRAN interface CALL h5close_f(error) CALL MPI_FINALIZE(mpierror) March 8, 2010 11th International LCI Conference - HDF5 Tutorial 49
  50. 50. Creating and Opening Dataset • All processes of the communicator open/close a dataset by a collective call C: H5Dcreate or H5Dopen; H5Dclose F90: h5dcreate_f or h5dopen_f; h5dclose_f • All processes of the communicator must extend an unlimited dimension dataset before writing to it C: H5Dextend F90: h5dextend_f March 8, 2010 11th International LCI Conference - HDF5 Tutorial 50
  51. 51. C Example: Create Dataset 56 57 58 59 60 61 62 63 64 65 66 ->67 68 70 71 72 73 74 file_id = H5Fcreate(…); /* * Create the dataspace for the dataset. */ dimsf[0] = NX; dimsf[1] = NY; filespace = H5Screate_simple(RANK, dimsf, NULL); /* * Create the dataset with default properties collective. */ dset_id = H5Dcreate(file_id, “dataset1”, H5T_NATIVE_INT, filespace, H5P_DEFAULT); H5Dclose(dset_id); /* * Close the file. */ H5Fclose(file_id); March 8, 2010 11th International LCI Conference - HDF5 Tutorial 51
  52. 52. F90 Example: Create Dataset 43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id) 73 CALL h5screate_simple_f(rank, dimsf, filespace, error) 76 ! 77 ! Create the dataset with default properties. 78 ! ->79 CALL h5dcreate_f(file_id, “dataset1”, H5T_NATIVE_INTEGER, filespace, dset_id, error) 90 ! 91 ! Close the dataset. 92 CALL h5dclose_f(dset_id, error) 93 ! 94 ! Close the file. 95 CALL h5fclose_f(file_id, error) March 8, 2010 11th International LCI Conference - HDF5 Tutorial 52
  53. 53. Accessing a Dataset • All processes that have opened dataset may do collective I/O • Each process may do independent and arbitrary number of data I/O access calls • C: H5Dwrite and H5Dread • F90: h5dwrite_f and h5dread_f March 8, 2010 11th International LCI Conference - HDF5 Tutorial 53
  54. 54. Programming model for dataset access • Create and set dataset transfer property • C: H5Pset_dxpl_mpio • H5FD_MPIO_COLLECTIVE • H5FD_MPIO_INDEPENDENT (default) • F90: h5pset_dxpl_mpio_f • H5FD_MPIO_COLLECTIVE_F • H5FD_MPIO_INDEPENDENT_F (default) • Access dataset with the defined transfer property March 8, 2010 11th International LCI Conference - HDF5 Tutorial 54
  55. 55. C Example: Collective write 95 96 97 98 ->99 100 101 102 /* * Create property list for collective dataset write. */ plist_id = H5Pcreate(H5P_DATASET_XFER); H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE); status = H5Dwrite(dset_id, H5T_NATIVE_INT, memspace, filespace, plist_id, data); March 8, 2010 11th International LCI Conference - HDF5 Tutorial 55
  56. 56. F90 Example: Collective write 88 89 90 ->91 92 93 94 95 96 ! Create property list for collective dataset write ! CALL h5pcreate_f(H5P_DATASET_XFER_F, plist_id, error) CALL h5pset_dxpl_mpio_f(plist_id, & H5FD_MPIO_COLLECTIVE_F, error) ! ! Write the dataset collectively. ! CALL h5dwrite_f(dset_id, H5T_NATIVE_INTEGER, data, & error, & file_space_id = filespace, & mem_space_id = memspace, & xfer_prp = plist_id) March 8, 2010 11th International LCI Conference - HDF5 Tutorial 56
  57. 57. Writing and Reading Hyperslabs • Distributed memory model: data is split among processes • PHDF5 uses HDF5 hyperslab model • Each process defines memory and file hyperslabs • Each process executes partial write/read call • Collective calls • Independent calls March 8, 2010 11th International LCI Conference - HDF5 Tutorial 57
  58. 58. Set up the Hyperslab for Read/Write H5Sselect_hyperslab( filespace,H5S_SELECT_SET, offset, stride, count, block ) March 8, 2010 11th International LCI Conference - HDF5 Tutorial 58
  59. 59. Example 1: Writing dataset by rows P0 P1 File P2 P3 March 8, 2010 11th International LCI Conference - HDF5 Tutorial 59
  60. 60. Writing by rows: Output of h5dump HDF5 "SDS_row.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13 } } } } March 8, 2010 11th International LCI Conference - HDF5 Tutorial 60
  61. 61. Example 1: Writing dataset by rows Process P1 File Memory offset[1] count[1] count[0] offset[0] count[0] = dimsf[0]/mpi_size count[1] = dimsf[1]; offset[0] = mpi_rank * count[0]; offset[1] = 0; March 8, 2010 11th International LCI Conference - HDF5 Tutorial /* = 2 */ 61
  62. 62. Example 1: Writing dataset by rows 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 March 8, 2010 /* * Each process defines dataset in memory and * writes it to the hyperslab * in the file. */ count[0] = dimsf[0]/mpi_size; count[1] = dimsf[1]; offset[0] = mpi_rank * count[0]; offset[1] = 0; memspace = H5Screate_simple(RANK,count,NULL); /* * Select hyperslab in the file. */ filespace = H5Dget_space(dset_id); H5Sselect_hyperslab(filespace, H5S_SELECT_SET,offset,NULL,count,NULL); 11th International LCI Conference - HDF5 Tutorial 62
  63. 63. Example 2: Writing dataset by columns P0 File P1 March 8, 2010 11th International LCI Conference - HDF5 Tutorial 63
  64. 64. Writing by columns: Output of h5dump HDF5 "SDS_col.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) } DATA { 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200 } } } } March 8, 2010 11th International LCI Conference - HDF5 Tutorial 64
  65. 65. Example 2: Writing dataset by column Memory Process P0 File P0 offset[1] block[0] dimsm[0] dimsm[1] block[1] P1 offset[1] stride[1] Process P1 March 8, 2010 11th International LCI Conference - HDF5 Tutorial 65
  66. 66. Example 2: Writing dataset by column 85 86 /* * Each process defines a hyperslab in * the file */ count[0] = 1; count[1] = dimsm[1]; offset[0] = 0; offset[1] = mpi_rank; stride[0] = 1; stride[1] = 2; block[0] = dimsf[0]; block[1] = 1; 88 89 90 91 92 93 94 95 96 97 98 /* 99 * Each process selects a hyperslab. 100 */ 101 filespace = H5Dget_space(dset_id); 102 March 8, 2010 H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, stride, count, block); 11th International LCI Conference - HDF5 Tutorial 66
  67. 67. Example 3: Writing dataset by pattern Memory File Process P0 Process P1 Process P2 Process P3 March 8, 2010 11th International LCI Conference - HDF5 Tutorial 67
  68. 68. Writing by Pattern: Output of h5dump HDF5 "SDS_pat.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4 } } } } March 8, 2010 11th International LCI Conference - HDF5 Tutorial 68
  69. 69. Example 3: Writing dataset by pattern Memory File stride[1] Process P2 stride[0] offset[0] offset[1] count[0] count[1] stride[0] stride[1] = = = = = = count[1] 0; 1; 4; 2; 2; 2; offset[1] March 8, 2010 11th International LCI Conference - HDF5 Tutorial 69
  70. 70. Example 3: Writing by pattern 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 March 8, 2010 /* Each process defines dataset in memory and * writes it to the hyperslab in the file. */ count[0] = 4; count[1] = 2; stride[0] = 2; stride[1] = 2; if(mpi_rank == 0) { offset[0] = 0; offset[1] = 0; } if(mpi_rank == 1) { offset[0] = 1; offset[1] = 0; } if(mpi_rank == 2) { offset[0] = 0; offset[1] = 1; } if(mpi_rank == 3) { offset[0] = 1; offset[1] = 1; } 11th International LCI Conference - HDF5 Tutorial 70
  71. 71. Example 4: Writing dataset by chunks P0 March 8, 2010 P1 P2 P3 11th International LCI Conference - HDF5 Tutorial File 71
  72. 72. Writing by Chunks: Output of h5dump HDF5 "SDS_chnk.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 4, 4 } } } } March 8, 2010 11th International LCI Conference - HDF5 Tutorial 72
  73. 73. Example 4: Writing dataset by chunks File Process P2: Memory offset[1] chunk_dims[1] offset[0] chunk_dims[0] block[0] block[0] block[1] offset[0] offset[1] March 8, 2010 = = = = chunk_dims[0]; chunk_dims[1]; chunk_dims[0]; 0; 11th International LCI Conference - HDF5 Tutorial block[1] 73
  74. 74. Example 4: Writing by chunks 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 March 8, 2010 count[0] = 1; count[1] = 1 ; stride[0] = 1; stride[1] = 1; block[0] = chunk_dims[0]; block[1] = chunk_dims[1]; if(mpi_rank == 0) { offset[0] = 0; offset[1] = 0; } if(mpi_rank == 1) { offset[0] = 0; offset[1] = chunk_dims[1]; } if(mpi_rank == 2) { offset[0] = chunk_dims[0]; offset[1] = 0; } if(mpi_rank == 3) { offset[0] = chunk_dims[0]; offset[1] = chunk_dims[1]; } 11th International LCI Conference - HDF5 Tutorial 74
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×