SlideShare a Scribd company logo
1 of 84
Download to read offline
Burst Buffer: From Alpha
to Omega
George Markomanolis
Computational Scientist
KAUST Supercomputing Laboratory
georgios.markomanolis@kaust.edu.sa
24 January 2017
Outline
— Introduction to Parallel I/O
— Understanding the I/O performance on Lustre
— Introduction to Burst Buffer
— Transition to Burst Buffer, which mode to use and how?
— Accelerating the performance
Compute
Node
Processor type:
Intel Haswell
2 CPU sockets per node
@2.3GHz
16 processor cores per CPU
6174 nodes 197,568 cores
128 GB of
memory per node
Over 790 TB total memory
Power Up to 3.5MW Water cooled
Weight/Siz
e
More than 100
metrics tons
36 XC40 Compute cabinets, disk,
blowers, management nodes
Speed
7.2 Peta FLOPS
peak
performance
5.53 Peta FLOPS sustained
LINPACK and ranked 15th in the
latest Top500 list
Network
Cray Aries
interconnect with
Dragonfly
topology
57% of the maximum global
bandwidth between the 18
groups of two cabinets
Storage
Storage
Sonexion 2000
Lustre appliance
17.6 Peta Bytes of usable storage
Over 500 GB/s bandwidth
Burst
Buffer
DataWarp
Intel Solid Sate Devices (SDD)
fast data cache
Over 1.5 TB/s bandwidth
Archive
Tiered Adaptive
Storage (TAS)
Hierarchical storage with 200 TB
disk cache and 20 PB of tape
storage, using a spectra logic
tape library (Upgradable to 100
PB)
Shaheen II Supercomputer
Operations: CS Team
• Application Software
• Weather & Environment: WRF, WRF-Chem, HIRAM, MITgcm
• Big Data: Mizan (in-house)
• Biology & MD: Amber, Gromacs, LAMMPS, NAMD, VEP, BLAST, Infernal
• Combustion: NGA, S3D, KARFS
• CFD & Plasma: Ansys, Fluent, OpenFOAM, Plasmoid (in-house)
• Chemistry & Materials Science: VASP, Materials Studio, Gaussian,
WEIN2k, Quantum Espresso, ADF, CP2K
• Electromagnetism: Ansys, In-house developed code
• Oil & Gas: Madagascar, sofi2D, sofi3D, In-house developed codes
• Seismology: SORD, SeisSol, SPECFEM_3D_GLOBE
• Development Tool
• Compiler: Cray, Intel and GNU with MPICH library
• Optimized Math Library: Cray-libsci, Intel-MKL, PETSc, FFTW, ParMetis
• I/O library: HDF5, NetCDF, PNetCDF, ADIOS
• Performance tools: CrayPat, Reveal, Extrae, Allinea Map
• Debugger: Totalview, Allinea DDT
Software
Introduction to parallel I/O
— I/O can create bottlenecks
— I/O components are much slower than the compute parts of a
supercomputer
— If the bandwidth is saturated, larger scale of execution can not improve the
I/O performance
— Parallel I/O is needed to
— Do more science than waiting files to be read/written
— No waste of resources
— Not stressing the file system, thus affecting other users
I/O Performance
— There is no magic solution
— I/O performance depends on the pattern
— Of course a bottleneck can occur from any part of an application
— Increasing computation and decreasing I/O is a good solution but not
always possible
Serial I/O
— Only one process performs I/O (default option for WRF)
— Data Aggregation or Duplication
— Limited by single I/O process
— Simple solution but does not scale
— Time increases with amount of data
and also with number of processes
Parallel I/O: File-per-Process
— All processes read/write their own separate file
— The number of the files can be limited
by file system
— Significant contention can be observed
Parallel I/O: Shared File
— Shared File
— One file is accessed from all the
processes
— The performance depends on
the data layout
— Large number of processes can
cause contention
Pattern Combinations
— Subset of processes perform I/O
— Aggregation of a group of processes data
— I/O process may access independent files
— Group of processes perform parallel I/O to a shared file
Lustre
— Lustre file system is made up of an underlying:
— Set of I/O servers called Object Storage Servers (OSSs)
— Disks called Object Storage Targets (OSTs), stores file data (chunk of
files). We have 144 OSTs on Shaheen
— The file metadata is controlled by a Metadata Server (MDS) and
stored on a Metadata Target (MDT)
Lustre Operation
Lessons learned from Lustre
— Important factors:
— Striping
— Aligned data
— But… how parallel is the I/O?
Collective Buffering – MPI I/O aggregators
— During a collective write, the buffers on the aggregated nodes are
buffered through MPI, then these nodes write the data to the I/O servers.
— Example 8 MPI processes, 2 MPI I/O aggregators
How many MPI processes are writing a
shared file?
— With CRAY-MPICH, we execute one application with 1216 MPI processes
and it supports parallel I/O with Parallel NetCDF and the file’s size is
361GB:
— First case (no stripping):
— mkdir execution_folder
— copy necessary files in the folder
— cd execution_folder
— run the application
— Timing for Writing restart for domain 1: 674.26 elapsed seconds
— Answer: 1 MPI process
How many MPI processes are writing a
shared file?
— With CRAY-MPICH, we execute one application with 1216 MPI processes
and it supports parallel I/O with Parallel NetCDF and the file’s size is
361GB:
— Second case:
— mkdir execution_folder
— lfs seststripe –c 144 execution_folder
— copy necessary files in the folder
— cd execution_folder
— Run the application
— Timing for Writing restart for domain 1: 10.35 elapsed seconds
— Answer: 144 MPI processes
Extract the list of the MPI I/O aggregators
nodes
— export MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY=1
— First case:
AGG Rank nid
---- ------ --------
0 0 nid04184
— Second case:
AGG Rank nid
---- ------ --------
0 0 nid00292
1 8 nid00294
…
143 1144 nid04592
I/O performance on Lustre while
increasing OSTs
0
20
40
60
80
100
120
140
160
5 10 20 40 80 100 144
Time(inseconds)
# OST
Lustre
Lustre
WRF – 361GB restart file
How to declare the number of MPI I/O
aggregators
— By default with the current version of Lustre, the number of MPI
I/O aggregators is the number of OSTs.
— There are two ways to declare the striping (number of OSTs).
— Execute the following command on an empty folder
— lfs setstripe -c X empty_folder
where X is between 2 and 144, depending on the size of the used files.
— Use the environment variable MPICH_MPIIO_HINTS to declare striping per files
export MPICH_MPIIO_HINTS=
"wrfinput*:striping_factor=64,wrfrst*:striping_factor=144,
wrfout*:striping_factor=144"
Using Darshan tool to visualize I/O
performance
Using Darshan tool
— Have you ever used Darshan tool?
— If the answer is “I don’t know, probably not”, then maybe you have used it, as it is
enabled automatic for most of the applications.
— A KSL suite of tools (beta) to provide you easy access to performance data from
Darshan:
— Connect to ShaheenII with -X (or -Y) option during ssh
— Execute: source /project/k01/markomg/scripts/darshan/load.sh
— To load the performance data of the last simulation which was executed today,
execute:
— open_darshan.sh
— To load the performance data from application which is not the last simulation or it
was not executed the same day, you need to set the job id as argument
— open_darshan.sh job_id
— You do not have access to Darshan data from other users
— Example: open_darshan.sh 2707981
git clone https://gmarkom@bitbucket.org/gmarkom/burst_buffer.git
Compare two executions through
Darshan tool
— In case that you want to compare the execution of two applications, execute:
— compare_darshan.sh job1_id job2_id
— One PDF file, with the Darshan performance data of both executions, is created
Discussion about Lustre
— There are many parameters to optimize Lustre, one quite interesting is
the striping size. This declares the number of bytes to store on an OST
before moving to the next OST
— lfs setstripe -s X empty_folder X in bytes
— export MPICH_MPIIO_HINTS=
"wrfinput*:striping_factor=64,wrfrst*:striping_factor=144:
striping_unit=4194304,wrfout*:striping_factor=144:
striping_unit=2097152”
MPI Options to use in general
— export MPICH_ENV_DISPLAY=1
— Displays all settings used by the MPI during execution
— export MPICH_VERSION_DISPLAY=1
— Displays MPI version
— export MPICH_MPIIO_HINTS_DISPLAY=1
— Displays all the available I/O hints and their values
— export MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY=1
— Display the ranks that are performing aggregation when using MI-I/O collective buffering
— export MPICH_MPIIO_STATS=2
— Statistics on the actual read/write operations after collective buffering
— export MPICH_MPIIO_HINTS=“…”
— Declare I/O hints
Burst Buffer
— Shaheen II: 268 Burst Buffer (BB) nodes, 536 SSDs, totally 1.52 PB, each node has 2 SSDs
— BB adds a layer between the compute nodes and the parallel file system
— Cray DataWarp (DW) I/O is the technology and Burst Buffer is the implementation
Burst Buffer Architecture
Burst Buffer Architecture
Burst Buffer – Use cases
— Periodic burst
— Transfer to PFS between bursts
— I/O improvements
— Accessed via POSIX I/O requests
— Stage-in/stage-out
— Shared BB allocation for multiple jobs
— Coupling applications
Burst Buffer - Status
• 268 DataWarp (DW) nodes, total 1.52PB with granularity 397.44GB
> dwstat most
pool units quantity free gran
wlm_pool bytes 1.52PiB 1.52PiB 397.44GiB
did not find any of [sessions, instances, configurations, registrations,
activations]
> dwstat nodes
node pool online drain gran capacity insts activs
nid00002 wlm_pool true false 16MiB 5.82TiB 0 0
…
nid07618 wlm_pool true false 16MiB 5.82TiB 0 0
Check if there are jobs using BB
> scontrol show burst
Name=cray DefaultPool=wlm_pool Granularity=406976M
TotalSpace=1636043520M UsedSpace=0
Flags=EnablePersistent
StageInTimeout=1800 StageOutTimeout=1800 ValidateTimeout=5
OtherTimeout=300
AllowUsers=…markomg…
GetSysState=/opt/cray/dw_wlm/default/bin/dw_wlm_cli
If your username is not in the list of AllowUsers while you have applied for BB
early access, send email to help@hpc.kaust.edu.sa
scontrol show burst
Name=cray DefaultPool=wlm_pool Granularity=406976M
TotalSpace=1636043520M UsedSpace=813952M
Flags=EnablePersistent
StageInTimeout=1800 StageOutTimeout=1800 ValidateTimeout=5
OtherTimeout=300
AllowUsers=…,markomg…
GetSysState=/opt/cray/dw_wlm/default/bin/dw_wlm_cli
Allocated Buffers:
JobID=2729000 CreateTime=2017-01-20T17:15:31 Pool=wlm_pool
Size=813952M State=allocated UserID=markomg(137767)
Per User Buffer Use:
UserID=markomg(137767) Used=813952M
Burst Buffer Nodes Allocation
• How many DW instances per node?
DW_instances_per_node = 5.82*1024/397.44 = 14.995
A DW node can accommodate up to 14*397.44/1024 = 5.43 TB
• A user requests 60TB of DW nodes, how many DW nodes is he going
to reserve (for striped mode explained later)?
We have 268 DW nodes, each nodes provides initially one DW instance
and when all of them are used, then it starts from the first DW node
again. The allocation occurs under round - robin basis
Requested_DW_nodes = 60*1024/397.44 = 154.58, so we will reserve
155 DW nodes.
Important: If you reserve more than 268 * 397.44/1024 = 104TB, then
some DW nodes will be used twice and this can cause I/O performance
issues
Burst Buffer Modes
• DW supports two access modes
• Private
Each of the compute job has its own private space on BB and it will
lot be visible to other compute jobs. For now, data is not striped over
BB nodes in private mode (not tested). Each compute node has
access to a BB allocation equal to the granularity size.
• Striped
The data will be striped over several Burst Buffer nodes. BB nodes
are allocated on a round-robin basis. We use this mode mainly
• BB supports two reservation modes
• Scratch is temporary space allocation which will be removed when
the job is finished
• Persistent is when you have many jobs that need to access the
same files, so this mode creates a DW space that persists after a
job is finished and it is available to other of your DW jobs.
Important: Persistent space is not a backup solution, you could
lose your data in case of any BB problem
Burst Buffer Workflow
• Initially the files are located on Lustre filesystem
• For the files that need to be accessed multiple times but also for any
big files you should move these files on BB before your job reservation.
This phase is called stage-in. You can stage-in either file or folder. We
discuss later about small shared file access.
• When the job finishes, the created files will be returned to the folder
that the user declared in the script, this is called stage-out.
• The files on BB are located inside the path declared by environment
variable $DW_JOB_STRIPED (for striped mode)
Note: Stage-in and –out are not mandatory it depends what the user
needs. Maybe there are no input files or the user wants just to measure
the execution time.
Modify SLURM script
• Lustre reservation
#!/bin/bash
#SBATCH --partition=workq
#SBATCH -t 10:00:00
#SBATCH -A k01
#SBATCH --nodes=32
#SBATCH --ntasks=1024
#SBATCH -J slurm_test
Comment: Insert the DW commands, exactly after the SBATCH
commands, do not include any other unrelated commands between
SBATCH and DW declarations.
Modify SLURM script
• BB reservation (2TB of DW space)
#!/bin/bash
#SBATCH --partition=workq
#SBATCH -t 10:00:00
#SBATCH -A k01
#SBATCH --nodes=32
#SBATCH --ntasks=1024
#SBATCH -J slurm_test
#DW jobdw type=scratch access_mode=striped capacity=2TiB
#DW stage_in type=directory source=/scratch/markomg/for_bb
destination=$DW_JOB_STRIPED
#DW stage_out type=directory destination=/scratch/markomg/back_up
source=$DW_JOB_STRIPED/
cd $DW_JOB_STRIPED
chmod +x executable
Note: You can stage-in/out also files instead of directory
DataWarp – Restrictions
• When you stage-in executables, it is required to apply appropriate
permissions on BB, that this file is executable (chmod +x executable)
• Symbolic and hard links will be lost during stage-in
• It seems that using the folder /projects with DW, it works now
BB – Check status of reservation
— Execute: dwstat most
pool units quantity free gran
wlm_pool bytes 1.52PiB 1.52PiB 397.44GiB
sess state token creator owner created expiration nodes
975 CA--- 2728666 SLURM 137767 2017-01-19T23:49:54 never 40
inst state sess bytes nodes created expiration intact label public confs
967 CA--- 975 794.88GiB 2 2017-01-19T23:49:54 never true I975-0 false
1
conf state inst type access_type activs
975 CA--- 967 scratch stripe 1
reg state sess conf wait
945 CA--- 975 975 true
activ state sess conf nodes mount
884 CA--- 975 975 40 /var/opt/cray/dws/mounts/batch/2728666/ss
— You can see only your jobs from this command
Profiling MPI I/O on BB
Question: Using 160 nodes with 1 MPI process per node and 2TB of DW space (6 DW nodes) with
MPI I/O through PnetCDF, how many MPI I/O aggregators are saving the NetCDF file on BB?
Answer: 6!
Table 6: File Output Stats by Filename
Write Time | Write MBytes | Write Rate | Writes | Bytes/ Call |File Name
| | MBytes/sec | | | PE
710.752322 | 988,990.668619 | 1,391.470191 | 671,160.0 | 1,545,133.62 |Total
|-----------------------------------------------------------------------------
| 263.824253 | 369,720.282763 | 1,401.388533 | 46,690.0 | 8,303,272.98 |wrfrst_d01_2009-
12-18_00_30_00
||----------------------------------------------------------------------------
|| 45.299442 | 61,624.000000 | 1,360.369943 | 7,798.0 | 8,286,412.85 |pe.96
|| 44.410365 | 61,616.000000 | 1,387.423860 | 7,795.0 | 8,288,525.83 |pe.160
|| 43.762797 | 61,623.999999 | 1,408.136675 | 7,763.0 | 8,323,772.69 |pe.32
|| 43.708663 | 61,616.148647 | 1,409.701068 | 7,762.0 | 8,323,784.42 |pe.0
|| 43.532686 | 61,616.134117 | 1,415.399323 | 7,764.0 | 8,321,638.26 |pe.128
|| 43.110299 | 61,624.000000 | 1,429.449598 | 7,808.0 | 8,275,800.13 |pe.64
|| 0.000000 | 0.000000 | -- | 0.0 | -- |pe.1
|| 0.000000 | 0.000000 | -- | 0.0 | -- |pe.2
|| 0.000000 | 0.000000 | -- | 0.0 | -- |pe.3
|| 0.000000 | 0.000000 | -- | 0.0 | -- |pe.4
|| 0.000000 | 0.000000 | -- | 0.0 | -- |pe.5
|| 0.000000 | 0.000000 | -- | 0.0 | -- |pe.6
|| 0.000000 | 0.000000 | -- | 0.0 | -- |pe.7
How do we choose the number of MPI I/O
aggregators on BB?
— In this example we have parallel I/O and we can adjust the number of the MPI
processes for simulating an application
— MPICH_MPIIO_HINTS
— export MPICH_MPIIO_HINTS="wrfrst*:cb_nodes=80,wrfout*:cb_nodes=40”
— This environment variable is supported on DataWarp since Cray-MPICH v7.4
— In this case we select 80 MPI I/O aggregators for the files starting with the name
wrfrst*, and 40 MPI I/O aggregators for the files starting with the name wrfout*.
— Although this depends on the application, according to our experience, if you have
one MPI I/O aggregator per DW node (default behavior), the performance is not
always good. In order to stress the SSDs of the DW node, more than one MPI process
should write data per DW node, and this happens with the MPI I/O aggregators.
— Depending on the size of the file, some times we need to use different number of MPI
I/O aggregators per file.
How do we choose the number of MPI I/O
aggregators on BB?
— Tips:
— The number of the MPI I/O aggregators should divide the number of total
MPI processes for better load balancing. For example, If you have 1024
MPI processes, do not declare 100 MPI I/O aggregators, but 128 or 64.
However, in some cases it does not matter.
— The number of the requested DW nodes, should divide the number of the
MPI I/O aggregators for better load balancing also.
— Of course the requested DW nodes should provide enough data for all of
your experiments, thus there is a minimum amount of needed DW nodes.
— 𝑀𝑃𝐼_𝐼𝑂_𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟𝑠 = .
𝐷𝑊_𝑛𝑜𝑑𝑒𝑠, 𝑖𝑓	𝑤𝑒	𝑢𝑠𝑒	𝑜𝑛𝑒	𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟	𝑝𝑒𝑟	𝐷𝑊	𝑛𝑜𝑑𝑒
𝑘 ∗ 𝐷𝑊_𝑛𝑜𝑑𝑒𝑠, 𝑤ℎ𝑒𝑟𝑒	𝑘 ∈ ℕ, 2 ≤ 𝑘 ≤ 16
— Number_of_total_MPI_processes= 𝑙 ∗ 𝑀𝑃𝐼_𝐼𝑂_𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟𝑠, 𝑤ℎ𝑒𝑟𝑒	𝑙 ∈ ℕ, 𝑙 ≥ 2
Compute the required DW space
— It is already mentioned that we need to have enough space for our
experiments
— If the experiments are about DW scalability and the number of the
MPI/OpenMP processes remain stable, then you could modify the MPI
I/O aggregators and the number of DW nodes.
— If you need for example 64 DW nodes, then you should calculate the
requested space as follows:
— Multiply with the DW granularity:
— 64*397.44=25436.16
— Round down and request this value in the submission script, for example:
— 25436GiB
How to compile your code for better
performance on DW
— Because of the MPICH_MPIIO_HINTS environment variable, Cray-MPICH v7.4 and
later is required.
— Load the appropriate versions through cdt module:
— module load cdt/16.08
Switching to atp/2.0.2.
Switching to cce/8.5.2.
Switching to cray-libsci/16.07.1.
Switching to cray-mpich/7.4.2.
Switching to craype/2.5.6.
Switching to modules/3.2.10.4.
Switching to pmi/5.0.10-1.0000.11050.0.0.ari.
— Then compile your application
— Be careful, sometimes gateway has different versions of default libraries
Create persistent DW space I
— Create persistent DW space
#!/bin/bash -x
#SBATCH --partition=workq
#SBATCH -t 1
#SBATCH -A k01
#SBATCH --nodes=1
#SBATCH -J create_persistent_space
#BB create_persistent name=george_test capacity=600G access=striped
type=scratch
exit 0
Checking the status of the persistent DW
reservation
> dwstat most
sess state token creator owner created expiration nodes
985 CA--- george_test CLI 137767 2017-01-20T18:01:00 never 0
inst state sess bytes nodes created expiration intact label public confs
977 CA--- 985 794.88GiB 2 2017-01-20T18:01:01 never true george_test true 1
> dwstat nodes
node pool online drain gran capacity insts activs
nid01349 wlm_pool true false 16MiB 5.82TiB 1 0
nid01410 wlm_pool true false 16MiB 5.82TiB 1 0
Use DW persistent space I
#!/bin/bash
#SBATCH --partition=workq
#SBATCH -t 10
#SBATCH -A k01
#SBATCH --nodes=1
#DW persistentdw name=george_test
#DW stage_in type=directory source=/project/k01/markomg/wrf
destination=$DW_PERSISTENT_STRIPED_george_test
cd $DW_PERSISTENT_STRIPED_george_test/
…
exit 0
Use DW persistent space II
— Now, you can execute the second job on the persistent space, however, do not
stage-in the same files:
— squeue -u markomg
JOBID USER ACCOUNT NAME ST REASON START_TIME TIME TIME_LEFT NODES
2729358 markomg k01 test PD burst_buf N/A 0:00 3:00 40
— scontrol show job 2729358
…
JobState=PENDING
Reason=burst_buffer/cray:_dws_data_in:_Error_creating_staging_object_for_file_(/
scratch/markomg/burst_buffer_early_access/wrfchem/wrfchem-
3.7.1_burst/test/em_real/forburst)_-2_Staging_failures_reported_
Dependency=(null)
...
— scancel 2729358
If the problem is not solved send us email immediately! help@hpc.kaust.edu.sa, inform also
the BB users through bb_users@hpc.lists.kaust.edu.sa
Use DW persistent space III
— Submit another jobs by either stage-in different files, or without stage-
in
— In the case that you want to connect interactively on the compute node
to have access to BB and check the files, follow the instructions:
— Create a file, called it for example persistent.conf with the following:
— #DW persistentdw name=george_test
— Execute:
salloc -N 1 -t 00:10:00 --bbf=”persistent.conf”
srun –N 1 bash –i
cd $DW_PERSISTENT_STRIPED_george_test
— markomg@nid00024:/var/opt/cray/dws/mounts/batch/george_test/ss
Use DW persistent space IV
— Three jobs were executed on persistent DW space and created a job folder with the job id as their
name:
nid00024:/var/opt/cray/dws/mounts/batch/george_test/ss/ ls -l 2729*
2729356:
-rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 18:49 wrfout_d01_2007-04-03_00_00_00
-rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 18:49 wrfout_d01_2007-04-03_01_00_00
2729361:
-rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 18:57 wrfout_d01_2007-04-03_00_00_00
-rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 18:57 wrfout_d01_2007-04-03_01_00_00
2729362:
-rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 19:09 wrfout_d01_2007-04-03_00_00_00
-rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 19:09 wrfout_d01_2007-04-03_01_00_00
Finalize DW persistent reservation
— When the experiments are finished, then stage-out the files (if
required). Do not copy the files from the interactive mode back to
Lustre as this can be much slower, it depends on the file sizes.
— Finally delete the DW space
#!/bin/bash
#SBATCH --partition=workq
#SBATCH -t 1
#SBATCH -A k01
#SBATCH --nodes=1
#SBATCH -J delete_persistent_space
#BB destroy_persistent name=george_test
exit 0
Fuse Blown error – just in case…
> dwstat most
pool units quantity free gran
wlm_pool bytes 1.52PiB 1.52PiB 397.44GiB sess state token
creator owner created expiration nodes
1007 D---- 2729546 SLURM 137767 2017-01-21T00:27:50 never 0
1008 D---- 2729547 SLURM 137767 2017-01-21T00:27:57 never 0
inst state sess bytes nodes created expiration intact label public confs
991 D---- 1007 397.44GiB 1 2017-01-21T00:27:50 never false I1007-
0 false 1
992 D---- 1008 397.44GiB 1 2017-01-21T00:27:58 never false I1008-
0 false 1
conf state inst type access_type activs
999 DAF-- 991 scratch stripe 0
1000 DAF-- 992 scratch stripe 0
Do not submit any more experiments and report to us!
Applications/Benchmarks
Early Results I
• 8 May 2016
8192 MPI processes, for Lustre 120 OSTs, 256 DataWarp nodes, and 256 compute nodes. We
have totally 256 * 32 tiles and the size of each tile is from 16 to 1024MB, so we save a file
with size from 128GB till 8TB. The collective mode behaves worse (for Lustre while we increase
the file size) and for 1024MB element size we have the same bandwidth between non
collective Lustre and collective DataWarp.
Early Results II
• 12 May 2016
Shaheen II power uncapped, 2048 nodes and 256 DataWarp nodes.
It seems that with 65536 MPI processes while we increase the MB per element some
nodes do not perform good. In the following figure we have MB per element from 16 up
to 1024. This means that the 32768 MPI processes are saving one single shared file
with size from 512GB up to 32TB while the 65536 MPI processes save a file from 1 till
64TB. Maximum write bandwidth is ~749GB/s for 32768 MPI processes, 512 MB per
element.
Study-case WRF-CHEM
WRF-CHEM on Burst Buffer
— Weather Research and Forecasting Model coupling with Chemistry
— Small domain: 330 x 275
— Size of input file: 804 MB
— Size of output file: 2.9GB, it is saved every one hour of simulation
— Output file quite small
— For all the WRF-CHEM experiments we use 1280 MPI processes (40 nodes),
as this is the optimum for the computation/communication
— For the default case, we stage-in all the files and we execute the simulation
from BB
Total execution time and I/O on BB
without MPICH_MPIIO_HINTS (default)
1
2
4
8
16
32
64
128
256
1 2 4 8
Time(inseconds)
#DW nodes
Burst Buffer
Total time Reading input file Writing output file Lustre 64 OSTs
• The best total execution time is provided when we have 4 DW nodes
• On Lustre the best execution time is with 64 OSTs and it is 15% faster than BB
Darshan – 4 DW nodes
Understand the MPI I/O statistics on BB
(MPICH_MPIIO_STATS=2) I
+--------------------------------------------------------+
| MPIIO read access patterns for wrfinput_d01
| independent reads = 1
| collective reads = 527360
| independent readers = 1
| aggregators = 4
| stripe count = 1
| stripe size = 8388608
| system reads = 762
| stripe sized reads = 108
| total bytes for reads = 2930104643 = 2794 MiB = 2 GiB
| ave system read size = 3845281
| number of read gaps = 2
| ave read gap size = 0
+--------------------------------------------------------+
Timing for processing wrfinput file (stream 0) for domain 1: 13.19130 elapsed seconds
We have 4 MPI I/O aggregators on 1 DW node
Default stripe size is 8MB
Only the 14.17% of the reads are striped
Average system read size is less than 4MB
Understand the MPI I/O statistics on BB
(MPICH_MPIIO_STATS=2) II
+--------------------------------------------------------+
| MPIIO write access patterns for wrfout_d01_2007-04-03_00_00_00
| independent writes = 2
| collective writes = 552960
| independent writers = 1
| aggregators = 4
| stripe count = 1
| stripe size = 8388608
| system writes = 797
| stripe sized writes = 114
| total bytes for writes = 3045341799 = 2904 MiB = 2 GiB
| ave system write size = 3821006
| read-modify-write count = 0
| read-modify-write bytes = 0
| number of write gaps = 2
| ave write gap size = 4194300
+--------------------------------------------------------+
Timing for Writing wrfout_d01_2007-04-03_00_00_00 for domain 1: 18.83090 elapsed seconds
Only 14.3% of the writes are striped
There are 2 gaps of almost 4MB
Compare the total execution time on single DW
nodes across various MPI I/O aggregators
188
136 135 131
0
20
40
60
80
100
120
140
160
180
200
1 2 4 8
Time(inseconds)
# MPI I/O aggregators
• Example for declaring 2 MPI I/O aggregators
export MPICH_MPIIO_HINTS="wrfinput*:cb_nodes=2,wrfout*:cb_nodes=2”
• Tip: You can declare different MPI I/O aggregators per file
Studying the influence of the stripe size on the
performance of application for 8 MPI I/O aggregators
on single DW node
131
102 105
0
20
40
60
80
100
120
140
8 2 1
Time(inseconds)
Size of stripe size in MB
• Changing the stripe size form 8 to 2 MB, improved the total execution time by 22%
• Example of 8 MPI I/O aggregators and 2MB of stripe size (striping_unit flag)
export MPICH_MPIIO_HINTS="wrfinput*:cb_nodes=8:striping_unit=2097152,
wrfout*:cb_nodes=8:striping_unit=2097152”
• Why for stripe size of 1 MB the performance is not improved?
MPI I/O statistics for stripe size equal to 1MB
+--------------------------------------------------------+
| MPIIO write access patterns for wrfout_d01_2007-04-03_00_00_00
| independent writes = 2
| collective writes = 552960
| independent writers = 1
| aggregators = 8
| stripe count = 1
| stripe size = 1048576
| system writes = 3338
| stripe sized writes = 2606
| total bytes for writes = 3045341799 = 2904 MiB = 2 GiB
| ave system write size = 912325
| read-modify-write count = 0
| read-modify-write bytes = 0
| number of write gaps = 2
| ave write gap size = 524284
+--------------------------------------------------------+
Timing for Writing wrfout_d01_2007-04-03_00_00_00 for domain 1: 9.09640 elapsed seconds
78% of the writes are striped but the system does
now 3338 writes instead of 797
Let’s increase the MPI I/O aggregators because there are many write calls now
Increasing the number of the MPI I/O
aggregators for stripe size equal to 1MB
+--------------------------------------------------------+
| MPIIO write access patterns for wrfout_d01_2007-04-03_00_00_00
| independent writes = 2
| collective writes = 552960
| independent writers = 1
| aggregators = 20
| stripe count = 1
| stripe size = 1048576
| system writes = 3338
| stripe sized writes = 2606
| total bytes for writes = 3045341799 = 2904 MiB = 2 GiB
| ave system write size = 912325
| read-modify-write count = 0
| read-modify-write bytes = 0
| number of write gaps = 2
| ave write gap size = 524284
+--------------------------------------------------------+
Timing for Writing wrfout_d01_2007-04-03_00_00_00 for domain 1: 6.40781 elapsed seconds
Finally, by increasing the MPI I/O aggregators to 40 and decreasing the stripe size to 256KB,
we achieve the best result, of total execution time 97 seconds.
Comparison between BB and Lustre
188
97
114 112
0
20
40
60
80
100
120
140
160
180
200
Default parameters Optimized parameters
Time(inseconds)
BB Lustre
The total execution time on BB is 13.4% faster than Lustre for one hour of
simulation of WRF-CHEM. For 24 hours of simulation, the execution time on BB
is faster than Lustre by 14.8%
We achieved better performance with BB by using one single BB node in
comparison to 64 OSTs of Lustre
WRF-CHEM – Split output to one file per
process
0
0.5
1
1.5
2
2.5
3
1 2 4 8 16 32 64
Time(inseconds)
# DW nodes
Reported “I/O” time from WRF-CHEM
File 2.9GB
WRF-CHEM – Split output to one file per
process II
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32 64
Percentage(in%)
# DW nodes
I/O efficiency using reported “I/O” time from WRF-CHEM
File 2.9GB
WRF – Split output to one file per process
– Large cases
0
2
4
6
8
10
12
14
16
18
20
4 8 16 32 64 128
Time(inseconds)
# DW nodes
Reported “I/O” time from WRF
File 81GB File 361GB
WRF – Split output to one file per process
II
0
20
40
60
80
100
120
4 8 16 32 64 128
Percentage(in%)
# DW nodes
I/O efficiency using reported “I/O” time from WRF
File 81GB File 361GB
What happens if we try to scale on BB
with this example?
For this experiment, we use 64 DW nodes, the file namelist.input is 12KB (Darshan reports wrong
file size) but 1280 MPI processes access it for reading and there is metadata issue
Solution
Tip: You need to declare the path of the files in the MPICH_MPIIO_HINTS
export MPICH_MPIIO_HINTS="$DW_JOB_STRIPED/wrfinput*:cb_nodes=40 …
$DW_JOB_STRIPED/wrfout*:cb_nodes=40…"
— Hybrid execution, we save small files on Lustre and large ones on BB
Multiple runs
— Submitting 3 jobs of 20 compute nodes and requesting 64 DW nodes
each one
— used_bb_nodes.sh
192 BB nodes are used with at least one BB job
0 BB nodes are used from more than one BB job
— Variation 2-3%
DataWarp vs Lustre for same number of
nodes (OSTs)
0
20
40
60
80
100
120
140
160
5 10 20 40 80 100 144
Time(inseconds)
#OSTs or DW nodes
I/O time for the WRF restart file, size 361 GB
Lustre
DataWarp
WRF reports I/O time but it includes other functionalities which are beyond
I/O
DataWarp vs Lustre, percentage of
performance difference
-5
0
5
10
15
20
25
30
35
40
45
5 10 20 40 80 100 144
Percentage(in%)
# OSTs or DW nodes
DataWarp vs Lustre,
same number of nodes
WRF – Lustre vs DW
500
550
600
650
700
750
800
850
900
4 8 16 32 64 128
Totalexecutiontime(insec.)
# DW nodes
Lustre
DW
NAS – BT I/O Benchmark - PNetCDF
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
2 4 8 16 32 64 128 256
I/OBandwidth(GB/s)
# DW nodes
DW
Lustre - 128 compute
nodes
Lustre - 512 compute
nodes
NAS – BT I/O Benchmark - PNetCDF
0
10
20
30
40
50
60
2 4 8 16 32 64 128 256
I/Oefficiency(in%)
# DW nodes
I/O Efficiency
I/O Efficiency
PIDX
— PIDX is an efficient parallel I/O library that reads and writes
multiresolution IDX data files
— It can provide high scalability up to 768k cores
— Successful integration with several simulation codes
— KARFS (KAUST Adaptive Reacting Flow Solvers) on Shaheen II
— Uintah with production runs on Mira
— S3D
0
100
200
300
400
500
600
700
800
900
1000
16 32 64 128 144 256
WriteI/Obandwidth(GB/s)
# BB nodes/OST
PIDX on BB
BB
Lustre
0
10
20
30
40
50
60
70
80
90
100
16 32 64 128 144 256
Percentageofefficiency
# BB nodes
Efficiency based on IOR peak
DataWarp API
— Libatawarp
— dw_get_stripe_configuration
— dw_query_directory_stage
— dw_query_file_stage
— dw_set_stage_concurrency
— dw_stage_file_out
— dw_wait_directory_stage
BB Policy and Calendar
— All the projects should schedule some time slots to execute experiment
— Send me your gmail accounts or other emails for google calendar.
— Up to half day slots
— It is mandatory to be trained before you have access to Burst Buffer
Conclusions
— It was explained how Lustre and BB operate
— Many parameters need be investigated for the optimum performance
— Be patient, probably you will not achieve the best performance immediately
— Be careful o compile your application with the appropriate Cray MPICH
— CLE 6.0 will solve some BB issues
— Be open, be collaborative
— Be active at bb_users@lists.hpc.kaust.edu.sa
— Always check the output of the application, used MPI version etc.
Thank you!
Questions?
georgios.markomanolis@kaust.edu.sa

More Related Content

What's hot

Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiKarel Ha
 
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineFARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineYuuki Takano
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SGanesan Narayanasamy
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemPost-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemLinaro
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFoholiab
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me_xhr_
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVELinaro
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportLinaro
 
IPC mechanisms in windows
IPC mechanisms in windowsIPC mechanisms in windows
IPC mechanisms in windowsVinoth Raj
 
Os7 2
Os7 2Os7 2
Os7 2issbp
 
eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureNetronome
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerKarel Ha
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolGanesan Narayanasamy
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Anne Nicolas
 

What's hot (20)

Real-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/PhiReal-time applications on IntelXeon/Phi
Real-time applications on IntelXeon/Phi
 
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
FARIS: Fast and Memory-efficient URL Filter by Domain Specific MachineFARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
 
Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Post-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC EcosystemPost-K: Building the Arm HPC Ecosystem
Post-K: Building the Arm HPC Ecosystem
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPFA Kernel of Truth: Intrusion Detection and Attestation with eBPF
A Kernel of Truth: Intrusion Detection and Attestation with eBPF
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
IPC mechanisms in windows
IPC mechanisms in windowsIPC mechanisms in windows
IPC mechanisms in windows
 
Os7 2
Os7 2Os7 2
Os7 2
 
eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging Infrastructure
 
IBM AI at Scale
IBM AI at ScaleIBM AI at Scale
IBM AI at Scale
 
Solving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as PokerSolving Endgames in Large Imperfect-Information Games such as Poker
Solving Endgames in Large Imperfect-Information Games such as Poker
 
main memory
main memorymain memory
main memory
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor tool
 
Chapter 8 - Main Memory
Chapter 8 - Main MemoryChapter 8 - Main Memory
Chapter 8 - Main Memory
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 

Similar to Burst Buffer: From Alpha to Omega

Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architectureAjithaSomasundaram
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...EUDAT
 
Best Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersBest Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersIntel® Software
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!A Jorge Garcia
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @JavaPeter Lawrey
 
maXbox Starter 42 Multiprocessing Programming
maXbox Starter 42 Multiprocessing Programming maXbox Starter 42 Multiprocessing Programming
maXbox Starter 42 Multiprocessing Programming Max Kleiner
 
Adios Api Scidac Tutorialv2
Adios Api Scidac Tutorialv2Adios Api Scidac Tutorialv2
Adios Api Scidac Tutorialv2fanc1985
 
PVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux ClustersPVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux ClustersTawose Olamide Timothy
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009lilyco
 
sector-sphere
sector-spheresector-sphere
sector-spherexlight
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdbRoman Podoliaka
 

Similar to Burst Buffer: From Alpha to Omega (20)

Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
 
Multicore
MulticoreMulticore
Multicore
 
Basic Linux Internals
Basic Linux InternalsBasic Linux Internals
Basic Linux Internals
 
Best Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersBest Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing Clusters
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Lab 1 Essay
Lab 1 EssayLab 1 Essay
Lab 1 Essay
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!Building A Linux Cluster Using Raspberry PI #2!
Building A Linux Cluster Using Raspberry PI #2!
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
maXbox Starter 42 Multiprocessing Programming
maXbox Starter 42 Multiprocessing Programming maXbox Starter 42 Multiprocessing Programming
maXbox Starter 42 Multiprocessing Programming
 
Parallel Processing.pptx
Parallel Processing.pptxParallel Processing.pptx
Parallel Processing.pptx
 
Adios Api Scidac Tutorialv2
Adios Api Scidac Tutorialv2Adios Api Scidac Tutorialv2
Adios Api Scidac Tutorialv2
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
PVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux ClustersPVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux Clusters
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 

More from George Markomanolis

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IGeorge Markomanolis
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIGeorge Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IGeorge Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

More from George Markomanolis (15)

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Recently uploaded

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 

Recently uploaded (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 

Burst Buffer: From Alpha to Omega

  • 1. Burst Buffer: From Alpha to Omega George Markomanolis Computational Scientist KAUST Supercomputing Laboratory georgios.markomanolis@kaust.edu.sa 24 January 2017
  • 2. Outline — Introduction to Parallel I/O — Understanding the I/O performance on Lustre — Introduction to Burst Buffer — Transition to Burst Buffer, which mode to use and how? — Accelerating the performance
  • 3. Compute Node Processor type: Intel Haswell 2 CPU sockets per node @2.3GHz 16 processor cores per CPU 6174 nodes 197,568 cores 128 GB of memory per node Over 790 TB total memory Power Up to 3.5MW Water cooled Weight/Siz e More than 100 metrics tons 36 XC40 Compute cabinets, disk, blowers, management nodes Speed 7.2 Peta FLOPS peak performance 5.53 Peta FLOPS sustained LINPACK and ranked 15th in the latest Top500 list Network Cray Aries interconnect with Dragonfly topology 57% of the maximum global bandwidth between the 18 groups of two cabinets Storage Storage Sonexion 2000 Lustre appliance 17.6 Peta Bytes of usable storage Over 500 GB/s bandwidth Burst Buffer DataWarp Intel Solid Sate Devices (SDD) fast data cache Over 1.5 TB/s bandwidth Archive Tiered Adaptive Storage (TAS) Hierarchical storage with 200 TB disk cache and 20 PB of tape storage, using a spectra logic tape library (Upgradable to 100 PB) Shaheen II Supercomputer
  • 4. Operations: CS Team • Application Software • Weather & Environment: WRF, WRF-Chem, HIRAM, MITgcm • Big Data: Mizan (in-house) • Biology & MD: Amber, Gromacs, LAMMPS, NAMD, VEP, BLAST, Infernal • Combustion: NGA, S3D, KARFS • CFD & Plasma: Ansys, Fluent, OpenFOAM, Plasmoid (in-house) • Chemistry & Materials Science: VASP, Materials Studio, Gaussian, WEIN2k, Quantum Espresso, ADF, CP2K • Electromagnetism: Ansys, In-house developed code • Oil & Gas: Madagascar, sofi2D, sofi3D, In-house developed codes • Seismology: SORD, SeisSol, SPECFEM_3D_GLOBE • Development Tool • Compiler: Cray, Intel and GNU with MPICH library • Optimized Math Library: Cray-libsci, Intel-MKL, PETSc, FFTW, ParMetis • I/O library: HDF5, NetCDF, PNetCDF, ADIOS • Performance tools: CrayPat, Reveal, Extrae, Allinea Map • Debugger: Totalview, Allinea DDT Software
  • 5. Introduction to parallel I/O — I/O can create bottlenecks — I/O components are much slower than the compute parts of a supercomputer — If the bandwidth is saturated, larger scale of execution can not improve the I/O performance — Parallel I/O is needed to — Do more science than waiting files to be read/written — No waste of resources — Not stressing the file system, thus affecting other users
  • 6. I/O Performance — There is no magic solution — I/O performance depends on the pattern — Of course a bottleneck can occur from any part of an application — Increasing computation and decreasing I/O is a good solution but not always possible
  • 7. Serial I/O — Only one process performs I/O (default option for WRF) — Data Aggregation or Duplication — Limited by single I/O process — Simple solution but does not scale — Time increases with amount of data and also with number of processes
  • 8. Parallel I/O: File-per-Process — All processes read/write their own separate file — The number of the files can be limited by file system — Significant contention can be observed
  • 9. Parallel I/O: Shared File — Shared File — One file is accessed from all the processes — The performance depends on the data layout — Large number of processes can cause contention
  • 10. Pattern Combinations — Subset of processes perform I/O — Aggregation of a group of processes data — I/O process may access independent files — Group of processes perform parallel I/O to a shared file
  • 11. Lustre — Lustre file system is made up of an underlying: — Set of I/O servers called Object Storage Servers (OSSs) — Disks called Object Storage Targets (OSTs), stores file data (chunk of files). We have 144 OSTs on Shaheen — The file metadata is controlled by a Metadata Server (MDS) and stored on a Metadata Target (MDT)
  • 13. Lessons learned from Lustre — Important factors: — Striping — Aligned data — But… how parallel is the I/O?
  • 14. Collective Buffering – MPI I/O aggregators — During a collective write, the buffers on the aggregated nodes are buffered through MPI, then these nodes write the data to the I/O servers. — Example 8 MPI processes, 2 MPI I/O aggregators
  • 15. How many MPI processes are writing a shared file? — With CRAY-MPICH, we execute one application with 1216 MPI processes and it supports parallel I/O with Parallel NetCDF and the file’s size is 361GB: — First case (no stripping): — mkdir execution_folder — copy necessary files in the folder — cd execution_folder — run the application — Timing for Writing restart for domain 1: 674.26 elapsed seconds — Answer: 1 MPI process
  • 16. How many MPI processes are writing a shared file? — With CRAY-MPICH, we execute one application with 1216 MPI processes and it supports parallel I/O with Parallel NetCDF and the file’s size is 361GB: — Second case: — mkdir execution_folder — lfs seststripe –c 144 execution_folder — copy necessary files in the folder — cd execution_folder — Run the application — Timing for Writing restart for domain 1: 10.35 elapsed seconds — Answer: 144 MPI processes
  • 17. Extract the list of the MPI I/O aggregators nodes — export MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY=1 — First case: AGG Rank nid ---- ------ -------- 0 0 nid04184 — Second case: AGG Rank nid ---- ------ -------- 0 0 nid00292 1 8 nid00294 … 143 1144 nid04592
  • 18. I/O performance on Lustre while increasing OSTs 0 20 40 60 80 100 120 140 160 5 10 20 40 80 100 144 Time(inseconds) # OST Lustre Lustre WRF – 361GB restart file
  • 19. How to declare the number of MPI I/O aggregators — By default with the current version of Lustre, the number of MPI I/O aggregators is the number of OSTs. — There are two ways to declare the striping (number of OSTs). — Execute the following command on an empty folder — lfs setstripe -c X empty_folder where X is between 2 and 144, depending on the size of the used files. — Use the environment variable MPICH_MPIIO_HINTS to declare striping per files export MPICH_MPIIO_HINTS= "wrfinput*:striping_factor=64,wrfrst*:striping_factor=144, wrfout*:striping_factor=144"
  • 20. Using Darshan tool to visualize I/O performance
  • 21. Using Darshan tool — Have you ever used Darshan tool? — If the answer is “I don’t know, probably not”, then maybe you have used it, as it is enabled automatic for most of the applications. — A KSL suite of tools (beta) to provide you easy access to performance data from Darshan: — Connect to ShaheenII with -X (or -Y) option during ssh — Execute: source /project/k01/markomg/scripts/darshan/load.sh — To load the performance data of the last simulation which was executed today, execute: — open_darshan.sh — To load the performance data from application which is not the last simulation or it was not executed the same day, you need to set the job id as argument — open_darshan.sh job_id — You do not have access to Darshan data from other users — Example: open_darshan.sh 2707981 git clone https://gmarkom@bitbucket.org/gmarkom/burst_buffer.git
  • 22. Compare two executions through Darshan tool — In case that you want to compare the execution of two applications, execute: — compare_darshan.sh job1_id job2_id — One PDF file, with the Darshan performance data of both executions, is created
  • 23. Discussion about Lustre — There are many parameters to optimize Lustre, one quite interesting is the striping size. This declares the number of bytes to store on an OST before moving to the next OST — lfs setstripe -s X empty_folder X in bytes — export MPICH_MPIIO_HINTS= "wrfinput*:striping_factor=64,wrfrst*:striping_factor=144: striping_unit=4194304,wrfout*:striping_factor=144: striping_unit=2097152”
  • 24. MPI Options to use in general — export MPICH_ENV_DISPLAY=1 — Displays all settings used by the MPI during execution — export MPICH_VERSION_DISPLAY=1 — Displays MPI version — export MPICH_MPIIO_HINTS_DISPLAY=1 — Displays all the available I/O hints and their values — export MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY=1 — Display the ranks that are performing aggregation when using MI-I/O collective buffering — export MPICH_MPIIO_STATS=2 — Statistics on the actual read/write operations after collective buffering — export MPICH_MPIIO_HINTS=“…” — Declare I/O hints
  • 25. Burst Buffer — Shaheen II: 268 Burst Buffer (BB) nodes, 536 SSDs, totally 1.52 PB, each node has 2 SSDs — BB adds a layer between the compute nodes and the parallel file system — Cray DataWarp (DW) I/O is the technology and Burst Buffer is the implementation
  • 28. Burst Buffer – Use cases — Periodic burst — Transfer to PFS between bursts — I/O improvements — Accessed via POSIX I/O requests — Stage-in/stage-out — Shared BB allocation for multiple jobs — Coupling applications
  • 29. Burst Buffer - Status • 268 DataWarp (DW) nodes, total 1.52PB with granularity 397.44GB > dwstat most pool units quantity free gran wlm_pool bytes 1.52PiB 1.52PiB 397.44GiB did not find any of [sessions, instances, configurations, registrations, activations] > dwstat nodes node pool online drain gran capacity insts activs nid00002 wlm_pool true false 16MiB 5.82TiB 0 0 … nid07618 wlm_pool true false 16MiB 5.82TiB 0 0
  • 30. Check if there are jobs using BB > scontrol show burst Name=cray DefaultPool=wlm_pool Granularity=406976M TotalSpace=1636043520M UsedSpace=0 Flags=EnablePersistent StageInTimeout=1800 StageOutTimeout=1800 ValidateTimeout=5 OtherTimeout=300 AllowUsers=…markomg… GetSysState=/opt/cray/dw_wlm/default/bin/dw_wlm_cli If your username is not in the list of AllowUsers while you have applied for BB early access, send email to help@hpc.kaust.edu.sa scontrol show burst Name=cray DefaultPool=wlm_pool Granularity=406976M TotalSpace=1636043520M UsedSpace=813952M Flags=EnablePersistent StageInTimeout=1800 StageOutTimeout=1800 ValidateTimeout=5 OtherTimeout=300 AllowUsers=…,markomg… GetSysState=/opt/cray/dw_wlm/default/bin/dw_wlm_cli Allocated Buffers: JobID=2729000 CreateTime=2017-01-20T17:15:31 Pool=wlm_pool Size=813952M State=allocated UserID=markomg(137767) Per User Buffer Use: UserID=markomg(137767) Used=813952M
  • 31. Burst Buffer Nodes Allocation • How many DW instances per node? DW_instances_per_node = 5.82*1024/397.44 = 14.995 A DW node can accommodate up to 14*397.44/1024 = 5.43 TB • A user requests 60TB of DW nodes, how many DW nodes is he going to reserve (for striped mode explained later)? We have 268 DW nodes, each nodes provides initially one DW instance and when all of them are used, then it starts from the first DW node again. The allocation occurs under round - robin basis Requested_DW_nodes = 60*1024/397.44 = 154.58, so we will reserve 155 DW nodes. Important: If you reserve more than 268 * 397.44/1024 = 104TB, then some DW nodes will be used twice and this can cause I/O performance issues
  • 32. Burst Buffer Modes • DW supports two access modes • Private Each of the compute job has its own private space on BB and it will lot be visible to other compute jobs. For now, data is not striped over BB nodes in private mode (not tested). Each compute node has access to a BB allocation equal to the granularity size. • Striped The data will be striped over several Burst Buffer nodes. BB nodes are allocated on a round-robin basis. We use this mode mainly • BB supports two reservation modes • Scratch is temporary space allocation which will be removed when the job is finished • Persistent is when you have many jobs that need to access the same files, so this mode creates a DW space that persists after a job is finished and it is available to other of your DW jobs. Important: Persistent space is not a backup solution, you could lose your data in case of any BB problem
  • 33. Burst Buffer Workflow • Initially the files are located on Lustre filesystem • For the files that need to be accessed multiple times but also for any big files you should move these files on BB before your job reservation. This phase is called stage-in. You can stage-in either file or folder. We discuss later about small shared file access. • When the job finishes, the created files will be returned to the folder that the user declared in the script, this is called stage-out. • The files on BB are located inside the path declared by environment variable $DW_JOB_STRIPED (for striped mode) Note: Stage-in and –out are not mandatory it depends what the user needs. Maybe there are no input files or the user wants just to measure the execution time.
  • 34. Modify SLURM script • Lustre reservation #!/bin/bash #SBATCH --partition=workq #SBATCH -t 10:00:00 #SBATCH -A k01 #SBATCH --nodes=32 #SBATCH --ntasks=1024 #SBATCH -J slurm_test Comment: Insert the DW commands, exactly after the SBATCH commands, do not include any other unrelated commands between SBATCH and DW declarations.
  • 35. Modify SLURM script • BB reservation (2TB of DW space) #!/bin/bash #SBATCH --partition=workq #SBATCH -t 10:00:00 #SBATCH -A k01 #SBATCH --nodes=32 #SBATCH --ntasks=1024 #SBATCH -J slurm_test #DW jobdw type=scratch access_mode=striped capacity=2TiB #DW stage_in type=directory source=/scratch/markomg/for_bb destination=$DW_JOB_STRIPED #DW stage_out type=directory destination=/scratch/markomg/back_up source=$DW_JOB_STRIPED/ cd $DW_JOB_STRIPED chmod +x executable Note: You can stage-in/out also files instead of directory
  • 36. DataWarp – Restrictions • When you stage-in executables, it is required to apply appropriate permissions on BB, that this file is executable (chmod +x executable) • Symbolic and hard links will be lost during stage-in • It seems that using the folder /projects with DW, it works now
  • 37. BB – Check status of reservation — Execute: dwstat most pool units quantity free gran wlm_pool bytes 1.52PiB 1.52PiB 397.44GiB sess state token creator owner created expiration nodes 975 CA--- 2728666 SLURM 137767 2017-01-19T23:49:54 never 40 inst state sess bytes nodes created expiration intact label public confs 967 CA--- 975 794.88GiB 2 2017-01-19T23:49:54 never true I975-0 false 1 conf state inst type access_type activs 975 CA--- 967 scratch stripe 1 reg state sess conf wait 945 CA--- 975 975 true activ state sess conf nodes mount 884 CA--- 975 975 40 /var/opt/cray/dws/mounts/batch/2728666/ss — You can see only your jobs from this command
  • 38. Profiling MPI I/O on BB Question: Using 160 nodes with 1 MPI process per node and 2TB of DW space (6 DW nodes) with MPI I/O through PnetCDF, how many MPI I/O aggregators are saving the NetCDF file on BB? Answer: 6! Table 6: File Output Stats by Filename Write Time | Write MBytes | Write Rate | Writes | Bytes/ Call |File Name | | MBytes/sec | | | PE 710.752322 | 988,990.668619 | 1,391.470191 | 671,160.0 | 1,545,133.62 |Total |----------------------------------------------------------------------------- | 263.824253 | 369,720.282763 | 1,401.388533 | 46,690.0 | 8,303,272.98 |wrfrst_d01_2009- 12-18_00_30_00 ||---------------------------------------------------------------------------- || 45.299442 | 61,624.000000 | 1,360.369943 | 7,798.0 | 8,286,412.85 |pe.96 || 44.410365 | 61,616.000000 | 1,387.423860 | 7,795.0 | 8,288,525.83 |pe.160 || 43.762797 | 61,623.999999 | 1,408.136675 | 7,763.0 | 8,323,772.69 |pe.32 || 43.708663 | 61,616.148647 | 1,409.701068 | 7,762.0 | 8,323,784.42 |pe.0 || 43.532686 | 61,616.134117 | 1,415.399323 | 7,764.0 | 8,321,638.26 |pe.128 || 43.110299 | 61,624.000000 | 1,429.449598 | 7,808.0 | 8,275,800.13 |pe.64 || 0.000000 | 0.000000 | -- | 0.0 | -- |pe.1 || 0.000000 | 0.000000 | -- | 0.0 | -- |pe.2 || 0.000000 | 0.000000 | -- | 0.0 | -- |pe.3 || 0.000000 | 0.000000 | -- | 0.0 | -- |pe.4 || 0.000000 | 0.000000 | -- | 0.0 | -- |pe.5 || 0.000000 | 0.000000 | -- | 0.0 | -- |pe.6 || 0.000000 | 0.000000 | -- | 0.0 | -- |pe.7
  • 39. How do we choose the number of MPI I/O aggregators on BB? — In this example we have parallel I/O and we can adjust the number of the MPI processes for simulating an application — MPICH_MPIIO_HINTS — export MPICH_MPIIO_HINTS="wrfrst*:cb_nodes=80,wrfout*:cb_nodes=40” — This environment variable is supported on DataWarp since Cray-MPICH v7.4 — In this case we select 80 MPI I/O aggregators for the files starting with the name wrfrst*, and 40 MPI I/O aggregators for the files starting with the name wrfout*. — Although this depends on the application, according to our experience, if you have one MPI I/O aggregator per DW node (default behavior), the performance is not always good. In order to stress the SSDs of the DW node, more than one MPI process should write data per DW node, and this happens with the MPI I/O aggregators. — Depending on the size of the file, some times we need to use different number of MPI I/O aggregators per file.
  • 40. How do we choose the number of MPI I/O aggregators on BB? — Tips: — The number of the MPI I/O aggregators should divide the number of total MPI processes for better load balancing. For example, If you have 1024 MPI processes, do not declare 100 MPI I/O aggregators, but 128 or 64. However, in some cases it does not matter. — The number of the requested DW nodes, should divide the number of the MPI I/O aggregators for better load balancing also. — Of course the requested DW nodes should provide enough data for all of your experiments, thus there is a minimum amount of needed DW nodes. — 𝑀𝑃𝐼_𝐼𝑂_𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟𝑠 = . 𝐷𝑊_𝑛𝑜𝑑𝑒𝑠, 𝑖𝑓 𝑤𝑒 𝑢𝑠𝑒 𝑜𝑛𝑒 𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟 𝑝𝑒𝑟 𝐷𝑊 𝑛𝑜𝑑𝑒 𝑘 ∗ 𝐷𝑊_𝑛𝑜𝑑𝑒𝑠, 𝑤ℎ𝑒𝑟𝑒 𝑘 ∈ ℕ, 2 ≤ 𝑘 ≤ 16 — Number_of_total_MPI_processes= 𝑙 ∗ 𝑀𝑃𝐼_𝐼𝑂_𝑎𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟𝑠, 𝑤ℎ𝑒𝑟𝑒 𝑙 ∈ ℕ, 𝑙 ≥ 2
  • 41. Compute the required DW space — It is already mentioned that we need to have enough space for our experiments — If the experiments are about DW scalability and the number of the MPI/OpenMP processes remain stable, then you could modify the MPI I/O aggregators and the number of DW nodes. — If you need for example 64 DW nodes, then you should calculate the requested space as follows: — Multiply with the DW granularity: — 64*397.44=25436.16 — Round down and request this value in the submission script, for example: — 25436GiB
  • 42. How to compile your code for better performance on DW — Because of the MPICH_MPIIO_HINTS environment variable, Cray-MPICH v7.4 and later is required. — Load the appropriate versions through cdt module: — module load cdt/16.08 Switching to atp/2.0.2. Switching to cce/8.5.2. Switching to cray-libsci/16.07.1. Switching to cray-mpich/7.4.2. Switching to craype/2.5.6. Switching to modules/3.2.10.4. Switching to pmi/5.0.10-1.0000.11050.0.0.ari. — Then compile your application — Be careful, sometimes gateway has different versions of default libraries
  • 43. Create persistent DW space I — Create persistent DW space #!/bin/bash -x #SBATCH --partition=workq #SBATCH -t 1 #SBATCH -A k01 #SBATCH --nodes=1 #SBATCH -J create_persistent_space #BB create_persistent name=george_test capacity=600G access=striped type=scratch exit 0
  • 44. Checking the status of the persistent DW reservation > dwstat most sess state token creator owner created expiration nodes 985 CA--- george_test CLI 137767 2017-01-20T18:01:00 never 0 inst state sess bytes nodes created expiration intact label public confs 977 CA--- 985 794.88GiB 2 2017-01-20T18:01:01 never true george_test true 1 > dwstat nodes node pool online drain gran capacity insts activs nid01349 wlm_pool true false 16MiB 5.82TiB 1 0 nid01410 wlm_pool true false 16MiB 5.82TiB 1 0
  • 45. Use DW persistent space I #!/bin/bash #SBATCH --partition=workq #SBATCH -t 10 #SBATCH -A k01 #SBATCH --nodes=1 #DW persistentdw name=george_test #DW stage_in type=directory source=/project/k01/markomg/wrf destination=$DW_PERSISTENT_STRIPED_george_test cd $DW_PERSISTENT_STRIPED_george_test/ … exit 0
  • 46. Use DW persistent space II — Now, you can execute the second job on the persistent space, however, do not stage-in the same files: — squeue -u markomg JOBID USER ACCOUNT NAME ST REASON START_TIME TIME TIME_LEFT NODES 2729358 markomg k01 test PD burst_buf N/A 0:00 3:00 40 — scontrol show job 2729358 … JobState=PENDING Reason=burst_buffer/cray:_dws_data_in:_Error_creating_staging_object_for_file_(/ scratch/markomg/burst_buffer_early_access/wrfchem/wrfchem- 3.7.1_burst/test/em_real/forburst)_-2_Staging_failures_reported_ Dependency=(null) ... — scancel 2729358 If the problem is not solved send us email immediately! help@hpc.kaust.edu.sa, inform also the BB users through bb_users@hpc.lists.kaust.edu.sa
  • 47. Use DW persistent space III — Submit another jobs by either stage-in different files, or without stage- in — In the case that you want to connect interactively on the compute node to have access to BB and check the files, follow the instructions: — Create a file, called it for example persistent.conf with the following: — #DW persistentdw name=george_test — Execute: salloc -N 1 -t 00:10:00 --bbf=”persistent.conf” srun –N 1 bash –i cd $DW_PERSISTENT_STRIPED_george_test — markomg@nid00024:/var/opt/cray/dws/mounts/batch/george_test/ss
  • 48. Use DW persistent space IV — Three jobs were executed on persistent DW space and created a job folder with the job id as their name: nid00024:/var/opt/cray/dws/mounts/batch/george_test/ss/ ls -l 2729* 2729356: -rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 18:49 wrfout_d01_2007-04-03_00_00_00 -rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 18:49 wrfout_d01_2007-04-03_01_00_00 2729361: -rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 18:57 wrfout_d01_2007-04-03_00_00_00 -rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 18:57 wrfout_d01_2007-04-03_01_00_00 2729362: -rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 19:09 wrfout_d01_2007-04-03_00_00_00 -rw-r--r-- 1 markomg g-markomg 3053617664 Jan 20 19:09 wrfout_d01_2007-04-03_01_00_00
  • 49. Finalize DW persistent reservation — When the experiments are finished, then stage-out the files (if required). Do not copy the files from the interactive mode back to Lustre as this can be much slower, it depends on the file sizes. — Finally delete the DW space #!/bin/bash #SBATCH --partition=workq #SBATCH -t 1 #SBATCH -A k01 #SBATCH --nodes=1 #SBATCH -J delete_persistent_space #BB destroy_persistent name=george_test exit 0
  • 50. Fuse Blown error – just in case… > dwstat most pool units quantity free gran wlm_pool bytes 1.52PiB 1.52PiB 397.44GiB sess state token creator owner created expiration nodes 1007 D---- 2729546 SLURM 137767 2017-01-21T00:27:50 never 0 1008 D---- 2729547 SLURM 137767 2017-01-21T00:27:57 never 0 inst state sess bytes nodes created expiration intact label public confs 991 D---- 1007 397.44GiB 1 2017-01-21T00:27:50 never false I1007- 0 false 1 992 D---- 1008 397.44GiB 1 2017-01-21T00:27:58 never false I1008- 0 false 1 conf state inst type access_type activs 999 DAF-- 991 scratch stripe 0 1000 DAF-- 992 scratch stripe 0 Do not submit any more experiments and report to us!
  • 52. Early Results I • 8 May 2016 8192 MPI processes, for Lustre 120 OSTs, 256 DataWarp nodes, and 256 compute nodes. We have totally 256 * 32 tiles and the size of each tile is from 16 to 1024MB, so we save a file with size from 128GB till 8TB. The collective mode behaves worse (for Lustre while we increase the file size) and for 1024MB element size we have the same bandwidth between non collective Lustre and collective DataWarp.
  • 53. Early Results II • 12 May 2016 Shaheen II power uncapped, 2048 nodes and 256 DataWarp nodes. It seems that with 65536 MPI processes while we increase the MB per element some nodes do not perform good. In the following figure we have MB per element from 16 up to 1024. This means that the 32768 MPI processes are saving one single shared file with size from 512GB up to 32TB while the 65536 MPI processes save a file from 1 till 64TB. Maximum write bandwidth is ~749GB/s for 32768 MPI processes, 512 MB per element.
  • 55. WRF-CHEM on Burst Buffer — Weather Research and Forecasting Model coupling with Chemistry — Small domain: 330 x 275 — Size of input file: 804 MB — Size of output file: 2.9GB, it is saved every one hour of simulation — Output file quite small — For all the WRF-CHEM experiments we use 1280 MPI processes (40 nodes), as this is the optimum for the computation/communication — For the default case, we stage-in all the files and we execute the simulation from BB
  • 56. Total execution time and I/O on BB without MPICH_MPIIO_HINTS (default) 1 2 4 8 16 32 64 128 256 1 2 4 8 Time(inseconds) #DW nodes Burst Buffer Total time Reading input file Writing output file Lustre 64 OSTs • The best total execution time is provided when we have 4 DW nodes • On Lustre the best execution time is with 64 OSTs and it is 15% faster than BB
  • 57. Darshan – 4 DW nodes
  • 58. Understand the MPI I/O statistics on BB (MPICH_MPIIO_STATS=2) I +--------------------------------------------------------+ | MPIIO read access patterns for wrfinput_d01 | independent reads = 1 | collective reads = 527360 | independent readers = 1 | aggregators = 4 | stripe count = 1 | stripe size = 8388608 | system reads = 762 | stripe sized reads = 108 | total bytes for reads = 2930104643 = 2794 MiB = 2 GiB | ave system read size = 3845281 | number of read gaps = 2 | ave read gap size = 0 +--------------------------------------------------------+ Timing for processing wrfinput file (stream 0) for domain 1: 13.19130 elapsed seconds We have 4 MPI I/O aggregators on 1 DW node Default stripe size is 8MB Only the 14.17% of the reads are striped Average system read size is less than 4MB
  • 59. Understand the MPI I/O statistics on BB (MPICH_MPIIO_STATS=2) II +--------------------------------------------------------+ | MPIIO write access patterns for wrfout_d01_2007-04-03_00_00_00 | independent writes = 2 | collective writes = 552960 | independent writers = 1 | aggregators = 4 | stripe count = 1 | stripe size = 8388608 | system writes = 797 | stripe sized writes = 114 | total bytes for writes = 3045341799 = 2904 MiB = 2 GiB | ave system write size = 3821006 | read-modify-write count = 0 | read-modify-write bytes = 0 | number of write gaps = 2 | ave write gap size = 4194300 +--------------------------------------------------------+ Timing for Writing wrfout_d01_2007-04-03_00_00_00 for domain 1: 18.83090 elapsed seconds Only 14.3% of the writes are striped There are 2 gaps of almost 4MB
  • 60. Compare the total execution time on single DW nodes across various MPI I/O aggregators 188 136 135 131 0 20 40 60 80 100 120 140 160 180 200 1 2 4 8 Time(inseconds) # MPI I/O aggregators • Example for declaring 2 MPI I/O aggregators export MPICH_MPIIO_HINTS="wrfinput*:cb_nodes=2,wrfout*:cb_nodes=2” • Tip: You can declare different MPI I/O aggregators per file
  • 61. Studying the influence of the stripe size on the performance of application for 8 MPI I/O aggregators on single DW node 131 102 105 0 20 40 60 80 100 120 140 8 2 1 Time(inseconds) Size of stripe size in MB • Changing the stripe size form 8 to 2 MB, improved the total execution time by 22% • Example of 8 MPI I/O aggregators and 2MB of stripe size (striping_unit flag) export MPICH_MPIIO_HINTS="wrfinput*:cb_nodes=8:striping_unit=2097152, wrfout*:cb_nodes=8:striping_unit=2097152” • Why for stripe size of 1 MB the performance is not improved?
  • 62. MPI I/O statistics for stripe size equal to 1MB +--------------------------------------------------------+ | MPIIO write access patterns for wrfout_d01_2007-04-03_00_00_00 | independent writes = 2 | collective writes = 552960 | independent writers = 1 | aggregators = 8 | stripe count = 1 | stripe size = 1048576 | system writes = 3338 | stripe sized writes = 2606 | total bytes for writes = 3045341799 = 2904 MiB = 2 GiB | ave system write size = 912325 | read-modify-write count = 0 | read-modify-write bytes = 0 | number of write gaps = 2 | ave write gap size = 524284 +--------------------------------------------------------+ Timing for Writing wrfout_d01_2007-04-03_00_00_00 for domain 1: 9.09640 elapsed seconds 78% of the writes are striped but the system does now 3338 writes instead of 797 Let’s increase the MPI I/O aggregators because there are many write calls now
  • 63. Increasing the number of the MPI I/O aggregators for stripe size equal to 1MB +--------------------------------------------------------+ | MPIIO write access patterns for wrfout_d01_2007-04-03_00_00_00 | independent writes = 2 | collective writes = 552960 | independent writers = 1 | aggregators = 20 | stripe count = 1 | stripe size = 1048576 | system writes = 3338 | stripe sized writes = 2606 | total bytes for writes = 3045341799 = 2904 MiB = 2 GiB | ave system write size = 912325 | read-modify-write count = 0 | read-modify-write bytes = 0 | number of write gaps = 2 | ave write gap size = 524284 +--------------------------------------------------------+ Timing for Writing wrfout_d01_2007-04-03_00_00_00 for domain 1: 6.40781 elapsed seconds Finally, by increasing the MPI I/O aggregators to 40 and decreasing the stripe size to 256KB, we achieve the best result, of total execution time 97 seconds.
  • 64. Comparison between BB and Lustre 188 97 114 112 0 20 40 60 80 100 120 140 160 180 200 Default parameters Optimized parameters Time(inseconds) BB Lustre The total execution time on BB is 13.4% faster than Lustre for one hour of simulation of WRF-CHEM. For 24 hours of simulation, the execution time on BB is faster than Lustre by 14.8% We achieved better performance with BB by using one single BB node in comparison to 64 OSTs of Lustre
  • 65. WRF-CHEM – Split output to one file per process 0 0.5 1 1.5 2 2.5 3 1 2 4 8 16 32 64 Time(inseconds) # DW nodes Reported “I/O” time from WRF-CHEM File 2.9GB
  • 66. WRF-CHEM – Split output to one file per process II 0 2 4 6 8 10 12 14 16 18 20 1 2 4 8 16 32 64 Percentage(in%) # DW nodes I/O efficiency using reported “I/O” time from WRF-CHEM File 2.9GB
  • 67. WRF – Split output to one file per process – Large cases 0 2 4 6 8 10 12 14 16 18 20 4 8 16 32 64 128 Time(inseconds) # DW nodes Reported “I/O” time from WRF File 81GB File 361GB
  • 68. WRF – Split output to one file per process II 0 20 40 60 80 100 120 4 8 16 32 64 128 Percentage(in%) # DW nodes I/O efficiency using reported “I/O” time from WRF File 81GB File 361GB
  • 69. What happens if we try to scale on BB with this example? For this experiment, we use 64 DW nodes, the file namelist.input is 12KB (Darshan reports wrong file size) but 1280 MPI processes access it for reading and there is metadata issue
  • 70. Solution Tip: You need to declare the path of the files in the MPICH_MPIIO_HINTS export MPICH_MPIIO_HINTS="$DW_JOB_STRIPED/wrfinput*:cb_nodes=40 … $DW_JOB_STRIPED/wrfout*:cb_nodes=40…" — Hybrid execution, we save small files on Lustre and large ones on BB
  • 71. Multiple runs — Submitting 3 jobs of 20 compute nodes and requesting 64 DW nodes each one — used_bb_nodes.sh 192 BB nodes are used with at least one BB job 0 BB nodes are used from more than one BB job — Variation 2-3%
  • 72. DataWarp vs Lustre for same number of nodes (OSTs) 0 20 40 60 80 100 120 140 160 5 10 20 40 80 100 144 Time(inseconds) #OSTs or DW nodes I/O time for the WRF restart file, size 361 GB Lustre DataWarp WRF reports I/O time but it includes other functionalities which are beyond I/O
  • 73. DataWarp vs Lustre, percentage of performance difference -5 0 5 10 15 20 25 30 35 40 45 5 10 20 40 80 100 144 Percentage(in%) # OSTs or DW nodes DataWarp vs Lustre, same number of nodes
  • 74. WRF – Lustre vs DW 500 550 600 650 700 750 800 850 900 4 8 16 32 64 128 Totalexecutiontime(insec.) # DW nodes Lustre DW
  • 75. NAS – BT I/O Benchmark - PNetCDF 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 2 4 8 16 32 64 128 256 I/OBandwidth(GB/s) # DW nodes DW Lustre - 128 compute nodes Lustre - 512 compute nodes
  • 76. NAS – BT I/O Benchmark - PNetCDF 0 10 20 30 40 50 60 2 4 8 16 32 64 128 256 I/Oefficiency(in%) # DW nodes I/O Efficiency I/O Efficiency
  • 77. PIDX — PIDX is an efficient parallel I/O library that reads and writes multiresolution IDX data files — It can provide high scalability up to 768k cores — Successful integration with several simulation codes — KARFS (KAUST Adaptive Reacting Flow Solvers) on Shaheen II — Uintah with production runs on Mira — S3D
  • 78.
  • 79. 0 100 200 300 400 500 600 700 800 900 1000 16 32 64 128 144 256 WriteI/Obandwidth(GB/s) # BB nodes/OST PIDX on BB BB Lustre
  • 80. 0 10 20 30 40 50 60 70 80 90 100 16 32 64 128 144 256 Percentageofefficiency # BB nodes Efficiency based on IOR peak
  • 81. DataWarp API — Libatawarp — dw_get_stripe_configuration — dw_query_directory_stage — dw_query_file_stage — dw_set_stage_concurrency — dw_stage_file_out — dw_wait_directory_stage
  • 82. BB Policy and Calendar — All the projects should schedule some time slots to execute experiment — Send me your gmail accounts or other emails for google calendar. — Up to half day slots — It is mandatory to be trained before you have access to Burst Buffer
  • 83. Conclusions — It was explained how Lustre and BB operate — Many parameters need be investigated for the optimum performance — Be patient, probably you will not achieve the best performance immediately — Be careful o compile your application with the appropriate Cray MPICH — CLE 6.0 will solve some BB issues — Be open, be collaborative — Be active at bb_users@lists.hpc.kaust.edu.sa — Always check the output of the application, used MPI version etc.