SlideShare a Scribd company logo
1 of 59
Download to read offline
Jialin Liu!
Data Analytics & Service Group!
NERSC/LBNL!

Parallel IO
-	1	-	
June	30,	2017	
Scaling	to	Petascale	Ins/tute
Outline
-	2	-	
Ø  I/O	Challenges	in	2020/2025	
Ø  HPC	I/O	Stack	
q  Hardware:	HDD,	SSD	
q  SoDware:	Lustre,	MPI-IO,	HDF5,	H5py	
q  Profiling	I/O	with	Darshan	
Ø  OpKmizing	and	Scaling	I/O		
Ø  HPC	I/O	&	Storage	Trend	
q  Burst	Buffer	
q  Object	Store
Introduction: I/O Challenges in 2020/2025
-	3	-	
Ø  ScienKfic	applicaKons/simulaKons	generate	massive	quanKKes	of	data.	
Ø  Example,	BES:	Basic	Energy	Science,	Requirement	Review,	2015	
Ø  19	projects	review	
Ø  Example	projects:	Quantum	Materials,	SoD	Ma]ers,	CombusKon,		
Average	Increasing	RaKo
Common I/O Issues
Ø  Bandwidth	
Ø  “The peak bandwidth is XXX GB/s, why I could only get 1% of that?”
Ø  Scalability	
Ø  “I have used more IO processes, why the performance is not scalable?”
Ø  Metadata	
Ø  “File closing is so slow in my test…”
Ø  Pain	of	ProducKvity		
Ø  “I like to use Python/Spark, but the I/O seems slow”
4
What does Parallel I/O Mean?
5	
Ø  At	the	program	level:	
Ø  Concurrent	reads	or	writes	from	mulKple	processes	to	a	common	file	
Ø  At	the	system	level:	
Ø  A	parallel	file	system	and	hardware	that	support	such	concurrent	access	
-William	Gropp
HPC I/O Software Stack
High	Level	I/O	Libraries	map	
applicaKon	abstracKons	onto	
storage	abstracKons	and	provide	
data	portability.	
	
HDF5,	Parallel	netCDF,		ADIOS	
I/O	Middleware	organizes	
accesses	from	many	processes,	
especially	those	using	collecKve		
I/O.	
	
MPI-IO,	GLEAN,	PLFS	
	
I/O	Forwarding	transforms	I/O	
from	many	clients	into	fewer,	
larger	request;	reduces	lock	
contenKon;	and	bridges	between	
the	HPC	system	and	external	
storage.	
	
IBM	ciod,	IOFSL,	Cray	DVS,	Cray	
Datawarp	
Parallel	file	system	maintains	
logical	file	model	and	provides	
efficient	access	to	data.	
	
PVFS,	PanFS,	GPFS,	Lustre	
6	
I/O Hardware
Application
Parallel File System
High-Level I/O Library
I/O Middleware
I/O Forwarding
Productive Interface
Produc/ve	Interface	builds	a	thin	
layer	on	top	of	exisKng	high	
performance	I/O	library	for	
producKve	big	data	analyKcs	
	
H5py,	H5Spark,	Julia,	Pandas,	Fits
Data Complexity in Computational Science
Ø  ApplicaKons	use	advanced	data	models	to	fit	
the	problem	at	hand	
Ø  MulKdimensional	typed	arrays,	images	
composed	of	scan	lines,	…	
Ø  Headers,	a]ributes	on	data	
Ø  I/O	systems	have	very	simple	data	models	
Ø  Tree-based	hierarchy	of	containers	
Ø  Some	containers	have	streams	of	bytes	(files)	
Ø  Others	hold	collecKons	of	other	containers	
(directories	or	folders)	
EffecKve	mapping	from	applicaKon	data	models	
to	I/O	system	data	models	is	the	key	to	I/O	
performance.	
Right Interior
Carotid Artery
Platelet
Aggregation
Model	complexity:	
Spectral	element	mesh	(top)	for	
thermal	hydraulics	computaKon	
coupled	with	finite	element	
mesh	(bo]om)	for	neutronics	
calculaKon.	
Scale	complexity:	
SpaKal	range	from	the	
reactor	core	in	meters	
to	fuel	pellets	in	
millimeters.	
Images	from	T.	Tautges	(ANL)	(upper	leD),	M.	Smith	(ANL)	
(lower	leD),	and	K.	Smith	(MIT)	(right).	
7
I/O Hardware
Ø  Storage	Side	
Ø  Hard	Disk	Drive	(TradiKonal)	
Ø  Solid	State	Drive	(Future)	
Ø  Compute	Side	
Ø  DRAM,	Cache	(TradiKonal)	
Ø  HBM(e.g.,	MCDRAM),	NVRAM(e.g.,	3D	Xpoint)	
8	
DRAM	
HDD	
SSD	
On-package	MCDRAM	
Courtesy	of	Tweaktown
I/O Hardware: HDD
9	
ConKguous	IO	
•  read	Kme,	0.1	ms		
NonconKguous	IO	
•  seek	Kme,	4ms	
•  rotaKon	Kme,	3ms		
•  read	Kme,	0.1	ms	
SSD:	No	moving	parts
HPC I/O Hierarchy
Memory		
(DRAM)	
Storage		
(HDD)	
CPU	
CPU	
Far	Memory		
			(DRAM)	
Far	Storage		
				(HDD)	
Near	Storage		
							(SSD)	
Near	Memory		
				(HBM)	
Past	 Future	
On		
Chip	
On		
Chip	
Off		
Chip	
Off		
Chip	
-	10	-
Parallel File System
Ø  Store	applicaKon	data	persistently		
Ø  Usually	extremely	large	datasets	that	can’t	fit	in	memory		
Ø  Provide	global	shared-namespace	(files,	directories)		
Ø  Designed	for	parallelism		
Ø  Concurrent	(oDen	coordinated)	access	from	many	clients		
Ø  Designed	for	high	performance		
Ø  Operate	over	high	speed	networks	(IB,	Myrinet,	Portals)	
Ø  OpKmized	I/O	path	for	maximum	bandwidth	
	
Ø  Examples	
Ø  Lustre:	Most	leadership	supercomputers	have	deployed	Lustre	
Ø  PVFS->	OrangeFS	
Ø  GPFS->	IBM	Spectrum	Scale,	Commercial	&	HPC	
11
Parallel File System: Lustre
12	
Lustre.org	
OSS:	Object	Storage	Server	
OST:	Object	Storage	Target	
MDT:	Metadata	Servers
Parallel File System: Cori FS
13	
248 OSS	 OSS	 …	 …	
OST	 OST	 OST	 OST	 OST	 OST	…	 …	
MDS4	MDS3	
MDS1	 MDS2	
1 primary MDS,
4 additional MDS
ADU1	
ADU2	
MDS	
248
OSS	 OSS	OSS	 OSS	
Infiniband	
130
Haswell	with	Aries	Network	
…	CMP	 CMP	
…	…	
CMP	
…	
CMP	
LNET	
…	
…	
…	
CMP	 CMP	
LNET	 LNET	
KNL	with	Aries	Network	
…	CMP	 CMP	
…	…	
CMP	
…	
CMP	
LNET	
…	
…	
…	
CMP	 CMP	
LNET	 LNET	…	
2004 9688
I/O Forwarding
Ø  A	layer	between	compuKng	system	and	storage	system	
Ø  Compute	nodes	kernel	ships	I/O	to	dedicated	I/O	nodes	[1]	
Ø  Examples	
Ø  Cray	DVS	
Ø  IOFSL	
Ø  Cray	Datawarp	
14	
1.	AcceleraKng	I/O	Forwarding	in	IBM	Blue	Gene/P	Systems	
2.	h]p://www.glennklockwood.com/data-intensive/storage/io-
forwarding.html
I/O Forwarding: Cray DVS
15	
DVS	on	parallel	file	system,	e.g.,	GPFS,	
Lustre	
Ø  The	DVS	clients	can	spread	their	I/O	traffic	
between	the	DVS	servers	using	a	
determinisKc	mapping.		
Ø  Configurable	number	DVS	clients	
Ø  Reduces	the	number	of	clients	that	
communicate	with	the	backing	file	system	
(GPFS	supports	limited	number	of	clients)	
Stephen	Sugiyama,	etc,	Cray	DVS:	Data	VirtualizaKon	Service,	CUG	2008
I/O Middleware
Ø  Why	addiKonal	I/O	SoDware?	
Ø  AddiKonal	I/O	soDware	provides	improved	performance	and	usability	over	directly	
accessing	the	parallel	file	system.		
Ø  Reduces	or	(ideally)	eliminates	need	for	opKmizaKon	in	applicaKon	codes.	
Ø  MPI-IO	
Ø  I/O	interface	specificaKon	for	use	in	MPI	apps	
Ø  Data	model	is	same	as	POSIX:	Stream	of	bytes	in	a	file	
Ø  	MPI-IO	Features	
Ø  CollecKve	I/O	
Ø  NonconKguous	I/O	with	MPI	datatypes	and	file	views	
Ø  Nonblocking	I/O	
Ø  Fortran	bindings	(and	addiKonal	languages)	
Ø  System	for	encoding	files	in	a	portable	format	(external32)	
16
What’s Wrong with POSIX?
Ø  It’s	a	useful,	ubiquitous	interface	for	basic	I/O	
Ø  It	lacks	constructs	useful	for	parallel	I/O	
Ø  Cluster	applicaKon	is	really	one	program	running	on	N	nodes,	but	looks	like	N	programs	
to	the	filesystem	
Ø  No	support	for	nonconKguous	I/O	
Ø  No	hinKng/prefetching	
Ø  Its	rules	hurt	performance	for	parallel	apps	
Ø  Atomic	writes,	read-aDer-write	consistency	
Ø  A]ribute	freshness	
Ø  POSIX	should	not	have	to	be	used	(directly)	in	parallel	applicaKons	that	want	good	
performance	
Ø  But	developers	use	it	anyway	
17
Independent and Collective I/O
Ø  Independent	I/O	operaKons	specify	only	what	a	single	process	will	do	
Ø  Independent	I/O	calls	do	not	pass	on	relaKonships	between	I/O	on	other	processes		
Ø  Why	use	independent	I/O	
Ø  SomeKmes	the	synchronizaKon	of	collecKve	calls	is	not	natural	
Ø  SomeKmes	the	overhead	of	collecKve	calls	outweighs	their	benefits	
Ø  Example:	very	small	I/O	during	metadata	operaKons	
18	
P0	 P1	 P2	 P3	 P4	 P5	
Independent	I/O
Independent and Collective I/O
Ø  CollecKve	I/O	is	coordinated	access	to	storage	by	a	group	of	processes	
Ø  CollecKve	I/O	funcKons	are	called	by	all	processes	parKcipaKng	in	I/O	
Ø  Why	use	collecKve	I/O	
Ø  Allows	I/O	layers	to	know	more	about	access	as	a	whole,	more	opportuniKes	for	
opKmizaKon	in	lower	soDware	layers,	be]er	performance	
Ø  Combined	with	non-conKguous	accesses	yields	highest	performance	
	
19	
P0	 P1	 P2	 P3	 P4	 P5	
CollecKve	I/O
Two Key Optimizations in ROMIO (MPIIO)
Ø  MPI	IO	has	many	implementaKons	
Ø  ROMIO	
Ø  Cray,	IBM,	OpenMPI	all	have	their	own	implementaKons/variants.		
Ø  Data	sieving	
Ø  For	independent	nonconKguous	requests	
Ø  ROMIO	makes	large	I/O	requests	to	the	file	system	and,	in	memory,	extracts	the	
data	requested	
Ø  For	wriKng,	a	read-modify-write	is	required	
Ø  Two-phase	collecKve	I/O	
Ø  CommunicaKon	phase	to	merge	data	into	large	chunks	
Ø  I/O	phase	to	write	large	chunks	in	parallel	
20
Contiguous and Noncontiguous I/O
Ø  ConKguous	I/O	moves	data	from	a	single	memory	block	into	a	single	file	region	
Ø  NonconKguous	I/O	has	three	forms:	
Ø  NonconKguous	in	memory,	nonconKguous	in	file,	or	nonconKguous	in	both	
Ø  Structured	data	leads	naturally	to	nonconKguous	I/O	(e.g.	block	decomposiKon)	
Ø  Describing	nonconKguous	accesses	with	a	single	operaKon	passes	more	knowledge	
to	I/O	system	
21	
Process	0	 Process	0	
NonconKguous	
in	File	
NonconKguous	
in	Memory	
Ghost	cell	
Stored	element	
…	
Vars	0,	1,	2,	3,	…	23	
ExtracKng	variables	from	a	block	and	
skipping	ghost	cells	will	result	in	
nonconKguous	I/O.
Example: Collective IO for Noncontiguous IO
22	
Courtesy	of	William	Gropp	
Large	array	
distributed	among	
16	processes	
Each	square	
represents	a	
subarray	in	the	
memory	of	a	single	
process	
Access	pa]ern	in	the	file	(row	major)
Example: Collective IO for Noncontiguous IO
23	
MPI_Type_create_subarray(ndims,..., &subarray);
MPI_Type_commit(&subarray);
MPI_File_open(MPI_COMM_WORLD, file,...,&fh);
MPI_File_set_view(fh, ..., subarray, ...);
MPI_File_read_all(fh, A, ...);
MPI_File_close(&fh);
MPI_File_open(MPI_COMM_WORLD, file, ...,&fh);
for (i=0; i<n_local_rows; i++) {
MPI_File_seek(fh, ...);
MPI_File_read(fh, &(A[i][0]), ...);
}
MPI_File_close(&fh);
Each	process	makes	one	independent	read	request	
for	each	row	in	the	local	array		
Each	process	creates	a	derived	datatype	to	
describe	the	nonconKguous	access	pa]ern,	defines	
a	file	view,	and	calls	independent	I/O	funcKons
Example: Collective IO + Noncontiguous IO
24
Example: Collective IO + Noncontiguous IO
25	
h]p://wgropp.cs.illinois.edu/
High Level I/O Libraries
Ø  Take	advantage	of	high-performance	parallel	I/O	while	reducing	complexity	
Ø  Add	a	well-defined	layer	to	the	I/O	stack	
Ø  Allow	users	to	specify	complex	data	relaKonships	and	dependencies	
Ø  Come	with	machine-independent	data	formats,	self-describing,	suitable	for	array-
oriented	scienKfic	data	
Ø  Examples	
Ø  HDF5:	HDF	group,	since		1989,	top	5	libraries	at	NERSC	
Ø  Parallel	netCDF:	NWU,	ANL,	since	2001	
Ø  ADIOS:	ORNL,	since	2009	
26
High Level I/O Libraries: HDF5
MPI_Init(&argc, &argv);
fapl_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(fapl_id, comm, info);
file_id = H5Fcreate(FNAME,…, fapl_id);
space_id = H5Screate_simple(…);
dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT,
space_id,…);
xf_id = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE);
status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…);
MPI_Finalize();
27	
Ø  A	parallel	HDF5	program	has	a	few	extra	calls	than	a	serial	one
Productive I/O Interface
Ø  Big	Data	AnalyKcs	Stack	
Ø  Spark	
Ø  Tensorflow	
Ø  Caffe	
Ø  Science	data	needs	to	be	loaded	
efficiently	into	the	engine.		
Ø  H5py	
Ø  H5Spark	
Ø  Fitsio	
28
Productive I/O Interface: H5py
29	
HDF5 C API
(libhdf5)
hsize_t	H5Dget_storage_size(hid_t	dset_id)
Cython
(h5d.pyx)
cdef	class	DatasetID(ObjectID):
									def	get_storage_size(self):
																return	H5Dget_storage_size(self.id)
Python
(_hl/dataset.py)
class	Dataset(HLObject):
										@property
										def	storagesize(self):
																	return	self.id.get_storage_size()
DatasetID
Dataset
File
Group
FileID
Productive I/O Interface: H5py
30	
Independent	IO	 CollecKve	IO
-	31	-	
Coding Efforts
-	32	-	
H5py vs. HDF5 Performance
Single	Node	 Mul/-nodes	
Metadata	
1k	File	CreaKon	 63.8%	
1k	Object	Scanning	 60.0%	
Independent	I/O	
Weak	Scaling	 97.8%	 100%	
Strong	Scaling	 100%	 97.1%	
CollecKve	I/O	
Weak	Scaling	 100%	 90%	
Strong	Scaling	 98.6%	 87%	
H5Py	Performance	/	HDF5	Performance	
Ques/ons:	When	you	gain	the	producKvity,	how	much	performance	you	can	afford	to	lose?
HPC I/O Software Stack
High	Level	I/O	Libraries	map	
applicaKon	abstracKons	onto	
storage	abstracKons	and	provide	
data	portability.	
	
HDF5,	Parallel	netCDF,		ADIOS	
I/O	Middleware	organizes	
accesses	from	many	processes,	
especially	those	using	collecKve		
I/O.	
	
MPI-IO,	GLEAN,	PLFS	
	
I/O	Forwarding	transforms	I/O	
from	many	clients	into	fewer,	
larger	request;	reduces	lock	
contenKon;	and	bridges	between	
the	HPC	system	and	external	
storage.	
	
IBM	ciod,	IOFSL,	Cray	DVS,	Cray	
Datawarp	
Parallel	file	system	maintains	
logical	file	model	and	provides	
efficient	access	to	data.	
	
PVFS,	PanFS,	GPFS,	Lustre	
33	
I/O Hardware
Application
Parallel File System
High-Level I/O Library
I/O Middleware
I/O Forwarding
Productive Interface
Produc/ve	Interface	builds	a	thin	
layer	on	top	of	exisKng	high	
performance	I/O	library	for	
producKve	big	data	analyKcs	
	
H5py,	H5Spark,	Julia,	Pandas,	Fits
Get to Know Your I/O: Warp IO
Ø  CharacterisKcs:		
Ø  Number	of	Files	
Ø  Size	per	File	
Ø  Number	of	Processes	
Ø  I/O	API	
34	
iteraKon	0	
iteraKon	1	Warp	IO	Pa]ern	
²  172	-	600	MB	per	file
Leverage I/O Profiling Tool: Darshan
Ø  Lightweight	scalable	I/O	profiling	tools	
Ø  Goal:	to	observe	I/O	pa]erns	of	the	majority	of	applicaKons	running	on	producKon	
HPC	plauorms,	without	perturbing	their	execuKon,	with	enough	detail	to	gain	
insight	and	aid	in	performance	debugging.	
Ø  Majority	of	applicaKons	–	transparent	integraKon	with	system	build	environment	
Ø  Without	perturbaKon	–	bounded	use	of	resources	(memory,	network,	storage);	no	
communicaKon	or	I/O	prior	to	job	terminaKon;	compression.	
Ø  Adequate	detail:	
Ø  Basic	job	staKsKcs	
Ø  File	access	informaKon	from	mulKple	APIs	
35
The Technology behind Darshan
Ø  Intercepts	I/O	funcKons	using	link-Kme	wrappers	
Ø  No	code	modificaKon	
Ø  Can	be	transparently	enabled	in	MPI	compiler	scripts	
Ø  CompaKble	with	all	major	C,	C++,	and	Fortran	compilers	
Ø  Record	staKsKcs	independently	at	each	process,	for	each	file	
Ø  Bounded	memory	consumpKon	
Ø  Compact	summary	rather	than	verbaKm	record	
Ø  Collect,	compress,	and	store	results	at	shutdown	Kme	
Ø  Aggregate	shared	file	data	using	custom	MPI	reducKon	operator	
Ø  Compress	remaining	data	in	parallel	with	zlib	
Ø  Write	results	with	collecKve	MPI-IO	
Ø  Result	is	a	single	gzip-compaKble	file	containing	characterizaKon	informaKon	
Ø  Works	for	Linux	clusters,	Blue	Gene,	and	Cray	systems	
36
Darshan Analysis Example
37	
hdf5writeTest 2 (2/4/2016) 1 of 3
jobid: 42375 uid: 58179 nprocs: 64 runtime: 84 seconds
0
20
40
60
80
100
PO
SIX
M
PI-IO
Percentageofruntime
Average I/O cost per process
Read
Write
Metadata
Other (including application compute)
0
10000
20000
30000
40000
50000
60000
70000
Read Write Open Stat Seek Mmap Fsync
Ops(Total,AllProcesses)
I/O Operation Counts
POSIX
MPI-IO Indep.
MPI-IO Coll.
0
10000
20000
30000
40000
50000
60000
70000
0-100
101-1K1K-10K10K-100K100K-1M1M
-4M
4M
-10M10M
-100M
100M
-1G1G
+
Count(Total,AllProcs)
I/O Sizes
Read Write
0
10000
20000
30000
40000
50000
60000
70000
Read Write
Ops(Total,AllProcs)
I/O Pattern
Total
Sequential
Consecutive
Most Common Access Sizes
access size count
1048576 65331
272 1
544 1
328 1
File Count Summary
(estimated by I/O access offsets)
type number of files avg. size max size
total opened 1 64G 64G
read-only files 0 0 0
write-only files 1 64G 64G
read/write files 0 0 0
created files 1 64G 64G
1	
5	
4	
3	2	
Ø The	darshan-job-summary	tool	
produces	a	3-page	PDF	file	that	
summarizes	job	I/O	behavior	
	
1.	Run	Kme	
	
	
2.	Percentage	of	runKme	in	I/O	
3.	Access	type	histograms	
4.	Access	size	histogram	
5.	File	usage
Ø  This	graph	(and	others	like	it)	are	on	the	second	page	of	the	darshan-job-
summary.pl	output.		This	example	shows	intervals	of	I/O	acKvity	from	each	
MPI	process.		In	this	case	we	see	that	different	ranks	completed	I/O	at	very	
different	Kmes.	
	
	
MPI	Ranks	
Time	
Darshan Analysis Example (page 2)
How to Use Darshan
Ø  How	to	link	with	Darshan	
Ø  Compile	a	C,	C++,	or	FORTRAN	program	that	uses	MPI	
Ø  Run	the	applicaKon	
Ø  Look	for	the	Darshan	log	file	
Ø  This	will	be	in	a	parKcular	directory	(depending	on	your	system’s	configuraKon)	
–  <dir>/<year>/<month>/<day>/<username>_<appname>*.darshan*	
Ø  Mira:	see	/projects/logs/darshan/	
Ø  Edison:	see	/scratch1/scratchdirs/darshanlogs/	
Ø  Cori:	see	/global/cscratch1/sd/darshanlogs/	
Ø  Use	Darshan	command	line	tools	to	analyze	the	log	file	
Ø  ApplicaKon	must	run	to	compleKon	and	call	MPI_Finalize()	to	generate	a	log	file	
39
Optimizing I/O from File System Layer: Lustre
Ø  File	striping	is	a	way	to	increase	IO	performance	
Ø  An	increase	in	bandwidth	because	mulKple	processes	can	simultaneously	access		
Ø  To	store	large	files	that	would	take	more	space	than	a	single	OST.	
40	 Figure	courtesy	of	NICS
Optimizing I/O from File System Layer: Lustre
Ø  Default	Striping:		1MB	stripe	size,	1	OST	
Ø  lfs getstripe
Ø  lfs setstripe –c 100 –S 8m
Ø  Chunk	the	file	into	8MB	blocks	and	distributed	onto	100	OSTs	
41
Optimizing I/O from File System Layer: Lustre
Ø  More	OSTS	generally	helps	
Ø  Striping	a	relaKvely	small	file	on	too	many	OSTs	is	not	Good	
Ø  CommunicaKon	overhead	
Ø  Saturated	I/O	bandwidth	on	other	layer,	e.g.,	compute	nodes	
Ø  Storage	straggler	or	bad	OST	
42	
Write	a	5GB	file	 Write	a	100GB	file
Optimizing I/O from File System Layer: Lustre
Ø  Empirical	recommendaKons	on	Cori	@NERSC	
Ø  File	per	process	à	Use	default	striping	
Ø  Single	shared	file	à	
43
Optimizing Lustre on Cori
Ø  Increasing	Lustre	Readahead	
44
Optimizing I/O from I/O Middle Layer: MPIIO
Ø  Aggregate	small	and/or	non-conKguous	I/O	into	larger	conKguous	I/O	
Ø  CollecKve	buffer	size		
Ø  CollecKve	buffer	node	#	the	actual	number	of	I/O	processes	
Ø  Use	customized	MPI-IO	for	be]er	leveraging	underneath	file	system	
Ø  Cray’s	MPIIO	knows	its	Lustre	be]er,	e.g.,	reduce	contenKon	by	revealing	data	layout	
informaKon	on	Lustre	to	MPI-IO	layer	
Ø  How	to	pass	the	hints	
Ø  Through	environmental	variable:		
setenv MPICH_MPIIO_HINTS "*:romio_cb_write=enable:romio_ds_write=disable”
Ø  Use	MPI_Info_set	
MPI_Info_set(info, “striping_factor”, “4”);
MPI_Info_set(info, “cb_nodes”, “4”);
45
Optimizing I/O from I/O Middle Layer: MPIIO
46	
cb_buffer_size = 16777216
romio_cb_read = automatic
romio_cb_write = automatic
cb_nodes = 61
cb_align = 2
romio_no_indep_rw = false
romio_cb_pfr = disable
romio_cb_fr_types = aar
romio_cb_fr_alignment = 1
romio_cb_ds_threshold = 0
romio_cb_alltoall = automatic
ind_rd_buffer_size = 4194304
ind_wr_buffer_size = 524288
romio_ds_read = disable
romio_ds_write = disable
striping_factor = 61
striping_unit = 8388608
direct_io = false
aggregator_placement_stride = -1
abort_on_rw_error = disable
cb_config_list = *:*
romio_filesystem_type = CRAY ADIO	
Ø  Dump	summary	of	MPI-IO	hints:	
MPICH_MPIIO_HINTS_DISPLAY
Optimizing MPI-IO on Cori Haswell vs. KNL
47	
Ø  Different	colors:	Different	number	of	aggregators	
Ø  X-axis	:	collecKve	buffer	size	
About	this	test:	
Ø  MPI-IO,	collecKve	IO	
Ø  486GB	file	
Ø  32	processes	per	node	
Ø  32	nodes	
We	recommend:	
Ø  4	aggregator	per	node	on	Haswell	
Ø  8	aggregator	per	node	on	KNL	
Ø  Check	our	paper	in	CUG’17	
J.	Liu,	etc.	Understanding	the	IO	Performance	Gap	Between	Cori	KNL	and	Haswell,	CUG’17
Optimizing I/O from I/O Middle Layer: Guidelines
48	
Ø  Limit	the	number	of	files	(less	metadata	and	easier	to	post-process)		
Ø  Make	large	and	conKguous	requests		
Ø  Avoid	small	accesses		
Ø  Avoid	non-conKguous	accesses		
Ø  Avoid	random	accesses	Prefer	collecKve	I/O	to	independent	I/O	
(especially	if	the	operaKons	can	be	aggregated	as	single	large	conKguous	
requests)		
Ø  Use	derived	datatypes	and	file	views	to	ease	the	MPI	I/O	collecKve	work		
Ø  Try	MPI	I/O	hints	(especially	the	collecKve	buffering	opKmizaKon;	
disabling	data	sieving	is	also	very	oDen	a	good	idea;	also	useful	for	
libraries	based	on	MPI-IO)	
Credits:	Philippe	Wautelet	@IDRIS
Optimizing I/O from High Level Interface: HDF5
Ø  MPI-IO	layers’	opKmizaKon	guidelines	generally	applies	to	HDF5	layer	
Ø  CollecKve	metadata	operaKon,	available	in	HDF5	1.10	
Ø  This	allows	the	library	to	just	use	one	rank	to	read	the	data	and	broadcast	it	to	all	other	
ranks	
Ø  And	constructs	an	MPI	derived	datatype	and	writes	collecKvely	in	a	single	call	
Ø  Increase	page	buffering	
Ø  H5Pset_page_buffer_size
Ø  Stay	tuned	for	next	HDF5	talk	at	10am	
49
Optimizing I/O from Python Interface: H5py
Ø  OpKmal	HDF5	file	creaKon	
Ø  Use	low-level	API	in	H5py	
50	
2.25X	
Get	closer	to	the	HDF5	C	library,	fine	tuning
Optimizing I/O from Python Interface: H5py
Ø  Speedup	the	I/O	with	collecKve	I/O	
51	
Using	1k	process	to	write	1TB	file,	collecKve	IO	
achieved	2X	speedup	on	Cori
Optimizing I/O from Python Interface: H5py
Ø  Avoid	type	casKng	in	H5py	
52	
Reduced	IO	from	527	seconds	to	1.3
Object Store for HPC
Ø  Amazon	S3	is	so	successful	in	supporKng	various	applicaKons,	e.g.,	Instagram,	
Dropbox	
Ø  HPC	file	system	relies	on	strong	POSIX,	which	hinders	the	performance,	e.g.,	
scalability.		
Ø  Object	Store:		
Ø  Put	:	creates	a	new	object	and	fills	it	with	data	
Ø  Get:	retrieves	the	data	based	on	object	ID	
Ø  Benefits:	Scalability	
Ø  Lockless:	Objects	are	immutable,	write-once,	no	need	to	lock	it	before	read	
Ø  Fast	lookup,	based	simple	hash	of	object	ID	
53
Object Store for HPC 
Ø  Disadvantages	
Ø  Data	can	not	be	modified,	most	HPC	applicaKons	like	to	read/write	the	data	
Ø  Limited	metadata	support,	e.g.,	user	info,	access	permission,	requires	a	database	layer	
54	
Figure	courtesy	of	Glenn	Lockwood	
DDN	WOS,	Object	Store	Testbed	at	NERSC,	(TBA	Soon)	
Contact	me	or	Damian	Hazen	if	you	are	interested
Burst Buffer
Ø  Burst	buffer	on	Cori	
Ø  1.7	TB/second	of	peak	I/O	performance	with	28M	IOPs,		
Ø  1.8PB	of	storage	
Ø  NVRAM-based	‘Burst	Buffer’	(BB)	as	intermediate	layer	
Ø  Handle	I/O	spikes	without	a	huge	PFS	(stage	to	PFS	asynchronously)	
Ø  Underlying	media	supports	challenging	I/O	
Ø  SoDware	for	filesystems-	on-demand	-	scales	be]er	than	large	POSIX	PFS	
Ø  Cray	DataWarp	soDware	allocates	storage	to	users	per-job	
Ø  Users	see	a	POSIX	filesystem	on-demand,	striped	across	nodes		
Ø  Can	specify	data	to	stage	in/out	from	Lustre	while	job	is	in	queue	
55
NERSC Burst Buffer Architecture 
-	56	-	
Burst	Buffer	Blade:	
Compute	Nodes	
Aries	High-Speed	
Network	
Blade		=	2x	Burst	Buffer	Node:	4	Intel	P3608	3.2	TB	SSDs	
		 I/O	Node	(2x	InfiniBand	HCA)	
InfiniBand	Fabric	
Lustre	OSSs/OSTs	
Storage	
Fabric	
(InfiniBand)	
Storage	Servers	
CN	
CN	 CN	
CN	
BB	 SSD	
SSD	
ION	 IB	
IB	
1.8	PB	on	144	BB	
nodes
0"
100"
200"
300"
400"
500"
600"
File"open" Fiber"object"copy" Catalog"query"&"copy"
Cost%(s)%
Steps%in%Workflow%
Lustre>Cori" BB"
Burst Buffer Use Case: H5Boss in Astronomy
•  BOSS	Baryon	OscillaKon	Spectroscopic	
Survey	–	from	SDSS	
•  Perform	typical	randomly	generated	
query	to	extract	small	amount	of	
stars/galaxies	from	millions	
•  Workflows	involve	1000s	of	file	open/
close	and	random	and	small	read/
write	I/O	
•  Run	on	final	release	of	
SDSS-III		complete	BOSS	dataset	
–  2393	HDF5	files	-	total	~3.2TB	
		
	
-	57	-	
•  4.4	TB	Burst	Buffer	-		22	nodes	
•  Lower	I/O	Kmes	on	Burst	Buffer	
•  	5.5x	speedup	for	enKre	workflow
Thanks	Rob	Latham,	Quincey	Koziol,	Phil	Carns	and	Wahid	Bhimji	for	
sharing	the	slides.		
	
Thank	You	
	
Email:	Jalnliu@lbl.gov	
58
[1]	Lustre	Striping	RecommendaKon	on	Cori:		
h]p://www.nersc.gov/users/storage-and-file-systems/i-o-resources-for-scienKfic-applicaKons/opKmizing-io-performance-for-lustre/	
[2]	J.L.	Liu,	Q.	Koziol,	H.J.	Tang,	F.	Tessier,	W.	Bhimji,	B.	Cook,	B.	AusKn,	S.	Byna,	B.	Thakur,	G.	Lockwood,	J.	Deslippe,	Prabhat,	
Understanding	the	IO	Performance	Gap	Between	Cori	KNL	and	Haswell,	CUG’17	
59

More Related Content

What's hot

HPC Midlands Launch - Introduction to HPC Midlands
HPC Midlands Launch - Introduction to HPC MidlandsHPC Midlands Launch - Introduction to HPC Midlands
HPC Midlands Launch - Introduction to HPC MidlandsMartin Hamilton
 
High Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & RankingsHigh Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & Rankingsinside-BigData.com
 
Scale Out Your Graph Across Servers and Clouds with OrientDB
Scale Out Your Graph Across Servers and Clouds  with OrientDBScale Out Your Graph Across Servers and Clouds  with OrientDB
Scale Out Your Graph Across Servers and Clouds with OrientDBLuca Garulli
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術Ryousei Takano
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsGanesan Narayanasamy
 
High Performance Compute: NextGen Silicon Photonics Storage Solution
High Performance Compute: NextGen Silicon Photonics Storage SolutionHigh Performance Compute: NextGen Silicon Photonics Storage Solution
High Performance Compute: NextGen Silicon Photonics Storage SolutionRohan Hubli
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERinside-BigData.com
 
Elastic multicore scheduling with the XiTAO runtime
Elastic multicore scheduling with the XiTAO runtimeElastic multicore scheduling with the XiTAO runtime
Elastic multicore scheduling with the XiTAO runtimeMiquel Pericas
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_ProcessingKohei KaiGai
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
Introduction to Ocean Observation1
Introduction to Ocean Observation1Introduction to Ocean Observation1
Introduction to Ocean Observation1Jose Rodriguez
 
Cisco UCS: meeting the growing need for bandwidth
Cisco UCS: meeting the growing need for bandwidthCisco UCS: meeting the growing need for bandwidth
Cisco UCS: meeting the growing need for bandwidthPrincipled Technologies
 
A new storage architecture for a flash memory video server
A new storage architecture for a flash memory video serverA new storage architecture for a flash memory video server
A new storage architecture for a flash memory video servereSAT Journals
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_ENKohei KaiGai
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
 

What's hot (20)

HPC Midlands Launch - Introduction to HPC Midlands
HPC Midlands Launch - Introduction to HPC MidlandsHPC Midlands Launch - Introduction to HPC Midlands
HPC Midlands Launch - Introduction to HPC Midlands
 
High Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & RankingsHigh Performance Interconnects: Assessment & Rankings
High Performance Interconnects: Assessment & Rankings
 
OpenPOWER System Marconi100
OpenPOWER System Marconi100OpenPOWER System Marconi100
OpenPOWER System Marconi100
 
Scale Out Your Graph Across Servers and Clouds with OrientDB
Scale Out Your Graph Across Servers and Clouds  with OrientDBScale Out Your Graph Across Servers and Clouds  with OrientDB
Scale Out Your Graph Across Servers and Clouds with OrientDB
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
HDF-EOS to GeoTIFF Conversion Tool & HDF-EOS Plug-in for HDFView
HDF-EOS to GeoTIFF Conversion Tool & HDF-EOS Plug-in for HDFViewHDF-EOS to GeoTIFF Conversion Tool & HDF-EOS Plug-in for HDFView
HDF-EOS to GeoTIFF Conversion Tool & HDF-EOS Plug-in for HDFView
 
Xilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systemsXilinx Edge Compute using Power 9 /OpenPOWER systems
Xilinx Edge Compute using Power 9 /OpenPOWER systems
 
High Performance Compute: NextGen Silicon Photonics Storage Solution
High Performance Compute: NextGen Silicon Photonics Storage SolutionHigh Performance Compute: NextGen Silicon Photonics Storage Solution
High Performance Compute: NextGen Silicon Photonics Storage Solution
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Elastic multicore scheduling with the XiTAO runtime
Elastic multicore scheduling with the XiTAO runtimeElastic multicore scheduling with the XiTAO runtime
Elastic multicore scheduling with the XiTAO runtime
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing
 
HDF5 Advanced Topics - Chunking
HDF5 Advanced Topics - ChunkingHDF5 Advanced Topics - Chunking
HDF5 Advanced Topics - Chunking
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Introduction to Ocean Observation1
Introduction to Ocean Observation1Introduction to Ocean Observation1
Introduction to Ocean Observation1
 
POWER9 for AI & HPC
POWER9 for AI & HPCPOWER9 for AI & HPC
POWER9 for AI & HPC
 
Cisco UCS: meeting the growing need for bandwidth
Cisco UCS: meeting the growing need for bandwidthCisco UCS: meeting the growing need for bandwidth
Cisco UCS: meeting the growing need for bandwidth
 
A new storage architecture for a flash memory video server
A new storage architecture for a flash memory video serverA new storage architecture for a flash memory video server
A new storage architecture for a flash memory video server
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 

Similar to Optimizing HPC I/O Performance and Trends

HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicHKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicLinaro
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataLviv Startup Club
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Lviv Startup Club
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfClyso GmbH
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterEttore Simone
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1blewington
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning
 
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?inside-BigData.com
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...In-Memory Computing Summit
 

Similar to Optimizing HPC I/O Performance and Trends (20)

HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicHKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
HPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big DataHPE Solutions for Challenges in AI and Big Data
HPE Solutions for Challenges in AI and Big Data
 
Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)Saviak lviv ai-2019-e-mail (1)
Saviak lviv ai-2019-e-mail (1)
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?Cloud, Fog, or Edge: Where and When to Compute?
Cloud, Fog, or Edge: Where and When to Compute?
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdf
 
TUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data CenterTUT18972: Unleash the power of Ceph across the Data Center
TUT18972: Unleash the power of Ceph across the Data Center
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?The convergence of HPC and BigData: What does it mean for HPC sysadmins?
The convergence of HPC and BigData: What does it mean for HPC sysadmins?
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...IMCSummit 2015 - Day 2  IT Business Track - Drive IMC Efficiency with Flash E...
IMCSummit 2015 - Day 2 IT Business Track - Drive IMC Efficiency with Flash E...
 

Recently uploaded

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 

Recently uploaded (20)

UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 

Optimizing HPC I/O Performance and Trends

  • 1. Jialin Liu! Data Analytics & Service Group! NERSC/LBNL! Parallel IO - 1 - June 30, 2017 Scaling to Petascale Ins/tute
  • 2. Outline - 2 - Ø  I/O Challenges in 2020/2025 Ø  HPC I/O Stack q  Hardware: HDD, SSD q  SoDware: Lustre, MPI-IO, HDF5, H5py q  Profiling I/O with Darshan Ø  OpKmizing and Scaling I/O Ø  HPC I/O & Storage Trend q  Burst Buffer q  Object Store
  • 3. Introduction: I/O Challenges in 2020/2025 - 3 - Ø  ScienKfic applicaKons/simulaKons generate massive quanKKes of data. Ø  Example, BES: Basic Energy Science, Requirement Review, 2015 Ø  19 projects review Ø  Example projects: Quantum Materials, SoD Ma]ers, CombusKon, Average Increasing RaKo
  • 4. Common I/O Issues Ø  Bandwidth Ø  “The peak bandwidth is XXX GB/s, why I could only get 1% of that?” Ø  Scalability Ø  “I have used more IO processes, why the performance is not scalable?” Ø  Metadata Ø  “File closing is so slow in my test…” Ø  Pain of ProducKvity Ø  “I like to use Python/Spark, but the I/O seems slow” 4
  • 5. What does Parallel I/O Mean? 5 Ø  At the program level: Ø  Concurrent reads or writes from mulKple processes to a common file Ø  At the system level: Ø  A parallel file system and hardware that support such concurrent access -William Gropp
  • 6. HPC I/O Software Stack High Level I/O Libraries map applicaKon abstracKons onto storage abstracKons and provide data portability. HDF5, Parallel netCDF, ADIOS I/O Middleware organizes accesses from many processes, especially those using collecKve I/O. MPI-IO, GLEAN, PLFS I/O Forwarding transforms I/O from many clients into fewer, larger request; reduces lock contenKon; and bridges between the HPC system and external storage. IBM ciod, IOFSL, Cray DVS, Cray Datawarp Parallel file system maintains logical file model and provides efficient access to data. PVFS, PanFS, GPFS, Lustre 6 I/O Hardware Application Parallel File System High-Level I/O Library I/O Middleware I/O Forwarding Productive Interface Produc/ve Interface builds a thin layer on top of exisKng high performance I/O library for producKve big data analyKcs H5py, H5Spark, Julia, Pandas, Fits
  • 7. Data Complexity in Computational Science Ø  ApplicaKons use advanced data models to fit the problem at hand Ø  MulKdimensional typed arrays, images composed of scan lines, … Ø  Headers, a]ributes on data Ø  I/O systems have very simple data models Ø  Tree-based hierarchy of containers Ø  Some containers have streams of bytes (files) Ø  Others hold collecKons of other containers (directories or folders) EffecKve mapping from applicaKon data models to I/O system data models is the key to I/O performance. Right Interior Carotid Artery Platelet Aggregation Model complexity: Spectral element mesh (top) for thermal hydraulics computaKon coupled with finite element mesh (bo]om) for neutronics calculaKon. Scale complexity: SpaKal range from the reactor core in meters to fuel pellets in millimeters. Images from T. Tautges (ANL) (upper leD), M. Smith (ANL) (lower leD), and K. Smith (MIT) (right). 7
  • 8. I/O Hardware Ø  Storage Side Ø  Hard Disk Drive (TradiKonal) Ø  Solid State Drive (Future) Ø  Compute Side Ø  DRAM, Cache (TradiKonal) Ø  HBM(e.g., MCDRAM), NVRAM(e.g., 3D Xpoint) 8 DRAM HDD SSD On-package MCDRAM Courtesy of Tweaktown
  • 9. I/O Hardware: HDD 9 ConKguous IO •  read Kme, 0.1 ms NonconKguous IO •  seek Kme, 4ms •  rotaKon Kme, 3ms •  read Kme, 0.1 ms SSD: No moving parts
  • 11. Parallel File System Ø  Store applicaKon data persistently Ø  Usually extremely large datasets that can’t fit in memory Ø  Provide global shared-namespace (files, directories) Ø  Designed for parallelism Ø  Concurrent (oDen coordinated) access from many clients Ø  Designed for high performance Ø  Operate over high speed networks (IB, Myrinet, Portals) Ø  OpKmized I/O path for maximum bandwidth Ø  Examples Ø  Lustre: Most leadership supercomputers have deployed Lustre Ø  PVFS-> OrangeFS Ø  GPFS-> IBM Spectrum Scale, Commercial & HPC 11
  • 12. Parallel File System: Lustre 12 Lustre.org OSS: Object Storage Server OST: Object Storage Target MDT: Metadata Servers
  • 13. Parallel File System: Cori FS 13 248 OSS OSS … … OST OST OST OST OST OST … … MDS4 MDS3 MDS1 MDS2 1 primary MDS, 4 additional MDS ADU1 ADU2 MDS 248 OSS OSS OSS OSS Infiniband 130 Haswell with Aries Network … CMP CMP … … CMP … CMP LNET … … … CMP CMP LNET LNET KNL with Aries Network … CMP CMP … … CMP … CMP LNET … … … CMP CMP LNET LNET … 2004 9688
  • 14. I/O Forwarding Ø  A layer between compuKng system and storage system Ø  Compute nodes kernel ships I/O to dedicated I/O nodes [1] Ø  Examples Ø  Cray DVS Ø  IOFSL Ø  Cray Datawarp 14 1. AcceleraKng I/O Forwarding in IBM Blue Gene/P Systems 2. h]p://www.glennklockwood.com/data-intensive/storage/io- forwarding.html
  • 15. I/O Forwarding: Cray DVS 15 DVS on parallel file system, e.g., GPFS, Lustre Ø  The DVS clients can spread their I/O traffic between the DVS servers using a determinisKc mapping. Ø  Configurable number DVS clients Ø  Reduces the number of clients that communicate with the backing file system (GPFS supports limited number of clients) Stephen Sugiyama, etc, Cray DVS: Data VirtualizaKon Service, CUG 2008
  • 16. I/O Middleware Ø  Why addiKonal I/O SoDware? Ø  AddiKonal I/O soDware provides improved performance and usability over directly accessing the parallel file system. Ø  Reduces or (ideally) eliminates need for opKmizaKon in applicaKon codes. Ø  MPI-IO Ø  I/O interface specificaKon for use in MPI apps Ø  Data model is same as POSIX: Stream of bytes in a file Ø  MPI-IO Features Ø  CollecKve I/O Ø  NonconKguous I/O with MPI datatypes and file views Ø  Nonblocking I/O Ø  Fortran bindings (and addiKonal languages) Ø  System for encoding files in a portable format (external32) 16
  • 17. What’s Wrong with POSIX? Ø  It’s a useful, ubiquitous interface for basic I/O Ø  It lacks constructs useful for parallel I/O Ø  Cluster applicaKon is really one program running on N nodes, but looks like N programs to the filesystem Ø  No support for nonconKguous I/O Ø  No hinKng/prefetching Ø  Its rules hurt performance for parallel apps Ø  Atomic writes, read-aDer-write consistency Ø  A]ribute freshness Ø  POSIX should not have to be used (directly) in parallel applicaKons that want good performance Ø  But developers use it anyway 17
  • 18. Independent and Collective I/O Ø  Independent I/O operaKons specify only what a single process will do Ø  Independent I/O calls do not pass on relaKonships between I/O on other processes Ø  Why use independent I/O Ø  SomeKmes the synchronizaKon of collecKve calls is not natural Ø  SomeKmes the overhead of collecKve calls outweighs their benefits Ø  Example: very small I/O during metadata operaKons 18 P0 P1 P2 P3 P4 P5 Independent I/O
  • 19. Independent and Collective I/O Ø  CollecKve I/O is coordinated access to storage by a group of processes Ø  CollecKve I/O funcKons are called by all processes parKcipaKng in I/O Ø  Why use collecKve I/O Ø  Allows I/O layers to know more about access as a whole, more opportuniKes for opKmizaKon in lower soDware layers, be]er performance Ø  Combined with non-conKguous accesses yields highest performance 19 P0 P1 P2 P3 P4 P5 CollecKve I/O
  • 20. Two Key Optimizations in ROMIO (MPIIO) Ø  MPI IO has many implementaKons Ø  ROMIO Ø  Cray, IBM, OpenMPI all have their own implementaKons/variants. Ø  Data sieving Ø  For independent nonconKguous requests Ø  ROMIO makes large I/O requests to the file system and, in memory, extracts the data requested Ø  For wriKng, a read-modify-write is required Ø  Two-phase collecKve I/O Ø  CommunicaKon phase to merge data into large chunks Ø  I/O phase to write large chunks in parallel 20
  • 21. Contiguous and Noncontiguous I/O Ø  ConKguous I/O moves data from a single memory block into a single file region Ø  NonconKguous I/O has three forms: Ø  NonconKguous in memory, nonconKguous in file, or nonconKguous in both Ø  Structured data leads naturally to nonconKguous I/O (e.g. block decomposiKon) Ø  Describing nonconKguous accesses with a single operaKon passes more knowledge to I/O system 21 Process 0 Process 0 NonconKguous in File NonconKguous in Memory Ghost cell Stored element … Vars 0, 1, 2, 3, … 23 ExtracKng variables from a block and skipping ghost cells will result in nonconKguous I/O.
  • 22. Example: Collective IO for Noncontiguous IO 22 Courtesy of William Gropp Large array distributed among 16 processes Each square represents a subarray in the memory of a single process Access pa]ern in the file (row major)
  • 23. Example: Collective IO for Noncontiguous IO 23 MPI_Type_create_subarray(ndims,..., &subarray); MPI_Type_commit(&subarray); MPI_File_open(MPI_COMM_WORLD, file,...,&fh); MPI_File_set_view(fh, ..., subarray, ...); MPI_File_read_all(fh, A, ...); MPI_File_close(&fh); MPI_File_open(MPI_COMM_WORLD, file, ...,&fh); for (i=0; i<n_local_rows; i++) { MPI_File_seek(fh, ...); MPI_File_read(fh, &(A[i][0]), ...); } MPI_File_close(&fh); Each process makes one independent read request for each row in the local array Each process creates a derived datatype to describe the nonconKguous access pa]ern, defines a file view, and calls independent I/O funcKons
  • 24. Example: Collective IO + Noncontiguous IO 24
  • 25. Example: Collective IO + Noncontiguous IO 25 h]p://wgropp.cs.illinois.edu/
  • 26. High Level I/O Libraries Ø  Take advantage of high-performance parallel I/O while reducing complexity Ø  Add a well-defined layer to the I/O stack Ø  Allow users to specify complex data relaKonships and dependencies Ø  Come with machine-independent data formats, self-describing, suitable for array- oriented scienKfic data Ø  Examples Ø  HDF5: HDF group, since 1989, top 5 libraries at NERSC Ø  Parallel netCDF: NWU, ANL, since 2001 Ø  ADIOS: ORNL, since 2009 26
  • 27. High Level I/O Libraries: HDF5 MPI_Init(&argc, &argv); fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpio(fapl_id, comm, info); file_id = H5Fcreate(FNAME,…, fapl_id); space_id = H5Screate_simple(…); dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT, space_id,…); xf_id = H5Pcreate(H5P_DATASET_XFER); H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE); status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…); MPI_Finalize(); 27 Ø  A parallel HDF5 program has a few extra calls than a serial one
  • 28. Productive I/O Interface Ø  Big Data AnalyKcs Stack Ø  Spark Ø  Tensorflow Ø  Caffe Ø  Science data needs to be loaded efficiently into the engine. Ø  H5py Ø  H5Spark Ø  Fitsio 28
  • 29. Productive I/O Interface: H5py 29 HDF5 C API (libhdf5) hsize_t H5Dget_storage_size(hid_t dset_id) Cython (h5d.pyx) cdef class DatasetID(ObjectID): def get_storage_size(self): return H5Dget_storage_size(self.id) Python (_hl/dataset.py) class Dataset(HLObject): @property def storagesize(self): return self.id.get_storage_size() DatasetID Dataset File Group FileID
  • 30. Productive I/O Interface: H5py 30 Independent IO CollecKve IO
  • 32. - 32 - H5py vs. HDF5 Performance Single Node Mul/-nodes Metadata 1k File CreaKon 63.8% 1k Object Scanning 60.0% Independent I/O Weak Scaling 97.8% 100% Strong Scaling 100% 97.1% CollecKve I/O Weak Scaling 100% 90% Strong Scaling 98.6% 87% H5Py Performance / HDF5 Performance Ques/ons: When you gain the producKvity, how much performance you can afford to lose?
  • 33. HPC I/O Software Stack High Level I/O Libraries map applicaKon abstracKons onto storage abstracKons and provide data portability. HDF5, Parallel netCDF, ADIOS I/O Middleware organizes accesses from many processes, especially those using collecKve I/O. MPI-IO, GLEAN, PLFS I/O Forwarding transforms I/O from many clients into fewer, larger request; reduces lock contenKon; and bridges between the HPC system and external storage. IBM ciod, IOFSL, Cray DVS, Cray Datawarp Parallel file system maintains logical file model and provides efficient access to data. PVFS, PanFS, GPFS, Lustre 33 I/O Hardware Application Parallel File System High-Level I/O Library I/O Middleware I/O Forwarding Productive Interface Produc/ve Interface builds a thin layer on top of exisKng high performance I/O library for producKve big data analyKcs H5py, H5Spark, Julia, Pandas, Fits
  • 34. Get to Know Your I/O: Warp IO Ø  CharacterisKcs: Ø  Number of Files Ø  Size per File Ø  Number of Processes Ø  I/O API 34 iteraKon 0 iteraKon 1 Warp IO Pa]ern ²  172 - 600 MB per file
  • 35. Leverage I/O Profiling Tool: Darshan Ø  Lightweight scalable I/O profiling tools Ø  Goal: to observe I/O pa]erns of the majority of applicaKons running on producKon HPC plauorms, without perturbing their execuKon, with enough detail to gain insight and aid in performance debugging. Ø  Majority of applicaKons – transparent integraKon with system build environment Ø  Without perturbaKon – bounded use of resources (memory, network, storage); no communicaKon or I/O prior to job terminaKon; compression. Ø  Adequate detail: Ø  Basic job staKsKcs Ø  File access informaKon from mulKple APIs 35
  • 36. The Technology behind Darshan Ø  Intercepts I/O funcKons using link-Kme wrappers Ø  No code modificaKon Ø  Can be transparently enabled in MPI compiler scripts Ø  CompaKble with all major C, C++, and Fortran compilers Ø  Record staKsKcs independently at each process, for each file Ø  Bounded memory consumpKon Ø  Compact summary rather than verbaKm record Ø  Collect, compress, and store results at shutdown Kme Ø  Aggregate shared file data using custom MPI reducKon operator Ø  Compress remaining data in parallel with zlib Ø  Write results with collecKve MPI-IO Ø  Result is a single gzip-compaKble file containing characterizaKon informaKon Ø  Works for Linux clusters, Blue Gene, and Cray systems 36
  • 37. Darshan Analysis Example 37 hdf5writeTest 2 (2/4/2016) 1 of 3 jobid: 42375 uid: 58179 nprocs: 64 runtime: 84 seconds 0 20 40 60 80 100 PO SIX M PI-IO Percentageofruntime Average I/O cost per process Read Write Metadata Other (including application compute) 0 10000 20000 30000 40000 50000 60000 70000 Read Write Open Stat Seek Mmap Fsync Ops(Total,AllProcesses) I/O Operation Counts POSIX MPI-IO Indep. MPI-IO Coll. 0 10000 20000 30000 40000 50000 60000 70000 0-100 101-1K1K-10K10K-100K100K-1M1M -4M 4M -10M10M -100M 100M -1G1G + Count(Total,AllProcs) I/O Sizes Read Write 0 10000 20000 30000 40000 50000 60000 70000 Read Write Ops(Total,AllProcs) I/O Pattern Total Sequential Consecutive Most Common Access Sizes access size count 1048576 65331 272 1 544 1 328 1 File Count Summary (estimated by I/O access offsets) type number of files avg. size max size total opened 1 64G 64G read-only files 0 0 0 write-only files 1 64G 64G read/write files 0 0 0 created files 1 64G 64G 1 5 4 3 2 Ø The darshan-job-summary tool produces a 3-page PDF file that summarizes job I/O behavior 1. Run Kme 2. Percentage of runKme in I/O 3. Access type histograms 4. Access size histogram 5. File usage
  • 39. How to Use Darshan Ø  How to link with Darshan Ø  Compile a C, C++, or FORTRAN program that uses MPI Ø  Run the applicaKon Ø  Look for the Darshan log file Ø  This will be in a parKcular directory (depending on your system’s configuraKon) –  <dir>/<year>/<month>/<day>/<username>_<appname>*.darshan* Ø  Mira: see /projects/logs/darshan/ Ø  Edison: see /scratch1/scratchdirs/darshanlogs/ Ø  Cori: see /global/cscratch1/sd/darshanlogs/ Ø  Use Darshan command line tools to analyze the log file Ø  ApplicaKon must run to compleKon and call MPI_Finalize() to generate a log file 39
  • 40. Optimizing I/O from File System Layer: Lustre Ø  File striping is a way to increase IO performance Ø  An increase in bandwidth because mulKple processes can simultaneously access Ø  To store large files that would take more space than a single OST. 40 Figure courtesy of NICS
  • 41. Optimizing I/O from File System Layer: Lustre Ø  Default Striping: 1MB stripe size, 1 OST Ø  lfs getstripe Ø  lfs setstripe –c 100 –S 8m Ø  Chunk the file into 8MB blocks and distributed onto 100 OSTs 41
  • 42. Optimizing I/O from File System Layer: Lustre Ø  More OSTS generally helps Ø  Striping a relaKvely small file on too many OSTs is not Good Ø  CommunicaKon overhead Ø  Saturated I/O bandwidth on other layer, e.g., compute nodes Ø  Storage straggler or bad OST 42 Write a 5GB file Write a 100GB file
  • 43. Optimizing I/O from File System Layer: Lustre Ø  Empirical recommendaKons on Cori @NERSC Ø  File per process à Use default striping Ø  Single shared file à 43
  • 44. Optimizing Lustre on Cori Ø  Increasing Lustre Readahead 44
  • 45. Optimizing I/O from I/O Middle Layer: MPIIO Ø  Aggregate small and/or non-conKguous I/O into larger conKguous I/O Ø  CollecKve buffer size Ø  CollecKve buffer node # the actual number of I/O processes Ø  Use customized MPI-IO for be]er leveraging underneath file system Ø  Cray’s MPIIO knows its Lustre be]er, e.g., reduce contenKon by revealing data layout informaKon on Lustre to MPI-IO layer Ø  How to pass the hints Ø  Through environmental variable: setenv MPICH_MPIIO_HINTS "*:romio_cb_write=enable:romio_ds_write=disable” Ø  Use MPI_Info_set MPI_Info_set(info, “striping_factor”, “4”); MPI_Info_set(info, “cb_nodes”, “4”); 45
  • 46. Optimizing I/O from I/O Middle Layer: MPIIO 46 cb_buffer_size = 16777216 romio_cb_read = automatic romio_cb_write = automatic cb_nodes = 61 cb_align = 2 romio_no_indep_rw = false romio_cb_pfr = disable romio_cb_fr_types = aar romio_cb_fr_alignment = 1 romio_cb_ds_threshold = 0 romio_cb_alltoall = automatic ind_rd_buffer_size = 4194304 ind_wr_buffer_size = 524288 romio_ds_read = disable romio_ds_write = disable striping_factor = 61 striping_unit = 8388608 direct_io = false aggregator_placement_stride = -1 abort_on_rw_error = disable cb_config_list = *:* romio_filesystem_type = CRAY ADIO Ø  Dump summary of MPI-IO hints: MPICH_MPIIO_HINTS_DISPLAY
  • 47. Optimizing MPI-IO on Cori Haswell vs. KNL 47 Ø  Different colors: Different number of aggregators Ø  X-axis : collecKve buffer size About this test: Ø  MPI-IO, collecKve IO Ø  486GB file Ø  32 processes per node Ø  32 nodes We recommend: Ø  4 aggregator per node on Haswell Ø  8 aggregator per node on KNL Ø  Check our paper in CUG’17 J. Liu, etc. Understanding the IO Performance Gap Between Cori KNL and Haswell, CUG’17
  • 48. Optimizing I/O from I/O Middle Layer: Guidelines 48 Ø  Limit the number of files (less metadata and easier to post-process) Ø  Make large and conKguous requests Ø  Avoid small accesses Ø  Avoid non-conKguous accesses Ø  Avoid random accesses Prefer collecKve I/O to independent I/O (especially if the operaKons can be aggregated as single large conKguous requests) Ø  Use derived datatypes and file views to ease the MPI I/O collecKve work Ø  Try MPI I/O hints (especially the collecKve buffering opKmizaKon; disabling data sieving is also very oDen a good idea; also useful for libraries based on MPI-IO) Credits: Philippe Wautelet @IDRIS
  • 49. Optimizing I/O from High Level Interface: HDF5 Ø  MPI-IO layers’ opKmizaKon guidelines generally applies to HDF5 layer Ø  CollecKve metadata operaKon, available in HDF5 1.10 Ø  This allows the library to just use one rank to read the data and broadcast it to all other ranks Ø  And constructs an MPI derived datatype and writes collecKvely in a single call Ø  Increase page buffering Ø  H5Pset_page_buffer_size Ø  Stay tuned for next HDF5 talk at 10am 49
  • 50. Optimizing I/O from Python Interface: H5py Ø  OpKmal HDF5 file creaKon Ø  Use low-level API in H5py 50 2.25X Get closer to the HDF5 C library, fine tuning
  • 51. Optimizing I/O from Python Interface: H5py Ø  Speedup the I/O with collecKve I/O 51 Using 1k process to write 1TB file, collecKve IO achieved 2X speedup on Cori
  • 52. Optimizing I/O from Python Interface: H5py Ø  Avoid type casKng in H5py 52 Reduced IO from 527 seconds to 1.3
  • 53. Object Store for HPC Ø  Amazon S3 is so successful in supporKng various applicaKons, e.g., Instagram, Dropbox Ø  HPC file system relies on strong POSIX, which hinders the performance, e.g., scalability. Ø  Object Store: Ø  Put : creates a new object and fills it with data Ø  Get: retrieves the data based on object ID Ø  Benefits: Scalability Ø  Lockless: Objects are immutable, write-once, no need to lock it before read Ø  Fast lookup, based simple hash of object ID 53
  • 54. Object Store for HPC Ø  Disadvantages Ø  Data can not be modified, most HPC applicaKons like to read/write the data Ø  Limited metadata support, e.g., user info, access permission, requires a database layer 54 Figure courtesy of Glenn Lockwood DDN WOS, Object Store Testbed at NERSC, (TBA Soon) Contact me or Damian Hazen if you are interested
  • 55. Burst Buffer Ø  Burst buffer on Cori Ø  1.7 TB/second of peak I/O performance with 28M IOPs, Ø  1.8PB of storage Ø  NVRAM-based ‘Burst Buffer’ (BB) as intermediate layer Ø  Handle I/O spikes without a huge PFS (stage to PFS asynchronously) Ø  Underlying media supports challenging I/O Ø  SoDware for filesystems- on-demand - scales be]er than large POSIX PFS Ø  Cray DataWarp soDware allocates storage to users per-job Ø  Users see a POSIX filesystem on-demand, striped across nodes Ø  Can specify data to stage in/out from Lustre while job is in queue 55
  • 56. NERSC Burst Buffer Architecture - 56 - Burst Buffer Blade: Compute Nodes Aries High-Speed Network Blade = 2x Burst Buffer Node: 4 Intel P3608 3.2 TB SSDs I/O Node (2x InfiniBand HCA) InfiniBand Fabric Lustre OSSs/OSTs Storage Fabric (InfiniBand) Storage Servers CN CN CN CN BB SSD SSD ION IB IB 1.8 PB on 144 BB nodes
  • 57. 0" 100" 200" 300" 400" 500" 600" File"open" Fiber"object"copy" Catalog"query"&"copy" Cost%(s)% Steps%in%Workflow% Lustre>Cori" BB" Burst Buffer Use Case: H5Boss in Astronomy •  BOSS Baryon OscillaKon Spectroscopic Survey – from SDSS •  Perform typical randomly generated query to extract small amount of stars/galaxies from millions •  Workflows involve 1000s of file open/ close and random and small read/ write I/O •  Run on final release of SDSS-III complete BOSS dataset –  2393 HDF5 files - total ~3.2TB - 57 - •  4.4 TB Burst Buffer - 22 nodes •  Lower I/O Kmes on Burst Buffer •  5.5x speedup for enKre workflow