SlideShare a Scribd company logo
1 of 37
Download to read offline
UC	Merced,	Oct	2016	
Bridging	the	I/O	Gap	between	Spark	and	Scien6fic	Data	
Formats	on	Supercomputer	
	
Jialin Liu
Jalnliu@lbl.gov
National Energy Research Scientific Computing Center(NERSC)
Lawrence Berkeley National Lab (LBNL)
-	1	-
Outline
•  Data	Analy6cs	Stack	in	Industry	and	HPC	
•  Related	work:	SciSpark	
•  HDF5	and	Spark	Data	Model	
•  H5Spark	Design	
•  H5Spark	Evalua6on	and	Science	Driver	
-	2	-
Data-centric Analytics in Industry and Science
•  Science: Collected Data from Instruments Increases Rapidly
•  Large	Synop6c	Survey	Telescope	capturing	ultra-high-resolu6on	images	of	the	
sky	every	15	seconds,	every	night,	for	at	least	10	years.	More	than	100	
petabytes	(about	20	million	DVD,	4.7GB	each)	of	data,	2022	
•  Industry:	Self-driving	car	
•  The	car’s	sensors	generate	1	Gigabyte	every	second	
•  2	Petabyte	of	data	per	car	per	year	
•  1	billions	cars	worldwide	
LSST		 Google	
-	3	-
Spark: A Powerful Big Data Analytics Tool
•  A	fast	and	general	engine	for	large-scale	data	processing	framework	
–  Similar	to	Hadoop,	except	it	harnesses	in-memory	data	for	fast	data	
processing	
•  Developed	at	UCB	AMPLab,	2014	v1.0,	2016	v2.0	
–  Ac6vely	developed,	1000+	contributors	in	2015	
•  Produc6ve	programming	interface		
–  6	vs	28	lines	of	code	compare	to	Hadoop	map-reduce	
•  Implicit	data	parallelism	
•  Fault-tolerance	
•  Rich	libraries:	streaming	processing,	sql,	machine	learning,	mllib,	graph	
processing	
-	4	-
Spark: A Powerful Big Data Analytics Engine
Berkeley	Data	Analy6cs	Stack	
-	5	-
Porting Spark onto HPC
Cori	@NERSC	
•  Phase	1	Haswell:	1.92	PFlops/
sec:	1600	nodes,	52160	cores	
•  Phase	2	KNL:	27.9	Pflops/sec,	
9304	nodes,	632672	cores	
•  Burst	buffer	
•  16G	MCDRAM	
Science	@	LBNL	
Huge	poten*al	for	science	and	
computer	science	research	
-	6	-
HPC Data Analytics Stack?
•  HPC	programming	model:		
–  MPI,	UPC,	openMP,	CUDA,	etc	
–  High	performance,	low	latency	
–  Bare	metal		
•  Data	analy6cs	and	management	
–  HPL,	PETSc,	Scalapack,	etc	
–  Python,	Ipython,	matlab,	R		
–  HDF5/netCDF,	Root	
Applications
High Level Libs
I/O Middleware
Network
Parallel File Systems
RAID
HPC Software Stack
Big	data	analy6cs	stack	in	HPC:	Mostly	non-existent.		
	 	 							--Bruce	A.	Hendrickson,	Director,	SNL	
-	7	-
Can	we	use	Spark	to	enable	faster	science	
discovery	on	HPC	architecture?	
-	8	-
This project and this work…
-	9	-	
Cray/AmpLab/NERSC	Collabora6ons	
Prabhat,	NERSC/LBNL
Porting Spark onto HPC
•  Advantages	of	Por6ng	Spark	onto	HPC	
–  A	more	produc6ve	API	for	data-intensive	compu6ng	
–  Relieve	the	users	from	concurrency	control,	communica6on	and	memory	
management	with	tradi6onal	MPI	model.		
–  Embarrassingly	parallel	compu6ng,	data.map(f)		
–  Fault	tolerance,	recompute()	
•  Challenges	
–  Spark	is	ini6ally	designed	on	commodity	cluster	
–  Programming	model	is	map-reduce,	does	it	apply	to	HPC	workload?	
–  Programming	languages	are	scala/java,	python.	JVM	
–  Communica6on	via	TCP-IP,	no	RDMA	support,	(RDMA-Spark,	U-Ohio)	
–  I/O,	parquet/json/txt	
–  Storage,	HDFS	
-	10	-
Running Spark directly on HPC
NASA	SciSpark	
hips://github.com/SciSpark/SciSpark	
Loading	Single	Large	File	on	Lustre	
is	not	scalable		
-	11	-
File System Matters
•  SciSpark	used	Spark’s	‘binaryFile’	func6on	to	load	all	data	into	memory	
–  Assumed	HDFS	as	underlying	FS	
•  Running	SciSpark	on	HPC	lacks	the	proper	filesystem	support	
–  HDFS	handles	parallel	I/O,	blocksize=128MB	
•	hdfs:hdfs-pathpart-00000		
•	hdfs:hdfs-pathpart-00001		
•	...		
•	hdfs:hdfs-pathpart-nnnnn	
-	12	-
File System Matters
hip://www.nersc.gov/users/storage-and-file-systems/file-
systems/ngfdrawings/	
-	13	-
I/O Formats Matters
•  Scien6fic	Data	Formats	in	HPC	not	na6vely	Supported	in	Spark	
–  HDF5/	netCDF	are	among	the	top	5	libraries	at	NERSC,	2015	
•  750+	unique	users	@NERSC,	million	of	users	worldwide	
–  1987,	NCSA&UIUC.	NASA	send	HDF-EOS	to	2.4	millions	end	users		
Brian	Aus6n,	NERSC	
-	14	-
1 2 3
4 5 6
7 8 9
1 2 3
4 5 6
7 8 9
Data in HDF5
•  Hierarchical	Structure	
•  Mul6-dimensional	Array	data	model	
Group
HDF5	
Dataset
1 2 3
4 5 6
7 8 9
Dataset
Attributes
Group
-	15	-
Data in Spark
•  RDD:	Resilient	Distributed	Datasets	
–  Read-only,	par66oned	collec6on	of	records	in	Spark	
–  RDD	can	contain	any	type	of	Python/Java/Scala	objects	
–  Fault	Tolerant	
•  Transforma6ons	on	RDD	
–  Filter,	map,	join,	etc	
•  Ac6ons	on	RDD	
–  Reduce,	collect,	etc	
•  Spark	opera6ons	are	lazy	
•  RDD	allows	in-memory	processing	
–  rdd.cache()	or	rdd.persist()	
–  Good	for	itera6ve	or	interac6ve	processing	
-	16	-
Data in Spark
myRDD	:	RDD	
Par::on	
Par::on	
Par::on	
Par::on	
Array	
--Tony	Duarte	
-	17	-
Data in Spark
Transforma6on	
-	18	-
H5Spark: Support HDF5 in Spark
•  What	does	Spark	have	in	reading	various	data	formats?	
–  Texpile,	sc.textFile()	
–  Parquet,	sc.read.parquet()	
–  Json,	sc.read.json()	
•  Challenges:	Func6onality	and	Performance	
–  How	to	transform	an	HDF5	dataset	into	an	RDD?		
–  How	to	u6lize	the	HDF5	I/O	libraries	in	Spark?		
–  How	to	enable	parallel	I/O	on	HPC?		
–  What	is	the	impact	of	Lustre	striping?					
HDF5àParquet?		
-	19	-
H5Spark: Software Overview
•  Scala/Python	implementa6on	
–  Spark	favors	Scala	and	Python	
–  H5Spark	uses	HDF5	java	library	
–  Underneath	is	HDF5	C	posix	library	
–  No	MPIIO	support	
•  H5Spark	as	a	standalone	package	
–  Users	can	load	it	in	their	Spark	applica6ons	
–  H5Spark	module	on	Cori	
–  sbt	package------>		h5spark_2.10-1.0.jar	
•  Open	source	
–  Github:	hips://github.com/valiantljk/h5spark	
	
H5Spark	
scala/python	
JHI5	
java	
HDF5	
c	
HDF5	
MPI	
H5Py	
python	
1.8.14	
-	20	-
H5Spark: Design
Group
HDF5	
Dataset
Lustre File System
User
App
H5Spark Hyperslab
Partitioner
RDD
Parallel I/O
H5Spark Metadata
Analyzer
H5Spark RDD
Constructor
H5spark
RDD Seeder
1 2 3
4 5 6
7 8 9
•  RDD	Seeder	
•  Metadata	Analyzer	
•  Hyperslab	Par66oner	
•  RDD	Constructor	
-	21	-
H5Spark: Design
•  Metadata	Analyzer	
•  Single	I/O	call	to	parse	the	HDF5	header,	get	the	dimension,	data	type,	etc	
•  Hyperslab	Par66oner	
•  Balance	between	Spark	par66on	and	HDF	dataset	size	
•  RDD	Seeder	
•  A	lightweight	RDD	seed	
•  RDD	Constructor	
•  Direct	transforma6on	
-	22	-
H5Spark: From HDF5 to RDD
•  Input:	
		
	*Spark	Par((on	determines	the	degree	of	parallelism	=	MPI	processes	
	 	 	 	 	 	 	 	 	 	 	 	 	 	+OpenMP	
	 	 	 	 	 	 	 	 	 	 	p	>	num	of	cores	
•  Output:	RDD:	r
•  Under	the	Hood:	reading	HDF5	into	RDD	
–  Adjust	par66ons	 	 	p=	p	>	dim[sid]	?	dim[sid]:p	
–  Determine	hyperslab	 	offset[i]=dim[sid]/p	*	i	
–  Seed	RDD 	 	 		 	r_seed	=	sc.parallelize(offset,	p)	
–  Perform	parallel	I/O 		 	r_seed.flatmap(h5read(f,v))	
HDF5	File	Path:	 f
Dataset	Name:	 v
SparkContext:	 sc
*Spark	Par66on:	 p
-	23	-
H5Spark: How to Use
•  H5Spark	APIs	
•  Correspond	to	Spark	MLlib	interface	
	import	org.apache.spark.mllib.linalg	
DataType:	Vector,	labeled	point,	matrix,	indexedrowmatrix,	etc	
Input:	sc,	f,	v,	p	
Func6ons	 Output	
h5read	 A	RDD	of	double	array	
h5read_point	 A	RDD	of	(key,	value)	pair	
h5read_vec	 A	RDD	of	vector	
h5read_irow	 A	RDD	of	indexed	row	
H5read_imat	 	A	RDD	of	indexed	row	matrix	
-	24	-
H5Spark: How to Use
•  Sample	codes,	H5Spark	vs	MPI	
1.  val	sc	=	new	SparkContext()	
2.  val	rdd	=	h5read	(sc,	f,	v,	p)	
3.  sc.stop()	
1.  		MPI_Init(&argc,	&argv);	
2.  		MPI_Comm_size(comm,	&mpi_size);	
3.  		MPI_Comm_rank(comm,	&mpi_rank);	
4.  		hid_t	fapl	=	H5Pcreate(H5P_FILE_ACCESS);	
5.  		H5Pset_fapl_mpio(fapl,	comm,	info);	
6.  		file=	H5Fopen(f,	H5F_ACC_RDONLY,	fapl);	
7.  		dataset=	H5Dopen(file,	v,	H5P_DEFAULT);	
8.  		hid_t	dataspace	=	H5Dget_space(dataset);		
9.  		hsize_t	offset[rank];	
10.  		hsize_t	count[rank];	
11.  		hsize_t	rest	=	dims_out[0]	%	mpi_size;	
12.  		if(mpi_rank	!=	(mpi_size	-1)){	
13.  				count[0]	=	dims_out[0]/mpi_size;	
14.  		}else{	
15.  				count[0]	=	dims_out[0]/mpi_size	+	rest;	
16.  		}	
17.  		offset[0]	=	dims_out[0]/mpi_size	*	mpi_rank;	
18.  		for(i=1;	i<rank;	i++){	
19.  			offset[i]	=	0;	
20.  			count[i]	=	dims_out[i];	
21.  		}	
22.  		hid_t	hyperid=H5Sselect_hyperslab(dataspace,	
23.  	 	 		H5S_SELECT_SET,	offset,	NULL,	count,	NULL);	
24.  		hsize_t	rankmemsize=1;	
25.  		for(i=0;	i<rank;	i++)			rankmemsize*=count[i];	
26.  		hid_t	memspace	=	H5Screate_simple(rank,count,NULL);	
27.  		double	*	data_t=(double	*)malloc(sizeof(double)*rankmemsize);	
28.  		H5Dread(dataset,	H5T_NATIVE_DOUBLE,	memspace,	
29.  	 		dataspace,	H5P_DEFAULT,	data_t);	
30.  		MPI_Finalize()	
	
H5Spark	Parallel	Read	
MPI	Parallel	Read	
Parallelism	
-	25	-
H5Spark: Evaluation
•  About	the	System	
–  Cori,	Phase	1,	Cray	XC40	supercomputer,	1600	compute	nodes,	248	
Lustre	OSTs	
–  Each	compute	node	has	32	cores	with	128	GB	RAM	in	total.	
	
•  Experimental	Setup	
–  2.2	TB	global	ocean	temperature	data,	16	TB	CAM5	atmosphere	data.		
–  2.2TB,	16	TB,	HDF5	format,	Double	precision	
–  Number	of	nodes:	45,	90,	135,	1600	
–  Stripe	counts:	1,	8,	24,	72,	144,	248	
-	26	-
H5Spark: Evaluation
•  Scaling/Profiling	H5Spark	with	Lustre	Striping	
–  45	nodes,	1440	cores,	3000	par66ons,	2.2TB	data,	1MB	stripe	size	
	
I/O	Bandwidth	with	Lustre	Striping	 H5Spark	Tasks	Launching	Delay	
H5Spark	is	scalable	with	Lustre	OSTs		
	
-	27	-
H5Spark: Evaluation
•  Scaling	H5Spark	with	Par66ons	
–  45	nodes,	2.2TB	
	
The	number	of	par66ons	can	be	tuned,	based	on	the	workloads	and	resources	
																																																Par66ons=	2	x	Cores		
-	28	-
H5Spark: Evaluation
•  Scaling	H5Spark	with	Executors	and/or	Par66ons	
–  2.2TB,	45,95,135	nodes	
	
	Increase	the	number	of	Executors	and	Par66ons	at	the	same	6me	
-	29	-
H5Spark: Evaluation
•  H5Spark	has	been	tested	at	full	scale	on	Cori	phase	1	
Tests	 Size(TB)	 I/O(s)	 B/W(GB/s)	 OSTs	 Executors	 Par::ons	
135	nodes	 2.2	 37	 59.7	 144	 135	 9000	
Full	scale	 16	 120	 136.5	 144	 1522	 52100	
-	30	-	
The	largest	run	to	date,	not	only	in	industry	but	also	in	HPC
H5Spark: Evaluation
•  H5Spark	Python	vs	Scala		
	
Version	 I/O(s)	 B/W(GB/s)	 Speedup	 Mem(GB)	 Ra:o	
Python	 162	 13.65	 1	 479	 1	
Scala	 90	 24.56	 1.8	 2210	 4.61	
Scala	is	faster	than	Python	
-	31	-
H5Spark: Evaluation
•  H5Spark	vs	MPI-IO	
	
MPI	scales	beier	with	increasing	OSTs	
H5Spark	scales	well	with	Nodes	(while	MPI	saturates	the	I/O)	
Par66ons	are	also	increased	
-	32	-	
2-3	X	I/O	gap,	MPI	is	s6ll	the	winner
-	33	-	
H5Spark for Science
Daya	Bay	—	neutrino	sensor	array	measurements;	used	for	NMF	
	
Ocean	and	Atmosphere	—	climate	variables	(ocean	temperature,	atmospheric	
humidity)	measured	on	a	3D	grid	at	3	or	6	hour	intervals	over	about	30	years;	used	
for	PCA	 The image cannot be displayed. Your computer may not have
enough memory to open the image, or the image may have been
corrupted. Restart your computer, and then open the file again. If
the red x still appears, you may have to delete the image and
then insert it again.
The image cannot be displayed. Your computer may not have enough memory
to open the image, or the image may have been corrupted. Restart your
computer, and then open the file again. If the red x still appears, you may have
to delete the image and then insert it again.
33
-	34	-	
PCA Results
MPI	
Spark	
34
Conclusion
•  Por6ng	Spark	onto	HPC	
–  I/O	and	Storage	
–  Formats	
•  H5Spark:		
–  An	efficient	HDF5	file	loader	for	Spark	
–  Conduct	big	data	analysis	on	scien6fic	data	sets	
•  Is	Spark	a	good	fit	in	HPC?	
–  Produc6vity	
–  Auto-parallelism		
–  Stragglers,	scheduling	
–  Redesign	
-	35	-
-	36	-	
Thanks
•  Cray:	Jim	Harrell,	Venkat	Krishnamurthy,	Michael	Ringenburg,		Kristyn	
Maschhoff,	Pramod	Sharma	
•  AMPLab/UCB:	Michael	W.	Mahoney,	Alex	Giiens,	Aditya	
Devarakonda,	Jey	Koiaalam,	James	Demmel	
•  NERSC/LBNL:	Evan	Racah,	Lisa	Gerhardt,	Jialin	Liu,	Shane	Canon,	
Prabhat	
	
1.  Matrix	Factoriza6on	at	Scale:	a	Comparison	of	Scien6fic	Data	Analy6cs	in	Spark	
and	C+	MPI	Using	Three	Case	Studies",	IEEE	BigData'16	
2.  H5Spark:	Bridging	the	I/O	Gap	between	Spark	and	Scien6fic	Data	Formats	on	
HPC	Systems",	CUG'16	
36
-	37	-	
Time Waiting Until Stage End (Stragglers)
Time	from	a	task	finishing	to	
when	a	stage	finishes	due	to	
wai6ng	for	other	slower	tasks	
37

More Related Content

What's hot

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData, Inc.
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
SplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunk
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217Sri Ambati
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUsHow To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUsKinetica
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistHUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistSpagoWorld
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling WaterSri Ambati
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella
 
Hunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for HadoopHunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for HadoopGeorg Knon
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 

What's hot (20)

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for Hadoop
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
SplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep DiveSplunkLive! Hunk Technical Deep Dive
SplunkLive! Hunk Technical Deep Dive
 
ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217ArnoCandelAIFrontiers011217
ArnoCandelAIFrontiers011217
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUsHow To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistHUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Hunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for HadoopHunk: Splunk Analytics for Hadoop
Hunk: Splunk Analytics for Hadoop
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Similar to H5spark

Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Petr Novotný
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigData_Europe
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptxITLAb21
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 

Similar to H5spark (20)

HDF Update
HDF UpdateHDF Update
HDF Update
 
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal Pilots
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 

Recently uploaded

Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...ranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 

Recently uploaded (20)

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
(TARA) Talegaon Dabhade Call Girls Just Call 7001035870 [ Cash on Delivery ] ...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 

H5spark

  • 2. Outline •  Data Analy6cs Stack in Industry and HPC •  Related work: SciSpark •  HDF5 and Spark Data Model •  H5Spark Design •  H5Spark Evalua6on and Science Driver - 2 -
  • 3. Data-centric Analytics in Industry and Science •  Science: Collected Data from Instruments Increases Rapidly •  Large Synop6c Survey Telescope capturing ultra-high-resolu6on images of the sky every 15 seconds, every night, for at least 10 years. More than 100 petabytes (about 20 million DVD, 4.7GB each) of data, 2022 •  Industry: Self-driving car •  The car’s sensors generate 1 Gigabyte every second •  2 Petabyte of data per car per year •  1 billions cars worldwide LSST Google - 3 -
  • 4. Spark: A Powerful Big Data Analytics Tool •  A fast and general engine for large-scale data processing framework –  Similar to Hadoop, except it harnesses in-memory data for fast data processing •  Developed at UCB AMPLab, 2014 v1.0, 2016 v2.0 –  Ac6vely developed, 1000+ contributors in 2015 •  Produc6ve programming interface –  6 vs 28 lines of code compare to Hadoop map-reduce •  Implicit data parallelism •  Fault-tolerance •  Rich libraries: streaming processing, sql, machine learning, mllib, graph processing - 4 -
  • 5. Spark: A Powerful Big Data Analytics Engine Berkeley Data Analy6cs Stack - 5 -
  • 6. Porting Spark onto HPC Cori @NERSC •  Phase 1 Haswell: 1.92 PFlops/ sec: 1600 nodes, 52160 cores •  Phase 2 KNL: 27.9 Pflops/sec, 9304 nodes, 632672 cores •  Burst buffer •  16G MCDRAM Science @ LBNL Huge poten*al for science and computer science research - 6 -
  • 7. HPC Data Analytics Stack? •  HPC programming model: –  MPI, UPC, openMP, CUDA, etc –  High performance, low latency –  Bare metal •  Data analy6cs and management –  HPL, PETSc, Scalapack, etc –  Python, Ipython, matlab, R –  HDF5/netCDF, Root Applications High Level Libs I/O Middleware Network Parallel File Systems RAID HPC Software Stack Big data analy6cs stack in HPC: Mostly non-existent. --Bruce A. Hendrickson, Director, SNL - 7 -
  • 9. This project and this work… - 9 - Cray/AmpLab/NERSC Collabora6ons Prabhat, NERSC/LBNL
  • 10. Porting Spark onto HPC •  Advantages of Por6ng Spark onto HPC –  A more produc6ve API for data-intensive compu6ng –  Relieve the users from concurrency control, communica6on and memory management with tradi6onal MPI model. –  Embarrassingly parallel compu6ng, data.map(f) –  Fault tolerance, recompute() •  Challenges –  Spark is ini6ally designed on commodity cluster –  Programming model is map-reduce, does it apply to HPC workload? –  Programming languages are scala/java, python. JVM –  Communica6on via TCP-IP, no RDMA support, (RDMA-Spark, U-Ohio) –  I/O, parquet/json/txt –  Storage, HDFS - 10 -
  • 11. Running Spark directly on HPC NASA SciSpark hips://github.com/SciSpark/SciSpark Loading Single Large File on Lustre is not scalable - 11 -
  • 12. File System Matters •  SciSpark used Spark’s ‘binaryFile’ func6on to load all data into memory –  Assumed HDFS as underlying FS •  Running SciSpark on HPC lacks the proper filesystem support –  HDFS handles parallel I/O, blocksize=128MB • hdfs:hdfs-pathpart-00000 • hdfs:hdfs-pathpart-00001 • ... • hdfs:hdfs-pathpart-nnnnn - 12 -
  • 14. I/O Formats Matters •  Scien6fic Data Formats in HPC not na6vely Supported in Spark –  HDF5/ netCDF are among the top 5 libraries at NERSC, 2015 •  750+ unique users @NERSC, million of users worldwide –  1987, NCSA&UIUC. NASA send HDF-EOS to 2.4 millions end users Brian Aus6n, NERSC - 14 -
  • 15. 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Data in HDF5 •  Hierarchical Structure •  Mul6-dimensional Array data model Group HDF5 Dataset 1 2 3 4 5 6 7 8 9 Dataset Attributes Group - 15 -
  • 16. Data in Spark •  RDD: Resilient Distributed Datasets –  Read-only, par66oned collec6on of records in Spark –  RDD can contain any type of Python/Java/Scala objects –  Fault Tolerant •  Transforma6ons on RDD –  Filter, map, join, etc •  Ac6ons on RDD –  Reduce, collect, etc •  Spark opera6ons are lazy •  RDD allows in-memory processing –  rdd.cache() or rdd.persist() –  Good for itera6ve or interac6ve processing - 16 -
  • 19. H5Spark: Support HDF5 in Spark •  What does Spark have in reading various data formats? –  Texpile, sc.textFile() –  Parquet, sc.read.parquet() –  Json, sc.read.json() •  Challenges: Func6onality and Performance –  How to transform an HDF5 dataset into an RDD? –  How to u6lize the HDF5 I/O libraries in Spark? –  How to enable parallel I/O on HPC? –  What is the impact of Lustre striping? HDF5àParquet? - 19 -
  • 20. H5Spark: Software Overview •  Scala/Python implementa6on –  Spark favors Scala and Python –  H5Spark uses HDF5 java library –  Underneath is HDF5 C posix library –  No MPIIO support •  H5Spark as a standalone package –  Users can load it in their Spark applica6ons –  H5Spark module on Cori –  sbt package------> h5spark_2.10-1.0.jar •  Open source –  Github: hips://github.com/valiantljk/h5spark H5Spark scala/python JHI5 java HDF5 c HDF5 MPI H5Py python 1.8.14 - 20 -
  • 21. H5Spark: Design Group HDF5 Dataset Lustre File System User App H5Spark Hyperslab Partitioner RDD Parallel I/O H5Spark Metadata Analyzer H5Spark RDD Constructor H5spark RDD Seeder 1 2 3 4 5 6 7 8 9 •  RDD Seeder •  Metadata Analyzer •  Hyperslab Par66oner •  RDD Constructor - 21 -
  • 22. H5Spark: Design •  Metadata Analyzer •  Single I/O call to parse the HDF5 header, get the dimension, data type, etc •  Hyperslab Par66oner •  Balance between Spark par66on and HDF dataset size •  RDD Seeder •  A lightweight RDD seed •  RDD Constructor •  Direct transforma6on - 22 -
  • 23. H5Spark: From HDF5 to RDD •  Input: *Spark Par((on determines the degree of parallelism = MPI processes +OpenMP p > num of cores •  Output: RDD: r •  Under the Hood: reading HDF5 into RDD –  Adjust par66ons p= p > dim[sid] ? dim[sid]:p –  Determine hyperslab offset[i]=dim[sid]/p * i –  Seed RDD r_seed = sc.parallelize(offset, p) –  Perform parallel I/O r_seed.flatmap(h5read(f,v)) HDF5 File Path: f Dataset Name: v SparkContext: sc *Spark Par66on: p - 23 -
  • 24. H5Spark: How to Use •  H5Spark APIs •  Correspond to Spark MLlib interface import org.apache.spark.mllib.linalg DataType: Vector, labeled point, matrix, indexedrowmatrix, etc Input: sc, f, v, p Func6ons Output h5read A RDD of double array h5read_point A RDD of (key, value) pair h5read_vec A RDD of vector h5read_irow A RDD of indexed row H5read_imat A RDD of indexed row matrix - 24 -
  • 25. H5Spark: How to Use •  Sample codes, H5Spark vs MPI 1.  val sc = new SparkContext() 2.  val rdd = h5read (sc, f, v, p) 3.  sc.stop() 1.  MPI_Init(&argc, &argv); 2.  MPI_Comm_size(comm, &mpi_size); 3.  MPI_Comm_rank(comm, &mpi_rank); 4.  hid_t fapl = H5Pcreate(H5P_FILE_ACCESS); 5.  H5Pset_fapl_mpio(fapl, comm, info); 6.  file= H5Fopen(f, H5F_ACC_RDONLY, fapl); 7.  dataset= H5Dopen(file, v, H5P_DEFAULT); 8.  hid_t dataspace = H5Dget_space(dataset); 9.  hsize_t offset[rank]; 10.  hsize_t count[rank]; 11.  hsize_t rest = dims_out[0] % mpi_size; 12.  if(mpi_rank != (mpi_size -1)){ 13.  count[0] = dims_out[0]/mpi_size; 14.  }else{ 15.  count[0] = dims_out[0]/mpi_size + rest; 16.  } 17.  offset[0] = dims_out[0]/mpi_size * mpi_rank; 18.  for(i=1; i<rank; i++){ 19.  offset[i] = 0; 20.  count[i] = dims_out[i]; 21.  } 22.  hid_t hyperid=H5Sselect_hyperslab(dataspace, 23.  H5S_SELECT_SET, offset, NULL, count, NULL); 24.  hsize_t rankmemsize=1; 25.  for(i=0; i<rank; i++) rankmemsize*=count[i]; 26.  hid_t memspace = H5Screate_simple(rank,count,NULL); 27.  double * data_t=(double *)malloc(sizeof(double)*rankmemsize); 28.  H5Dread(dataset, H5T_NATIVE_DOUBLE, memspace, 29.  dataspace, H5P_DEFAULT, data_t); 30.  MPI_Finalize() H5Spark Parallel Read MPI Parallel Read Parallelism - 25 -
  • 26. H5Spark: Evaluation •  About the System –  Cori, Phase 1, Cray XC40 supercomputer, 1600 compute nodes, 248 Lustre OSTs –  Each compute node has 32 cores with 128 GB RAM in total. •  Experimental Setup –  2.2 TB global ocean temperature data, 16 TB CAM5 atmosphere data. –  2.2TB, 16 TB, HDF5 format, Double precision –  Number of nodes: 45, 90, 135, 1600 –  Stripe counts: 1, 8, 24, 72, 144, 248 - 26 -
  • 27. H5Spark: Evaluation •  Scaling/Profiling H5Spark with Lustre Striping –  45 nodes, 1440 cores, 3000 par66ons, 2.2TB data, 1MB stripe size I/O Bandwidth with Lustre Striping H5Spark Tasks Launching Delay H5Spark is scalable with Lustre OSTs - 27 -
  • 28. H5Spark: Evaluation •  Scaling H5Spark with Par66ons –  45 nodes, 2.2TB The number of par66ons can be tuned, based on the workloads and resources Par66ons= 2 x Cores - 28 -
  • 29. H5Spark: Evaluation •  Scaling H5Spark with Executors and/or Par66ons –  2.2TB, 45,95,135 nodes Increase the number of Executors and Par66ons at the same 6me - 29 -
  • 30. H5Spark: Evaluation •  H5Spark has been tested at full scale on Cori phase 1 Tests Size(TB) I/O(s) B/W(GB/s) OSTs Executors Par::ons 135 nodes 2.2 37 59.7 144 135 9000 Full scale 16 120 136.5 144 1522 52100 - 30 - The largest run to date, not only in industry but also in HPC
  • 31. H5Spark: Evaluation •  H5Spark Python vs Scala Version I/O(s) B/W(GB/s) Speedup Mem(GB) Ra:o Python 162 13.65 1 479 1 Scala 90 24.56 1.8 2210 4.61 Scala is faster than Python - 31 -
  • 33. - 33 - H5Spark for Science Daya Bay — neutrino sensor array measurements; used for NMF Ocean and Atmosphere — climate variables (ocean temperature, atmospheric humidity) measured on a 3D grid at 3 or 6 hour intervals over about 30 years; used for PCA The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. 33
  • 35. Conclusion •  Por6ng Spark onto HPC –  I/O and Storage –  Formats •  H5Spark: –  An efficient HDF5 file loader for Spark –  Conduct big data analysis on scien6fic data sets •  Is Spark a good fit in HPC? –  Produc6vity –  Auto-parallelism –  Stragglers, scheduling –  Redesign - 35 -
  • 36. - 36 - Thanks •  Cray: Jim Harrell, Venkat Krishnamurthy, Michael Ringenburg, Kristyn Maschhoff, Pramod Sharma •  AMPLab/UCB: Michael W. Mahoney, Alex Giiens, Aditya Devarakonda, Jey Koiaalam, James Demmel •  NERSC/LBNL: Evan Racah, Lisa Gerhardt, Jialin Liu, Shane Canon, Prabhat 1.  Matrix Factoriza6on at Scale: a Comparison of Scien6fic Data Analy6cs in Spark and C+ MPI Using Three Case Studies", IEEE BigData'16 2.  H5Spark: Bridging the I/O Gap between Spark and Scien6fic Data Formats on HPC Systems", CUG'16 36
  • 37. - 37 - Time Waiting Until Stage End (Stragglers) Time from a task finishing to when a stage finishes due to wai6ng for other slower tasks 37