H5spark

UC Merced, Oct 2016
Bridging the I/O Gap between Spark and Scien6ﬁc Data
Formats on Supercomputer

Jialin Liu
Jalnliu@lbl.gov
National Energy Research Scientific Computing Center(NERSC)
Lawrence Berkeley National Lab (LBNL)
- 1 -

Outline
•  Data Analy6cs Stack in Industry and HPC
•  Related work: SciSpark
•  HDF5 and Spark Data Model
•  H5Spark Design
•  H5Spark Evalua6on and Science Driver
- 2 -

Data-centric Analytics in Industry and Science
•  Science: Collected Data from Instruments Increases Rapidly
•  Large Synop6c Survey Telescope capturing ultra-high-resolu6on images of the
sky every 15 seconds, every night, for at least 10 years. More than 100
petabytes (about 20 million DVD, 4.7GB each) of data, 2022
•  Industry: Self-driving car
•  The car’s sensors generate 1 Gigabyte every second
•  2 Petabyte of data per car per year
•  1 billions cars worldwide
LSST Google
- 3 -

Spark: A Powerful Big Data Analytics Tool
•  A fast and general engine for large-scale data processing framework
–  Similar to Hadoop, except it harnesses in-memory data for fast data
processing
•  Developed at UCB AMPLab, 2014 v1.0, 2016 v2.0
–  Ac6vely developed, 1000+ contributors in 2015
•  Produc6ve programming interface
–  6 vs 28 lines of code compare to Hadoop map-reduce
•  Implicit data parallelism
•  Fault-tolerance
•  Rich libraries: streaming processing, sql, machine learning, mllib, graph
processing
- 4 -

Spark: A Powerful Big Data Analytics Engine
Berkeley Data Analy6cs Stack
- 5 -

Porting Spark onto HPC
Cori @NERSC
•  Phase 1 Haswell: 1.92 PFlops/
sec: 1600 nodes, 52160 cores
•  Phase 2 KNL: 27.9 Pﬂops/sec,
9304 nodes, 632672 cores
•  Burst buﬀer
•  16G MCDRAM
Science @ LBNL
Huge poten*al for science and
computer science research
- 6 -

HPC Data Analytics Stack?
•  HPC programming model:
–  MPI, UPC, openMP, CUDA, etc
–  High performance, low latency
–  Bare metal
•  Data analy6cs and management
–  HPL, PETSc, Scalapack, etc
–  Python, Ipython, matlab, R
–  HDF5/netCDF, Root
Applications
High Level Libs
I/O Middleware
Network
Parallel File Systems
RAID
HPC Software Stack
Big data analy6cs stack in HPC: Mostly non-existent.
--Bruce A. Hendrickson, Director, SNL
- 7 -

Can we use Spark to enable faster science
discovery on HPC architecture?
- 8 -

This project and this work…
- 9 -
Cray/AmpLab/NERSC Collabora6ons
Prabhat, NERSC/LBNL

Porting Spark onto HPC
•  Advantages of Por6ng Spark onto HPC
–  A more produc6ve API for data-intensive compu6ng
–  Relieve the users from concurrency control, communica6on and memory
management with tradi6onal MPI model.
–  Embarrassingly parallel compu6ng, data.map(f)
–  Fault tolerance, recompute()
•  Challenges
–  Spark is ini6ally designed on commodity cluster
–  Programming model is map-reduce, does it apply to HPC workload?
–  Programming languages are scala/java, python. JVM
–  Communica6on via TCP-IP, no RDMA support, (RDMA-Spark, U-Ohio)
–  I/O, parquet/json/txt
–  Storage, HDFS
- 10 -

Running Spark directly on HPC
NASA SciSpark
hips://github.com/SciSpark/SciSpark
Loading Single Large File on Lustre
is not scalable
- 11 -

File System Matters
•  SciSpark used Spark’s ‘binaryFile’ func6on to load all data into memory
–  Assumed HDFS as underlying FS
•  Running SciSpark on HPC lacks the proper ﬁlesystem support
–  HDFS handles parallel I/O, blocksize=128MB
• hdfs:hdfs-pathpart-00000
• hdfs:hdfs-pathpart-00001
• ...
• hdfs:hdfs-pathpart-nnnnn
- 12 -

File System Matters
hip://www.nersc.gov/users/storage-and-ﬁle-systems/ﬁle-
systems/ngfdrawings/
- 13 -

I/O Formats Matters
•  Scien6ﬁc Data Formats in HPC not na6vely Supported in Spark
–  HDF5/ netCDF are among the top 5 libraries at NERSC, 2015
•  750+ unique users @NERSC, million of users worldwide
–  1987, NCSA&UIUC. NASA send HDF-EOS to 2.4 millions end users
Brian Aus6n, NERSC
- 14 -

1 2 3
4 5 6
7 8 9
1 2 3
4 5 6
7 8 9
Data in HDF5
•  Hierarchical Structure
•  Mul6-dimensional Array data model
Group
HDF5
Dataset
1 2 3
4 5 6
7 8 9
Dataset
Attributes
Group
- 15 -

Data in Spark
•  RDD: Resilient Distributed Datasets
–  Read-only, par66oned collec6on of records in Spark
–  RDD can contain any type of Python/Java/Scala objects
–  Fault Tolerant
•  Transforma6ons on RDD
–  Filter, map, join, etc
•  Ac6ons on RDD
–  Reduce, collect, etc
•  Spark opera6ons are lazy
•  RDD allows in-memory processing
–  rdd.cache() or rdd.persist()
–  Good for itera6ve or interac6ve processing
- 16 -

Data in Spark
myRDD : RDD
Par::on
Par::on
Par::on
Par::on
Array
--Tony Duarte
- 17 -

Data in Spark
Transforma6on
- 18 -

H5Spark: Support HDF5 in Spark
•  What does Spark have in reading various data formats?
–  Texpile, sc.textFile()
–  Parquet, sc.read.parquet()
–  Json, sc.read.json()
•  Challenges: Func6onality and Performance
–  How to transform an HDF5 dataset into an RDD?
–  How to u6lize the HDF5 I/O libraries in Spark?
–  How to enable parallel I/O on HPC?
–  What is the impact of Lustre striping?
HDF5àParquet?
- 19 -

H5Spark: Software Overview
•  Scala/Python implementa6on
–  Spark favors Scala and Python
–  H5Spark uses HDF5 java library
–  Underneath is HDF5 C posix library
–  No MPIIO support
•  H5Spark as a standalone package
–  Users can load it in their Spark applica6ons
–  H5Spark module on Cori
–  sbt package------> h5spark_2.10-1.0.jar
•  Open source
–  Github: hips://github.com/valiantljk/h5spark

H5Spark
scala/python
JHI5
java
HDF5
c
HDF5
MPI
H5Py
python
1.8.14
- 20 -

H5Spark: Design
Group
HDF5
Dataset
Lustre File System
User
App
H5Spark Hyperslab
Partitioner
RDD
Parallel I/O
H5Spark Metadata
Analyzer
H5Spark RDD
Constructor
H5spark
RDD Seeder
1 2 3
4 5 6
7 8 9
•  RDD Seeder
•  Metadata Analyzer
•  Hyperslab Par66oner
•  RDD Constructor
- 21 -

H5Spark: Design
•  Metadata Analyzer
•  Single I/O call to parse the HDF5 header, get the dimension, data type, etc
•  Hyperslab Par66oner
•  Balance between Spark par66on and HDF dataset size
•  RDD Seeder
•  A lightweight RDD seed
•  RDD Constructor
•  Direct transforma6on
- 22 -

H5Spark: From HDF5 to RDD
•  Input:

*Spark Par((on determines the degree of parallelism = MPI processes
+OpenMP
p > num of cores
•  Output: RDD: r
•  Under the Hood: reading HDF5 into RDD
–  Adjust par66ons p= p > dim[sid] ? dim[sid]:p
–  Determine hyperslab offset[i]=dim[sid]/p * i
–  Seed RDD r_seed = sc.parallelize(offset, p)
–  Perform parallel I/O r_seed.flatmap(h5read(f,v))
HDF5 File Path: f
Dataset Name: v
SparkContext: sc
*Spark Par66on: p
- 23 -

H5Spark: How to Use
•  H5Spark APIs
•  Correspond to Spark MLlib interface
import org.apache.spark.mllib.linalg
DataType: Vector, labeled point, matrix, indexedrowmatrix, etc
Input: sc, f, v, p
Func6ons Output
h5read A RDD of double array
h5read_point A RDD of (key, value) pair
h5read_vec A RDD of vector
h5read_irow A RDD of indexed row
H5read_imat A RDD of indexed row matrix
- 24 -

H5Spark: How to Use
•  Sample codes, H5Spark vs MPI
1.  val sc = new SparkContext()
2.  val rdd = h5read (sc, f, v, p)
3.  sc.stop()
1.  MPI_Init(&argc, &argv);
2.  MPI_Comm_size(comm, &mpi_size);
3.  MPI_Comm_rank(comm, &mpi_rank);
4.  hid_t fapl = H5Pcreate(H5P_FILE_ACCESS);
5.  H5Pset_fapl_mpio(fapl, comm, info);
6.  file= H5Fopen(f, H5F_ACC_RDONLY, fapl);
7.  dataset= H5Dopen(file, v, H5P_DEFAULT);
8.  hid_t dataspace = H5Dget_space(dataset);
9.  hsize_t offset[rank];
10.  hsize_t count[rank];
11.  hsize_t rest = dims_out[0] % mpi_size;
12.  if(mpi_rank != (mpi_size -1)){
13.  count[0] = dims_out[0]/mpi_size;
14.  }else{
15.  count[0] = dims_out[0]/mpi_size + rest;
16.  }
17.  offset[0] = dims_out[0]/mpi_size * mpi_rank;
18.  for(i=1; i<rank; i++){
19.  offset[i] = 0;
20.  count[i] = dims_out[i];
21.  }
22.  hid_t hyperid=H5Sselect_hyperslab(dataspace,
23.  H5S_SELECT_SET, offset, NULL, count, NULL);
24.  hsize_t rankmemsize=1;
25.  for(i=0; i<rank; i++) rankmemsize*=count[i];
26.  hid_t memspace = H5Screate_simple(rank,count,NULL);
27.  double * data_t=(double *)malloc(sizeof(double)*rankmemsize);
28.  H5Dread(dataset, H5T_NATIVE_DOUBLE, memspace,
29.  dataspace, H5P_DEFAULT, data_t);
30.  MPI_Finalize()

H5Spark Parallel Read
MPI Parallel Read
Parallelism
- 25 -

H5Spark: Evaluation
•  About the System
–  Cori, Phase 1, Cray XC40 supercomputer, 1600 compute nodes, 248
Lustre OSTs
–  Each compute node has 32 cores with 128 GB RAM in total.

•  Experimental Setup
–  2.2 TB global ocean temperature data, 16 TB CAM5 atmosphere data.
–  2.2TB, 16 TB, HDF5 format, Double precision
–  Number of nodes: 45, 90, 135, 1600
–  Stripe counts: 1, 8, 24, 72, 144, 248
- 26 -

H5Spark: Evaluation
•  Scaling/Proﬁling H5Spark with Lustre Striping
–  45 nodes, 1440 cores, 3000 par66ons, 2.2TB data, 1MB stripe size

I/O Bandwidth with Lustre Striping H5Spark Tasks Launching Delay
H5Spark is scalable with Lustre OSTs

- 27 -

H5Spark: Evaluation
•  Scaling H5Spark with Par66ons
–  45 nodes, 2.2TB

The number of par66ons can be tuned, based on the workloads and resources
Par66ons= 2 x Cores
- 28 -

H5Spark: Evaluation
•  Scaling H5Spark with Executors and/or Par66ons
–  2.2TB, 45,95,135 nodes

Increase the number of Executors and Par66ons at the same 6me
- 29 -

H5Spark: Evaluation
•  H5Spark has been tested at full scale on Cori phase 1
Tests Size(TB) I/O(s) B/W(GB/s) OSTs Executors Par::ons
135 nodes 2.2 37 59.7 144 135 9000
Full scale 16 120 136.5 144 1522 52100
- 30 -
The largest run to date, not only in industry but also in HPC

H5Spark: Evaluation
•  H5Spark Python vs Scala

Version I/O(s) B/W(GB/s) Speedup Mem(GB) Ra:o
Python 162 13.65 1 479 1
Scala 90 24.56 1.8 2210 4.61
Scala is faster than Python
- 31 -

H5Spark: Evaluation
•  H5Spark vs MPI-IO

MPI scales beier with increasing OSTs
H5Spark scales well with Nodes (while MPI saturates the I/O)
Par66ons are also increased
- 32 -
2-3 X I/O gap, MPI is s6ll the winner

- 33 -
H5Spark for Science
Daya Bay — neutrino sensor array measurements; used for NMF

Ocean and Atmosphere — climate variables (ocean temperature, atmospheric
humidity) measured on a 3D grid at 3 or 6 hour intervals over about 30 years; used
for PCA The image cannot be displayed. Your computer may not have
enough memory to open the image, or the image may have been
corrupted. Restart your computer, and then open the ﬁle again. If
the red x still appears, you may have to delete the image and
then insert it again.
The image cannot be displayed. Your computer may not have enough memory
to open the image, or the image may have been corrupted. Restart your
computer, and then open the ﬁle again. If the red x still appears, you may have
to delete the image and then insert it again.
33

- 34 -
PCA Results
MPI
Spark
34

Conclusion
•  Por6ng Spark onto HPC
–  I/O and Storage
–  Formats
•  H5Spark:
–  An efficient HDF5 file loader for Spark
–  Conduct big data analysis on scien6fic data sets
•  Is Spark a good fit in HPC?
–  Produc6vity
–  Auto-parallelism
–  Stragglers, scheduling
–  Redesign
- 35 -

- 36 -
Thanks
•  Cray: Jim Harrell, Venkat Krishnamurthy, Michael Ringenburg, Kristyn
Maschhoff, Pramod Sharma
•  AMPLab/UCB: Michael W. Mahoney, Alex Giiens, Aditya
Devarakonda, Jey Koiaalam, James Demmel
•  NERSC/LBNL: Evan Racah, Lisa Gerhardt, Jialin Liu, Shane Canon,
Prabhat

1.  Matrix Factoriza6on at Scale: a Comparison of Scien6fic Data Analy6cs in Spark
and C+ MPI Using Three Case Studies", IEEE BigData'16
2.  H5Spark: Bridging the I/O Gap between Spark and Scien6fic Data Formats on
HPC Systems", CUG'16
36

- 37 -
Time Waiting Until Stage End (Stragglers)
Time from a task ﬁnishing to
when a stage ﬁnishes due to
wai6ng for other slower tasks
37

H5spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to H5spark

Similar to H5spark (20)

Recently uploaded

Recently uploaded (20)

H5spark