3. Data-centric Analytics in Industry and Science
• Science: Collected Data from Instruments Increases Rapidly
• Large Synop6c Survey Telescope capturing ultra-high-resolu6on images of the
sky every 15 seconds, every night, for at least 10 years. More than 100
petabytes (about 20 million DVD, 4.7GB each) of data, 2022
• Industry: Self-driving car
• The car’s sensors generate 1 Gigabyte every second
• 2 Petabyte of data per car per year
• 1 billions cars worldwide
LSST Google
- 3 -
4. Spark: A Powerful Big Data Analytics Tool
• A fast and general engine for large-scale data processing framework
– Similar to Hadoop, except it harnesses in-memory data for fast data
processing
• Developed at UCB AMPLab, 2014 v1.0, 2016 v2.0
– Ac6vely developed, 1000+ contributors in 2015
• Produc6ve programming interface
– 6 vs 28 lines of code compare to Hadoop map-reduce
• Implicit data parallelism
• Fault-tolerance
• Rich libraries: streaming processing, sql, machine learning, mllib, graph
processing
- 4 -
5. Spark: A Powerful Big Data Analytics Engine
Berkeley Data Analy6cs Stack
- 5 -
9. This project and this work…
- 9 -
Cray/AmpLab/NERSC Collabora6ons
Prabhat, NERSC/LBNL
10. Porting Spark onto HPC
• Advantages of Por6ng Spark onto HPC
– A more produc6ve API for data-intensive compu6ng
– Relieve the users from concurrency control, communica6on and memory
management with tradi6onal MPI model.
– Embarrassingly parallel compu6ng, data.map(f)
– Fault tolerance, recompute()
• Challenges
– Spark is ini6ally designed on commodity cluster
– Programming model is map-reduce, does it apply to HPC workload?
– Programming languages are scala/java, python. JVM
– Communica6on via TCP-IP, no RDMA support, (RDMA-Spark, U-Ohio)
– I/O, parquet/json/txt
– Storage, HDFS
- 10 -
11. Running Spark directly on HPC
NASA SciSpark
hips://github.com/SciSpark/SciSpark
Loading Single Large File on Lustre
is not scalable
- 11 -
12. File System Matters
• SciSpark used Spark’s ‘binaryFile’ func6on to load all data into memory
– Assumed HDFS as underlying FS
• Running SciSpark on HPC lacks the proper filesystem support
– HDFS handles parallel I/O, blocksize=128MB
• hdfs:hdfs-pathpart-00000
• hdfs:hdfs-pathpart-00001
• ...
• hdfs:hdfs-pathpart-nnnnn
- 12 -
14. I/O Formats Matters
• Scien6fic Data Formats in HPC not na6vely Supported in Spark
– HDF5/ netCDF are among the top 5 libraries at NERSC, 2015
• 750+ unique users @NERSC, million of users worldwide
– 1987, NCSA&UIUC. NASA send HDF-EOS to 2.4 millions end users
Brian Aus6n, NERSC
- 14 -
15. 1 2 3
4 5 6
7 8 9
1 2 3
4 5 6
7 8 9
Data in HDF5
• Hierarchical Structure
• Mul6-dimensional Array data model
Group
HDF5
Dataset
1 2 3
4 5 6
7 8 9
Dataset
Attributes
Group
- 15 -
16. Data in Spark
• RDD: Resilient Distributed Datasets
– Read-only, par66oned collec6on of records in Spark
– RDD can contain any type of Python/Java/Scala objects
– Fault Tolerant
• Transforma6ons on RDD
– Filter, map, join, etc
• Ac6ons on RDD
– Reduce, collect, etc
• Spark opera6ons are lazy
• RDD allows in-memory processing
– rdd.cache() or rdd.persist()
– Good for itera6ve or interac6ve processing
- 16 -
19. H5Spark: Support HDF5 in Spark
• What does Spark have in reading various data formats?
– Texpile, sc.textFile()
– Parquet, sc.read.parquet()
– Json, sc.read.json()
• Challenges: Func6onality and Performance
– How to transform an HDF5 dataset into an RDD?
– How to u6lize the HDF5 I/O libraries in Spark?
– How to enable parallel I/O on HPC?
– What is the impact of Lustre striping?
HDF5àParquet?
- 19 -
20. H5Spark: Software Overview
• Scala/Python implementa6on
– Spark favors Scala and Python
– H5Spark uses HDF5 java library
– Underneath is HDF5 C posix library
– No MPIIO support
• H5Spark as a standalone package
– Users can load it in their Spark applica6ons
– H5Spark module on Cori
– sbt package------> h5spark_2.10-1.0.jar
• Open source
– Github: hips://github.com/valiantljk/h5spark
H5Spark
scala/python
JHI5
java
HDF5
c
HDF5
MPI
H5Py
python
1.8.14
- 20 -
22. H5Spark: Design
• Metadata Analyzer
• Single I/O call to parse the HDF5 header, get the dimension, data type, etc
• Hyperslab Par66oner
• Balance between Spark par66on and HDF dataset size
• RDD Seeder
• A lightweight RDD seed
• RDD Constructor
• Direct transforma6on
- 22 -
23. H5Spark: From HDF5 to RDD
• Input:
*Spark Par((on determines the degree of parallelism = MPI processes
+OpenMP
p > num of cores
• Output: RDD: r
• Under the Hood: reading HDF5 into RDD
– Adjust par66ons p= p > dim[sid] ? dim[sid]:p
– Determine hyperslab offset[i]=dim[sid]/p * i
– Seed RDD r_seed = sc.parallelize(offset, p)
– Perform parallel I/O r_seed.flatmap(h5read(f,v))
HDF5 File Path: f
Dataset Name: v
SparkContext: sc
*Spark Par66on: p
- 23 -
24. H5Spark: How to Use
• H5Spark APIs
• Correspond to Spark MLlib interface
import org.apache.spark.mllib.linalg
DataType: Vector, labeled point, matrix, indexedrowmatrix, etc
Input: sc, f, v, p
Func6ons Output
h5read A RDD of double array
h5read_point A RDD of (key, value) pair
h5read_vec A RDD of vector
h5read_irow A RDD of indexed row
H5read_imat A RDD of indexed row matrix
- 24 -
35. Conclusion
• Por6ng Spark onto HPC
– I/O and Storage
– Formats
• H5Spark:
– An efficient HDF5 file loader for Spark
– Conduct big data analysis on scien6fic data sets
• Is Spark a good fit in HPC?
– Produc6vity
– Auto-parallelism
– Stragglers, scheduling
– Redesign
- 35 -