Combining Big Data and HPC in a GRIDScalar Environment

Dr. Stephan Schenk
Dr. Frank Heilmann
Combining Big Data and
HPC in a GRIDScaler
environment

BASF’s segments
Chemicals
Petrochemicals
Intermediates
Materials
Performance Materials
Monomers
Industrial
Solutions
Dispersions & Pigments
Performance Chemicals
Surface
Technologies
Catalysts
Coatings
Construction Chemicals*
Nutrition &
Care
Nutrition & Health
Care Chemicals
Agricultural
Solutions
* We are considering the possibility of merging our construction chemicals business with a strong partner, as well as the option of divesting this business. The
outcome of this review is open. The Construction Chemicals division will be reported under the Surface Technologies segment until signing of a transaction
agreement.

Integrating digital technologies into BASF’s R&D operations
will boost innovative power
Digital Capabilities
Data and knowledge management
Algorithms and statistical applications
Scientific modeling and simulation
Machine Learning
Research & Development
Hypothesis
Experiments
Analysis
Validation of models
This Photo by Unknown Author is licensed under CC BY-SA

1996 2000 2004 2008 2012 2016 2019
Supercomputing at BASF
PeakPerformance(GFLOPS)
BASF HPC history Quriosity Specifications
 Quriosity debuted at #65 in June 2017
with Rmax = 1.75 PFLOPS
 HPE Apollo 6000 Gen10, 888 nodes
 2x Intel® Xeon Gold 6148 („Skylake“)
 192/384/768/3072 GB RAM
 Intel® Omnipath interconnect
 DDN GRIDScaler 5 PByte (GPFS)
 Red Hat Enterprise Linux 7
 Altair PBSPro scheduler
Significant opportunity for BASF to establish leadership in R&D supercomputing
109
106
103
100
#1 among
TOP500 computers
largest computer
system in BASF
Quriosity

Apache Spark on Quriosity and Spectrum Scale:
Big-Data workflows to complement HPC
Example I: Image classification
Train
classifier
(HPC/AI)
Use classifier in a
Spark job on a huge
numbers of images
Apache Spark job can use
complete API
Spark job is scheduled and
runs like any other job
Job uses existing global
filesystem
Example II: Full-text indexing and text mining
Machine learning,
e.g. document
clustering
Full-text indexing
This Photo by Unknown author
is licensed under CC BY-ND.
This Photo by Unknown author is
licensed under CC BY-SA.

Deploying Apache Spark on an HPC system
 Deploy Spark in standalone mode (untar)
 Spin-up Spark cluster at beginning of HPC job
 Integration with PBS by setting appropriate
environment variables
 Spark job has complete API available
(Python, Scala, Libraries)
 Files can be accessed directly
sc.textFile("/gpfs/big_data")
sc.saveAsTextFile("/gpfs/results")
 Multi-node jobs require global filesystem of your
choice
#!/bin/bash
#PBS -l select=2:ncpus=40:mem=160GB
#PBS -l place=scatter:excl
#PBS –N spark-on-hpc
module load spark
# Spawn the Spark cluster
export SPARK_MASTER_HOST="$(hostname -f)"
export SPARK_MASTER_PORT="7077“
export SPARK_SLAVES="${PBS_NODEFILE}"
${SPARK_HOME}/sbin/start-all.sh
sparkmaster="spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT}"
# Run the Spark script
${SPARK_HOME}/bin/spark-submit --master ${sparkmaster} script.py
# Teardown the Spark cluster
${SPARK_HOME}/sbin/stop-all.sh --wait
 Inspired by https://github.com/glennklockwood/hpchadoop

Experimenting with HDFS Transparency in Spectrum Scale
 HDFS Transparency
integrated with
Hortonworks HDP
Hadoop Applications
Spark MapReduce Hive HBase …….
Namespace hdfs://quriosity-hdfs:8020
Block Management using
Spectrum Scale HDFS NameNode
Spectrum Scale DataNode1 Spectrum Scale DataNode2
Namespace hdfs://native-hdfs:8020
Block Management using
native HDFS NameNode
Native HDFS DataNode3Native HDFS DataNode2Native HDFS DataNode1
ViewFS

Benchmarking HDFS Transparency on Quriosity
 Benchmark TestDFSIO executed on a single
compute node
 Consistent performance across all test data
sizes
 I/O rate essentially limited by 10G network used
for communication
10GB 20GB 30GB 40GB 50GB
Avg I/O Rate Write 854.38 861.69 860.52 862 866.59
Avg I/O Rate Read 906.7 904.39 890.99 876.82 892.98
0
100
200
300
400
500
600
700
800
900
1000
I/OrateinMB/s
Size of test files
Avg I/O Rate TestDFSIO

https://www.basf.com/supercomputer
Further information

Combining Big Data and HPC in a GRIDScalar Environment

Combining Big Data and HPC in a GRIDScalar Environment

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Combining Big Data and HPC in a GRIDScalar Environment

Similar to Combining Big Data and HPC in a GRIDScalar Environment (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Combining Big Data and HPC in a GRIDScalar Environment

Editor's Notes