SlideShare a Scribd company logo
1 of 30
Download to read offline
Performance Benchmarking of
the R Programming Environment
on the Stampede 1.5
Supercomputer
James McCombs and Scott Michael
Pervasive Technology Institute, Indiana University
Acknowledgements
2
IU
• Eric Wernert
• Esen Tuna
TACC
• Bill Barth
• Tommy Minyard
• Doug James
• Weijia Xu
• David Walling
National	Science	Foundation
Award	ACI-1134872
Introduction
• Data analysts need software environments optimized for
modern HPC machines as increasing problem sizes
necessitate use of HPC platforms
• We developed an R HPC benchmark to assess single-node
performance and expose opportunities for improvement
• We present benchmark results from the Stampede and the
Xeon Phi-based interim Stampede 1.5 system at the Texas
Advanced Computing Center
• We identify a few standard R packages that should be
restructured to take full advantage of vectorization and many
core architectures
3
Motivation
• Data analytics is becoming more dependent on HPC as problem
size and complexity continues to grow
• However, there is no robust benchmark for evaluating the
performance of the R programming environment on HPC systems
• Vectorized, many-integrated-core architectures like Xeon Phi are
increasingly common in HPC
• R has been a high productivity environment rather than a high
performance environment
– Need to determine what aspects of the R programming
environment are optimized or can be restructured to
reuse optimized kernels
4
Current R benchmarks
• R Benchmark is the most robust benchmark publicly available
– 15 different microbenchmarks
• matrix formation
• matrix factorization
• solving linear systems
• sorting
• R Benchmark lacks flexibility in many respects
– Problem sizes are fixed, can't test scalability for large numbers
of threads
– Can't automate strong scaling studies over successively larger
problem dimensions
– Monolithic structure prevents specific microbenchmarks from
being executed
• Other benchmarks only focus on a small number of microbenchmarks
(e.g. bench)
5
R HPC Benchmark
We developed a new R benchmark for HPC that has four
improvements over R Benchmark:
1. Users can specify problem sizes as input parameters
2. Specific microbenchmarks can be selected for execution
3. Output stored in CSV files and data frames specified by user
4. Users can supply their own microbenchmarks to be run
alongside the package microbenchmarks
6
Supported microbenchmarks
• R HPC microbenchmarks include:
– Dense linear algebra kernels
• Operators
• Factorizations
• Solution of linear systems
– Machine learning functionality
• Clustering
• Neural networking
– Sparse matrix kernels (New)
• Operators
• Factorizations
7
Description of tested systems
• Sandy Bridge (SNB) nodes
– 2 8-core, 2.7GHz Intel Xeon E5-2680 CPU, 20 MB L2 cache
• Configured with no hyperthreading
– 32 GB 1600 MHz DDR3 RAM
– 61-core Intel Xeon Phi SE10P coprocessor (Knights Corner)
• Not included in performance tests
• Knights Landing (KNL) nodes
– Single 68-core, 1.4GHz Intel Xeon Phi 7250 CPU
• Configured with 4 hyperthreads/core
– 2 512-bit vector processing units per core
– Fused multiply add
– Cores divided amongst 34 tiles
• 2 cores per tile
• Shared 1MB L2 cache
– 16GB of Multi-Channel Dynamic Random Access Memory
(MCDRAM)
– 96GB of 2400 MHz DDR4 RAM
8
Knights Landing architecture
9
KNL MCDRAM and tile clustering modes
• KNL-supported MCDRAM configurations
– flat mode - Operates as a separate RAM from main
memory
– cache mode - Operates as a direct-mapped cache
– hybrid mode - Divided into flat-mode region and cache-
mode region
• KNL tile clustering modes take advantage of data locality
– all-to-all mode - Tile cache tag directories can map to
any memory controller
– quadrant/hemisphere - Tile cache tag directories map
to controller in its quadrant
– sub-NUMA quadrant/hemisphere - NUMA aware
application can pin threads corresponding to NUMA
nodes to specific quadrants/hemispheres of tiles.
10
Performance Tests and Results
11
Tests of dense linear algebra microbenchmarks
• Linked with Intel Math Kernel Library (MKL)
– All dense matrix microbenchmarks call into MKL
• Most tests performed with MCDRAM in cache mode
• Performed strong scaling of linear algebra kernels on SNB
and KNL nodes
– Threads on SNB: 1, 2, 4, 8, 12, 16
– Threads on KNL: 1, 2, 4, 8, 16, 34, 66, 68, 136, 204,
250, 272
– Matrix dimensions parameterized by N include
• N = {1000, 2000, 4000, 8000, 10000, 15000, 20000, 40000}
12
Performance of Cholesky factorization and linear solve
●
●● ● ● ● ●●0
200
400
600
800
1000
1 8 16 34 50 66
●
●
● ● ● ●
Number of Threads
RunTime(sec)
●
●
KNL time − Cholesky fact.
SNB time − Cholesky fact.
KNL time − linear solve
SNB time − linear solve
Run time of Cholesky and linear solve (N=20000)
13
●
●
●
●
●
●
●●
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
1 8 16 34 50 66
●
●
●
●
●
●
Number of Threads
StrongScaling
●
●
linear scaling
KNL scaling − Cholesky fact.
SNB scaling − Cholesky fact.
KNL scaling − linear solve
SNB scaling − linear solve
Strong scaling of Cholesky and lin. solve (N=20000)
Performance of eigendecomposition (small matrix)
●
●
●
●
● ● ●●
35
40
45
50
55
60
65
70
75
80
1 8 16 34 50 66
●
●
●
● ● ●
Number of Threads
RunTime(sec)
●
●
KNL time − eigendecomposition
SNB time − eigendecomposition
Run time of eigendecomposition (N=4000)
14
●
●
●
●
● ● ●●
1
2
1 8 16 34 50 66
●
●
●
● ● ●
Number of Threads
StrongScaling
●
●
linear scaling
KNL scaling − eigendecomposition
SNB scaling − eigendecomposition
Strong scaling of eigendecomposition (N=4000)
Performance of eigendecomposition (large matrix)
●
●
●
●
● ● ●●2000
3000
4000
5000
6000
7000
8000
1 8 16 34 50 66
●
●
●
● ● ●
Number of Threads
RunTime(sec)
●
●
KNL time − eigendecomposition
SNB time − eigendecomposition
Run time of eigendecomposition (N=20000)
15
●
●
●
●
●
●
●●
1
2
3
1 8 16 34 50 66
●
●
●
● ● ●
Number of Threads
StrongScaling
●
●
linear scaling
KNL scaling − eigendecomposition
SNB scaling − eigendecomposition
Strong scaling of eigendecomposition (N=20000)
Performance of matrix-matrix multiplication
●
●
●
●
●
● ●●3
13
23
33
43
53
63
73
83
93
103
113
123
133
143
153
163
173
183
193
203
213
223
233
243
1 8 16 34 50 66
●
●
●
●
● ●
Number of Threads
RunTime(sec)
●
●
KNL time − matrix−matrix mult.
SNB time − matrix−matrix mult.
Run time of matrix−matrix mult. (N=20000)
16
●●
●
●
●
●
●
●
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
1 4 8 12 34 50 66
●
●
●
●
●
●
Number of Threads
StrongScaling
●
●
linear scaling
KNL scaling − matrix−matrix mult.
SNB scaling − matrix−matrix mult.
Strong scaling of matrix−matrix mult. (N=20000)
Strong scaling on KNL using all hyperthreads
●●●●● ● ●● ● ● ●2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
1 16 34 66 100 136 204 250 272
Number of Threads
StrongScaling
●
linear scaling
KNL scaling − eigendecomposition.
KNL scaling − matrix−matrix mult.
Strong scaling of eigendecomposition and matrix−matrix mult. (N=20000)
17
Test results using MCDRAM flat mode
• Reran tests on KNL nodes with MCDRAM configured in flat
mode
• Tested Cholesky factorization, linear solve, and matrix cross
product kernels using the numactl command line utility with
the --preferred option
• The matrices fit entirely in MCDRAM
• No appreciable difference in performance achieved between
MCDRAM cache mode and flat mode
18
Linear algebra kernel overheads
• Developed stand-alone C language matrix cross product, QR
decomposition, and linear solve drivers to call MKL
functionality
• Compared performance of drivers to that of their R internal
function counterparts to determine effect of overheads from
data copying and validity checks
• Results on KNL nodes showed that the performance of the R
internal functions are virtually identical to that of the C drivers
19
Linear algebra kernel overheads (small matrix)
20
0
2
4
6
8
10
12
14
16
18
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16 34 66 68 136
Strong	scaling
Run	time	(sec)
Number	of	threads
Matrix	cross	product	performance	using	C	driver	and	R,	N	=	4,000
Run	time	(C) Run	time	(R)
Strong	scaling	(C) Strong	scaling	(R)
Linear algebra kernel overheads (large matrix)
21
0
5
10
15
20
25
30
0
50
100
150
200
250
1 2 4 8 16 34 66 68 136
Strong	scaling
Run	time	(sec)
Number	of	threads
Matrix	cross	product	performance	using	C	driver	and	R,	N	=	20,000
Run	time	(C) Run	time	(R) Strong	scaling	(C) Strong	scaling	(R)
Tests of machine learning microbenchmarks
• Performance tested neural network training using nnet and
cluster assignment using cluster packages
• nnet microbenchmark trains a neural network to approximate
a multivariate normal probability density function
• cluster microbenchmark uses partitioning around medoids
(pam) function to identify clusters of normally distributed N-
dimensional vectors in a real-valued feature space where the
mean of one cluster is at the origin and the means of
remaining clusters are at -1 and 1 of each axis
22
The	implementations	in	nnet and	cluster are	not	multithreaded	or	
properly	vectorized,	but	can	be	restructured	to	utilize	kernels	that	
are.
nnet performance results
23
0
200
400
600
800
1000
1200
5000 10000
Run	times	(sec)	vs.	number	of	training	
vectors	for	three	features
SNB KNL
0
2000
4000
6000
8000
10000
5000 10000 15000
Run	times	(sec)	vs.	number	of	training	
vectors	for	five	features
SNB KNL
cluster performance results
24
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
14000 17500 35000 40005
Run	time	(sec.)	vs.	number	of	training	vectors	for	three	features	and	
seven	clusters
SNB KNL
Conclusions
1. Strong scaling either flattens past 68 threads or is detrimental to
performance for the benchmarked linear algebra kernels
2. Large matrices are needed for the linear algebra kernels to take full
use of the large core count and wide vector units of KNL
3. The MCDRAM flat mode does not offer a performance benefit over
cache mode
4. The R interpreter overhead is negligible for microbenchmarked
functions
5. Many R packages are not properly structured to take full advantage
of the many-core, vectorized architecture of the Xeon Phi, and they
do not leverage the MKL functionality exposed in the
microbenchmarked functions
25
Recent and Future work
• Recent Work:
– Developed microbenchmarks of sparse matrix
functionality from the matrix package
– Extended cluster benchmarks to include additional
clustering algorithms
– R core team has started making some source code
changes based on findings
• Future/Current Work:
– Extend benchmark to include summary statistics: mean,
variance, covariance computation, etc.
– Developing package to track package utilization on HPC
clusters so optimization efforts can be prioritized
26
R HPC Benchmark availability
The	benchmark	is	now	available	as	a	package	on	the	
Comprehensive	R	Archive	Network	as	RHPCBenchmark:	
https://cran.r-project.org/package=RHPCBenchmark
Collaboration is welcome! Source repository is available at:
https://github.com/IUResearchAnalytics/RBenchmarking
Contact information:
James McCombs: jmccombs@iu.edu
Scott Michael: scamicha@iu.edu
27
Supplemental slides
28
Supported microbenchmarks (cont’d)
Microbenchmark Kernel	/	Package	function
Cholesky factorization chol
eigendecomposition eigen
Linear	least	squares	fit lsfit
Linear	solve w/	multiple	r.h.s. solve
Matrix	cross	product crossprod
Matrix	determinant determinant
Matrix-matrix	multiplication %*%
Matrix-vector	multiplication %*%
QR	decomposition qr
Singular	value	decomposition svd
Neural	network	training nnet
Cluster	identification pam
29
Build and environment settings
• Operating systems: CentOS 6(SNB) / 7(KNL)
• Software: R version 3.2.1
• Compiled R with Intel XE compiler 15.0 Update 2 and 17.0
Update 1 on SNB and linked with bundled Intel Math Kernel
Library
• Compiled R with Intel XE compiler 17.0 Update 1 on KNL
and linked with the bundled parallel MKL version
• Applied -O3 optimizations in each case
30

More Related Content

What's hot

Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
Runtime Reconfigurable Network-on-chips for FPGA-based Systems
Runtime Reconfigurable Network-on-chips for FPGA-based SystemsRuntime Reconfigurable Network-on-chips for FPGA-based Systems
Runtime Reconfigurable Network-on-chips for FPGA-based SystemsMugdha2289
 
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesMugdha2289
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projectsawan2008
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationDat Nguyen
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Tech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeTech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeAdaCore
 
Chap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software OverviewChap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software OverviewSethCopeland
 
Training CNNs with Selective Allocation of Channels (ICML 2019)
Training CNNs with Selective Allocation of Channels (ICML 2019)Training CNNs with Selective Allocation of Channels (ICML 2019)
Training CNNs with Selective Allocation of Channels (ICML 2019)Jongheon Jeong
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Best Practices: Large Scale Multiphysics
Best Practices: Large Scale MultiphysicsBest Practices: Large Scale Multiphysics
Best Practices: Large Scale Multiphysicsinside-BigData.com
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plansinside-BigData.com
 
Pr057 mask rcnn
Pr057 mask rcnnPr057 mask rcnn
Pr057 mask rcnnTaeoh Kim
 
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsNicolas Poggi
 

What's hot (20)

Dsgld
DsgldDsgld
Dsgld
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Moving object detection on FPGA
Moving object detection on FPGAMoving object detection on FPGA
Moving object detection on FPGA
 
Runtime Reconfigurable Network-on-chips for FPGA-based Systems
Runtime Reconfigurable Network-on-chips for FPGA-based SystemsRuntime Reconfigurable Network-on-chips for FPGA-based Systems
Runtime Reconfigurable Network-on-chips for FPGA-based Systems
 
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
3rd 3DDRESD: BiRF
3rd 3DDRESD: BiRF3rd 3DDRESD: BiRF
3rd 3DDRESD: BiRF
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projects
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Tech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical CollegeTech Days 2015: User Presentation Vermont Technical College
Tech Days 2015: User Presentation Vermont Technical College
 
Chap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software OverviewChap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software Overview
 
Training CNNs with Selective Allocation of Channels (ICML 2019)
Training CNNs with Selective Allocation of Channels (ICML 2019)Training CNNs with Selective Allocation of Channels (ICML 2019)
Training CNNs with Selective Allocation of Channels (ICML 2019)
 
UIC Thesis Novati
UIC Thesis NovatiUIC Thesis Novati
UIC Thesis Novati
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Best Practices: Large Scale Multiphysics
Best Practices: Large Scale MultiphysicsBest Practices: Large Scale Multiphysics
Best Practices: Large Scale Multiphysics
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
 
Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
Pr057 mask rcnn
Pr057 mask rcnnPr057 mask rcnn
Pr057 mask rcnn
 
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
 

Similar to Performance Benchmarking of the R Programming Environment on the Stampede 1.5 Supercomputer

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...NECST Lab @ Politecnico di Milano
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 
Raghu nambiar:industry standard benchmarks
Raghu nambiar:industry standard benchmarksRaghu nambiar:industry standard benchmarks
Raghu nambiar:industry standard benchmarkshdhappy001
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
ES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosSyed Asad Alam
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Ganesan Narayanasamy
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...Intel® Software
 
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...Edge AI and Vision Alliance
 

Similar to Performance Benchmarking of the R Programming Environment on the Stampede 1.5 Supercomputer (20)

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Aa sort-v4
Aa sort-v4Aa sort-v4
Aa sort-v4
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming Model
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
Raghu nambiar:industry standard benchmarks
Raghu nambiar:industry standard benchmarksRaghu nambiar:industry standard benchmarks
Raghu nambiar:industry standard benchmarks
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
ES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_PosES_SAA_OG_PF_ECCTD_Pos
ES_SAA_OG_PF_ECCTD_Pos
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10Architecture innovations in POWER ISA v3.01 and POWER10
Architecture innovations in POWER ISA v3.01 and POWER10
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
 
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Performance Benchmarking of the R Programming Environment on the Stampede 1.5 Supercomputer

  • 1. Performance Benchmarking of the R Programming Environment on the Stampede 1.5 Supercomputer James McCombs and Scott Michael Pervasive Technology Institute, Indiana University
  • 2. Acknowledgements 2 IU • Eric Wernert • Esen Tuna TACC • Bill Barth • Tommy Minyard • Doug James • Weijia Xu • David Walling National Science Foundation Award ACI-1134872
  • 3. Introduction • Data analysts need software environments optimized for modern HPC machines as increasing problem sizes necessitate use of HPC platforms • We developed an R HPC benchmark to assess single-node performance and expose opportunities for improvement • We present benchmark results from the Stampede and the Xeon Phi-based interim Stampede 1.5 system at the Texas Advanced Computing Center • We identify a few standard R packages that should be restructured to take full advantage of vectorization and many core architectures 3
  • 4. Motivation • Data analytics is becoming more dependent on HPC as problem size and complexity continues to grow • However, there is no robust benchmark for evaluating the performance of the R programming environment on HPC systems • Vectorized, many-integrated-core architectures like Xeon Phi are increasingly common in HPC • R has been a high productivity environment rather than a high performance environment – Need to determine what aspects of the R programming environment are optimized or can be restructured to reuse optimized kernels 4
  • 5. Current R benchmarks • R Benchmark is the most robust benchmark publicly available – 15 different microbenchmarks • matrix formation • matrix factorization • solving linear systems • sorting • R Benchmark lacks flexibility in many respects – Problem sizes are fixed, can't test scalability for large numbers of threads – Can't automate strong scaling studies over successively larger problem dimensions – Monolithic structure prevents specific microbenchmarks from being executed • Other benchmarks only focus on a small number of microbenchmarks (e.g. bench) 5
  • 6. R HPC Benchmark We developed a new R benchmark for HPC that has four improvements over R Benchmark: 1. Users can specify problem sizes as input parameters 2. Specific microbenchmarks can be selected for execution 3. Output stored in CSV files and data frames specified by user 4. Users can supply their own microbenchmarks to be run alongside the package microbenchmarks 6
  • 7. Supported microbenchmarks • R HPC microbenchmarks include: – Dense linear algebra kernels • Operators • Factorizations • Solution of linear systems – Machine learning functionality • Clustering • Neural networking – Sparse matrix kernels (New) • Operators • Factorizations 7
  • 8. Description of tested systems • Sandy Bridge (SNB) nodes – 2 8-core, 2.7GHz Intel Xeon E5-2680 CPU, 20 MB L2 cache • Configured with no hyperthreading – 32 GB 1600 MHz DDR3 RAM – 61-core Intel Xeon Phi SE10P coprocessor (Knights Corner) • Not included in performance tests • Knights Landing (KNL) nodes – Single 68-core, 1.4GHz Intel Xeon Phi 7250 CPU • Configured with 4 hyperthreads/core – 2 512-bit vector processing units per core – Fused multiply add – Cores divided amongst 34 tiles • 2 cores per tile • Shared 1MB L2 cache – 16GB of Multi-Channel Dynamic Random Access Memory (MCDRAM) – 96GB of 2400 MHz DDR4 RAM 8
  • 10. KNL MCDRAM and tile clustering modes • KNL-supported MCDRAM configurations – flat mode - Operates as a separate RAM from main memory – cache mode - Operates as a direct-mapped cache – hybrid mode - Divided into flat-mode region and cache- mode region • KNL tile clustering modes take advantage of data locality – all-to-all mode - Tile cache tag directories can map to any memory controller – quadrant/hemisphere - Tile cache tag directories map to controller in its quadrant – sub-NUMA quadrant/hemisphere - NUMA aware application can pin threads corresponding to NUMA nodes to specific quadrants/hemispheres of tiles. 10
  • 11. Performance Tests and Results 11
  • 12. Tests of dense linear algebra microbenchmarks • Linked with Intel Math Kernel Library (MKL) – All dense matrix microbenchmarks call into MKL • Most tests performed with MCDRAM in cache mode • Performed strong scaling of linear algebra kernels on SNB and KNL nodes – Threads on SNB: 1, 2, 4, 8, 12, 16 – Threads on KNL: 1, 2, 4, 8, 16, 34, 66, 68, 136, 204, 250, 272 – Matrix dimensions parameterized by N include • N = {1000, 2000, 4000, 8000, 10000, 15000, 20000, 40000} 12
  • 13. Performance of Cholesky factorization and linear solve ● ●● ● ● ● ●●0 200 400 600 800 1000 1 8 16 34 50 66 ● ● ● ● ● ● Number of Threads RunTime(sec) ● ● KNL time − Cholesky fact. SNB time − Cholesky fact. KNL time − linear solve SNB time − linear solve Run time of Cholesky and linear solve (N=20000) 13 ● ● ● ● ● ● ●● 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 1 8 16 34 50 66 ● ● ● ● ● ● Number of Threads StrongScaling ● ● linear scaling KNL scaling − Cholesky fact. SNB scaling − Cholesky fact. KNL scaling − linear solve SNB scaling − linear solve Strong scaling of Cholesky and lin. solve (N=20000)
  • 14. Performance of eigendecomposition (small matrix) ● ● ● ● ● ● ●● 35 40 45 50 55 60 65 70 75 80 1 8 16 34 50 66 ● ● ● ● ● ● Number of Threads RunTime(sec) ● ● KNL time − eigendecomposition SNB time − eigendecomposition Run time of eigendecomposition (N=4000) 14 ● ● ● ● ● ● ●● 1 2 1 8 16 34 50 66 ● ● ● ● ● ● Number of Threads StrongScaling ● ● linear scaling KNL scaling − eigendecomposition SNB scaling − eigendecomposition Strong scaling of eigendecomposition (N=4000)
  • 15. Performance of eigendecomposition (large matrix) ● ● ● ● ● ● ●●2000 3000 4000 5000 6000 7000 8000 1 8 16 34 50 66 ● ● ● ● ● ● Number of Threads RunTime(sec) ● ● KNL time − eigendecomposition SNB time − eigendecomposition Run time of eigendecomposition (N=20000) 15 ● ● ● ● ● ● ●● 1 2 3 1 8 16 34 50 66 ● ● ● ● ● ● Number of Threads StrongScaling ● ● linear scaling KNL scaling − eigendecomposition SNB scaling − eigendecomposition Strong scaling of eigendecomposition (N=20000)
  • 16. Performance of matrix-matrix multiplication ● ● ● ● ● ● ●●3 13 23 33 43 53 63 73 83 93 103 113 123 133 143 153 163 173 183 193 203 213 223 233 243 1 8 16 34 50 66 ● ● ● ● ● ● Number of Threads RunTime(sec) ● ● KNL time − matrix−matrix mult. SNB time − matrix−matrix mult. Run time of matrix−matrix mult. (N=20000) 16 ●● ● ● ● ● ● ● 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 1 4 8 12 34 50 66 ● ● ● ● ● ● Number of Threads StrongScaling ● ● linear scaling KNL scaling − matrix−matrix mult. SNB scaling − matrix−matrix mult. Strong scaling of matrix−matrix mult. (N=20000)
  • 17. Strong scaling on KNL using all hyperthreads ●●●●● ● ●● ● ● ●2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 1 16 34 66 100 136 204 250 272 Number of Threads StrongScaling ● linear scaling KNL scaling − eigendecomposition. KNL scaling − matrix−matrix mult. Strong scaling of eigendecomposition and matrix−matrix mult. (N=20000) 17
  • 18. Test results using MCDRAM flat mode • Reran tests on KNL nodes with MCDRAM configured in flat mode • Tested Cholesky factorization, linear solve, and matrix cross product kernels using the numactl command line utility with the --preferred option • The matrices fit entirely in MCDRAM • No appreciable difference in performance achieved between MCDRAM cache mode and flat mode 18
  • 19. Linear algebra kernel overheads • Developed stand-alone C language matrix cross product, QR decomposition, and linear solve drivers to call MKL functionality • Compared performance of drivers to that of their R internal function counterparts to determine effect of overheads from data copying and validity checks • Results on KNL nodes showed that the performance of the R internal functions are virtually identical to that of the C drivers 19
  • 20. Linear algebra kernel overheads (small matrix) 20 0 2 4 6 8 10 12 14 16 18 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 16 34 66 68 136 Strong scaling Run time (sec) Number of threads Matrix cross product performance using C driver and R, N = 4,000 Run time (C) Run time (R) Strong scaling (C) Strong scaling (R)
  • 21. Linear algebra kernel overheads (large matrix) 21 0 5 10 15 20 25 30 0 50 100 150 200 250 1 2 4 8 16 34 66 68 136 Strong scaling Run time (sec) Number of threads Matrix cross product performance using C driver and R, N = 20,000 Run time (C) Run time (R) Strong scaling (C) Strong scaling (R)
  • 22. Tests of machine learning microbenchmarks • Performance tested neural network training using nnet and cluster assignment using cluster packages • nnet microbenchmark trains a neural network to approximate a multivariate normal probability density function • cluster microbenchmark uses partitioning around medoids (pam) function to identify clusters of normally distributed N- dimensional vectors in a real-valued feature space where the mean of one cluster is at the origin and the means of remaining clusters are at -1 and 1 of each axis 22 The implementations in nnet and cluster are not multithreaded or properly vectorized, but can be restructured to utilize kernels that are.
  • 23. nnet performance results 23 0 200 400 600 800 1000 1200 5000 10000 Run times (sec) vs. number of training vectors for three features SNB KNL 0 2000 4000 6000 8000 10000 5000 10000 15000 Run times (sec) vs. number of training vectors for five features SNB KNL
  • 24. cluster performance results 24 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 14000 17500 35000 40005 Run time (sec.) vs. number of training vectors for three features and seven clusters SNB KNL
  • 25. Conclusions 1. Strong scaling either flattens past 68 threads or is detrimental to performance for the benchmarked linear algebra kernels 2. Large matrices are needed for the linear algebra kernels to take full use of the large core count and wide vector units of KNL 3. The MCDRAM flat mode does not offer a performance benefit over cache mode 4. The R interpreter overhead is negligible for microbenchmarked functions 5. Many R packages are not properly structured to take full advantage of the many-core, vectorized architecture of the Xeon Phi, and they do not leverage the MKL functionality exposed in the microbenchmarked functions 25
  • 26. Recent and Future work • Recent Work: – Developed microbenchmarks of sparse matrix functionality from the matrix package – Extended cluster benchmarks to include additional clustering algorithms – R core team has started making some source code changes based on findings • Future/Current Work: – Extend benchmark to include summary statistics: mean, variance, covariance computation, etc. – Developing package to track package utilization on HPC clusters so optimization efforts can be prioritized 26
  • 27. R HPC Benchmark availability The benchmark is now available as a package on the Comprehensive R Archive Network as RHPCBenchmark: https://cran.r-project.org/package=RHPCBenchmark Collaboration is welcome! Source repository is available at: https://github.com/IUResearchAnalytics/RBenchmarking Contact information: James McCombs: jmccombs@iu.edu Scott Michael: scamicha@iu.edu 27
  • 29. Supported microbenchmarks (cont’d) Microbenchmark Kernel / Package function Cholesky factorization chol eigendecomposition eigen Linear least squares fit lsfit Linear solve w/ multiple r.h.s. solve Matrix cross product crossprod Matrix determinant determinant Matrix-matrix multiplication %*% Matrix-vector multiplication %*% QR decomposition qr Singular value decomposition svd Neural network training nnet Cluster identification pam 29
  • 30. Build and environment settings • Operating systems: CentOS 6(SNB) / 7(KNL) • Software: R version 3.2.1 • Compiled R with Intel XE compiler 15.0 Update 2 and 17.0 Update 1 on SNB and linked with bundled Intel Math Kernel Library • Compiled R with Intel XE compiler 17.0 Update 1 on KNL and linked with the bundled parallel MKL version • Applied -O3 optimizations in each case 30