We present performance results obtained with a new single-node performance benchmark of the R programming environment on the many-core Xeon Phi Knights Landing and standard Xeonbased compute nodes of the Stampede supercomputer cluster at the Texas Advanced Computing Center. The benchmark consists of microbenchmarks of linear algebra kernels and machine learning functionality that includes clustering and neural network training from the R distribution. The standard Xeon-based nodes outperformed their Xeon Phi counterparts for matrices of small to medium dimensions, performing approximately twice as fast for most of the linear algebra microbenchmarks. For matrices of medium to large dimensions, the Knights Landing nodes were competitive with or outperformed the standard Xeon-based nodes with most of the linear algebra microbenchmarks, executing as much as five
times faster than the standard Xeon-based nodes. For the clustering and neural network training microbenchmarks, the standard Xeonbased nodes performed up to four times faster than their Xeon Phi counterparts for many large data sets, indicating that commonly used R packages may need to be reengineered to take advantage of existing optimized, scalable kernels.
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5 Supercomputer
1. Performance Benchmarking of
the R Programming Environment
on the Stampede 1.5
Supercomputer
James McCombs and Scott Michael
Pervasive Technology Institute, Indiana University
2. Acknowledgements
2
IU
• Eric Wernert
• Esen Tuna
TACC
• Bill Barth
• Tommy Minyard
• Doug James
• Weijia Xu
• David Walling
National Science Foundation
Award ACI-1134872
3. Introduction
• Data analysts need software environments optimized for
modern HPC machines as increasing problem sizes
necessitate use of HPC platforms
• We developed an R HPC benchmark to assess single-node
performance and expose opportunities for improvement
• We present benchmark results from the Stampede and the
Xeon Phi-based interim Stampede 1.5 system at the Texas
Advanced Computing Center
• We identify a few standard R packages that should be
restructured to take full advantage of vectorization and many
core architectures
3
4. Motivation
• Data analytics is becoming more dependent on HPC as problem
size and complexity continues to grow
• However, there is no robust benchmark for evaluating the
performance of the R programming environment on HPC systems
• Vectorized, many-integrated-core architectures like Xeon Phi are
increasingly common in HPC
• R has been a high productivity environment rather than a high
performance environment
– Need to determine what aspects of the R programming
environment are optimized or can be restructured to
reuse optimized kernels
4
5. Current R benchmarks
• R Benchmark is the most robust benchmark publicly available
– 15 different microbenchmarks
• matrix formation
• matrix factorization
• solving linear systems
• sorting
• R Benchmark lacks flexibility in many respects
– Problem sizes are fixed, can't test scalability for large numbers
of threads
– Can't automate strong scaling studies over successively larger
problem dimensions
– Monolithic structure prevents specific microbenchmarks from
being executed
• Other benchmarks only focus on a small number of microbenchmarks
(e.g. bench)
5
6. R HPC Benchmark
We developed a new R benchmark for HPC that has four
improvements over R Benchmark:
1. Users can specify problem sizes as input parameters
2. Specific microbenchmarks can be selected for execution
3. Output stored in CSV files and data frames specified by user
4. Users can supply their own microbenchmarks to be run
alongside the package microbenchmarks
6
7. Supported microbenchmarks
• R HPC microbenchmarks include:
– Dense linear algebra kernels
• Operators
• Factorizations
• Solution of linear systems
– Machine learning functionality
• Clustering
• Neural networking
– Sparse matrix kernels (New)
• Operators
• Factorizations
7
8. Description of tested systems
• Sandy Bridge (SNB) nodes
– 2 8-core, 2.7GHz Intel Xeon E5-2680 CPU, 20 MB L2 cache
• Configured with no hyperthreading
– 32 GB 1600 MHz DDR3 RAM
– 61-core Intel Xeon Phi SE10P coprocessor (Knights Corner)
• Not included in performance tests
• Knights Landing (KNL) nodes
– Single 68-core, 1.4GHz Intel Xeon Phi 7250 CPU
• Configured with 4 hyperthreads/core
– 2 512-bit vector processing units per core
– Fused multiply add
– Cores divided amongst 34 tiles
• 2 cores per tile
• Shared 1MB L2 cache
– 16GB of Multi-Channel Dynamic Random Access Memory
(MCDRAM)
– 96GB of 2400 MHz DDR4 RAM
8
10. KNL MCDRAM and tile clustering modes
• KNL-supported MCDRAM configurations
– flat mode - Operates as a separate RAM from main
memory
– cache mode - Operates as a direct-mapped cache
– hybrid mode - Divided into flat-mode region and cache-
mode region
• KNL tile clustering modes take advantage of data locality
– all-to-all mode - Tile cache tag directories can map to
any memory controller
– quadrant/hemisphere - Tile cache tag directories map
to controller in its quadrant
– sub-NUMA quadrant/hemisphere - NUMA aware
application can pin threads corresponding to NUMA
nodes to specific quadrants/hemispheres of tiles.
10
12. Tests of dense linear algebra microbenchmarks
• Linked with Intel Math Kernel Library (MKL)
– All dense matrix microbenchmarks call into MKL
• Most tests performed with MCDRAM in cache mode
• Performed strong scaling of linear algebra kernels on SNB
and KNL nodes
– Threads on SNB: 1, 2, 4, 8, 12, 16
– Threads on KNL: 1, 2, 4, 8, 16, 34, 66, 68, 136, 204,
250, 272
– Matrix dimensions parameterized by N include
• N = {1000, 2000, 4000, 8000, 10000, 15000, 20000, 40000}
12
13. Performance of Cholesky factorization and linear solve
●
●● ● ● ● ●●0
200
400
600
800
1000
1 8 16 34 50 66
●
●
● ● ● ●
Number of Threads
RunTime(sec)
●
●
KNL time − Cholesky fact.
SNB time − Cholesky fact.
KNL time − linear solve
SNB time − linear solve
Run time of Cholesky and linear solve (N=20000)
13
●
●
●
●
●
●
●●
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
1 8 16 34 50 66
●
●
●
●
●
●
Number of Threads
StrongScaling
●
●
linear scaling
KNL scaling − Cholesky fact.
SNB scaling − Cholesky fact.
KNL scaling − linear solve
SNB scaling − linear solve
Strong scaling of Cholesky and lin. solve (N=20000)
14. Performance of eigendecomposition (small matrix)
●
●
●
●
● ● ●●
35
40
45
50
55
60
65
70
75
80
1 8 16 34 50 66
●
●
●
● ● ●
Number of Threads
RunTime(sec)
●
●
KNL time − eigendecomposition
SNB time − eigendecomposition
Run time of eigendecomposition (N=4000)
14
●
●
●
●
● ● ●●
1
2
1 8 16 34 50 66
●
●
●
● ● ●
Number of Threads
StrongScaling
●
●
linear scaling
KNL scaling − eigendecomposition
SNB scaling − eigendecomposition
Strong scaling of eigendecomposition (N=4000)
15. Performance of eigendecomposition (large matrix)
●
●
●
●
● ● ●●2000
3000
4000
5000
6000
7000
8000
1 8 16 34 50 66
●
●
●
● ● ●
Number of Threads
RunTime(sec)
●
●
KNL time − eigendecomposition
SNB time − eigendecomposition
Run time of eigendecomposition (N=20000)
15
●
●
●
●
●
●
●●
1
2
3
1 8 16 34 50 66
●
●
●
● ● ●
Number of Threads
StrongScaling
●
●
linear scaling
KNL scaling − eigendecomposition
SNB scaling − eigendecomposition
Strong scaling of eigendecomposition (N=20000)
17. Strong scaling on KNL using all hyperthreads
●●●●● ● ●● ● ● ●2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
1 16 34 66 100 136 204 250 272
Number of Threads
StrongScaling
●
linear scaling
KNL scaling − eigendecomposition.
KNL scaling − matrix−matrix mult.
Strong scaling of eigendecomposition and matrix−matrix mult. (N=20000)
17
18. Test results using MCDRAM flat mode
• Reran tests on KNL nodes with MCDRAM configured in flat
mode
• Tested Cholesky factorization, linear solve, and matrix cross
product kernels using the numactl command line utility with
the --preferred option
• The matrices fit entirely in MCDRAM
• No appreciable difference in performance achieved between
MCDRAM cache mode and flat mode
18
19. Linear algebra kernel overheads
• Developed stand-alone C language matrix cross product, QR
decomposition, and linear solve drivers to call MKL
functionality
• Compared performance of drivers to that of their R internal
function counterparts to determine effect of overheads from
data copying and validity checks
• Results on KNL nodes showed that the performance of the R
internal functions are virtually identical to that of the C drivers
19
20. Linear algebra kernel overheads (small matrix)
20
0
2
4
6
8
10
12
14
16
18
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16 34 66 68 136
Strong scaling
Run time (sec)
Number of threads
Matrix cross product performance using C driver and R, N = 4,000
Run time (C) Run time (R)
Strong scaling (C) Strong scaling (R)
21. Linear algebra kernel overheads (large matrix)
21
0
5
10
15
20
25
30
0
50
100
150
200
250
1 2 4 8 16 34 66 68 136
Strong scaling
Run time (sec)
Number of threads
Matrix cross product performance using C driver and R, N = 20,000
Run time (C) Run time (R) Strong scaling (C) Strong scaling (R)
22. Tests of machine learning microbenchmarks
• Performance tested neural network training using nnet and
cluster assignment using cluster packages
• nnet microbenchmark trains a neural network to approximate
a multivariate normal probability density function
• cluster microbenchmark uses partitioning around medoids
(pam) function to identify clusters of normally distributed N-
dimensional vectors in a real-valued feature space where the
mean of one cluster is at the origin and the means of
remaining clusters are at -1 and 1 of each axis
22
The implementations in nnet and cluster are not multithreaded or
properly vectorized, but can be restructured to utilize kernels that
are.
23. nnet performance results
23
0
200
400
600
800
1000
1200
5000 10000
Run times (sec) vs. number of training
vectors for three features
SNB KNL
0
2000
4000
6000
8000
10000
5000 10000 15000
Run times (sec) vs. number of training
vectors for five features
SNB KNL
25. Conclusions
1. Strong scaling either flattens past 68 threads or is detrimental to
performance for the benchmarked linear algebra kernels
2. Large matrices are needed for the linear algebra kernels to take full
use of the large core count and wide vector units of KNL
3. The MCDRAM flat mode does not offer a performance benefit over
cache mode
4. The R interpreter overhead is negligible for microbenchmarked
functions
5. Many R packages are not properly structured to take full advantage
of the many-core, vectorized architecture of the Xeon Phi, and they
do not leverage the MKL functionality exposed in the
microbenchmarked functions
25
26. Recent and Future work
• Recent Work:
– Developed microbenchmarks of sparse matrix
functionality from the matrix package
– Extended cluster benchmarks to include additional
clustering algorithms
– R core team has started making some source code
changes based on findings
• Future/Current Work:
– Extend benchmark to include summary statistics: mean,
variance, covariance computation, etc.
– Developing package to track package utilization on HPC
clusters so optimization efforts can be prioritized
26
27. R HPC Benchmark availability
The benchmark is now available as a package on the
Comprehensive R Archive Network as RHPCBenchmark:
https://cran.r-project.org/package=RHPCBenchmark
Collaboration is welcome! Source repository is available at:
https://github.com/IUResearchAnalytics/RBenchmarking
Contact information:
James McCombs: jmccombs@iu.edu
Scott Michael: scamicha@iu.edu
27
29. Supported microbenchmarks (cont’d)
Microbenchmark Kernel / Package function
Cholesky factorization chol
eigendecomposition eigen
Linear least squares fit lsfit
Linear solve w/ multiple r.h.s. solve
Matrix cross product crossprod
Matrix determinant determinant
Matrix-matrix multiplication %*%
Matrix-vector multiplication %*%
QR decomposition qr
Singular value decomposition svd
Neural network training nnet
Cluster identification pam
29
30. Build and environment settings
• Operating systems: CentOS 6(SNB) / 7(KNL)
• Software: R version 3.2.1
• Compiled R with Intel XE compiler 15.0 Update 2 and 17.0
Update 1 on SNB and linked with bundled Intel Math Kernel
Library
• Compiled R with Intel XE compiler 17.0 Update 1 on KNL
and linked with the bundled parallel MKL version
• Applied -O3 optimizations in each case
30