Concurrent Matrix Multiplication on Multi-core Processors
Interface for Performance Environment Autoconfiguration Framework
1. Interface for Performance Environmental
Autofiguration frameworK
Liang Men
Computer Science and Computer Engineering
University of Arkansas
Fayetteville, AR, 72710
Bilel Hadri and Haihang You
National Institute for Computational Science
University of Tennessee
Knoxville, TN, 37996
Abstract—Performance Environment Autoconfiguration
frameworK (PEAK) is presented to help developers and
users of scientific applications to find the optimal
configurations for their application on a given platform
with rich computational resources and complicate options.
The choices to be made include the compiler with its
settings of compiling options, the numerical libraries and
settings of library parameters, and settings of other
environment variables to take advantage of the NUMA
systems. A website based interface is developed for users
convenience of choosing the optimal configuration to get a
significant speedup for some scientific applications
executed on different systems.
Keywords—HPC, Numerical Library, Compiler
Autoconfiguration
I. INTRODUCTION
A systematic way of selecting and configuring the
resources on High Performance Computing (HPC) systems is
desired for scientific application developers in order to obtain
optimal application performance. For most application
developers, an exhaustive search of all the possible
configurations with plenty of parameters is beyond their time
and effort. For example, on several HPC supercomputers, like
Kraken and Jaguar, Cray XT5 systems operated by the
National Institute for Computational Science (NICS) and the
Oak Ridge National Laboratory (ORNL) respectively, the
numerical libraries are one of the most used libraries [1], which
are supported by several numerical vendors using optimized
BLAS(Basic Linear Algebra Subprograms) [2], LAPACK(the
Linear Algebra PACKage) [3], FFT and ScaLAPACK [4]
functions. The libraries, such as LibSci [5] from CRAY,
ACML [6] from AMD and MKL [7] from Intel, also support
different compilers on the systems like PGI, Cray, GNU, Intel
and Pathscale. A default configuration is usually implemented
in the supercomputer with the expectation of handling the
majority of the scientific applications. However, our
preliminary exploration indicates that the default environment
is not the best configuration for many applications. Although
the libraries and compilers documentations are well maintained
in the supercomputing center, such information is
overwhelming for users and it is difficult to find optimal
options.
An environmental auto-configuration framework is
developed to help users of scientific applications select the
optimal configuration for their applications on the
supercomputing platform with abundant computational
resources. It starts with the benchmarks of the popular
numerical routines in optimized vendor libraries. The
benchmarks are compiled by the available compilers linking to
the available numerical resources on the supercomputing
systems. With a wide range of vector or matrix sizes as input
parameters, the compiled programs are executed in different
environments for a knowledgeable database preserving the
performance data. The database will provide valuable reference
for users to find out the most beneficial environments and
configurations of computational resources for their applications.
The framework interface is deployed with potential and
distinctive functions. Initially, it provides scripts to
automatically compile and execute the benchmarks of the
available numerical functions depending on different
environments. With the performance data of the benchmarks,
developers can easily discover the best configurations to their
scientific applications. In addition, it is highly recommended
for developers to check performance data from the existed
database to determine their choices of the available
computational resources. The interface will provide
performance diagrams based on the users’ parameters and
update its database if new configurations are executed.
Furthermore, the interface can advise the users on the static or
dynamic linking paths, compiler options as well as other
configuration tools based on the user’s choice of platform. The
auto-configuration feature is essential for the developers to
transplant their codes between different environments of the
supercomputing systems for compilation and execution.
II. THE DEVELOPMENT OF THE FRAMEWORK
A. Experimental environment
The performance comparison is based on two high
performance platforms, Kraken and Nautilus, at NICS.
Kraken is a Cray XT5 platform, with a peak performance
of 1.17 Pflops/s with 112,896 compute cores. Each node is
composed of two 2.6 GHz six-core AMD Opteron processors
(Istanbul) with 16 GB of memory. All the results in section C
have been performed on one node with twelve cores, leading to
a theoretical peak performance of 124.8 Gflops/s. Three
numerical libraries have been studied: LibSci (10.4.5), ACML
(4.4.0) and Intel MKL (10.2). Each numerical library is built
with the following compilers: PGI (10.6.0), GNU (4.4.3) and
Intel (11.1.038).
2. Nautilus is an SGI Altix 1000 UltraViolet shared-memory
machine featuring 1,024 cores (Intel Nehalem EX processors
with CPU speed of 2.0 GHz on each core) and 4 terabytes of
memory within a single system image. Three compilers, Intel
(12.1.233), PGI (10.3) and GNU (4.3.4) and two numerical
vendor libraries MKL (10.3.6) and ACML(4.4.0) are
considered in the framework.
USER INTERFACE
Auto
Configuration
Generator
Test Driver
Code
Kernel Code
Job Script
Generator
Test Driver
Compiler
/Library
PBS Scripts
Platform Execution
Performance
Database
Performance
Data
Input: Size, Function... Output: Compiler, Library, Environment...
User
Inquiry
Fig. 1. Design Flow of Autoconfiguration Framework
B. Design Flow of the Framework
The framework is built on the performance data of
commonly used numerical routines, which are compiled with
various compilers linking to the available libraries on the
platforms. As shown in Fig. 1, a batch of most used scientific
functions is selected for benchmarking. The auto-configuration
generator produces the test driver code and kernel code from
the preserved test bench model with user configuration , such
as matrix sizes, timing functions, and the performance
evaluation functions. The compilation process, which generates
applications with various combinations of compilers and
libraries, is automatically performed by a job script linking to
optimized flags and options. Performance data is generated by
running an application on the platform with a wide range of
input parameters and scheduled by the scripts in Potable Batch
System (PBS). A performance database is developed along
with external variables, such as matrix size, function names,
compilers, libraries, number of cores, etc.
Based on initial setup, an inquiry interface is developed
for the users' access to the framework. It provides suggestions
on using the recommended library and compiler for better
performance in scientific applications. The performance data in
the database will be plotted as diagrams for reference. If the
inquired information does not exist, the framework is adaptive
for inserting new benchmark functions, and reserves such
information for future reference. More options of new
compilers and libraries will be explored and added to the
interface.
C. Performance Comparison for BLAS/LAPACK Libraries
In previous work [8], nine popular subroutines from BLAS
and LAPACK have been benchmarked on different numerical
vendor libraries (LibSci, ACML and MKL) with 3 compilers
(PGI, GNU and Intel). The default programming environment
on Kraken, LibSci with the PGI compiler, provides in most
cases the fastest implementation or very closes to the peak
performance of ACML for the BLAS subroutines. However,
for computing of the eigenvalues and eigenvectors with
DSYGV and computation the QR factorization with DGELS
in LAPACK, the default programming environment is not
descent and can dramatically slow down a scientific
application.
Fig. 2. DSYGV Performance on Kraken with 12 Cores
As the detailed performance shown in Fig.2, when calling
DSYGV function for computing eigenvalues and
eigenvectors, LibSci is not recommended to be linked with for
the poor performance. ACML with PGI performs well for
problems with small sizes, and MKL with PGI performs better
with larger size. Another example is shown in Fig.3; DGELS
routine solves a system of equations using the QR
factorization. While MKL and ACML obtain better
performance, LibSci does not perform well especially
compiled with Intel. According to Cray scientific developers,
this function has not been perfected and they are in process of
improving the performance in future releases.
3. Fig. 3. DGELS Performance on Kraken with 12 Cores
On Nautilus, MKL library outperforms ACML by almost a
factor of two for the LAPACK functions when compiled with
Intel for general cases. One exception is DSYGV routine
when the number of cores is greater than 16. Fig. 4 shows the
performance of the function with 64 cores. For the matrix size
less than 8000, ACML is the fastest implementation to solve
the eigenvalue problems.
Fig. 4. DSYGV Performance on Nautilus with 64 Cores
D. Performance Comparison for FFT Libraries
Fast Fourier Transform (FFT) provides the Discrete
Fourier Transform (DFT) computation for the basis of many
scientific applications. FFTW is a popular FFT library with a
comprehensive collection of fast C routines for computing the
DFT and related cases [9]. The latest version 3.x supports a
brand-new design offering better support to SIMD instructions
on modern CPUs and a distributed-memory implementation
on top of MPI. CRAFFT (Cray Adaptive FFT) [10] as part of
LibSci library is available on Cray XT systems. CRAFFT
provides a simplified interface to delegate computations to
other FFT kernels (including FFTW) and can dynamically
select the available FFT kernels by configuring
CRAFFT_PLANNER. Intel MKL has been providing Discrete
Fourier Transforms Interface (DFTI) since MKL (6.0). After
version 10.2, MKL has fully integrated FFTW3.x interface
without any extra effort for building wrappers, which
contributes the same benchmark as FFTW during the
framework exploration. With highly-efficient FFT algorithm
to compute DFT, ACML supports FFT routines of complex
data and real to complex data, handling single and
multidimensional data. The multidimensional routines benefit
from the use of OpenMP for good scalability on SMP
machines.
Fig. 5. FFT Performance on Kraken with different 2D matrix size
The FFT routines in different libraries may not have a
common interface. Three test benches, developed for ACML,
LibSci, MKL and FFTW, are linked to correspondent libraries
and compiled with different compilers for benchmarking on
Kraken. The transform is performed on 2D matrix from real to
complex with a size of 4 to 4096. Each computation is
performed 100 times for randomly generated matrix with
cache cleaned. The performance results among the PGI, Intel
and GNU compilers are close and the PGI result is selected to
show in Fig. 5, normalized with regards to the
CRAFFT_PLANNER = 0 group, which is comparable to
FFTW3.3. Setting CRAFFT_PLANNER to 0 indicates no
online planning is done and a default FFT kernel is used at all
times. If the value is 1, then some planning is attempted to
find a faster kernel than the default. If the value is 2, planning
is extensive and attempts to use the fastest kernel available to
CRAFFT. For most cases of the small matrix sizes, MKL has
a better performance than other libraries.
III. USER INTERFACE FOR ENVIRONMENTAL
CONFIGURATION
A user interface is developed for the framework to
simplify the environment variable settings by automatic
configuration. The automated data generation tool partitions
the process of constructing the framework database into four
steps to address the following issues: compiling the
4. benchmarks with different compilers and environmental
settings, executing the applications with various parameters,
extracting data from the performance results and building a
database for further inquiry. The interface hides the details of
test bench compilation and execution on HPC platforms.
Another advantage of employing the configuration tools is to
get rid of the configuration of complicated linking flags and
compiling options. The interface provides users with a
selection of attributes in the drag down menu shown in Fig. 6
to guide the creation of a batch of python scripts for the
working flow.
Fig. 6. Performance Data Generation Interface
A. Autoconfiguration for Linking Flags and Compiler
Options
Linking flags and compiler options for the available
resources on Kraken and Nautilus are abundant and intricate
for the nonprofessional users. For instance, LibSci is the
default programming environment, with no adding flags for
most compilers. When linking is performed with ACML or
MKL, flag '_mp' is necessary for optimized performance by
taking full advantage of the multithreaded BLAS. After
successful compilation of the source code, configuration files
are required to execute the user's application on the HPC
systems. In the job script files, the environment variable
OMP_NUM_THREADS is set by the user at runtime to the
number of threads desired. Alternatively, the environment
variable MKL_NUM_THREADS must be set to the maximum
number of cores in one node when using the MKL libraries.
On Nautilus, memory placements management and thread
affinity are important to optimize multithreaded as well as
openMP applications. For optimized memory placement,
numactl [11] tool can schedule processes with a specific
NUMA architecture, such as specifying the round robin fashion
policy on node. Beside memory locality, dplace tool [12] is
used to bind a related set of processes to specific CPUs to
prevent process migrations. As mentioned in the dplace
manual, the option “-x 2” is used for the Intel MKL to skip
placement of the second thread, as Intel OpenMP jobs use an
extra lightweight shepherd thread unknown to the user and
need to be placed.
Fig. 7. Performance of DGEMM with Different Environmental Variables
and Tools Configuration
Besides the configuration tools, more environment
variables are available for custom configuration to avoid
performance degradation on the NUMA systems. Fig. 7 shows
an example of various performance results in DGEMM
function from MKL compiled with Intel and executed on 16
cores with different environmental configuration on Nautilus.
Two more environment variables, MKL_DYNAMIC and
KMP_AFFINITY, are recommended for optimized
performance. When MKL_DYNAMIC is TRUE, it enables the
MKL to dynamically set the number of threads. Otherwise,
Intel MKL will use the number of threads set by the user.
Another environment variable, KMP_AFFINITY, provides a
thread affinity mechanism for Intel OpenMP programs. If
DISABLED, the OpenMP runtime library will not be available
to make any affinity-related system calls [13]. When the option
“-x 2” is configured to dplace and other environmental
variables, KMP_AFFINITY and MKL_DYNAMIC, are set
appropriately, the best performance in the blue curve achieves
121 Gflops/s. If the KMP_AFFINITY is disabled, as the
bottom curve in green shows, the peak performance of
DGEMM remains the same with single core performance at
7.66 Gflops/s, which is 15.8 times slower than the best
performance. The performance is 90% of the peak performance
without setting MKL_DYNAMIC to FALSE, or drops more
than 60% if dplace is not evoked, no matter if the numactl
option is set or not.
B. Framework Interface Implementation
The framework interface is implemented as a webpage for
universal access. Users are responsible to choose specified
tasks, platform, libraries and compilers for their applications.
After a computing platform is selected, such as Kraken or
Nautilus on current interface, the linking flags for compilation
and the environmental variable settings for optimized
performance are automatically loaded. The auto configured
files or scripts relieve the users from the complex details on
program execution. Furthermore, the framework makes it
5. possible for users to transplant applications from one
computing system to another, avoiding restrictions on the
execution level.
Submit User Selections
Get Parameters from User
Interface
Reconfigure User Interface
for User Access
Upload Scripts to Configured
Platform for Execution
Select Tasks:
Cleanse, Compliation, Execution
Call CGI Program
User Interface
Backup ServerChoose Platform, Library,
Compiler,and System Architecture
Reconfigure PBS file and Python
Script for Execution
Fig. 8. Framework Interface Implementation in Python CGI
The implementation of the interface is shown in Fig. 8. The
configuration information is transferred to the backup server
where a python CGI program is evoked. PBS script templates
are preserved in the server with all possible and recommended
configurations on the available platforms. The background
program loads in a PBS template based on the platform
selection and tailors it with the optimized environment
variable and recommended tool settings. A new PBS file is
loaded with a python script for program compilation and
application execution. The files are convenient to be accessed
or uploaded to the selected platform on the interface through
python CGI support.
IV. CONCLUSIONS
The framework is built to study the performance of
available computing resources on high performance computing
platforms to help researchers determine the extreme
performance combination of vendor libraries and compilers,
which is essential to achieve better utility efficiency for
execution of time-costing scientific applications. The common
functions in BLAS and LAPACK have been benchmarked and
the performance data is well maintained. The new development
of the framework is meeting the goal on current and emerging
extreme scale systems, as well as the parallel and distributed
computations. The performance of different FFT
implementations in the vendor libraries are benchmarked in the
framework regarding the accuracy of the implementations as
well as the distributed memory resources.
A website interface is developed to help researchers
determining the fastest library choices for their applications
and save their effort from exploring with different libraries. A
knowledgeable database storing the previous performance data
on the website server provides great convenience for the user
inquiry on better performance of their applications..
REFERENCES
[1] Hadri, Bilel, Timothy Robinson, Mark Fahey, and William Renaud.
"Software Usage on Cray Systems across Three Centers (NICS, ORNL
and CSCS)" , CUG 2012, Stuttgart, Germany, 2012.
[2] BLAS, “Basic linear algebra subprograms,” http://www.netlib.org/blas/.
[3] Dongarra, Jack J., Jim R. Bunch, G. B. Moler, and George W. Stewart.
LINPACK users' guide. No. 8. Society for Industrial Mathematics, 1987.
[4] Scalapack, “Scalable Linear Algebra PACKage,”
http://www.netlib.org/scalapack/.
[5] Cray. “LibSci,” http://docs.cray.com.
[6] AMD, “Core Math Library,” http://www.amd.com/acml.
[7] Intel, “Math Kernal Library (MKL), ”
http://www.intel.com/software/products/mkl/.
[8] Hadri Bilel, and Haihang You, “A Performance Comparison Framework
for Numerical Libraries on Cray XT5 System,” CUG 2011, Fairbanks,
Alaska, 2011.
[9] FFTW, “Fast Fourier Transform in the West,” http://www.fftw.org/.
[10] J. Bentz, “FFT Libraries on Cray XT: Current Performance and Future
Plans for Adaptive FFT Libraries,” CUG 2008, Helsinki, Finland, 2008.
[11] SGI, “numactl man page,” http://techpubs.sgi.com/library/
[12] SGI, “dplace man page,” http://techpubs.sgi.com/library/
[13] Hadri Bilel, Haihang You, and Shirley Moore. "Achieve better
performance with PEAK on XSEDE resources." In Proceedings of the
1st Conference of the Extreme Science and Engineering Discovery
Environment: Bridging from the eXtreme to the campus and beyond,
p. 10. ACM, 2012.