SlideShare a Scribd company logo
1 of 5
Download to read offline
Interface for Performance Environmental
Autofiguration frameworK
Liang Men
Computer Science and Computer Engineering
University of Arkansas
Fayetteville, AR, 72710
Bilel Hadri and Haihang You
National Institute for Computational Science
University of Tennessee
Knoxville, TN, 37996
Abstract—Performance Environment Autoconfiguration
frameworK (PEAK) is presented to help developers and
users of scientific applications to find the optimal
configurations for their application on a given platform
with rich computational resources and complicate options.
The choices to be made include the compiler with its
settings of compiling options, the numerical libraries and
settings of library parameters, and settings of other
environment variables to take advantage of the NUMA
systems. A website based interface is developed for users
convenience of choosing the optimal configuration to get a
significant speedup for some scientific applications
executed on different systems.  
Keywords—HPC, Numerical Library, Compiler
Autoconfiguration
I.   INTRODUCTION
A systematic way of selecting and configuring the
resources on High Performance Computing (HPC) systems is
desired for scientific application developers in order to obtain
optimal application performance. For most application
developers, an exhaustive search of all the possible
configurations with plenty of parameters is beyond their time
and effort. For example, on several HPC supercomputers, like
Kraken and Jaguar, Cray XT5 systems operated by the
National Institute for Computational Science (NICS) and the
Oak Ridge National Laboratory (ORNL) respectively, the
numerical libraries are one of the most used libraries [1], which
are supported by several numerical vendors using optimized
BLAS(Basic Linear Algebra Subprograms) [2], LAPACK(the
Linear Algebra PACKage) [3], FFT and ScaLAPACK [4]
functions. The libraries, such as LibSci [5] from CRAY,
ACML [6] from AMD and MKL [7] from Intel, also support
different compilers on the systems like PGI, Cray, GNU, Intel
and Pathscale. A default configuration is usually implemented
in the supercomputer with the expectation of handling the
majority of the scientific applications. However, our
preliminary exploration indicates that the default environment
is not the best configuration for many applications. Although
the libraries and compilers documentations are well maintained
in the supercomputing center, such information is
overwhelming for users and it is difficult to find optimal
options.
An environmental auto-configuration framework is
developed to help users of scientific applications select the
optimal configuration for their applications on the
supercomputing platform with abundant computational
resources. It starts with the benchmarks of the popular
numerical routines in optimized vendor libraries. The
benchmarks are compiled by the available compilers linking to
the available numerical resources on the supercomputing
systems. With a wide range of vector or matrix sizes as input
parameters, the compiled programs are executed in different
environments for a knowledgeable database preserving the
performance data. The database will provide valuable reference
for users to find out the most beneficial environments and
configurations of computational resources for their applications.
The framework interface is deployed with potential and
distinctive functions. Initially, it provides scripts to
automatically compile and execute the benchmarks of the
available numerical functions depending on different
environments. With the performance data of the benchmarks,
developers can easily discover the best configurations to their
scientific applications. In addition, it is highly recommended
for developers to check performance data from the existed
database to determine their choices of the available
computational resources. The interface will provide
performance diagrams based on the users’ parameters and
update its database if new configurations are executed.
Furthermore, the interface can advise the users on the static or
dynamic linking paths, compiler options as well as other
configuration tools based on the user’s choice of platform. The
auto-configuration feature is essential for the developers to
transplant their codes between different environments of the
supercomputing systems for compilation and execution.
II.   THE DEVELOPMENT OF THE FRAMEWORK
A.   Experimental environment
The performance comparison is based on two high
performance platforms, Kraken and Nautilus, at NICS.
Kraken is a Cray XT5 platform, with a peak performance
of 1.17 Pflops/s with 112,896 compute cores. Each node is
composed of two 2.6 GHz six-core AMD Opteron processors
(Istanbul) with 16 GB of memory. All the results in section C
have been performed on one node with twelve cores, leading to
a theoretical peak performance of 124.8 Gflops/s. Three
numerical libraries have been studied: LibSci (10.4.5), ACML
(4.4.0) and Intel MKL (10.2). Each numerical library is built
with the following compilers: PGI (10.6.0), GNU (4.4.3) and
Intel (11.1.038).
Nautilus is an SGI Altix 1000 UltraViolet shared-memory
machine featuring 1,024 cores (Intel Nehalem EX processors
with CPU speed of 2.0 GHz on each core) and 4 terabytes of
memory within a single system image. Three compilers, Intel
(12.1.233), PGI (10.3) and GNU (4.3.4) and two numerical
vendor libraries MKL (10.3.6) and ACML(4.4.0) are
considered in the framework.
USER    INTERFACE
Auto
Configuration  
Generator
Test  Driver  
Code
Kernel  Code
Job  Script  
Generator
Test  Driver  
Compiler
/Library
PBS  Scripts
Platform  Execution
Performance  
Database
Performance
  Data
Input:  Size,  Function... Output:  Compiler,  Library,  Environment...
User
Inquiry
Fig. 1.   Design Flow of Autoconfiguration Framework
B.   Design Flow of the Framework
The framework is built on the performance data of
commonly used numerical routines, which are compiled with
various compilers linking to the available libraries on the
platforms. As shown in Fig. 1, a batch of most used scientific
functions is selected for benchmarking. The auto-configuration
generator produces the test driver code and kernel code from
the preserved test bench model with user configuration , such
as matrix sizes, timing functions, and the performance
evaluation functions. The compilation process, which generates
applications with various combinations of compilers and
libraries, is automatically performed by a job script linking to
optimized flags and options. Performance data is generated by
running an application on the platform with a wide range of
input parameters and scheduled by the scripts in Potable Batch
System (PBS). A performance database is developed along
with external variables, such as matrix size, function names,
compilers, libraries, number of cores, etc.
Based on initial setup, an inquiry interface is developed
for the users' access to the framework. It provides suggestions
on using the recommended library and compiler for better
performance in scientific applications. The performance data in
the database will be plotted as diagrams for reference. If the
inquired information does not exist, the framework is adaptive
for inserting new benchmark functions, and reserves such
information for future reference. More options of new
compilers and libraries will be explored and added to the
interface.
C.   Performance Comparison for BLAS/LAPACK Libraries
In previous work [8], nine popular subroutines from BLAS
and LAPACK have been benchmarked on different numerical
vendor libraries (LibSci, ACML and MKL) with 3 compilers
(PGI, GNU and Intel). The default programming environment
on Kraken, LibSci with the PGI compiler, provides in most
cases the fastest implementation or very closes to the peak
performance of ACML for the BLAS subroutines. However,
for computing of the eigenvalues and eigenvectors with
DSYGV and computation the QR factorization with DGELS
in LAPACK, the default programming environment is not
descent and can dramatically slow down a scientific
application.
Fig. 2.   DSYGV Performance on Kraken with 12 Cores
As the detailed performance shown in Fig.2, when calling
DSYGV function for computing eigenvalues and
eigenvectors, LibSci is not recommended to be linked with for
the poor performance. ACML with PGI performs well for
problems with small sizes, and MKL with PGI performs better
with larger size. Another example is shown in Fig.3; DGELS
routine solves a system of equations using the QR
factorization. While MKL and ACML obtain better
performance, LibSci does not perform well especially
compiled with Intel. According to Cray scientific developers,
this function has not been perfected and they are in process of
improving the performance in future releases.
Fig. 3.   DGELS Performance on Kraken with 12 Cores
On Nautilus, MKL library outperforms ACML by almost a
factor of two for the LAPACK functions when compiled with
Intel for general cases. One exception is DSYGV routine
when the number of cores is greater than 16. Fig. 4 shows the
performance of the function with 64 cores. For the matrix size
less than 8000, ACML is the fastest implementation to solve
the eigenvalue problems.
Fig. 4.   DSYGV Performance on Nautilus with 64 Cores
D.   Performance Comparison for FFT Libraries
Fast Fourier Transform (FFT) provides the Discrete
Fourier Transform (DFT) computation for the basis of many
scientific applications. FFTW is a popular FFT library with a
comprehensive collection of fast C routines for computing the
DFT and related cases [9]. The latest version 3.x supports a
brand-new design offering better support to SIMD instructions
on modern CPUs and a distributed-memory implementation
on top of MPI. CRAFFT (Cray Adaptive FFT) [10] as part of
LibSci library is available on Cray XT systems. CRAFFT
provides a simplified interface to delegate computations to
other FFT kernels (including FFTW) and can dynamically
select the available FFT kernels by configuring
CRAFFT_PLANNER. Intel MKL has been providing Discrete
Fourier Transforms Interface (DFTI) since MKL (6.0). After
version 10.2, MKL has fully integrated FFTW3.x interface
without any extra effort for building wrappers, which
contributes the same benchmark as FFTW during the
framework exploration. With highly-efficient FFT algorithm
to compute DFT, ACML supports FFT routines of complex
data and real to complex data, handling single and
multidimensional data. The multidimensional routines benefit
from the use of OpenMP for good scalability on SMP
machines.
Fig. 5.   FFT Performance on Kraken with different 2D matrix size
The FFT routines in different libraries may not have a
common interface. Three test benches, developed for ACML,
LibSci, MKL and FFTW, are linked to correspondent libraries
and compiled with different compilers for benchmarking on
Kraken. The transform is performed on 2D matrix from real to
complex with a size of 4 to 4096. Each computation is
performed 100 times for randomly generated matrix with
cache cleaned. The performance results among the PGI, Intel
and GNU compilers are close and the PGI result is selected to
show in Fig. 5, normalized with regards to the
CRAFFT_PLANNER = 0 group, which is comparable to
FFTW3.3. Setting CRAFFT_PLANNER to 0 indicates no
online planning is done and a default FFT kernel is used at all
times. If the value is 1, then some planning is attempted to
find a faster kernel than the default. If the value is 2, planning
is extensive and attempts to use the fastest kernel available to
CRAFFT. For most cases of the small matrix sizes, MKL has
a better performance than other libraries.
III.   USER INTERFACE FOR ENVIRONMENTAL
CONFIGURATION
A user interface is developed for the framework to
simplify the environment variable settings by automatic
configuration. The automated data generation tool partitions
the process of constructing the framework database into four
steps to address the following issues: compiling the
benchmarks with different compilers and environmental
settings, executing the applications with various parameters,
extracting data from the performance results and building a
database for further inquiry. The interface hides the details of
test bench compilation and execution on HPC platforms.
Another advantage of employing the configuration tools is to
get rid of the configuration of complicated linking flags and
compiling options. The interface provides users with a
selection of attributes in the drag down menu shown in Fig. 6
to guide the creation of a batch of python scripts for the
working flow.
Fig. 6.   Performance Data Generation Interface
A.   Autoconfiguration for Linking Flags and Compiler
Options
Linking flags and compiler options for the available
resources on Kraken and Nautilus are abundant and intricate
for the nonprofessional users. For instance, LibSci is the
default programming environment, with no adding flags for
most compilers. When linking is performed with ACML or
MKL, flag '_mp' is necessary for optimized performance by
taking full advantage of the multithreaded BLAS. After
successful compilation of the source code, configuration files
are required to execute the user's application on the HPC
systems. In the job script files, the environment variable
OMP_NUM_THREADS is set by the user at runtime to the
number of threads desired. Alternatively, the environment
variable MKL_NUM_THREADS must be set to the maximum
number of cores in one node when using the MKL libraries.
On Nautilus, memory placements management and thread
affinity are important to optimize multithreaded as well as
openMP applications. For optimized memory placement,
numactl [11] tool can schedule processes with a specific
NUMA architecture, such as specifying the round robin fashion
policy on node. Beside memory locality, dplace tool [12] is
used to bind a related set of processes to specific CPUs to
prevent process migrations. As mentioned in the dplace
manual, the option “-x 2” is used for the Intel MKL to skip
placement of the second thread, as Intel OpenMP jobs use an
extra lightweight shepherd thread unknown to the user and
need to be placed.
Fig. 7.   Performance of DGEMM with Different Environmental Variables
and Tools Configuration
Besides the configuration tools, more environment
variables are available for custom configuration to avoid
performance degradation on the NUMA systems. Fig. 7 shows
an example of various performance results in DGEMM
function from MKL compiled with Intel and executed on 16
cores with different environmental configuration on Nautilus.
Two more environment variables, MKL_DYNAMIC and
KMP_AFFINITY, are recommended for optimized
performance. When MKL_DYNAMIC is TRUE, it enables the
MKL to dynamically set the number of threads. Otherwise,
Intel MKL will use the number of threads set by the user.
Another environment variable, KMP_AFFINITY, provides a
thread affinity mechanism for Intel OpenMP programs. If
DISABLED, the OpenMP runtime library will not be available
to make any affinity-related system calls [13]. When the option
“-x 2” is configured to dplace and other environmental
variables, KMP_AFFINITY and MKL_DYNAMIC, are set
appropriately, the best performance in the blue curve achieves
121 Gflops/s. If the KMP_AFFINITY is disabled, as the
bottom curve in green shows, the peak performance of
DGEMM remains the same with single core performance at
7.66 Gflops/s, which is 15.8 times slower than the best
performance. The performance is 90% of the peak performance
without setting MKL_DYNAMIC to FALSE, or drops more
than 60% if dplace is not evoked, no matter if the numactl
option is set or not.
B.   Framework Interface Implementation
The framework interface is implemented as a webpage for
universal access. Users are responsible to choose specified
tasks, platform, libraries and compilers for their applications.
After a computing platform is selected, such as Kraken or
Nautilus on current interface, the linking flags for compilation
and the environmental variable settings for optimized
performance are automatically loaded. The auto configured
files or scripts relieve the users from the complex details on
program execution. Furthermore, the framework makes it
possible for users to transplant applications from one
computing system to another, avoiding restrictions on the
execution level.
Submit  User  Selections
Get  Parameters  from  User  
Interface
Reconfigure  User  Interface  
for  User  Access
Upload  Scripts  to  Configured  
Platform  for  Execution
Select  Tasks:
Cleanse,  Compliation,  Execution
Call  CGI  Program
User  Interface
Backup  ServerChoose  Platform,  Library,  
Compiler,and  System  Architecture
Reconfigure  PBS  file  and  Python  
Script  for  Execution
Fig. 8.   Framework Interface Implementation in Python CGI
The implementation of the interface is shown in Fig. 8. The
configuration information is transferred to the backup server
where a python CGI program is evoked. PBS script templates
are preserved in the server with all possible and recommended
configurations on the available platforms. The background
program loads in a PBS template based on the platform
selection and tailors it with the optimized environment
variable and recommended tool settings. A new PBS file is
loaded with a python script for program compilation and
application execution. The files are convenient to be accessed
or uploaded to the selected platform on the interface through
python CGI support.
IV.   CONCLUSIONS
The framework is built to study the performance of
available computing resources on high performance computing
platforms to help researchers determine the extreme
performance combination of vendor libraries and compilers,
which is essential to achieve better utility efficiency for
execution of time-costing scientific applications. The common
functions in BLAS and LAPACK have been benchmarked and
the performance data is well maintained. The new development
of the framework is meeting the goal on current and emerging
extreme scale systems, as well as the parallel and distributed
computations. The performance of different FFT
implementations in the vendor libraries are benchmarked in the
framework regarding the accuracy of the implementations as
well as the distributed memory resources.
A website interface is developed to help researchers
determining the fastest library choices for their applications
and save their effort from exploring with different libraries. A
knowledgeable database storing the previous performance data
on the website server provides great convenience for the user
inquiry on better performance of their applications..
REFERENCES
[1]   Hadri, Bilel, Timothy Robinson, Mark Fahey, and William Renaud.
"Software Usage on Cray Systems across Three Centers (NICS, ORNL
and CSCS)" , CUG 2012, Stuttgart, Germany, 2012.
[2]   BLAS, “Basic linear algebra subprograms,” http://www.netlib.org/blas/.
[3]   Dongarra, Jack J., Jim R. Bunch, G. B. Moler, and George W. Stewart.
LINPACK users' guide. No. 8. Society for Industrial Mathematics, 1987.
[4]   Scalapack, “Scalable Linear Algebra PACKage,”
http://www.netlib.org/scalapack/.
[5]   Cray. “LibSci,” http://docs.cray.com.
[6]   AMD, “Core Math Library,” http://www.amd.com/acml.
[7]   Intel, “Math Kernal Library (MKL), ”
http://www.intel.com/software/products/mkl/.
[8]   Hadri Bilel, and Haihang You, “A Performance Comparison Framework
for Numerical Libraries on Cray XT5 System,” CUG 2011, Fairbanks,
Alaska, 2011.
[9]   FFTW, “Fast Fourier Transform in the West,” http://www.fftw.org/.
[10]   J. Bentz, “FFT Libraries on Cray XT: Current Performance and Future
Plans for Adaptive FFT Libraries,” CUG 2008, Helsinki, Finland, 2008.
[11]   SGI, “numactl man page,” http://techpubs.sgi.com/library/
[12]   SGI, “dplace man page,” http://techpubs.sgi.com/library/
[13] Hadri Bilel, Haihang You, and Shirley Moore. "Achieve better
performance with PEAK on XSEDE resources." In Proceedings of the
1st Conference of the Extreme Science and Engineering Discovery
Environment: Bridging from the eXtreme to the campus and beyond,
p. 10. ACM, 2012.

More Related Content

What's hot

Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Shivkumar Babshetty
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learningAdam Gibson
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operationsphanleson
 
Pegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsPegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsRafael Ferreira da Silva
 
Spark performance tuning eng
Spark performance tuning engSpark performance tuning eng
Spark performance tuning enghaiteam
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
White Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
White Paper: Using Perforce 'Attributes' for Managing Game Asset MetadataWhite Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
White Paper: Using Perforce 'Attributes' for Managing Game Asset MetadataPerforce
 
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...Databricks
 

What's hot (20)

Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
cluster08
cluster08cluster08
cluster08
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
4 026
4 0264 026
4 026
 
Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization Advanced Hadoop Tuning and Optimization
Advanced Hadoop Tuning and Optimization
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Multicore Intel Processors Performance Evaluation
Multicore Intel Processors Performance EvaluationMulticore Intel Processors Performance Evaluation
Multicore Intel Processors Performance Evaluation
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
HBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - OperationsHBase In Action - Chapter 10 - Operations
HBase In Action - Chapter 10 - Operations
 
Pegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsPegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computations
 
Spark performance tuning eng
Spark performance tuning engSpark performance tuning eng
Spark performance tuning eng
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
White Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
White Paper: Using Perforce 'Attributes' for Managing Game Asset MetadataWhite Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
White Paper: Using Perforce 'Attributes' for Managing Game Asset Metadata
 
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 

Viewers also liked (12)

U3_AA3_SalvadorIbáñez
U3_AA3_SalvadorIbáñezU3_AA3_SalvadorIbáñez
U3_AA3_SalvadorIbáñez
 
CAR Emails 6.10.02 (a)
CAR Emails 6.10.02 (a)CAR Emails 6.10.02 (a)
CAR Emails 6.10.02 (a)
 
EL MOUDEN DRAWINGS 3
EL MOUDEN DRAWINGS 3EL MOUDEN DRAWINGS 3
EL MOUDEN DRAWINGS 3
 
Certificate
CertificateCertificate
Certificate
 
1. german guaman luis malan
1. german guaman luis malan1. german guaman luis malan
1. german guaman luis malan
 
Req 00007 19-02-2016
Req 00007   19-02-2016Req 00007   19-02-2016
Req 00007 19-02-2016
 
Inicijacija zabavno stimulativnog programa
Inicijacija zabavno stimulativnog programaInicijacija zabavno stimulativnog programa
Inicijacija zabavno stimulativnog programa
 
Анкета
АнкетаАнкета
Анкета
 
المصحف المجزأ بالخط الكبير- الجزء التاسع والعشرون - ملون
المصحف المجزأ بالخط الكبير- الجزء التاسع والعشرون - ملونالمصحف المجزأ بالخط الكبير- الجزء التاسع والعشرون - ملون
المصحف المجزأ بالخط الكبير- الجزء التاسع والعشرون - ملون
 
مصحف التجويد برواية ورش من طريق الأصبهاني
مصحف التجويد برواية ورش من طريق الأصبهانيمصحف التجويد برواية ورش من طريق الأصبهاني
مصحف التجويد برواية ورش من طريق الأصبهاني
 
CEI Email 4.27.05
CEI Email 4.27.05CEI Email 4.27.05
CEI Email 4.27.05
 
CEI Email 4.24.03 (c)
CEI Email 4.24.03 (c)CEI Email 4.24.03 (c)
CEI Email 4.24.03 (c)
 

Similar to Interface for Performance Environment Autoconfiguration Framework

OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluationGIORGOS STAMELOS
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
IBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsIBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsthinkASG
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkLenovo Data Center
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processorsHebeon1
 
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS User Group - Thailand
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Get higher performance for your MySQL databases with Dell APEX Private Cloud ...
Get higher performance for your MySQL databases with Dell APEX Private Cloud ...Get higher performance for your MySQL databases with Dell APEX Private Cloud ...
Get higher performance for your MySQL databases with Dell APEX Private Cloud ...Principled Technologies
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
A Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing CenterA Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing CenterJames McGalliard
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkSingleStore
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Redis Labs
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
 

Similar to Interface for Performance Environment Autoconfiguration Framework (20)

OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluation
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
IBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsIBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deployments
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processors
 
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Get higher performance for your MySQL databases with Dell APEX Private Cloud ...
Get higher performance for your MySQL databases with Dell APEX Private Cloud ...Get higher performance for your MySQL databases with Dell APEX Private Cloud ...
Get higher performance for your MySQL databases with Dell APEX Private Cloud ...
 
Streams on wires
Streams on wiresStreams on wires
Streams on wires
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
A Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing CenterA Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing Center
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 

Interface for Performance Environment Autoconfiguration Framework

  • 1. Interface for Performance Environmental Autofiguration frameworK Liang Men Computer Science and Computer Engineering University of Arkansas Fayetteville, AR, 72710 Bilel Hadri and Haihang You National Institute for Computational Science University of Tennessee Knoxville, TN, 37996 Abstract—Performance Environment Autoconfiguration frameworK (PEAK) is presented to help developers and users of scientific applications to find the optimal configurations for their application on a given platform with rich computational resources and complicate options. The choices to be made include the compiler with its settings of compiling options, the numerical libraries and settings of library parameters, and settings of other environment variables to take advantage of the NUMA systems. A website based interface is developed for users convenience of choosing the optimal configuration to get a significant speedup for some scientific applications executed on different systems.   Keywords—HPC, Numerical Library, Compiler Autoconfiguration I.   INTRODUCTION A systematic way of selecting and configuring the resources on High Performance Computing (HPC) systems is desired for scientific application developers in order to obtain optimal application performance. For most application developers, an exhaustive search of all the possible configurations with plenty of parameters is beyond their time and effort. For example, on several HPC supercomputers, like Kraken and Jaguar, Cray XT5 systems operated by the National Institute for Computational Science (NICS) and the Oak Ridge National Laboratory (ORNL) respectively, the numerical libraries are one of the most used libraries [1], which are supported by several numerical vendors using optimized BLAS(Basic Linear Algebra Subprograms) [2], LAPACK(the Linear Algebra PACKage) [3], FFT and ScaLAPACK [4] functions. The libraries, such as LibSci [5] from CRAY, ACML [6] from AMD and MKL [7] from Intel, also support different compilers on the systems like PGI, Cray, GNU, Intel and Pathscale. A default configuration is usually implemented in the supercomputer with the expectation of handling the majority of the scientific applications. However, our preliminary exploration indicates that the default environment is not the best configuration for many applications. Although the libraries and compilers documentations are well maintained in the supercomputing center, such information is overwhelming for users and it is difficult to find optimal options. An environmental auto-configuration framework is developed to help users of scientific applications select the optimal configuration for their applications on the supercomputing platform with abundant computational resources. It starts with the benchmarks of the popular numerical routines in optimized vendor libraries. The benchmarks are compiled by the available compilers linking to the available numerical resources on the supercomputing systems. With a wide range of vector or matrix sizes as input parameters, the compiled programs are executed in different environments for a knowledgeable database preserving the performance data. The database will provide valuable reference for users to find out the most beneficial environments and configurations of computational resources for their applications. The framework interface is deployed with potential and distinctive functions. Initially, it provides scripts to automatically compile and execute the benchmarks of the available numerical functions depending on different environments. With the performance data of the benchmarks, developers can easily discover the best configurations to their scientific applications. In addition, it is highly recommended for developers to check performance data from the existed database to determine their choices of the available computational resources. The interface will provide performance diagrams based on the users’ parameters and update its database if new configurations are executed. Furthermore, the interface can advise the users on the static or dynamic linking paths, compiler options as well as other configuration tools based on the user’s choice of platform. The auto-configuration feature is essential for the developers to transplant their codes between different environments of the supercomputing systems for compilation and execution. II.   THE DEVELOPMENT OF THE FRAMEWORK A.   Experimental environment The performance comparison is based on two high performance platforms, Kraken and Nautilus, at NICS. Kraken is a Cray XT5 platform, with a peak performance of 1.17 Pflops/s with 112,896 compute cores. Each node is composed of two 2.6 GHz six-core AMD Opteron processors (Istanbul) with 16 GB of memory. All the results in section C have been performed on one node with twelve cores, leading to a theoretical peak performance of 124.8 Gflops/s. Three numerical libraries have been studied: LibSci (10.4.5), ACML (4.4.0) and Intel MKL (10.2). Each numerical library is built with the following compilers: PGI (10.6.0), GNU (4.4.3) and Intel (11.1.038).
  • 2. Nautilus is an SGI Altix 1000 UltraViolet shared-memory machine featuring 1,024 cores (Intel Nehalem EX processors with CPU speed of 2.0 GHz on each core) and 4 terabytes of memory within a single system image. Three compilers, Intel (12.1.233), PGI (10.3) and GNU (4.3.4) and two numerical vendor libraries MKL (10.3.6) and ACML(4.4.0) are considered in the framework. USER    INTERFACE Auto Configuration   Generator Test  Driver   Code Kernel  Code Job  Script   Generator Test  Driver   Compiler /Library PBS  Scripts Platform  Execution Performance   Database Performance  Data Input:  Size,  Function... Output:  Compiler,  Library,  Environment... User Inquiry Fig. 1.   Design Flow of Autoconfiguration Framework B.   Design Flow of the Framework The framework is built on the performance data of commonly used numerical routines, which are compiled with various compilers linking to the available libraries on the platforms. As shown in Fig. 1, a batch of most used scientific functions is selected for benchmarking. The auto-configuration generator produces the test driver code and kernel code from the preserved test bench model with user configuration , such as matrix sizes, timing functions, and the performance evaluation functions. The compilation process, which generates applications with various combinations of compilers and libraries, is automatically performed by a job script linking to optimized flags and options. Performance data is generated by running an application on the platform with a wide range of input parameters and scheduled by the scripts in Potable Batch System (PBS). A performance database is developed along with external variables, such as matrix size, function names, compilers, libraries, number of cores, etc. Based on initial setup, an inquiry interface is developed for the users' access to the framework. It provides suggestions on using the recommended library and compiler for better performance in scientific applications. The performance data in the database will be plotted as diagrams for reference. If the inquired information does not exist, the framework is adaptive for inserting new benchmark functions, and reserves such information for future reference. More options of new compilers and libraries will be explored and added to the interface. C.   Performance Comparison for BLAS/LAPACK Libraries In previous work [8], nine popular subroutines from BLAS and LAPACK have been benchmarked on different numerical vendor libraries (LibSci, ACML and MKL) with 3 compilers (PGI, GNU and Intel). The default programming environment on Kraken, LibSci with the PGI compiler, provides in most cases the fastest implementation or very closes to the peak performance of ACML for the BLAS subroutines. However, for computing of the eigenvalues and eigenvectors with DSYGV and computation the QR factorization with DGELS in LAPACK, the default programming environment is not descent and can dramatically slow down a scientific application. Fig. 2.   DSYGV Performance on Kraken with 12 Cores As the detailed performance shown in Fig.2, when calling DSYGV function for computing eigenvalues and eigenvectors, LibSci is not recommended to be linked with for the poor performance. ACML with PGI performs well for problems with small sizes, and MKL with PGI performs better with larger size. Another example is shown in Fig.3; DGELS routine solves a system of equations using the QR factorization. While MKL and ACML obtain better performance, LibSci does not perform well especially compiled with Intel. According to Cray scientific developers, this function has not been perfected and they are in process of improving the performance in future releases.
  • 3. Fig. 3.   DGELS Performance on Kraken with 12 Cores On Nautilus, MKL library outperforms ACML by almost a factor of two for the LAPACK functions when compiled with Intel for general cases. One exception is DSYGV routine when the number of cores is greater than 16. Fig. 4 shows the performance of the function with 64 cores. For the matrix size less than 8000, ACML is the fastest implementation to solve the eigenvalue problems. Fig. 4.   DSYGV Performance on Nautilus with 64 Cores D.   Performance Comparison for FFT Libraries Fast Fourier Transform (FFT) provides the Discrete Fourier Transform (DFT) computation for the basis of many scientific applications. FFTW is a popular FFT library with a comprehensive collection of fast C routines for computing the DFT and related cases [9]. The latest version 3.x supports a brand-new design offering better support to SIMD instructions on modern CPUs and a distributed-memory implementation on top of MPI. CRAFFT (Cray Adaptive FFT) [10] as part of LibSci library is available on Cray XT systems. CRAFFT provides a simplified interface to delegate computations to other FFT kernels (including FFTW) and can dynamically select the available FFT kernels by configuring CRAFFT_PLANNER. Intel MKL has been providing Discrete Fourier Transforms Interface (DFTI) since MKL (6.0). After version 10.2, MKL has fully integrated FFTW3.x interface without any extra effort for building wrappers, which contributes the same benchmark as FFTW during the framework exploration. With highly-efficient FFT algorithm to compute DFT, ACML supports FFT routines of complex data and real to complex data, handling single and multidimensional data. The multidimensional routines benefit from the use of OpenMP for good scalability on SMP machines. Fig. 5.   FFT Performance on Kraken with different 2D matrix size The FFT routines in different libraries may not have a common interface. Three test benches, developed for ACML, LibSci, MKL and FFTW, are linked to correspondent libraries and compiled with different compilers for benchmarking on Kraken. The transform is performed on 2D matrix from real to complex with a size of 4 to 4096. Each computation is performed 100 times for randomly generated matrix with cache cleaned. The performance results among the PGI, Intel and GNU compilers are close and the PGI result is selected to show in Fig. 5, normalized with regards to the CRAFFT_PLANNER = 0 group, which is comparable to FFTW3.3. Setting CRAFFT_PLANNER to 0 indicates no online planning is done and a default FFT kernel is used at all times. If the value is 1, then some planning is attempted to find a faster kernel than the default. If the value is 2, planning is extensive and attempts to use the fastest kernel available to CRAFFT. For most cases of the small matrix sizes, MKL has a better performance than other libraries. III.   USER INTERFACE FOR ENVIRONMENTAL CONFIGURATION A user interface is developed for the framework to simplify the environment variable settings by automatic configuration. The automated data generation tool partitions the process of constructing the framework database into four steps to address the following issues: compiling the
  • 4. benchmarks with different compilers and environmental settings, executing the applications with various parameters, extracting data from the performance results and building a database for further inquiry. The interface hides the details of test bench compilation and execution on HPC platforms. Another advantage of employing the configuration tools is to get rid of the configuration of complicated linking flags and compiling options. The interface provides users with a selection of attributes in the drag down menu shown in Fig. 6 to guide the creation of a batch of python scripts for the working flow. Fig. 6.   Performance Data Generation Interface A.   Autoconfiguration for Linking Flags and Compiler Options Linking flags and compiler options for the available resources on Kraken and Nautilus are abundant and intricate for the nonprofessional users. For instance, LibSci is the default programming environment, with no adding flags for most compilers. When linking is performed with ACML or MKL, flag '_mp' is necessary for optimized performance by taking full advantage of the multithreaded BLAS. After successful compilation of the source code, configuration files are required to execute the user's application on the HPC systems. In the job script files, the environment variable OMP_NUM_THREADS is set by the user at runtime to the number of threads desired. Alternatively, the environment variable MKL_NUM_THREADS must be set to the maximum number of cores in one node when using the MKL libraries. On Nautilus, memory placements management and thread affinity are important to optimize multithreaded as well as openMP applications. For optimized memory placement, numactl [11] tool can schedule processes with a specific NUMA architecture, such as specifying the round robin fashion policy on node. Beside memory locality, dplace tool [12] is used to bind a related set of processes to specific CPUs to prevent process migrations. As mentioned in the dplace manual, the option “-x 2” is used for the Intel MKL to skip placement of the second thread, as Intel OpenMP jobs use an extra lightweight shepherd thread unknown to the user and need to be placed. Fig. 7.   Performance of DGEMM with Different Environmental Variables and Tools Configuration Besides the configuration tools, more environment variables are available for custom configuration to avoid performance degradation on the NUMA systems. Fig. 7 shows an example of various performance results in DGEMM function from MKL compiled with Intel and executed on 16 cores with different environmental configuration on Nautilus. Two more environment variables, MKL_DYNAMIC and KMP_AFFINITY, are recommended for optimized performance. When MKL_DYNAMIC is TRUE, it enables the MKL to dynamically set the number of threads. Otherwise, Intel MKL will use the number of threads set by the user. Another environment variable, KMP_AFFINITY, provides a thread affinity mechanism for Intel OpenMP programs. If DISABLED, the OpenMP runtime library will not be available to make any affinity-related system calls [13]. When the option “-x 2” is configured to dplace and other environmental variables, KMP_AFFINITY and MKL_DYNAMIC, are set appropriately, the best performance in the blue curve achieves 121 Gflops/s. If the KMP_AFFINITY is disabled, as the bottom curve in green shows, the peak performance of DGEMM remains the same with single core performance at 7.66 Gflops/s, which is 15.8 times slower than the best performance. The performance is 90% of the peak performance without setting MKL_DYNAMIC to FALSE, or drops more than 60% if dplace is not evoked, no matter if the numactl option is set or not. B.   Framework Interface Implementation The framework interface is implemented as a webpage for universal access. Users are responsible to choose specified tasks, platform, libraries and compilers for their applications. After a computing platform is selected, such as Kraken or Nautilus on current interface, the linking flags for compilation and the environmental variable settings for optimized performance are automatically loaded. The auto configured files or scripts relieve the users from the complex details on program execution. Furthermore, the framework makes it
  • 5. possible for users to transplant applications from one computing system to another, avoiding restrictions on the execution level. Submit  User  Selections Get  Parameters  from  User   Interface Reconfigure  User  Interface   for  User  Access Upload  Scripts  to  Configured   Platform  for  Execution Select  Tasks: Cleanse,  Compliation,  Execution Call  CGI  Program User  Interface Backup  ServerChoose  Platform,  Library,   Compiler,and  System  Architecture Reconfigure  PBS  file  and  Python   Script  for  Execution Fig. 8.   Framework Interface Implementation in Python CGI The implementation of the interface is shown in Fig. 8. The configuration information is transferred to the backup server where a python CGI program is evoked. PBS script templates are preserved in the server with all possible and recommended configurations on the available platforms. The background program loads in a PBS template based on the platform selection and tailors it with the optimized environment variable and recommended tool settings. A new PBS file is loaded with a python script for program compilation and application execution. The files are convenient to be accessed or uploaded to the selected platform on the interface through python CGI support. IV.   CONCLUSIONS The framework is built to study the performance of available computing resources on high performance computing platforms to help researchers determine the extreme performance combination of vendor libraries and compilers, which is essential to achieve better utility efficiency for execution of time-costing scientific applications. The common functions in BLAS and LAPACK have been benchmarked and the performance data is well maintained. The new development of the framework is meeting the goal on current and emerging extreme scale systems, as well as the parallel and distributed computations. The performance of different FFT implementations in the vendor libraries are benchmarked in the framework regarding the accuracy of the implementations as well as the distributed memory resources. A website interface is developed to help researchers determining the fastest library choices for their applications and save their effort from exploring with different libraries. A knowledgeable database storing the previous performance data on the website server provides great convenience for the user inquiry on better performance of their applications.. REFERENCES [1]   Hadri, Bilel, Timothy Robinson, Mark Fahey, and William Renaud. "Software Usage on Cray Systems across Three Centers (NICS, ORNL and CSCS)" , CUG 2012, Stuttgart, Germany, 2012. [2]   BLAS, “Basic linear algebra subprograms,” http://www.netlib.org/blas/. [3]   Dongarra, Jack J., Jim R. Bunch, G. B. Moler, and George W. Stewart. LINPACK users' guide. No. 8. Society for Industrial Mathematics, 1987. [4]   Scalapack, “Scalable Linear Algebra PACKage,” http://www.netlib.org/scalapack/. [5]   Cray. “LibSci,” http://docs.cray.com. [6]   AMD, “Core Math Library,” http://www.amd.com/acml. [7]   Intel, “Math Kernal Library (MKL), ” http://www.intel.com/software/products/mkl/. [8]   Hadri Bilel, and Haihang You, “A Performance Comparison Framework for Numerical Libraries on Cray XT5 System,” CUG 2011, Fairbanks, Alaska, 2011. [9]   FFTW, “Fast Fourier Transform in the West,” http://www.fftw.org/. [10]   J. Bentz, “FFT Libraries on Cray XT: Current Performance and Future Plans for Adaptive FFT Libraries,” CUG 2008, Helsinki, Finland, 2008. [11]   SGI, “numactl man page,” http://techpubs.sgi.com/library/ [12]   SGI, “dplace man page,” http://techpubs.sgi.com/library/ [13] Hadri Bilel, Haihang You, and Shirley Moore. "Achieve better performance with PEAK on XSEDE resources." In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond, p. 10. ACM, 2012.