1. Intel®Cluster Poisson Solver Library,
a research project for heterogeneous clusters
Alexander Kalinkin, Ilya Krjukov, Intel Corporation
Introduction
•
•
•
This research explores Intel®Cluster Poisson Solver Library that
implements a direct method to solve a grid Laplace problem in 3D
parallelepiped domain on a cluster of Intel® Xeon® processors. This
method is based on a novel approach of data decomposition and
transportation, which leads to performance improvement on largescale clusters.
Elliptic boundary value problems with separable variables can be
solved in a fast and direct manner. This type of problems usually
presume a single computational domain (rectangle or circle) and
constant coefficients [1], [2]. They can be used to generate
preconditioners for iterative solvers that solve far more complex
problems. For example, high-accuracy models for atmospheric and
oceanic flow simulation, such as those used in the Numerical
Weather Simulations, can be solved iteratively using a Helmholtz
solver with constant coefficients as a preconditioner. Because the
preconditioner is used in every iteration step, the Helmholtz solver
performance is critical to the overall computation time of the
iterative solver. On a cluster, the size of the initial grid and data
distribution determine the number of data transfers among
computing processes, as well as the amount of computations
needed for the Helmholtz solver. These can significantly affect its
performance.
This work studies the implementation of a Helmholtz solver on
clusters using 2D memory decomposition with the objective of
minimizing data transfer and synchronization overhead. This work
is a continuation of a series of works on Helmholtz solver for
shared and distributed memory machines. Paper [3] compared the
performance of a Poisson solver from Intel®Math Kernel Library
(Intel®MKL) [6] with the NETLIB* Fishpack solver. It also presented
an implementation of Intel®Cluster Poisson Solver Library. Paper [4]
demonstrated the performance of Intel® MKL Poisson Solvers with
the support of periodic boundary conditions.
Algorithm
Experiments
The 3D Helmholtz problem is to find an approximate solution of the
Helmholtz equation:
All experiments have been performed on a cluster with Infiniband* interconnect,
consisting of 128 computational nodes where each node contains two Intel®Xeon®
E5-2670 processors and 64G of RAM. We used Intel®MKL version 11.0.1 [6] and
Intel®MPI version 4.1.
−
𝝏𝟐 𝒖
𝝏𝒙 𝟐
−
𝝏𝟐 𝒖
𝝏𝒚 𝟐
−
𝝏𝟐 𝒖
𝝏𝒛 𝟐
+ 𝒒𝒖 = 𝒇 𝒙, 𝒚, 𝒛 , 𝒒 = 𝒄𝒐𝒏𝒔𝒕
Problems in a parallelepiped domain with Neumann, Direchlet or
periodical boundary conditions can be solved using the standard seven point finite difference approximation on the mesh .
•At a mesh point (x_i, y_i, z_i), if the values of the right-hand side f(x, y,
z) are given and the values of the appropriate boundary functions at
the mesh point are known, then on a shared memory computer the
equation can be solved using a sequence of 5 steps. Each step works
with one dimension of the data by doing an FFT and an LU
decomposition of a 3-diagonal matrix. On a distributed memory cluster,
this algorithm still applies, but the problem of data distribution arises.
Depending on how the mesh is distributed among the computing
processes, the number of data transfers between these processes
varies and has a significant impact on performance. To minimize the
total number of data transfers, we propose the following initial data
distribution as depicted in Figure 1:
Elements of the same color along the x-axis are
stored on the same process. They can be processed
independently with respect to elements on other
processes. Then, the mesh is transposed as shown
in Figure 2:
After the transposition, elements of the same
color along the y-axis are stored on the same
process; and they can be processed
independently. Following this scheme, we
transpose the mesh at the beginning of each step such that all processes can run
in parallel on independent data. With this approach, the total number of
data transfers is 4x 𝑛𝑝𝑟𝑜𝑐, where 𝑛𝑝𝑟𝑜𝑐 is the number of MPI
processes. Comparing to the algorithm in [3], where the total number of
data transfer is 2x 𝑛𝑝𝑟𝑜𝑐, this approach will be more efficient when the
number of MPI processes is large.
For the first set of tests we choose a grid problem with 0.81*109 of unknowns
("small" problem). second one (medium) test have about 3*109 of unknowns and,
finally, last test contain more than 45*109 of unknowns. On the Table below one
can see the time results for our algorithm as a function of a number of cores used in
the computation. All results are measured in seconds.
64 128 256
512 1024 2048 4096
Cores
Small
X
X
X
9 of unkn. 2.87 1.56 0.907 0.627
0.81*10
Medium
X
X
X
7.87 1.80 1.34
X
9 of unkn.
3*10
Large
X
X
X
X
X
X
4.13
9 of unkn.
45*10
Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 11.0.5, Intel® Compiler 13.0, Hardware: Intel® Xeon® Processor E5-268 ; Benchmark Source: Intel Corporation.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in
system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are
considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmarklimitations.html
Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/en-ru/articles/optimization-notice/
*Other brands and names are the property of their respective owners.
Reference
• Performance scales almost linearly up to a
certain number of processes for each problem
size.
• Larger problems can efficiently use larger
number of processes.
1. A.A.Samarskii and E.S.Nikolaev, Methods of Solution of Grid Problems, Nauka,
Moscow, (1978) (in Russian).
2. R. W. Hockney, A fast direct solution of Poisson equation using Fourier analysis, J.
Assoc. Comput. Mach., vol. 8, 1965, pp. 95-113.
3. . A. Kalinkin, Y.M. Laevsky, S.V. Gololobov, 2D Fast Poisson Solver for HighPerformance Computing, Parallel Computing Technologies, Lecture Notes in
Computer Science 2009, Vol. 5698/2009
4. A. Kalinkin, A. Kuzmin, Inteltextregistered MKL Poisson Library for scalable and
efficient solution of elliptic problems with separable variables, Collection of
Works International Scientific Conference Parallel Computing Technologies 2012,
pp 336-341
5. PALM - A PArallelized LES Model http://palm.muk.uni-hannover.de
6. Intel®Math Kernel Library http://software.intel.com/en-us/intel-mkl