SlideShare a Scribd company logo
1 of 1
Download to read offline
Intel®Cluster Poisson Solver Library,
a research project for heterogeneous clusters
Alexander Kalinkin, Ilya Krjukov, Intel Corporation
Introduction
•

•

•

This research explores Intel®Cluster Poisson Solver Library that
implements a direct method to solve a grid Laplace problem in 3D
parallelepiped domain on a cluster of Intel® Xeon® processors. This
method is based on a novel approach of data decomposition and
transportation, which leads to performance improvement on largescale clusters.
Elliptic boundary value problems with separable variables can be
solved in a fast and direct manner. This type of problems usually
presume a single computational domain (rectangle or circle) and
constant coefficients [1], [2]. They can be used to generate
preconditioners for iterative solvers that solve far more complex
problems. For example, high-accuracy models for atmospheric and
oceanic flow simulation, such as those used in the Numerical
Weather Simulations, can be solved iteratively using a Helmholtz
solver with constant coefficients as a preconditioner. Because the
preconditioner is used in every iteration step, the Helmholtz solver
performance is critical to the overall computation time of the
iterative solver. On a cluster, the size of the initial grid and data
distribution determine the number of data transfers among
computing processes, as well as the amount of computations
needed for the Helmholtz solver. These can significantly affect its
performance.

This work studies the implementation of a Helmholtz solver on
clusters using 2D memory decomposition with the objective of
minimizing data transfer and synchronization overhead. This work
is a continuation of a series of works on Helmholtz solver for
shared and distributed memory machines. Paper [3] compared the
performance of a Poisson solver from Intel®Math Kernel Library
(Intel®MKL) [6] with the NETLIB* Fishpack solver. It also presented
an implementation of Intel®Cluster Poisson Solver Library. Paper [4]
demonstrated the performance of Intel® MKL Poisson Solvers with
the support of periodic boundary conditions.

Algorithm

Experiments

The 3D Helmholtz problem is to find an approximate solution of the
Helmholtz equation:

All experiments have been performed on a cluster with Infiniband* interconnect,
consisting of 128 computational nodes where each node contains two Intel®Xeon®
E5-2670 processors and 64G of RAM. We used Intel®MKL version 11.0.1 [6] and
Intel®MPI version 4.1.

−

𝝏𝟐 𝒖
𝝏𝒙 𝟐

−

𝝏𝟐 𝒖
𝝏𝒚 𝟐

−

𝝏𝟐 𝒖
𝝏𝒛 𝟐

+ 𝒒𝒖 = 𝒇 𝒙, 𝒚, 𝒛 , 𝒒 = 𝒄𝒐𝒏𝒔𝒕

Problems in a parallelepiped domain with Neumann, Direchlet or
periodical boundary conditions can be solved using the standard seven point finite difference approximation on the mesh .
•At a mesh point (x_i, y_i, z_i), if the values of the right-hand side f(x, y,
z) are given and the values of the appropriate boundary functions at
the mesh point are known, then on a shared memory computer the
equation can be solved using a sequence of 5 steps. Each step works
with one dimension of the data by doing an FFT and an LU
decomposition of a 3-diagonal matrix. On a distributed memory cluster,
this algorithm still applies, but the problem of data distribution arises.
Depending on how the mesh is distributed among the computing
processes, the number of data transfers between these processes
varies and has a significant impact on performance. To minimize the
total number of data transfers, we propose the following initial data
distribution as depicted in Figure 1:
Elements of the same color along the x-axis are
stored on the same process. They can be processed
independently with respect to elements on other
processes. Then, the mesh is transposed as shown
in Figure 2:
After the transposition, elements of the same
color along the y-axis are stored on the same
process; and they can be processed
independently. Following this scheme, we
transpose the mesh at the beginning of each step such that all processes can run
in parallel on independent data. With this approach, the total number of
data transfers is 4x 𝑛𝑝𝑟𝑜𝑐, where 𝑛𝑝𝑟𝑜𝑐 is the number of MPI
processes. Comparing to the algorithm in [3], where the total number of
data transfer is 2x 𝑛𝑝𝑟𝑜𝑐, this approach will be more efficient when the
number of MPI processes is large.

For the first set of tests we choose a grid problem with 0.81*109 of unknowns
("small" problem). second one (medium) test have about 3*109 of unknowns and,
finally, last test contain more than 45*109 of unknowns. On the Table below one
can see the time results for our algorithm as a function of a number of cores used in
the computation. All results are measured in seconds.

64 128 256
512 1024 2048 4096
Cores
Small
X
X
X
9 of unkn. 2.87 1.56 0.907 0.627
0.81*10
Medium
X
X
X
7.87 1.80 1.34
X
9 of unkn.
3*10
Large
X
X
X
X
X
X
4.13
9 of unkn.
45*10
Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 11.0.5, Intel® Compiler 13.0, Hardware: Intel® Xeon® Processor E5-268 ; Benchmark Source: Intel Corporation.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in
system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are
considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmarklimitations.html
Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/en-ru/articles/optimization-notice/
*Other brands and names are the property of their respective owners.

Reference

• Performance scales almost linearly up to a
certain number of processes for each problem
size.
• Larger problems can efficiently use larger
number of processes.

1. A.A.Samarskii and E.S.Nikolaev, Methods of Solution of Grid Problems, Nauka,
Moscow, (1978) (in Russian).
2. R. W. Hockney, A fast direct solution of Poisson equation using Fourier analysis, J.
Assoc. Comput. Mach., vol. 8, 1965, pp. 95-113.
3. . A. Kalinkin, Y.M. Laevsky, S.V. Gololobov, 2D Fast Poisson Solver for HighPerformance Computing, Parallel Computing Technologies, Lecture Notes in
Computer Science 2009, Vol. 5698/2009
4. A. Kalinkin, A. Kuzmin, Inteltextregistered MKL Poisson Library for scalable and
efficient solution of elliptic problems with separable variables, Collection of
Works International Scientific Conference Parallel Computing Technologies 2012,
pp 336-341
5. PALM - A PArallelized LES Model http://palm.muk.uni-hannover.de
6. Intel®Math Kernel Library http://software.intel.com/en-us/intel-mkl

More Related Content

What's hot

An experimental evaluation of similarity-based and embedding-based link predi...
An experimental evaluation of similarity-based and embedding-based link predi...An experimental evaluation of similarity-based and embedding-based link predi...
An experimental evaluation of similarity-based and embedding-based link predi...IJDKP
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computingVajira Thambawita
 
Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...IAEME Publication
 
Real Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCLReal Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCLcsandit
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows seteSAT Publishing House
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsHardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsMarek Kraft
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
 
PEEC based electromagnetic simulator
PEEC based electromagnetic simulator PEEC based electromagnetic simulator
PEEC based electromagnetic simulator Swapnil Gaul
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
 
A Novel Low Complexity Histogram Algorithm for High Performance Image Process...
A Novel Low Complexity Histogram Algorithm for High Performance Image Process...A Novel Low Complexity Histogram Algorithm for High Performance Image Process...
A Novel Low Complexity Histogram Algorithm for High Performance Image Process...IRJET Journal
 
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Serhan
 
Architecture neural network deep optimizing based on self organizing feature ...
Architecture neural network deep optimizing based on self organizing feature ...Architecture neural network deep optimizing based on self organizing feature ...
Architecture neural network deep optimizing based on self organizing feature ...journalBEEI
 
Performance analysis of real-time and general-purpose operating systems for p...
Performance analysis of real-time and general-purpose operating systems for p...Performance analysis of real-time and general-purpose operating systems for p...
Performance analysis of real-time and general-purpose operating systems for p...IJECEIAES
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Nondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evilNondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evilracesworkshop
 

What's hot (20)

An experimental evaluation of similarity-based and embedding-based link predi...
An experimental evaluation of similarity-based and embedding-based link predi...An experimental evaluation of similarity-based and embedding-based link predi...
An experimental evaluation of similarity-based and embedding-based link predi...
 
Lecture 2 more about parallel computing
Lecture 2   more about parallel computingLecture 2   more about parallel computing
Lecture 2 more about parallel computing
 
Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...Performance boosting of discrete cosine transform using parallel programming ...
Performance boosting of discrete cosine transform using parallel programming ...
 
Real Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCLReal Time Face Detection on GPU Using OPENCL
Real Time Face Detection on GPU Using OPENCL
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows set
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsHardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 
Chap1 slides
Chap1 slidesChap1 slides
Chap1 slides
 
PEEC based electromagnetic simulator
PEEC based electromagnetic simulator PEEC based electromagnetic simulator
PEEC based electromagnetic simulator
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
 
A Novel Low Complexity Histogram Algorithm for High Performance Image Process...
A Novel Low Complexity Histogram Algorithm for High Performance Image Process...A Novel Low Complexity Histogram Algorithm for High Performance Image Process...
A Novel Low Complexity Histogram Algorithm for High Performance Image Process...
 
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
 
nnUNet
nnUNetnnUNet
nnUNet
 
Architecture neural network deep optimizing based on self organizing feature ...
Architecture neural network deep optimizing based on self organizing feature ...Architecture neural network deep optimizing based on self organizing feature ...
Architecture neural network deep optimizing based on self organizing feature ...
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
Performance analysis of real-time and general-purpose operating systems for p...
Performance analysis of real-time and general-purpose operating systems for p...Performance analysis of real-time and general-purpose operating systems for p...
Performance analysis of real-time and general-purpose operating systems for p...
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
M017419499
M017419499M017419499
M017419499
 
Nondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evilNondeterminism is unavoidable, but data races are pure evil
Nondeterminism is unavoidable, but data races are pure evil
 

Similar to Intel Cluster Poisson Solver Library

29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)IAESIJEECS
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)IAESIJEECS
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
Parallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPParallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPIJSRED
 
Verilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Verilog Ams Used In Top Down Methodology For Wireless Integrated CircuitsVerilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Verilog Ams Used In Top Down Methodology For Wireless Integrated CircuitsRégis SANTONJA
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
 
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...eArtius, Inc.
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...rinzindorjej
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...rinzindorjej
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
The Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemThe Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemMelissa Luster
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computingHeman Pathak
 

Similar to Intel Cluster Poisson Solver Library (20)

FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)
 
29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)29 19 sep17 17may 6637 10140-1-ed(edit)
29 19 sep17 17may 6637 10140-1-ed(edit)
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Parallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MPParallelization of Graceful Labeling Using Open MP
Parallelization of Graceful Labeling Using Open MP
 
Verilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Verilog Ams Used In Top Down Methodology For Wireless Integrated CircuitsVerilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Verilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
 
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
Multi-Objective Optimization of Solar Cells Thermal Uniformity Using Combined...
 
Gk3611601162
Gk3611601162Gk3611601162
Gk3611601162
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
 
6119ijcsitce01
6119ijcsitce016119ijcsitce01
6119ijcsitce01
 
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
CONTRAST OF RESNET AND DENSENET BASED ON THE RECOGNITION OF SIMPLE FRUIT DATA...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
genalg
genalggenalg
genalg
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
The Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemThe Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging System
 
Chapter 1 - introduction - parallel computing
Chapter  1 - introduction - parallel computingChapter  1 - introduction - parallel computing
Chapter 1 - introduction - parallel computing
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Intel Cluster Poisson Solver Library

  • 1. Intel®Cluster Poisson Solver Library, a research project for heterogeneous clusters Alexander Kalinkin, Ilya Krjukov, Intel Corporation Introduction • • • This research explores Intel®Cluster Poisson Solver Library that implements a direct method to solve a grid Laplace problem in 3D parallelepiped domain on a cluster of Intel® Xeon® processors. This method is based on a novel approach of data decomposition and transportation, which leads to performance improvement on largescale clusters. Elliptic boundary value problems with separable variables can be solved in a fast and direct manner. This type of problems usually presume a single computational domain (rectangle or circle) and constant coefficients [1], [2]. They can be used to generate preconditioners for iterative solvers that solve far more complex problems. For example, high-accuracy models for atmospheric and oceanic flow simulation, such as those used in the Numerical Weather Simulations, can be solved iteratively using a Helmholtz solver with constant coefficients as a preconditioner. Because the preconditioner is used in every iteration step, the Helmholtz solver performance is critical to the overall computation time of the iterative solver. On a cluster, the size of the initial grid and data distribution determine the number of data transfers among computing processes, as well as the amount of computations needed for the Helmholtz solver. These can significantly affect its performance. This work studies the implementation of a Helmholtz solver on clusters using 2D memory decomposition with the objective of minimizing data transfer and synchronization overhead. This work is a continuation of a series of works on Helmholtz solver for shared and distributed memory machines. Paper [3] compared the performance of a Poisson solver from Intel®Math Kernel Library (Intel®MKL) [6] with the NETLIB* Fishpack solver. It also presented an implementation of Intel®Cluster Poisson Solver Library. Paper [4] demonstrated the performance of Intel® MKL Poisson Solvers with the support of periodic boundary conditions. Algorithm Experiments The 3D Helmholtz problem is to find an approximate solution of the Helmholtz equation: All experiments have been performed on a cluster with Infiniband* interconnect, consisting of 128 computational nodes where each node contains two Intel®Xeon® E5-2670 processors and 64G of RAM. We used Intel®MKL version 11.0.1 [6] and Intel®MPI version 4.1. − 𝝏𝟐 𝒖 𝝏𝒙 𝟐 − 𝝏𝟐 𝒖 𝝏𝒚 𝟐 − 𝝏𝟐 𝒖 𝝏𝒛 𝟐 + 𝒒𝒖 = 𝒇 𝒙, 𝒚, 𝒛 , 𝒒 = 𝒄𝒐𝒏𝒔𝒕 Problems in a parallelepiped domain with Neumann, Direchlet or periodical boundary conditions can be solved using the standard seven point finite difference approximation on the mesh . •At a mesh point (x_i, y_i, z_i), if the values of the right-hand side f(x, y, z) are given and the values of the appropriate boundary functions at the mesh point are known, then on a shared memory computer the equation can be solved using a sequence of 5 steps. Each step works with one dimension of the data by doing an FFT and an LU decomposition of a 3-diagonal matrix. On a distributed memory cluster, this algorithm still applies, but the problem of data distribution arises. Depending on how the mesh is distributed among the computing processes, the number of data transfers between these processes varies and has a significant impact on performance. To minimize the total number of data transfers, we propose the following initial data distribution as depicted in Figure 1: Elements of the same color along the x-axis are stored on the same process. They can be processed independently with respect to elements on other processes. Then, the mesh is transposed as shown in Figure 2: After the transposition, elements of the same color along the y-axis are stored on the same process; and they can be processed independently. Following this scheme, we transpose the mesh at the beginning of each step such that all processes can run in parallel on independent data. With this approach, the total number of data transfers is 4x 𝑛𝑝𝑟𝑜𝑐, where 𝑛𝑝𝑟𝑜𝑐 is the number of MPI processes. Comparing to the algorithm in [3], where the total number of data transfer is 2x 𝑛𝑝𝑟𝑜𝑐, this approach will be more efficient when the number of MPI processes is large. For the first set of tests we choose a grid problem with 0.81*109 of unknowns ("small" problem). second one (medium) test have about 3*109 of unknowns and, finally, last test contain more than 45*109 of unknowns. On the Table below one can see the time results for our algorithm as a function of a number of cores used in the computation. All results are measured in seconds. 64 128 256 512 1024 2048 4096 Cores Small X X X 9 of unkn. 2.87 1.56 0.907 0.627 0.81*10 Medium X X X 7.87 1.80 1.34 X 9 of unkn. 3*10 Large X X X X X X 4.13 9 of unkn. 45*10 Configuration Info - Versions: Intel® Math Kernel Library (Intel® MKL) 11.0.5, Intel® Compiler 13.0, Hardware: Intel® Xeon® Processor E5-268 ; Benchmark Source: Intel Corporation. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmarklimitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/en-ru/articles/optimization-notice/ *Other brands and names are the property of their respective owners. Reference • Performance scales almost linearly up to a certain number of processes for each problem size. • Larger problems can efficiently use larger number of processes. 1. A.A.Samarskii and E.S.Nikolaev, Methods of Solution of Grid Problems, Nauka, Moscow, (1978) (in Russian). 2. R. W. Hockney, A fast direct solution of Poisson equation using Fourier analysis, J. Assoc. Comput. Mach., vol. 8, 1965, pp. 95-113. 3. . A. Kalinkin, Y.M. Laevsky, S.V. Gololobov, 2D Fast Poisson Solver for HighPerformance Computing, Parallel Computing Technologies, Lecture Notes in Computer Science 2009, Vol. 5698/2009 4. A. Kalinkin, A. Kuzmin, Inteltextregistered MKL Poisson Library for scalable and efficient solution of elliptic problems with separable variables, Collection of Works International Scientific Conference Parallel Computing Technologies 2012, pp 336-341 5. PALM - A PArallelized LES Model http://palm.muk.uni-hannover.de 6. Intel®Math Kernel Library http://software.intel.com/en-us/intel-mkl