SlideShare a Scribd company logo
1 of 22
Download to read offline
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 20131
Comparison and Performance Analysis of Multiple2
CPU/GPU Computing Systems – Resin Infusion Flow3
Modeling Application4
R.H. Haney1 and R.V. Mohan25
Abstract: The use of Graphics Processing Units (GPUs) as co-processors for s-6
ingle CPU/GPU computing systems has become pronounced in high performance7
computing research, however the solution of truly large scale computationally in-8
tensive problems require the utilization of multiple computing nodes. Multiple9
CPU/GPU computing systems bring new complexities to the observed performance10
of computationally intensive applications, the more salient of which is the cost of11
local CPU-GPU host and intra-nodal communication. This paper compares and an-12
alyzes the performance of a computationally intensive application represented by13
resin infusion flow during liquid composite molding process for the manufacture14
of structural composites application via two distinct multiple CPU/GPU comput-15
ing system architectures. Resin flow infusion modeling during liquid composite16
molding process is the engineering application of interest in the present study. The17
global domain is partitioned into a series of sub-domains each of which is solved18
at the local host and reassembled for the final solution as per the domain decom-19
position methodology. The candidate application, as with many scientific and en-20
gineering applications, uses the Finite Element Method (FEM) to computationally21
model the governing physics based mass and momentum conversation equations.22
FEM discretization results in large sparse linear equation systems that are solved23
iteratively for this class of free surface, moving boundary value problem. Com-24
putational analysis software for the GPU environment has been developed using25
CUDA API for the iterative linear equation system solver based on the precondi-26
tioned conjugate gradient method for the solution of linear system of equations.27
These linear equation systems are solved multiple times with intra-nodal commu-28
nication utilized, resulting in the converged global system. The interplay of local29
host CPU/GPU and intra-nodal communication creates mixed performance results30
for the presented candidate application. The software/hardware factors that affec-31
t performance for each architecture are examined and discussed in this paper –32
1 North Carolina A&T State University, Greensboro, NC, U.S.A.
2 North Carolina A&T State University, Greensboro, NC, U.S.A, For Correspondence.
CMES Galley Proof Only Please Return in 48 Hours.
Proof
2 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
understanding how the presented candidate application’s observed performance is33
effected by both the individual multiple CPU/GPU computing system architecture34
and algorithmic/software design is critical to optimize many modern high perfor-35
mance applications which employee the GPU as a hardware accelerator.36
Keywords: resin flow infusion modeling, sparse matrix, performance, multiple37
GPUs.38
1 Introduction39
The Graphics Processing Unit (GPU) is inexorably becoming common place for the40
acceleration of computationally intensive applications for single CPU/GPU com-41
puting systems [Hamada, Narumi et al. (2009); Kuznik, Obrecht et al. (2010);42
Corrigan, Camelli et al. (2011); Bustamam, Burrage et al. (2012)] as ever larger43
computing challenges seek to leverage the dramatic computational power expressed44
with the modern GPU device [Fatahalian and Houston (2008); Garland, Le Grand45
et al. (2008); Grozea, Bankovic et al. (2010); Bustamam, Burrage et al. (2012)].46
However, truly large scale high performance computing applications necessitate47
the use of multiple computing nodes. The execution of computationally intensive48
applications within the context of multiple CPU/GPU computing systems has been49
shown with varying degrees of success [Karunadasa and Ranasinghe (2009); Corri-50
gan, Camelli et al. (2011); Kim, Seo et al. (2012); Lee and Vetter (2012)]. This pa-51
per examines the performance of computational analysis software for resin infusion52
flow modeling during liquid composite molding process for structural composites53
via two distinct multiple CPU/GPU computing systems – exposing the effects of54
different hardware and software influencing factors.55
Liquid Composite Molding (LCM) is a popular process used for the manufacture of56
polymer composite structures. The process involves injection of a polymeric resin57
inside a mold cavity filled with an oriented, woven dry fiber preform [Mohan, Ngo58
et al. (1996); Mohan, Shires et al. (2001)]. The resin infusion flow modeling and59
simulation of the LCM process involves the isothermal process flow solution based60
on the conservation of resin mass as the governing equation in FEM computational61
developments. The governing coupled mass and momentum conservation equa-62
tion is discretized and the fill factors and pressure values are solved in an iterative63
manner [Mohan, Ngo et al. (1996); Mohan, Shires et al. (2001)]. The computa-64
tionally intensive portion of the analysis code is from the multiple calls to the linear65
equation system solver that is solved using the Preconditioned Conjugate Gradient66
(PCG) method.67
The PCG iterative linear equation system solver consists of a set of matrix-vector68
operations [Shewchuk (1994)] that map well to the natural matrix structure of the69
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 3
GPU using Nvidia Compute Unified Device Architecture (CUDA) API [Nvidia70
(2012)]. The performance results of two different multiple CPU/GPU computing71
systems are examined and discussed, offering possible avenues of further investiga-72
tion for both rapid and robust code developments using the GPU that can be applied73
for the candidate application, and translate well to other legacy FEM based codes74
as well. The present paper is organized as follows.75
The GPU and its evolution are briefly discussed in the first section, as well as the76
details of the CPU and GPU hardware employed in the present work. The next77
section provides a description of the candidate application and includes the details78
of the physical problem, model equations, implementation, and the computation79
solution algorithm involved. A description of the GPU implementation and devel-80
opment for the resin flow infusion modeling is presented in the following section.81
The final section examines and discusses the computational performance of the82
presented candidate application providing analysis, comparisons, and conclusions.83
2 Graphical Processing Units84
The modern GPU consists of heavy concentration of transistors in the Arithmetic85
Logic Units (ALUs) and wide data bus [Boggan and Pressel (2007); Luebke and86
Humphreys (2007); Fatahalian and Houston (2008)]. The memory architecture has87
remained relatively simple to facilitate quicker access to input data. Unlike the CPU88
[Rumpf and Strzodka (2005); Luebke and Humphreys (2007)], GPU separates the89
configuration/instruction from the data to be operated on by the processor’s large90
number of ALUs.91
2.1 GPU evolution92
The gaming industry had been a key driving force for the improvement in the com-93
puting power of GPUs over the years [Buck, Foley et al. (2004); Boggan and94
Pressel (2007); Luebke and Humphreys (2007)]. Images are rendered using matrix95
operations applied simultaneously to large sets of data at the same time. Graphic-96
s hardware receives the instruction/configuration prior to the influx of the data to97
be operated, a natural Single Instruction Multiple Data (SIMD) environment seen98
in many parallel systems [Wilkinson and Allen (2005); Fatahalian and Houston99
(2008)]. Newer and more detailed games required faster processing of matrix op-100
erations as well as a wider data bus to enable rapid execution. Transistors were101
concentrated for floating point operations to facilitate large number of matrix op-102
erations [Boggan and Pressel (2007)] inherent in graphical processing hardware.103
The CPU on the other hand needed to maintain flexibility [Patterson and Hennessy104
(1998); Silberschatz, Galvin et al. (2003)]. Both CPU and GPU must deal with105
latency, and do so in distinctly different ways based on their architecture. The CPU106
CMES Galley Proof Only Please Return in 48 Hours.
Proof
4 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
maintains complex memory hierarchy to execute multiple processes in quick suc-107
cession to deal with latency [Patterson and Hennessy (1998); Silberschatz, Galvin108
et al. (2003); Boggan and Pressel (2007)]. GPU on the other hand deals with laten-109
cy by executing large data sets with the same configuration/instruction, which is set110
up prior to any data processing with a wide data bus and transistors concentrated in111
the arithmetic region. Fig. 1 illustrates typical CPU and GPU transistor distribu-112
tion. The evolution of GPU architecture has resulted in a computational workhorse113
optimized for data throughput.114
Figure 1: Distribution of transistors for CPU and GPU hardware devices
Initially graphics hardware operated in a fixed pipeline and any programming was115
low level and very complex [Boggan and Pressel (2007)]. Thus the ability to access116
the GPU in any realistic way to leverage its computational power was limited. The117
development of flexible pipeline as well as high level language constructs paved118
the way for the use of GPU as a co-processor [Boggan and Pressel (2007); Lue-119
bke and Humphreys (2007)]. GPU software paradigms are further enabling their120
use as computational processors in computationally intensive engineering analysis121
applications such as the one employed in the present work.122
2.2 CPU and GPU architectures123
In this work, we utilize two distinct multiple CPU/GPU computing systems each124
consisting of multiple nodes composed of a single commodity GPU and CPU pro-125
cessor for software development and comparative performance analysis. These two126
multiple CPU/GPU computing systems are denoted as System A and System B and127
are defined as follows.128
1. System A: The CPU and GPU device architectures for this multiple CPU/GPU129
computing system can be viewed as the lower-grade of the two presented in130
this work. Each node in the multiple CPU/GPU computing system contains131
the following technical specifications.132
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 5
• CPU: A dual-core Opteron processor with a clock frequency of 2.8133
GHz, 64-bit DDR2, an L1 cache of 256 KBs, an L2 cache of 2 MBs134
and a 1000 MHz Front Side Bus.135
• GPU: Nvidia Quadro FX5600 with 1.5GBs global memory, CUDA136
compute architecture 1.0, 128 cores, 8192 registers, 16 shared mem-137
ory banks, a clock frequency of 1.35 GHz and GDDR3 SDRAM using138
a 384-bit wide bus.139
2. System B: The CPU and GPU device architectures for this multiple CPU/GPU140
computing system can be viewed as the higher-grade of the two presented in141
this work. Each node in the multiple CPU/GPU computing system contains142
the following technical specifications.143
• CPU: A 6-core Intel X5650 processor with a clock frequency of 2.6144
GHz, 64-bit DDR3, an L1 cache of 384 KBs, an L2 cache of 1.5 MBs,145
an L3 cache of 12 MBs and a 1300 MHz Front Side Bus.146
• GPU: Nvidia Tesla M2070 with 6 GBs global memory, CUDA com-147
pute architecture 2.0, 448 cores, 32768 registers, 32 shared memory148
banks, a clock frequency of 1.15 GHz and GDDR5 SDRAM using a149
384-bit wide bus.150
The next section provides a description of the candidate application and includes151
the details of the physical problem, model equations, implementation, and the com-152
putation solution algorithm involved.153
3 Candidate engineering analysis application154
Resin infusion flow modeling in liquid composite manufacturing process for the155
manufacture of woven fabric structural composite parts is the candidate engineer-156
ing analysis application for the computational code development and GPU, CPU157
performance analysis employed in the present study. Resin flow infusion model-158
ing involves the tracking of the progressing resin from the injection gates through159
woven fabric filled Eulerian mold cavity mesh geometry. For thin shell composite160
structural configurations employing 2D flow representation and thickness averaged161
2D flow field, three-node triangular finite elements are used to discretize the mold162
cavity domain [Mohan, Ngo et al. (1996); Mohan, Shires et al. (2001)]. The finite163
element discretization for the computational problem involves the Galerkin weight-164
ed residual finite element formulation of the resin mass conservation equation in165
conjunction with the Darcian porous media flow field representing the momentum166
conservation, and an implicit methodology for the transient solution [Mohan, Ngo167
et al. (1996), Mohan, Shires et al. (2001)].168
CMES Galley Proof Only Please Return in 48 Hours.
Proof
6 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
3.1 Resin infusion flow modeling169
Following the discussions in [Mohan, Shires et al. (2001); Mohan, Ngo et al.
(1996)], the resin flow through the fiber preform contained within the mold cav-
ity is represented by a transient mass conservation equation. The physical mass
conservation equation (formed by coupling the mass conservation equation with
the momentum equation via Darcian velocity field) is given by Eq. (1) with ¯Kthe
permeability tensor, µ the resin viscosity, Pthe pressure field, and Ψthe state vari-
able, representing the resin flow infusion state of the mold cavity region.
∂
∂t
Ω
Ψ∂Ω ≡
Ω
∇
¯K
µ
∇P ∂Ω (1)
Further details of this above governing equation are available in [Mohan, Ngo et170
al. (1996)] and [Mohan, Shires et al. (2001)]. The value of the state variable Ψis 0171
in the regions of the mold where the resin has not infused the fiber preform and 1172
where the resin has completely infused the fiber preform in any given region of the173
Eulerian mold cavity domain Ω in the FEM computations.174
As discussed in [Mohan, Ngo et al. (1996)] and [Mohan, Shires et al. (2001)], the
application of the Galerkin weighted residual formulation and approximating for
the pressure Pand fill factor Ψwith appropriate elements to define the mold cavity
resin infusion geometry, and associated shape functions yields a semi-discrete sys-
tem of equations given by Eq. (2) with Cthe lumped mass matrix, Kthe stiffness
matrix, qthe load vector, and ˙Ψthe time derivative.
C ˙Ψ+KP = q (2)
The transient semi-discrete equation is then solved by introducing the finite differ-
ence approximation given by Eq. (3) for the time derivative term.
˙Ψ =
Ψn+1 −Ψn
∆t
(3)
3.2 Resin flow infusion modeling strategy175
The semi-discrete Eq. (2) can be reduced to Eq. (4), as discussed in [Mohan, Ngo
et al. (1996)] and [Mohan, Shires et al. (2001)].
CiiΨn+1
i −CiiΨn
i +∆tKijPj = ∆tqi (4)
The above form of the discretized equation is solved during the resin infusion flow176
modeling during the transient resin progression at each time-step. Equation (4)177
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 7
defines the implicit form of the process flow modeling in LCM, as shown in refer-178
ences [Mohan, Ngo et al. (1996)] and [Mohan, Shires et al. (2001)]. The gener-179
alized algorithm for the finite element computations for the process flow modeling180
and analysis of the resin progression in the transient, free surface flow problem for181
each time step is summarized in Algorithm 1. The finite element based solution182
strategy employed results in a system of linear equations that is solved using an183
iterative pre-conditioned conjugate gradient method involving matrix-vector, and184
dot product operations.185
Algorithm 1: Implicit Pure FE methodology for Resin Infusion Flow Modeling
Computation
(For progression from time step n to time step n+1and iteration m)
1. REPEAT
2. SET {Ψi}n+1
m to{Ψi}n
(save previous fill factor values)
3. CALL assembleC for Ci (assembleC forms lump mass matrix)
4. CALL assembleK for Kij (assembleK forms stiffness matrixK)
5. CALL assembleLoad on qi (assembleLoad forms load vectorq)
6. REPEAT
7. SET boundary conditions on Kij (Modified load vectorg)
8. SET {gi}m to Cii {Ψi}n
−Cii {Ψi}n+1
m +∆t {qi}m
(Where ˆKijis Kmatrix with boundary conditions applied)
9. SOLVE ˆKij m
Pj m
= {gi}m
(Compute new nodal resin fraction field using equation (4))
10. SET Cii {Ψi}n+1
m+1 = Cii {Ψi}n
−∆t [Kij] Pj m
+∆t {qi}m
11. IF Cii {Ψi}n+1
m+1 −Cii {Ψi}n+1
m ≤ ξ THEN
12. BREAK
13. ELSE
14. SET {Ψi}n+1
m to {Ψi}n+1
m+1
15. ENDIF
16. UNTIL mass resin convergence
17. UNTIL all nodes are filled
The interested reader is referred to [Mohan, Ngo et al. (1996)] for further details on186
the resin infusion flow modeling in LCM process and the finite element discretiza-187
tion employed for the computational problem.188
CMES Galley Proof Only Please Return in 48 Hours.
Proof
8 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
3.3 CPU and GPU resin flow infusion modeling validation189
The validity of the CPU and GPU code developments for multiple computing n-
odes was determined by careful examination of the numerical results against the
analytical solution for a simple 2D circular plate radial injection model (see Fig.
2). The analytical solution for this simple circular plate geometry radial injection
condition for comparison were determined based on the radial flow front location,
and injection port pressure at time step t for multiple numbers of sub-domains de-
rived by partitioning the initial global domain and applying equations presented in
reference [Mohan, Ngo et al. (1996)]. Eq. (5) and (6) define the radial flow front
and injection port pressure respectively with R0 the inner radius, µthe resin viscos-
ity, φ the fiber volume fraction, Kthe permeability, and Has the element or mold
cavity thickness.
R(t) =
Qt
πφH
+R2
0 (5)
P0 =
µQ
2π ¯KH
ln
R(t)
R0
(6)
Figure 2: 2D circular plate radial injection flow model
190
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 9
The radial flow front and injection port pressures for this thin shell circular plate191
model from the computational CPU and GPU analysis values for serial and 2-192
partitions were compared to the analytical values. Fig. 3, Fig. 4, Fig. 5 and193
Fig. 6 clearly demonstrate the congruency of CPU and GPU numerical solutions194
to the analytical solutions at each time step for serial and multiple partitions for195
both System A and System B. Additional computational modeling analysis com-196
parisons for other complex composite structures also showed excellent numerical197
congruency between CPU and GPU analysis code execution runs.198
Figure 3: Transient flow front locations comparisons for serial/multiple CPU/GPU
System A
Figure 4: Transient injection pressure comparisons for serial/multiple CPU/GPU
System A
The mapping of the candidate application to the multiple GPU/CPU computing199
system environments is discussed next.200
CMES Galley Proof Only Please Return in 48 Hours.
Proof
10 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
Figure 5: Transient flow front location comparisons for serial/multiple CPU/GPU
System B
Figure 6: Injection pressure comparisons for serial/multiple CPU/GPU System B
4 GPU implementation and code development strategies201
Finite element discretization of the transient, moving boundary value problem in-202
volves the solution of linear system of equations multiple times during the com-203
putational analysis and is the computationally intensive portion in the resin flow204
infusion modeling analysis of complex composite structures. The Preconditioned205
Conjugate Gradient (PCG) solution methodology for the solution of linear system206
of equations (line 9 of Algorithm 1) executes matrix operations, involves signifi-207
cant computational load, and was chosen as the key resin flow infusion modeling208
kernel to port to the GPU. The PCG solver is composed of sparse matrix-vector209
multiplication and vector operations which is shown in Algorithm 2.210
The PCG solver is executed with an element-centric [Mohan, Ngo et al. (1996);211
Mohan, Shires et al. (2001)] rather than node-centric focus which results in a larger212
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 11
Algorithm 2: Preconditioned conjugate gradient (solves Ax = b)
Input: Matrix A and load/force vector b
Output: solution vector x
1. Set r0 ⇐ b−Ax0
2. Set z0 ⇐ M−1r0
3. Set p0 ⇐ z0
4. Set k ⇐ 0
5. DO UNTIL CONVERGENCE
6. αk ⇐
rT
k zk
pT
k Apk
7. xk+1 ⇐ xk +αk pk
8. rk+1 ⇐ rk −αkApk
9. IF ( rk −rk+1 ≤ ξ) BREAK
10. zk+1 ⇐ M−1rk+1
11. βk ⇐
zT
k+1rk+1
zT
k rk
12. pk+1 ⇐ zk+1 +βk pk
13. k ⇐ k +1
14. END DO
ratio of computation to communication cost [Haney (2006)] – an optimal strategy213
to leverage the dramatic floating point power of the GPU. The sparse matrix-vector214
operation (line 2 and 10 of Algorithm 2) is the largest computational cost of the215
PCG solver [Buatois, Caumon et al. (2009); Helfenstein and Koko (2011)] and216
an obvious section for porting to the GPU device. The other major computational217
portions of the linear system solver are vector updates and scalar dot-products, both218
of which are handled via the CUBLAS library calls [Shewchuk (1994); Nvidia219
(2007)].220
The presented candidate application for this study utilize domain decomposition to221
partition a global problem domain such that each resulting sub-domain is solved222
with a local CPU/GPU host – communication is handled via the MPI. The inter-223
play of the coarse-grained parallelism expressed by MPI and the fine-grained par-224
allelism of the GPU is the source of often disappointing performance of computa-225
tionally intensive applications employing multiple CPU/GPU computing systems226
[Papadrakakis, Stavroulakis et al. (2011); Wang, Huang et al. (2011); Song and227
Dongarra (2012)].228
CMES Galley Proof Only Please Return in 48 Hours.
Proof
12 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
5 Computational performance analysis and comparison229
Figure 7: Compressed Sparse Row (CSR) format
The first half of this section produces the preliminary data for the models being230
examined and the second half of this section presents the observed performance231
using System A and System B multiple CPU/GPU computing systems. The input232
unstructured mesh models are initially solved using the CSR (Compressed Sparse233
Row) compression format shown in Fig. 7 then the BCSR2x2 (Block Compressed234
Sparse Row) compression format is utilized. The CSR compression format is well235
documented as a potential performance problem as increased pointer indirection236
can magnify irregularity of data element access for sparse matrix-vector multi-237
plications [Baskaran and Bordawekar (2008); Hugues and Petiton (2010)]. The238
BCSR2x2 compression format, shown in Fig. 8, has been shown to increase lo-239
cality in cases where dense sub-blocks exist in the sparse matrices [Baskaran and240
Bordawekar (2008); Hugues and Petiton (2010)], such as with linear equations gen-241
erated for the presented candidate application – effectively isolating the influence242
of spatial locality for both System A and System B architectures.243
5.1 Preliminaries244
The computational performance of the GPU and CPU code developments and their245
comparative performance were studied employing two separate unstructured finite246
element mesh models representing resin infusion in a composite structural part247
configuration. For the computational performance comparisons, this resulted in248
two computational models composed of differing numbers of nodes with three-249
node triangular elements but with the same geometries (Fig. 9). The unstructured250
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 13
Figure 8: Block Compressed Sparse Row with 2x2 sub-blocks (BCSR2x2) format
mesh models are labeled as follows.251
• Model MA: Model consisting of 26,936 nodes and 53,148 triangular ele-252
ments.253
• Model MB: Model consisting of 103,196 nodes and 204,970 triangular ele-254
ments.255
Computational performance benchmarking for the GPU and CPU developments256
and resin flow infusion modeling was accomplished as follows.257
• Speedup factor: The ratio of CPU execution time to GPU execution time258
whereby the larger the value, the more optimal the performance obtained259
through GPU.260
5.2 Observed performance and analysis – System A261
Executing with the CSR format we observe that MPI provides a distinct perfor-262
mance boost over the serial versions for this multiple CPU/GPU computing system263
regardless of the input unstructured meshes. However, when using MPI in combi-264
nation with the local GPU we get less distinct benefit over MPI without employing265
CMES Galley Proof Only Please Return in 48 Hours.
Proof
14 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
Figure 9: Unstructured mesh geometry that represents both MA and MB models
the local GPU. Fig. 10 shows the smaller unstructured mesh model MA and Fig.266
11 shows the larger unstructured mesh model MB. Table. 1 and Table. 2 are the267
observed full solution times for the MA and MB models respectively.268
Table 1: System A execution time for mesh MA
Partitions MPI + GPU Time (secs.) MPI + CPU Time (secs.) Speedup Factor
2 1,604.200 1,666.960 1.034
4 1,357.890 825.989 0.608
16 1,461.590 260.669 0.178
The unstructured mesh model MA exposes a nearly constant total solution time269
when using MPI and the local GPU regardless of the number of partitions, whereas270
MPI with no GPU displays a nearly linear decrease in this time as shown in Fig.271
10. The unstructured mesh model MB displays a slight performance boost when272
using MPI and the local GPU for partition counts less than 16 as the larger mesh273
will inexorably utilize more of the GPU computational resources to mitigate la-274
tency – higher probability of memory address coalescing with the CUDA-enabled275
hardware.276
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 15
Table 2: System A execution time for mesh MB
Partitions MPI + GPU Time (secs.) MPI + CPU Time (secs.) Speedup Factor
2 16,985.600 20,169.400 1.187
4 8,570.300 9,633.960 1.124
16 23,956.100 2,495.580 0.104
Figure 10: Total solution time for mesh MA with CSR compression using System
A
Figure 11: Total solution time for mesh MB with CSR compression using System
A
CMES Galley Proof Only Please Return in 48 Hours.
Proof
16 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
Figure 12: Total solution time for mesh MA with CSR compression using System
B
Figure 13: Total solution time for mesh MB with CSR compression using System
B
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 17
Figure 14: Mesh MA and MB with CSR and BCSR2x2 compression - System A
Figure 15: Mesh MA and MB with CSR and BCSR2x2 compression - System B
CMES Galley Proof Only Please Return in 48 Hours.
Proof
18 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
The dramatic decrease in performance for MPI employing local host GPU after277
4 partitions correlates well with the inefficiency of the fine grained parallelism278
of the GPU to cooperate with the coarse grained parallelism defined by domain279
decomposition methodology as implemented by the MPI standard [Papadrakakis,280
Stavroulakis et al. (2011); Wang, Potluri et al. (2011)] – i.e., the GPU cannot di-281
rectly access the communication buffers of the MPI calls and must cross the local282
CPU-GPU communication bus as defined by PCIe and this is a well documented283
bottleneck in GPGPU computing ...(Luebke 2008, Lee, Kim et al. 2010, Rehman284
2010).285
Executing with the BCSR2x2 compression format, as shown in Fig. 14, spatial286
locality appears to be a larger factor in performance for the smaller model MA with287
partitions less than 16 when intra-nodal communication assumes a larger cost of288
execution. The larger model MB has a different behavior with the smaller partitions289
yielding a lower performance and the higher partition showing better performance290
– as seen in Fig. 14.291
5.3 Observed performance and analysis – System B292
Executing with the CSR compression format we observe the performance of the293
input unstructured mesh models for this multiple CPU/GPU computing system to294
be the relative inverse of that seen with System A. The larger unstructured mesh295
model MB performs worse for the MPI and local GPU paradigm (Fig. 12) and the296
smaller mesh model MA performs better using MPI and the local CPU (Fig. 13).297
This behavior is counter-intuitive as previous experience with GPU enhanced ap-298
plications have shown increased performance benefit for problems requiring larger299
floating-point operations [Boggan and Pressel (2007); Baskaran and Bordawekar300
(2008); Grozea, Bankovic et al. (2010); Bustamam, Burrage et al. (2012)]. Table.301
3 and Table. 4 are the observed full solution times for the MA and MB models302
respectively.303
Table 3: System B execution time for mesh MA
Partitions MPI + GPU Time (secs.) MPI + CPU Time (secs.) Speedup Factor
2 220.462 767.364 3.481
4 98.034 448.665 4.577
16 44.392 307.081 6.917
Given that the input mesh models have the same parameters and geometries for Sys-304
tem B and System A, the inverted performance of the candidate application solved305
with the defined input meshes using System B are likely due to a faster memory306
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 19
Table 4: System B execution time for mesh MB
Partitions MPI + GPU Time (secs.) MPI + CPU Time (secs.) Speedup Factor
2 2,995.870 1,675.850 0.559
4 2,172.520 872.019 0.401
16 1,490.310 318.895 0.214
I/O for the local GPU. Both System A and System B have 384-bit wide bus, but307
System B uses a GDDR5 memory I/O device whereas System A employees the308
GDDR3 memory I/O device. The GDDR5 can execute on both edges of the clock309
pulse and thus push the input data to the processing cores of System B GPU at a310
higher rate than System A, exposing the significant latency generated as a product311
of the local CPU-GPU host and intra-nodal communications.312
Executing the candidate application using the BCSR2x2 compression format, as313
shown in Fig. 15, spatial locality appears to provide a nearly consistent advantage314
with the smaller input unstructured mesh model MA – regardless of number of par-315
titions. The larger mesh model MB illustrates consistent performance degradation316
– immune to any number of partitions used.317
6 Concluding Remarks318
The multiple CPU/GPU computing systems examined in this work illustrate the319
sometimes deleterious effects that iterative sparse matrix solvers can present when320
MPI and the local GPU device are combined, illustrating the critical importance of321
understanding the role of both hardware and software in high performance appli-322
cations. In the general case, MPI that employees either CPU or GPU as the local323
backend device will produce a performance benefit for either System A or System324
B – as shown in Fig. 10, Fig. 11 and Fig. 12, Fig. 13 respectively. However the325
advantage of using MPI with the local GPU over MPI without the local GPU is326
not as discernible as this depends heavily on the specific hardware and algorithmic327
approaches employed.328
This work has illustrated that later generations of GPU devices, such as System329
B, have mitigated much of the spatial locality issues that plagued earlier devices330
executing sparse matrix-vector operations with necessary data compression but has331
revealed other concerns as shown by Fig. 12. The Faster memory I/O employed by332
System B increases the potential process throughput, defined as the central paradig-333
m of latency mitigation for GPU devices, but can actually create paradoxical situ-334
ations where increased computational loads can degrade application performance.335
Software/algorithmic factors can mitigate poor application performance to a de-336
CMES Galley Proof Only Please Return in 48 Hours.
Proof
20 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
gree, as with increasing spatial locality via a compression format that load near-by337
elements of the matrix to on-board GPU device registers as is the case with the 2x2338
blocks of BCSR2x2 compression.339
Critical to the optimal performance of computationally intensive applications for340
multiple CPU/GPU computing systems is the understanding of the underlying ar-341
chitecture and the application to execute.342
Acknowledgement: The support in part by contracts/grants from Office of Naval343
Research and Clarkson Aerospace is acknowledged.344
References345
Baskaran, M. M.; Bordawekar, R. (2008): Optimizing Sparse Matrix-Vector346
Multiplication on GPUs, IBM Research: 11.347
Boggan, S. K.; Pressel, D. M. (2007): GPUs: An Emerging Platform for General-348
Purpose Computation. Aberdeen Proving Ground, MD, USA, Army Research Lab:349
50.350
Buatois, L.; Caumon, G.; Levy, B. (2009): Concurrent number cruncher: a GPU351
implementation of a general sparse linear solver. International Journal of Parallel,352
Emergent and Distributed Systems, vol. 24, no. 3, pp. 18.353
Buck, I.; Foley, T.; Horn, D.; Sugerman, J.; Fatahalian, K.; Houston, M.;354
Hanrahan, P. (2004): Brook for GPUs: Stream Computing on Graphics Hardware.355
ACM Transactions on Graphics - Proceedings of ACM SIGGRAPH, vol. 23, no. 3,356
pp. 777-786.357
Bustamam, A.; Burrage, K.; Hamilton, N. A. (2012): Fast Parallel Markov Clus-358
tering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA359
and ELLPACK-R Sparse Format. IEEE/ACM Trans Comput Biol Bioinform., vol.360
3, no. 9, pp. 13.361
Corrigan, A.; Camelli, F. F.; Lohner, R.; Wallin, J. (2011): Running unstruc-362
tured grid-based CFD solvers on modern graphics hardware. International Journal363
for Numerical Methods in Fluids, vol. 66, no. 2, pp. 221-229.364
Fatahalian, K.; Houston, M. (2008): GPUs: A Closer Look. ACM Queue, vol. 6,365
no. 2, pp. 10.366
Garland, M.; Le Grand, S.; Nickolls, J.; Anderson, J.; Hardwick, J.; Morton,367
S.; Phillips, E.; Zhang Y.; Volkov, V. (2008): Parallel Computing Experiences368
with CUDA . Micro, IEEE, vol. 28, no. 4, pp. 13-27.369
Grozea, C.; Bankovic, Z.; Laskov, P. (2010): FPGA vs. Multi-Core CPUs vs.370
GPUs: Hands-on Experience with a Sorting Application. Facing the multicore-371
CMES Galley Proof Only Please Return in 48 Hours.
Proof
Resin Infusion Flow Modeling Application 21
challenge Berlin, Heidelberg, Springer-Verlag Berlin, Heidelberg pp. 105-117.372
Hamada, T.; Narumi, T.; Yasuoka, K.; Nitadori, K.; Taiji, M. (2009): 42 TFlops373
Hierarchical N-body Simulations on GPUs with Applications in both Astrophysics374
and Turbulence. The International Conference for High Performance Computing,375
Networking, Storage, and Analysis, ACM New York, NY, USA.376
Haney, R. H. (2006): Study and Evaluation of Domain Decomposition Approaches377
in two Parallel Software Code Developments for Process Flow Modeling in Liquid378
Composite Molding. Master of Science, North Carolina A & T State University.379
Helfenstein, R.; Koko, J. (2011): Parallel preconditioned conjugate gradient algo-380
rithm on GPU. Journal of Computational and Applied Mathematics, vol. 236, no.381
15, pp. 6.382
Hugues, M. R.; Petiton, S. G. (2010): Sparse Matrix Formats Evaluation and383
Optimization on a GPU, IEEE Computer Society Washington, DC, USA.384
Karunadasa, N. P.; Ranasinghe, D. N. (2009): Accelerating High Performance385
Applications with CUDA and MPI. 2009 International Conference on Industrial386
and Information Systems (ICIIS). Sri Lanka, IEEE, pp. 331-336.387
Kim, J.; Seo, S.; Lee, J.; Nah, J.; Jo, G.; Lee, J. (2012). SnuCL: An OpenCL388
Framework for Heterogeneous CPU/GPU Clusters. International Conference on389
Supercomputing, ACM New York, NY, USA.390
Kuznik, F.; Obrecht, C.; Rusaouen, G.; Roux, J.-J. (2010): LBM based flow391
simulation using GPU computing processor. Computers & Mathematics with Ap-392
plications, vol. 59, no. 7, pp. 12.393
Lee, S.; Vetter, J. S. (2012): Early Evaluation of Directive-Based GPU Program-394
ming Models for Productive Exascale Computing. The International Conference395
for High Performance Computing, Networking, Storage, and Analysis, IEEE Com-396
puter Society Press Los Alamitos, CA, USA.397
Lee, V. W.; Kim, C.; Chhugani, J.; Deisher, M.; Kim, D.; Nguyen, A. D.; Satish,398
N.; Smelyanskiy, M.; Chennupaty, S.; Hammarlund, P.; Singhal, R.; Dubey,399
P. (2010): Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput400
Computing on CPU and GPU.401
Luebke, D. (2008): CUDA: Scalable parallel programming for high-performance402
scientific computing. Biomedical Imaging: From Nano to Macro, 2008, Paris,403
ACM New York, NY, USA.404
Luebke, D.; Humphreys, G. (2007): How GPUs Work. IEEE Computer, 2007.405
Mohan, D. R.; Shires, D.; Mark, A. (2001): Scalable Large Scale Process Model-406
ing and Simulations in Liquid Composite Molding. Computational Science - ICCS407
2001, Springer Berlin Heidelberg, pp. 1199-1208.408
CMES Galley Proof Only Please Return in 48 Hours.
Proof
22 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013
Mohan, R. V.; Ngo, N. D.; Tamma, K. K.; Fickie, K. D. (1996): On a Pure Finite-409
Element-Based Methodology for Resin Transfer Mold Filling Simulations, Army410
Research Lab, pp. 18.411
Nvidia (2007): CUDA CUBLAS Library: Version 1.1, Nvidia Corporation, pp. 84.412
Nvidia (2012): NVIDIA CUDA C Programming Guide: Version 4.2, Nvidia Cor-413
poration, pp. 173.414
Papadrakakis, M.; Stavroulakis, G.; Karatarakis, A. (2011): A new era in sci-415
entific computing: Domain decomposition methods in hybrid CPU–GPU architec-416
tures. Computer Methods in Applied Mechanics and Engineering, vol. 200, no.417
13-16, pp. 1490-1508.418
Patterson, D. A.; Hennessy, J. L. (1998): Computer Organization & Design: The419
Hardware/Software Interface. San Francisco, California, USA, Morgan Kaufmann420
Publishers, Inc.421
Rehman, M. S. (2010): Exploring Irregular Memory Access Applications on the422
GPU. Master of Science, International Institute of Information Technology.423
Rumpf, M.; Strzodka, R. (2005): Graphics Processor Units: New Prospects for424
Parallel Computing. Numerical Solution of Partial Differential Equations on Par-425
allel Computers. A. M. a. T. Bruaset, Aslak, Springer. vol. 51, pp. 89-134.426
Shewchuk, J. R. (1994): An Introduction to the Conjugate Gradient Method With-427
out the Agonizing Pain. Pittsburgh, PA, USA, Carnegie Mellon University, pp. 64.428
Silberschatz, A.; Galvin, P.; Gagne, G. (2003): Applied Operating System Con-429
cepts: windows XP update, John Wiley & Sons, Inc.430
Song, F.; Dongarra, J. (2012): A Scalable Framework for Heterogeneous GPU-431
Based Clusters. ACM Symposium on Parallel Algorithms and Architectures, ACM432
New York, NY, USA.433
Wang, H.; Potluri, S.; Luo, M.; Singh, A. K.; Sur, S.; Panda, D. K. (2011):434
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters.435
Computer Science - Research and Development, vol. 26, no. 3-4, pp. 257-266.436
Wang, L.; Huang, M.; Narayana, V.; El-Ghazawi, T. A. (2011): Scaling Scientif-437
ic Applications on Clusters of Hybrid Multicore/GPU Nodes. Computing Frontiers438
Conference, ACM New York, NY, USA.439
Wilkinson, B.; Allen, M. (2005): Parallel Programming: Techniques and Applica-440
tions Using Networked Workstations and Parallel Computers. Upper Saddle River,441
NJ, USA, Pearson Education, Inc.442

More Related Content

What's hot

A multi objective hybrid aco-pso optimization algorithm for virtual machine p...
A multi objective hybrid aco-pso optimization algorithm for virtual machine p...A multi objective hybrid aco-pso optimization algorithm for virtual machine p...
A multi objective hybrid aco-pso optimization algorithm for virtual machine p...eSAT Publishing House
 
A novel methodology for task distribution
A novel methodology for task distributionA novel methodology for task distribution
A novel methodology for task distributionijesajournal
 
Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
 
High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...IRJET Journal
 
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search AlgorithmHybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search AlgorithmIRJET Journal
 
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...ijcsit
 
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...IRJET Journal
 
Improved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling AlgorithmImproved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling Algorithmiosrjce
 
An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...eSAT Publishing House
 
Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...TELKOMNIKA JOURNAL
 
A Modified GA-based Workflow Scheduling for Cloud Computing Environment
A Modified GA-based Workflow Scheduling for Cloud Computing EnvironmentA Modified GA-based Workflow Scheduling for Cloud Computing Environment
A Modified GA-based Workflow Scheduling for Cloud Computing EnvironmentIJCSIS Research Publications
 
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...ijgca
 

What's hot (18)

A multi objective hybrid aco-pso optimization algorithm for virtual machine p...
A multi objective hybrid aco-pso optimization algorithm for virtual machine p...A multi objective hybrid aco-pso optimization algorithm for virtual machine p...
A multi objective hybrid aco-pso optimization algorithm for virtual machine p...
 
A novel methodology for task distribution
A novel methodology for task distributionA novel methodology for task distribution
A novel methodology for task distribution
 
Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
 
High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...High Dimensionality Structures Selection for Efficient Economic Big data usin...
High Dimensionality Structures Selection for Efficient Economic Big data usin...
 
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search AlgorithmHybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
Hybrid Task Scheduling Approach using Gravitational and ACO Search Algorithm
 
Omega
OmegaOmega
Omega
 
GRID COMPUTING
GRID COMPUTINGGRID COMPUTING
GRID COMPUTING
 
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
 
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
Service Request Scheduling in Cloud Computing using Meta-Heuristic Technique:...
 
Improved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling AlgorithmImproved Max-Min Scheduling Algorithm
Improved Max-Min Scheduling Algorithm
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...An enhanced adaptive scoring job scheduling algorithm with replication strate...
An enhanced adaptive scoring job scheduling algorithm with replication strate...
 
Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...Transformation and dynamic visualization of images from computer through an F...
Transformation and dynamic visualization of images from computer through an F...
 
A Modified GA-based Workflow Scheduling for Cloud Computing Environment
A Modified GA-based Workflow Scheduling for Cloud Computing EnvironmentA Modified GA-based Workflow Scheduling for Cloud Computing Environment
A Modified GA-based Workflow Scheduling for Cloud Computing Environment
 
20120140505010
2012014050501020120140505010
20120140505010
 
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
 
F017423643
F017423643F017423643
F017423643
 
Accelerated seam carving using cuda
Accelerated seam carving using cudaAccelerated seam carving using cuda
Accelerated seam carving using cuda
 

Viewers also liked

Viewers also liked (14)

V3I8-0460
V3I8-0460V3I8-0460
V3I8-0460
 
ARL-TR-7315
ARL-TR-7315ARL-TR-7315
ARL-TR-7315
 
Profusion Consult Ltd
Profusion Consult LtdProfusion Consult Ltd
Profusion Consult Ltd
 
JAVA_STEP_V7
JAVA_STEP_V7JAVA_STEP_V7
JAVA_STEP_V7
 
ADA613693
ADA613693ADA613693
ADA613693
 
AWE EU 2016 Major Announcements
AWE EU 2016 Major AnnouncementsAWE EU 2016 Major Announcements
AWE EU 2016 Major Announcements
 
Theo Lagendijk (Platipus) AR in Education
Theo Lagendijk (Platipus) AR in EducationTheo Lagendijk (Platipus) AR in Education
Theo Lagendijk (Platipus) AR in Education
 
Visualizing STEP Files
Visualizing STEP FilesVisualizing STEP Files
Visualizing STEP Files
 
estandarizacion de soluciones y titulacion
 estandarizacion de soluciones y titulacion estandarizacion de soluciones y titulacion
estandarizacion de soluciones y titulacion
 
Daniel Sproll (realities.io) Explore What Was Out of Reach - The Future of VR...
Daniel Sproll (realities.io) Explore What Was Out of Reach - The Future of VR...Daniel Sproll (realities.io) Explore What Was Out of Reach - The Future of VR...
Daniel Sproll (realities.io) Explore What Was Out of Reach - The Future of VR...
 
Carbohidratos
CarbohidratosCarbohidratos
Carbohidratos
 
Gerben Harmsen (TWINKLS) Augmenting Your Value Chain
Gerben Harmsen (TWINKLS) Augmenting Your Value ChainGerben Harmsen (TWINKLS) Augmenting Your Value Chain
Gerben Harmsen (TWINKLS) Augmenting Your Value Chain
 
Lessons from Canada’s First Open Data Exchange
Lessons from Canada’s First Open Data ExchangeLessons from Canada’s First Open Data Exchange
Lessons from Canada’s First Open Data Exchange
 
Bebidas alcoholicas y no alcoholicas
Bebidas alcoholicas y no alcoholicasBebidas alcoholicas y no alcoholicas
Bebidas alcoholicas y no alcoholicas
 

Similar to CMES201308262603_16563

Hybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESHybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESSubhajit Sahu
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Heterogeneous computing with graphical processing unit: improvised back-prop...
Heterogeneous computing with graphical processing unit:  improvised back-prop...Heterogeneous computing with graphical processing unit:  improvised back-prop...
Heterogeneous computing with graphical processing unit: improvised back-prop...IJECEIAES
 
4213ijaia02
4213ijaia024213ijaia02
4213ijaia02ijaia
 
NASA_EPSCoR_poster_2015
NASA_EPSCoR_poster_2015NASA_EPSCoR_poster_2015
NASA_EPSCoR_poster_2015Longyin Cui
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...ijesajournal
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...ijesajournal
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...ijesajournal
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC
 
Accelerating S3D A GPGPU Case Study
Accelerating S3D  A GPGPU Case StudyAccelerating S3D  A GPGPU Case Study
Accelerating S3D A GPGPU Case StudyMartha Brown
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows seteSAT Publishing House
 
Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...eSAT Journals
 
Run time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration usingRun time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration usingeSAT Publishing House
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...IOSR Journals
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
 
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...IJECEIAES
 

Similar to CMES201308262603_16563 (20)

Hybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESHybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTES
 
1605.08695.pdf
1605.08695.pdf1605.08695.pdf
1605.08695.pdf
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Heterogeneous computing with graphical processing unit: improvised back-prop...
Heterogeneous computing with graphical processing unit:  improvised back-prop...Heterogeneous computing with graphical processing unit:  improvised back-prop...
Heterogeneous computing with graphical processing unit: improvised back-prop...
 
4213ijaia02
4213ijaia024213ijaia02
4213ijaia02
 
NASA_EPSCoR_poster_2015
NASA_EPSCoR_poster_2015NASA_EPSCoR_poster_2015
NASA_EPSCoR_poster_2015
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
 
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
A NOVEL METHODOLOGY FOR TASK DISTRIBUTION IN HETEROGENEOUS RECONFIGURABLE COM...
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
 
OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021OpenACC Monthly Highlights: September 2021
OpenACC Monthly Highlights: September 2021
 
Accelerating S3D A GPGPU Case Study
Accelerating S3D  A GPGPU Case StudyAccelerating S3D  A GPGPU Case Study
Accelerating S3D A GPGPU Case Study
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows set
 
Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...
 
Run time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration usingRun time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration using
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...Enhance similarity searching algorithm with optimized fast population count m...
Enhance similarity searching algorithm with optimized fast population count m...
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
 
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
Comparative Study of Neural Networks Algorithms for Cloud Computing CPU Sched...
 

CMES201308262603_16563

  • 1. CMES Galley Proof Only Please Return in 48 Hours. Proof Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 20131 Comparison and Performance Analysis of Multiple2 CPU/GPU Computing Systems – Resin Infusion Flow3 Modeling Application4 R.H. Haney1 and R.V. Mohan25 Abstract: The use of Graphics Processing Units (GPUs) as co-processors for s-6 ingle CPU/GPU computing systems has become pronounced in high performance7 computing research, however the solution of truly large scale computationally in-8 tensive problems require the utilization of multiple computing nodes. Multiple9 CPU/GPU computing systems bring new complexities to the observed performance10 of computationally intensive applications, the more salient of which is the cost of11 local CPU-GPU host and intra-nodal communication. This paper compares and an-12 alyzes the performance of a computationally intensive application represented by13 resin infusion flow during liquid composite molding process for the manufacture14 of structural composites application via two distinct multiple CPU/GPU comput-15 ing system architectures. Resin flow infusion modeling during liquid composite16 molding process is the engineering application of interest in the present study. The17 global domain is partitioned into a series of sub-domains each of which is solved18 at the local host and reassembled for the final solution as per the domain decom-19 position methodology. The candidate application, as with many scientific and en-20 gineering applications, uses the Finite Element Method (FEM) to computationally21 model the governing physics based mass and momentum conversation equations.22 FEM discretization results in large sparse linear equation systems that are solved23 iteratively for this class of free surface, moving boundary value problem. Com-24 putational analysis software for the GPU environment has been developed using25 CUDA API for the iterative linear equation system solver based on the precondi-26 tioned conjugate gradient method for the solution of linear system of equations.27 These linear equation systems are solved multiple times with intra-nodal commu-28 nication utilized, resulting in the converged global system. The interplay of local29 host CPU/GPU and intra-nodal communication creates mixed performance results30 for the presented candidate application. The software/hardware factors that affec-31 t performance for each architecture are examined and discussed in this paper –32 1 North Carolina A&T State University, Greensboro, NC, U.S.A. 2 North Carolina A&T State University, Greensboro, NC, U.S.A, For Correspondence.
  • 2. CMES Galley Proof Only Please Return in 48 Hours. Proof 2 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 understanding how the presented candidate application’s observed performance is33 effected by both the individual multiple CPU/GPU computing system architecture34 and algorithmic/software design is critical to optimize many modern high perfor-35 mance applications which employee the GPU as a hardware accelerator.36 Keywords: resin flow infusion modeling, sparse matrix, performance, multiple37 GPUs.38 1 Introduction39 The Graphics Processing Unit (GPU) is inexorably becoming common place for the40 acceleration of computationally intensive applications for single CPU/GPU com-41 puting systems [Hamada, Narumi et al. (2009); Kuznik, Obrecht et al. (2010);42 Corrigan, Camelli et al. (2011); Bustamam, Burrage et al. (2012)] as ever larger43 computing challenges seek to leverage the dramatic computational power expressed44 with the modern GPU device [Fatahalian and Houston (2008); Garland, Le Grand45 et al. (2008); Grozea, Bankovic et al. (2010); Bustamam, Burrage et al. (2012)].46 However, truly large scale high performance computing applications necessitate47 the use of multiple computing nodes. The execution of computationally intensive48 applications within the context of multiple CPU/GPU computing systems has been49 shown with varying degrees of success [Karunadasa and Ranasinghe (2009); Corri-50 gan, Camelli et al. (2011); Kim, Seo et al. (2012); Lee and Vetter (2012)]. This pa-51 per examines the performance of computational analysis software for resin infusion52 flow modeling during liquid composite molding process for structural composites53 via two distinct multiple CPU/GPU computing systems – exposing the effects of54 different hardware and software influencing factors.55 Liquid Composite Molding (LCM) is a popular process used for the manufacture of56 polymer composite structures. The process involves injection of a polymeric resin57 inside a mold cavity filled with an oriented, woven dry fiber preform [Mohan, Ngo58 et al. (1996); Mohan, Shires et al. (2001)]. The resin infusion flow modeling and59 simulation of the LCM process involves the isothermal process flow solution based60 on the conservation of resin mass as the governing equation in FEM computational61 developments. The governing coupled mass and momentum conservation equa-62 tion is discretized and the fill factors and pressure values are solved in an iterative63 manner [Mohan, Ngo et al. (1996); Mohan, Shires et al. (2001)]. The computa-64 tionally intensive portion of the analysis code is from the multiple calls to the linear65 equation system solver that is solved using the Preconditioned Conjugate Gradient66 (PCG) method.67 The PCG iterative linear equation system solver consists of a set of matrix-vector68 operations [Shewchuk (1994)] that map well to the natural matrix structure of the69
  • 3. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 3 GPU using Nvidia Compute Unified Device Architecture (CUDA) API [Nvidia70 (2012)]. The performance results of two different multiple CPU/GPU computing71 systems are examined and discussed, offering possible avenues of further investiga-72 tion for both rapid and robust code developments using the GPU that can be applied73 for the candidate application, and translate well to other legacy FEM based codes74 as well. The present paper is organized as follows.75 The GPU and its evolution are briefly discussed in the first section, as well as the76 details of the CPU and GPU hardware employed in the present work. The next77 section provides a description of the candidate application and includes the details78 of the physical problem, model equations, implementation, and the computation79 solution algorithm involved. A description of the GPU implementation and devel-80 opment for the resin flow infusion modeling is presented in the following section.81 The final section examines and discusses the computational performance of the82 presented candidate application providing analysis, comparisons, and conclusions.83 2 Graphical Processing Units84 The modern GPU consists of heavy concentration of transistors in the Arithmetic85 Logic Units (ALUs) and wide data bus [Boggan and Pressel (2007); Luebke and86 Humphreys (2007); Fatahalian and Houston (2008)]. The memory architecture has87 remained relatively simple to facilitate quicker access to input data. Unlike the CPU88 [Rumpf and Strzodka (2005); Luebke and Humphreys (2007)], GPU separates the89 configuration/instruction from the data to be operated on by the processor’s large90 number of ALUs.91 2.1 GPU evolution92 The gaming industry had been a key driving force for the improvement in the com-93 puting power of GPUs over the years [Buck, Foley et al. (2004); Boggan and94 Pressel (2007); Luebke and Humphreys (2007)]. Images are rendered using matrix95 operations applied simultaneously to large sets of data at the same time. Graphic-96 s hardware receives the instruction/configuration prior to the influx of the data to97 be operated, a natural Single Instruction Multiple Data (SIMD) environment seen98 in many parallel systems [Wilkinson and Allen (2005); Fatahalian and Houston99 (2008)]. Newer and more detailed games required faster processing of matrix op-100 erations as well as a wider data bus to enable rapid execution. Transistors were101 concentrated for floating point operations to facilitate large number of matrix op-102 erations [Boggan and Pressel (2007)] inherent in graphical processing hardware.103 The CPU on the other hand needed to maintain flexibility [Patterson and Hennessy104 (1998); Silberschatz, Galvin et al. (2003)]. Both CPU and GPU must deal with105 latency, and do so in distinctly different ways based on their architecture. The CPU106
  • 4. CMES Galley Proof Only Please Return in 48 Hours. Proof 4 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 maintains complex memory hierarchy to execute multiple processes in quick suc-107 cession to deal with latency [Patterson and Hennessy (1998); Silberschatz, Galvin108 et al. (2003); Boggan and Pressel (2007)]. GPU on the other hand deals with laten-109 cy by executing large data sets with the same configuration/instruction, which is set110 up prior to any data processing with a wide data bus and transistors concentrated in111 the arithmetic region. Fig. 1 illustrates typical CPU and GPU transistor distribu-112 tion. The evolution of GPU architecture has resulted in a computational workhorse113 optimized for data throughput.114 Figure 1: Distribution of transistors for CPU and GPU hardware devices Initially graphics hardware operated in a fixed pipeline and any programming was115 low level and very complex [Boggan and Pressel (2007)]. Thus the ability to access116 the GPU in any realistic way to leverage its computational power was limited. The117 development of flexible pipeline as well as high level language constructs paved118 the way for the use of GPU as a co-processor [Boggan and Pressel (2007); Lue-119 bke and Humphreys (2007)]. GPU software paradigms are further enabling their120 use as computational processors in computationally intensive engineering analysis121 applications such as the one employed in the present work.122 2.2 CPU and GPU architectures123 In this work, we utilize two distinct multiple CPU/GPU computing systems each124 consisting of multiple nodes composed of a single commodity GPU and CPU pro-125 cessor for software development and comparative performance analysis. These two126 multiple CPU/GPU computing systems are denoted as System A and System B and127 are defined as follows.128 1. System A: The CPU and GPU device architectures for this multiple CPU/GPU129 computing system can be viewed as the lower-grade of the two presented in130 this work. Each node in the multiple CPU/GPU computing system contains131 the following technical specifications.132
  • 5. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 5 • CPU: A dual-core Opteron processor with a clock frequency of 2.8133 GHz, 64-bit DDR2, an L1 cache of 256 KBs, an L2 cache of 2 MBs134 and a 1000 MHz Front Side Bus.135 • GPU: Nvidia Quadro FX5600 with 1.5GBs global memory, CUDA136 compute architecture 1.0, 128 cores, 8192 registers, 16 shared mem-137 ory banks, a clock frequency of 1.35 GHz and GDDR3 SDRAM using138 a 384-bit wide bus.139 2. System B: The CPU and GPU device architectures for this multiple CPU/GPU140 computing system can be viewed as the higher-grade of the two presented in141 this work. Each node in the multiple CPU/GPU computing system contains142 the following technical specifications.143 • CPU: A 6-core Intel X5650 processor with a clock frequency of 2.6144 GHz, 64-bit DDR3, an L1 cache of 384 KBs, an L2 cache of 1.5 MBs,145 an L3 cache of 12 MBs and a 1300 MHz Front Side Bus.146 • GPU: Nvidia Tesla M2070 with 6 GBs global memory, CUDA com-147 pute architecture 2.0, 448 cores, 32768 registers, 32 shared memory148 banks, a clock frequency of 1.15 GHz and GDDR5 SDRAM using a149 384-bit wide bus.150 The next section provides a description of the candidate application and includes151 the details of the physical problem, model equations, implementation, and the com-152 putation solution algorithm involved.153 3 Candidate engineering analysis application154 Resin infusion flow modeling in liquid composite manufacturing process for the155 manufacture of woven fabric structural composite parts is the candidate engineer-156 ing analysis application for the computational code development and GPU, CPU157 performance analysis employed in the present study. Resin flow infusion model-158 ing involves the tracking of the progressing resin from the injection gates through159 woven fabric filled Eulerian mold cavity mesh geometry. For thin shell composite160 structural configurations employing 2D flow representation and thickness averaged161 2D flow field, three-node triangular finite elements are used to discretize the mold162 cavity domain [Mohan, Ngo et al. (1996); Mohan, Shires et al. (2001)]. The finite163 element discretization for the computational problem involves the Galerkin weight-164 ed residual finite element formulation of the resin mass conservation equation in165 conjunction with the Darcian porous media flow field representing the momentum166 conservation, and an implicit methodology for the transient solution [Mohan, Ngo167 et al. (1996), Mohan, Shires et al. (2001)].168
  • 6. CMES Galley Proof Only Please Return in 48 Hours. Proof 6 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 3.1 Resin infusion flow modeling169 Following the discussions in [Mohan, Shires et al. (2001); Mohan, Ngo et al. (1996)], the resin flow through the fiber preform contained within the mold cav- ity is represented by a transient mass conservation equation. The physical mass conservation equation (formed by coupling the mass conservation equation with the momentum equation via Darcian velocity field) is given by Eq. (1) with ¯Kthe permeability tensor, µ the resin viscosity, Pthe pressure field, and Ψthe state vari- able, representing the resin flow infusion state of the mold cavity region. ∂ ∂t Ω Ψ∂Ω ≡ Ω ∇ ¯K µ ∇P ∂Ω (1) Further details of this above governing equation are available in [Mohan, Ngo et170 al. (1996)] and [Mohan, Shires et al. (2001)]. The value of the state variable Ψis 0171 in the regions of the mold where the resin has not infused the fiber preform and 1172 where the resin has completely infused the fiber preform in any given region of the173 Eulerian mold cavity domain Ω in the FEM computations.174 As discussed in [Mohan, Ngo et al. (1996)] and [Mohan, Shires et al. (2001)], the application of the Galerkin weighted residual formulation and approximating for the pressure Pand fill factor Ψwith appropriate elements to define the mold cavity resin infusion geometry, and associated shape functions yields a semi-discrete sys- tem of equations given by Eq. (2) with Cthe lumped mass matrix, Kthe stiffness matrix, qthe load vector, and ˙Ψthe time derivative. C ˙Ψ+KP = q (2) The transient semi-discrete equation is then solved by introducing the finite differ- ence approximation given by Eq. (3) for the time derivative term. ˙Ψ = Ψn+1 −Ψn ∆t (3) 3.2 Resin flow infusion modeling strategy175 The semi-discrete Eq. (2) can be reduced to Eq. (4), as discussed in [Mohan, Ngo et al. (1996)] and [Mohan, Shires et al. (2001)]. CiiΨn+1 i −CiiΨn i +∆tKijPj = ∆tqi (4) The above form of the discretized equation is solved during the resin infusion flow176 modeling during the transient resin progression at each time-step. Equation (4)177
  • 7. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 7 defines the implicit form of the process flow modeling in LCM, as shown in refer-178 ences [Mohan, Ngo et al. (1996)] and [Mohan, Shires et al. (2001)]. The gener-179 alized algorithm for the finite element computations for the process flow modeling180 and analysis of the resin progression in the transient, free surface flow problem for181 each time step is summarized in Algorithm 1. The finite element based solution182 strategy employed results in a system of linear equations that is solved using an183 iterative pre-conditioned conjugate gradient method involving matrix-vector, and184 dot product operations.185 Algorithm 1: Implicit Pure FE methodology for Resin Infusion Flow Modeling Computation (For progression from time step n to time step n+1and iteration m) 1. REPEAT 2. SET {Ψi}n+1 m to{Ψi}n (save previous fill factor values) 3. CALL assembleC for Ci (assembleC forms lump mass matrix) 4. CALL assembleK for Kij (assembleK forms stiffness matrixK) 5. CALL assembleLoad on qi (assembleLoad forms load vectorq) 6. REPEAT 7. SET boundary conditions on Kij (Modified load vectorg) 8. SET {gi}m to Cii {Ψi}n −Cii {Ψi}n+1 m +∆t {qi}m (Where ˆKijis Kmatrix with boundary conditions applied) 9. SOLVE ˆKij m Pj m = {gi}m (Compute new nodal resin fraction field using equation (4)) 10. SET Cii {Ψi}n+1 m+1 = Cii {Ψi}n −∆t [Kij] Pj m +∆t {qi}m 11. IF Cii {Ψi}n+1 m+1 −Cii {Ψi}n+1 m ≤ ξ THEN 12. BREAK 13. ELSE 14. SET {Ψi}n+1 m to {Ψi}n+1 m+1 15. ENDIF 16. UNTIL mass resin convergence 17. UNTIL all nodes are filled The interested reader is referred to [Mohan, Ngo et al. (1996)] for further details on186 the resin infusion flow modeling in LCM process and the finite element discretiza-187 tion employed for the computational problem.188
  • 8. CMES Galley Proof Only Please Return in 48 Hours. Proof 8 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 3.3 CPU and GPU resin flow infusion modeling validation189 The validity of the CPU and GPU code developments for multiple computing n- odes was determined by careful examination of the numerical results against the analytical solution for a simple 2D circular plate radial injection model (see Fig. 2). The analytical solution for this simple circular plate geometry radial injection condition for comparison were determined based on the radial flow front location, and injection port pressure at time step t for multiple numbers of sub-domains de- rived by partitioning the initial global domain and applying equations presented in reference [Mohan, Ngo et al. (1996)]. Eq. (5) and (6) define the radial flow front and injection port pressure respectively with R0 the inner radius, µthe resin viscos- ity, φ the fiber volume fraction, Kthe permeability, and Has the element or mold cavity thickness. R(t) = Qt πφH +R2 0 (5) P0 = µQ 2π ¯KH ln R(t) R0 (6) Figure 2: 2D circular plate radial injection flow model 190
  • 9. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 9 The radial flow front and injection port pressures for this thin shell circular plate191 model from the computational CPU and GPU analysis values for serial and 2-192 partitions were compared to the analytical values. Fig. 3, Fig. 4, Fig. 5 and193 Fig. 6 clearly demonstrate the congruency of CPU and GPU numerical solutions194 to the analytical solutions at each time step for serial and multiple partitions for195 both System A and System B. Additional computational modeling analysis com-196 parisons for other complex composite structures also showed excellent numerical197 congruency between CPU and GPU analysis code execution runs.198 Figure 3: Transient flow front locations comparisons for serial/multiple CPU/GPU System A Figure 4: Transient injection pressure comparisons for serial/multiple CPU/GPU System A The mapping of the candidate application to the multiple GPU/CPU computing199 system environments is discussed next.200
  • 10. CMES Galley Proof Only Please Return in 48 Hours. Proof 10 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 Figure 5: Transient flow front location comparisons for serial/multiple CPU/GPU System B Figure 6: Injection pressure comparisons for serial/multiple CPU/GPU System B 4 GPU implementation and code development strategies201 Finite element discretization of the transient, moving boundary value problem in-202 volves the solution of linear system of equations multiple times during the com-203 putational analysis and is the computationally intensive portion in the resin flow204 infusion modeling analysis of complex composite structures. The Preconditioned205 Conjugate Gradient (PCG) solution methodology for the solution of linear system206 of equations (line 9 of Algorithm 1) executes matrix operations, involves signifi-207 cant computational load, and was chosen as the key resin flow infusion modeling208 kernel to port to the GPU. The PCG solver is composed of sparse matrix-vector209 multiplication and vector operations which is shown in Algorithm 2.210 The PCG solver is executed with an element-centric [Mohan, Ngo et al. (1996);211 Mohan, Shires et al. (2001)] rather than node-centric focus which results in a larger212
  • 11. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 11 Algorithm 2: Preconditioned conjugate gradient (solves Ax = b) Input: Matrix A and load/force vector b Output: solution vector x 1. Set r0 ⇐ b−Ax0 2. Set z0 ⇐ M−1r0 3. Set p0 ⇐ z0 4. Set k ⇐ 0 5. DO UNTIL CONVERGENCE 6. αk ⇐ rT k zk pT k Apk 7. xk+1 ⇐ xk +αk pk 8. rk+1 ⇐ rk −αkApk 9. IF ( rk −rk+1 ≤ ξ) BREAK 10. zk+1 ⇐ M−1rk+1 11. βk ⇐ zT k+1rk+1 zT k rk 12. pk+1 ⇐ zk+1 +βk pk 13. k ⇐ k +1 14. END DO ratio of computation to communication cost [Haney (2006)] – an optimal strategy213 to leverage the dramatic floating point power of the GPU. The sparse matrix-vector214 operation (line 2 and 10 of Algorithm 2) is the largest computational cost of the215 PCG solver [Buatois, Caumon et al. (2009); Helfenstein and Koko (2011)] and216 an obvious section for porting to the GPU device. The other major computational217 portions of the linear system solver are vector updates and scalar dot-products, both218 of which are handled via the CUBLAS library calls [Shewchuk (1994); Nvidia219 (2007)].220 The presented candidate application for this study utilize domain decomposition to221 partition a global problem domain such that each resulting sub-domain is solved222 with a local CPU/GPU host – communication is handled via the MPI. The inter-223 play of the coarse-grained parallelism expressed by MPI and the fine-grained par-224 allelism of the GPU is the source of often disappointing performance of computa-225 tionally intensive applications employing multiple CPU/GPU computing systems226 [Papadrakakis, Stavroulakis et al. (2011); Wang, Huang et al. (2011); Song and227 Dongarra (2012)].228
  • 12. CMES Galley Proof Only Please Return in 48 Hours. Proof 12 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 5 Computational performance analysis and comparison229 Figure 7: Compressed Sparse Row (CSR) format The first half of this section produces the preliminary data for the models being230 examined and the second half of this section presents the observed performance231 using System A and System B multiple CPU/GPU computing systems. The input232 unstructured mesh models are initially solved using the CSR (Compressed Sparse233 Row) compression format shown in Fig. 7 then the BCSR2x2 (Block Compressed234 Sparse Row) compression format is utilized. The CSR compression format is well235 documented as a potential performance problem as increased pointer indirection236 can magnify irregularity of data element access for sparse matrix-vector multi-237 plications [Baskaran and Bordawekar (2008); Hugues and Petiton (2010)]. The238 BCSR2x2 compression format, shown in Fig. 8, has been shown to increase lo-239 cality in cases where dense sub-blocks exist in the sparse matrices [Baskaran and240 Bordawekar (2008); Hugues and Petiton (2010)], such as with linear equations gen-241 erated for the presented candidate application – effectively isolating the influence242 of spatial locality for both System A and System B architectures.243 5.1 Preliminaries244 The computational performance of the GPU and CPU code developments and their245 comparative performance were studied employing two separate unstructured finite246 element mesh models representing resin infusion in a composite structural part247 configuration. For the computational performance comparisons, this resulted in248 two computational models composed of differing numbers of nodes with three-249 node triangular elements but with the same geometries (Fig. 9). The unstructured250
  • 13. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 13 Figure 8: Block Compressed Sparse Row with 2x2 sub-blocks (BCSR2x2) format mesh models are labeled as follows.251 • Model MA: Model consisting of 26,936 nodes and 53,148 triangular ele-252 ments.253 • Model MB: Model consisting of 103,196 nodes and 204,970 triangular ele-254 ments.255 Computational performance benchmarking for the GPU and CPU developments256 and resin flow infusion modeling was accomplished as follows.257 • Speedup factor: The ratio of CPU execution time to GPU execution time258 whereby the larger the value, the more optimal the performance obtained259 through GPU.260 5.2 Observed performance and analysis – System A261 Executing with the CSR format we observe that MPI provides a distinct perfor-262 mance boost over the serial versions for this multiple CPU/GPU computing system263 regardless of the input unstructured meshes. However, when using MPI in combi-264 nation with the local GPU we get less distinct benefit over MPI without employing265
  • 14. CMES Galley Proof Only Please Return in 48 Hours. Proof 14 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 Figure 9: Unstructured mesh geometry that represents both MA and MB models the local GPU. Fig. 10 shows the smaller unstructured mesh model MA and Fig.266 11 shows the larger unstructured mesh model MB. Table. 1 and Table. 2 are the267 observed full solution times for the MA and MB models respectively.268 Table 1: System A execution time for mesh MA Partitions MPI + GPU Time (secs.) MPI + CPU Time (secs.) Speedup Factor 2 1,604.200 1,666.960 1.034 4 1,357.890 825.989 0.608 16 1,461.590 260.669 0.178 The unstructured mesh model MA exposes a nearly constant total solution time269 when using MPI and the local GPU regardless of the number of partitions, whereas270 MPI with no GPU displays a nearly linear decrease in this time as shown in Fig.271 10. The unstructured mesh model MB displays a slight performance boost when272 using MPI and the local GPU for partition counts less than 16 as the larger mesh273 will inexorably utilize more of the GPU computational resources to mitigate la-274 tency – higher probability of memory address coalescing with the CUDA-enabled275 hardware.276
  • 15. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 15 Table 2: System A execution time for mesh MB Partitions MPI + GPU Time (secs.) MPI + CPU Time (secs.) Speedup Factor 2 16,985.600 20,169.400 1.187 4 8,570.300 9,633.960 1.124 16 23,956.100 2,495.580 0.104 Figure 10: Total solution time for mesh MA with CSR compression using System A Figure 11: Total solution time for mesh MB with CSR compression using System A
  • 16. CMES Galley Proof Only Please Return in 48 Hours. Proof 16 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 Figure 12: Total solution time for mesh MA with CSR compression using System B Figure 13: Total solution time for mesh MB with CSR compression using System B
  • 17. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 17 Figure 14: Mesh MA and MB with CSR and BCSR2x2 compression - System A Figure 15: Mesh MA and MB with CSR and BCSR2x2 compression - System B
  • 18. CMES Galley Proof Only Please Return in 48 Hours. Proof 18 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 The dramatic decrease in performance for MPI employing local host GPU after277 4 partitions correlates well with the inefficiency of the fine grained parallelism278 of the GPU to cooperate with the coarse grained parallelism defined by domain279 decomposition methodology as implemented by the MPI standard [Papadrakakis,280 Stavroulakis et al. (2011); Wang, Potluri et al. (2011)] – i.e., the GPU cannot di-281 rectly access the communication buffers of the MPI calls and must cross the local282 CPU-GPU communication bus as defined by PCIe and this is a well documented283 bottleneck in GPGPU computing ...(Luebke 2008, Lee, Kim et al. 2010, Rehman284 2010).285 Executing with the BCSR2x2 compression format, as shown in Fig. 14, spatial286 locality appears to be a larger factor in performance for the smaller model MA with287 partitions less than 16 when intra-nodal communication assumes a larger cost of288 execution. The larger model MB has a different behavior with the smaller partitions289 yielding a lower performance and the higher partition showing better performance290 – as seen in Fig. 14.291 5.3 Observed performance and analysis – System B292 Executing with the CSR compression format we observe the performance of the293 input unstructured mesh models for this multiple CPU/GPU computing system to294 be the relative inverse of that seen with System A. The larger unstructured mesh295 model MB performs worse for the MPI and local GPU paradigm (Fig. 12) and the296 smaller mesh model MA performs better using MPI and the local CPU (Fig. 13).297 This behavior is counter-intuitive as previous experience with GPU enhanced ap-298 plications have shown increased performance benefit for problems requiring larger299 floating-point operations [Boggan and Pressel (2007); Baskaran and Bordawekar300 (2008); Grozea, Bankovic et al. (2010); Bustamam, Burrage et al. (2012)]. Table.301 3 and Table. 4 are the observed full solution times for the MA and MB models302 respectively.303 Table 3: System B execution time for mesh MA Partitions MPI + GPU Time (secs.) MPI + CPU Time (secs.) Speedup Factor 2 220.462 767.364 3.481 4 98.034 448.665 4.577 16 44.392 307.081 6.917 Given that the input mesh models have the same parameters and geometries for Sys-304 tem B and System A, the inverted performance of the candidate application solved305 with the defined input meshes using System B are likely due to a faster memory306
  • 19. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 19 Table 4: System B execution time for mesh MB Partitions MPI + GPU Time (secs.) MPI + CPU Time (secs.) Speedup Factor 2 2,995.870 1,675.850 0.559 4 2,172.520 872.019 0.401 16 1,490.310 318.895 0.214 I/O for the local GPU. Both System A and System B have 384-bit wide bus, but307 System B uses a GDDR5 memory I/O device whereas System A employees the308 GDDR3 memory I/O device. The GDDR5 can execute on both edges of the clock309 pulse and thus push the input data to the processing cores of System B GPU at a310 higher rate than System A, exposing the significant latency generated as a product311 of the local CPU-GPU host and intra-nodal communications.312 Executing the candidate application using the BCSR2x2 compression format, as313 shown in Fig. 15, spatial locality appears to provide a nearly consistent advantage314 with the smaller input unstructured mesh model MA – regardless of number of par-315 titions. The larger mesh model MB illustrates consistent performance degradation316 – immune to any number of partitions used.317 6 Concluding Remarks318 The multiple CPU/GPU computing systems examined in this work illustrate the319 sometimes deleterious effects that iterative sparse matrix solvers can present when320 MPI and the local GPU device are combined, illustrating the critical importance of321 understanding the role of both hardware and software in high performance appli-322 cations. In the general case, MPI that employees either CPU or GPU as the local323 backend device will produce a performance benefit for either System A or System324 B – as shown in Fig. 10, Fig. 11 and Fig. 12, Fig. 13 respectively. However the325 advantage of using MPI with the local GPU over MPI without the local GPU is326 not as discernible as this depends heavily on the specific hardware and algorithmic327 approaches employed.328 This work has illustrated that later generations of GPU devices, such as System329 B, have mitigated much of the spatial locality issues that plagued earlier devices330 executing sparse matrix-vector operations with necessary data compression but has331 revealed other concerns as shown by Fig. 12. The Faster memory I/O employed by332 System B increases the potential process throughput, defined as the central paradig-333 m of latency mitigation for GPU devices, but can actually create paradoxical situ-334 ations where increased computational loads can degrade application performance.335 Software/algorithmic factors can mitigate poor application performance to a de-336
  • 20. CMES Galley Proof Only Please Return in 48 Hours. Proof 20 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 gree, as with increasing spatial locality via a compression format that load near-by337 elements of the matrix to on-board GPU device registers as is the case with the 2x2338 blocks of BCSR2x2 compression.339 Critical to the optimal performance of computationally intensive applications for340 multiple CPU/GPU computing systems is the understanding of the underlying ar-341 chitecture and the application to execute.342 Acknowledgement: The support in part by contracts/grants from Office of Naval343 Research and Clarkson Aerospace is acknowledged.344 References345 Baskaran, M. M.; Bordawekar, R. (2008): Optimizing Sparse Matrix-Vector346 Multiplication on GPUs, IBM Research: 11.347 Boggan, S. K.; Pressel, D. M. (2007): GPUs: An Emerging Platform for General-348 Purpose Computation. Aberdeen Proving Ground, MD, USA, Army Research Lab:349 50.350 Buatois, L.; Caumon, G.; Levy, B. (2009): Concurrent number cruncher: a GPU351 implementation of a general sparse linear solver. International Journal of Parallel,352 Emergent and Distributed Systems, vol. 24, no. 3, pp. 18.353 Buck, I.; Foley, T.; Horn, D.; Sugerman, J.; Fatahalian, K.; Houston, M.;354 Hanrahan, P. (2004): Brook for GPUs: Stream Computing on Graphics Hardware.355 ACM Transactions on Graphics - Proceedings of ACM SIGGRAPH, vol. 23, no. 3,356 pp. 777-786.357 Bustamam, A.; Burrage, K.; Hamilton, N. A. (2012): Fast Parallel Markov Clus-358 tering in Bioinformatics Using Massively Parallel Computing on GPU with CUDA359 and ELLPACK-R Sparse Format. IEEE/ACM Trans Comput Biol Bioinform., vol.360 3, no. 9, pp. 13.361 Corrigan, A.; Camelli, F. F.; Lohner, R.; Wallin, J. (2011): Running unstruc-362 tured grid-based CFD solvers on modern graphics hardware. International Journal363 for Numerical Methods in Fluids, vol. 66, no. 2, pp. 221-229.364 Fatahalian, K.; Houston, M. (2008): GPUs: A Closer Look. ACM Queue, vol. 6,365 no. 2, pp. 10.366 Garland, M.; Le Grand, S.; Nickolls, J.; Anderson, J.; Hardwick, J.; Morton,367 S.; Phillips, E.; Zhang Y.; Volkov, V. (2008): Parallel Computing Experiences368 with CUDA . Micro, IEEE, vol. 28, no. 4, pp. 13-27.369 Grozea, C.; Bankovic, Z.; Laskov, P. (2010): FPGA vs. Multi-Core CPUs vs.370 GPUs: Hands-on Experience with a Sorting Application. Facing the multicore-371
  • 21. CMES Galley Proof Only Please Return in 48 Hours. Proof Resin Infusion Flow Modeling Application 21 challenge Berlin, Heidelberg, Springer-Verlag Berlin, Heidelberg pp. 105-117.372 Hamada, T.; Narumi, T.; Yasuoka, K.; Nitadori, K.; Taiji, M. (2009): 42 TFlops373 Hierarchical N-body Simulations on GPUs with Applications in both Astrophysics374 and Turbulence. The International Conference for High Performance Computing,375 Networking, Storage, and Analysis, ACM New York, NY, USA.376 Haney, R. H. (2006): Study and Evaluation of Domain Decomposition Approaches377 in two Parallel Software Code Developments for Process Flow Modeling in Liquid378 Composite Molding. Master of Science, North Carolina A & T State University.379 Helfenstein, R.; Koko, J. (2011): Parallel preconditioned conjugate gradient algo-380 rithm on GPU. Journal of Computational and Applied Mathematics, vol. 236, no.381 15, pp. 6.382 Hugues, M. R.; Petiton, S. G. (2010): Sparse Matrix Formats Evaluation and383 Optimization on a GPU, IEEE Computer Society Washington, DC, USA.384 Karunadasa, N. P.; Ranasinghe, D. N. (2009): Accelerating High Performance385 Applications with CUDA and MPI. 2009 International Conference on Industrial386 and Information Systems (ICIIS). Sri Lanka, IEEE, pp. 331-336.387 Kim, J.; Seo, S.; Lee, J.; Nah, J.; Jo, G.; Lee, J. (2012). SnuCL: An OpenCL388 Framework for Heterogeneous CPU/GPU Clusters. International Conference on389 Supercomputing, ACM New York, NY, USA.390 Kuznik, F.; Obrecht, C.; Rusaouen, G.; Roux, J.-J. (2010): LBM based flow391 simulation using GPU computing processor. Computers & Mathematics with Ap-392 plications, vol. 59, no. 7, pp. 12.393 Lee, S.; Vetter, J. S. (2012): Early Evaluation of Directive-Based GPU Program-394 ming Models for Productive Exascale Computing. The International Conference395 for High Performance Computing, Networking, Storage, and Analysis, IEEE Com-396 puter Society Press Los Alamitos, CA, USA.397 Lee, V. W.; Kim, C.; Chhugani, J.; Deisher, M.; Kim, D.; Nguyen, A. D.; Satish,398 N.; Smelyanskiy, M.; Chennupaty, S.; Hammarlund, P.; Singhal, R.; Dubey,399 P. (2010): Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput400 Computing on CPU and GPU.401 Luebke, D. (2008): CUDA: Scalable parallel programming for high-performance402 scientific computing. Biomedical Imaging: From Nano to Macro, 2008, Paris,403 ACM New York, NY, USA.404 Luebke, D.; Humphreys, G. (2007): How GPUs Work. IEEE Computer, 2007.405 Mohan, D. R.; Shires, D.; Mark, A. (2001): Scalable Large Scale Process Model-406 ing and Simulations in Liquid Composite Molding. Computational Science - ICCS407 2001, Springer Berlin Heidelberg, pp. 1199-1208.408
  • 22. CMES Galley Proof Only Please Return in 48 Hours. Proof 22 Copyright © 2013 Tech Science Press CMES, vol.0, no.0, pp.1-22, 2013 Mohan, R. V.; Ngo, N. D.; Tamma, K. K.; Fickie, K. D. (1996): On a Pure Finite-409 Element-Based Methodology for Resin Transfer Mold Filling Simulations, Army410 Research Lab, pp. 18.411 Nvidia (2007): CUDA CUBLAS Library: Version 1.1, Nvidia Corporation, pp. 84.412 Nvidia (2012): NVIDIA CUDA C Programming Guide: Version 4.2, Nvidia Cor-413 poration, pp. 173.414 Papadrakakis, M.; Stavroulakis, G.; Karatarakis, A. (2011): A new era in sci-415 entific computing: Domain decomposition methods in hybrid CPU–GPU architec-416 tures. Computer Methods in Applied Mechanics and Engineering, vol. 200, no.417 13-16, pp. 1490-1508.418 Patterson, D. A.; Hennessy, J. L. (1998): Computer Organization & Design: The419 Hardware/Software Interface. San Francisco, California, USA, Morgan Kaufmann420 Publishers, Inc.421 Rehman, M. S. (2010): Exploring Irregular Memory Access Applications on the422 GPU. Master of Science, International Institute of Information Technology.423 Rumpf, M.; Strzodka, R. (2005): Graphics Processor Units: New Prospects for424 Parallel Computing. Numerical Solution of Partial Differential Equations on Par-425 allel Computers. A. M. a. T. Bruaset, Aslak, Springer. vol. 51, pp. 89-134.426 Shewchuk, J. R. (1994): An Introduction to the Conjugate Gradient Method With-427 out the Agonizing Pain. Pittsburgh, PA, USA, Carnegie Mellon University, pp. 64.428 Silberschatz, A.; Galvin, P.; Gagne, G. (2003): Applied Operating System Con-429 cepts: windows XP update, John Wiley & Sons, Inc.430 Song, F.; Dongarra, J. (2012): A Scalable Framework for Heterogeneous GPU-431 Based Clusters. ACM Symposium on Parallel Algorithms and Architectures, ACM432 New York, NY, USA.433 Wang, H.; Potluri, S.; Luo, M.; Singh, A. K.; Sur, S.; Panda, D. K. (2011):434 MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters.435 Computer Science - Research and Development, vol. 26, no. 3-4, pp. 257-266.436 Wang, L.; Huang, M.; Narayana, V.; El-Ghazawi, T. A. (2011): Scaling Scientif-437 ic Applications on Clusters of Hybrid Multicore/GPU Nodes. Computing Frontiers438 Conference, ACM New York, NY, USA.439 Wilkinson, B.; Allen, M. (2005): Parallel Programming: Techniques and Applica-440 tions Using Networked Workstations and Parallel Computers. Upper Saddle River,441 NJ, USA, Pearson Education, Inc.442