SlideShare a Scribd company logo
1 of 15
Download to read offline
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
DOI:10.5121/ijdps.2015.6501 1
A PROGRESSIVE MESH METHOD FOR PHYSICAL
SIMULATIONS USING LATTICE BOLTZMANN
METHOD ON SINGLE-NODE MULTI-GPU
ARCHITECTURES
Julien Duchateau1
, François Rousselle1
, Nicolas Maquignon1
, Gilles Roussel1
,
Christophe Renaud1
1
Laboratoire d’Informatique, Signal, Image de la Côte d’Opale
Université du Littoral Côte d’Opale, Calais, France
ABSTRACT
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations
by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is
able to mesh automatically the simulation domain according to the propagation of fluids. This method can
also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
KEYWORDS
Progressive mesh, Lattice Boltzmann method,single-node multi-GPU, parallel computing.
1. INTRODUCTION
The lattice Boltzmann method (LBM) is a computational fluid dynamics (CFD) method. It is a
relatively recent technique which is able to approximate Navier-Stokes equations by a collision-
propagation scheme [1]. Lattice Boltzmann method however differs from standard approaches as
finite element method (FEM) or finite volume method (FVM) by its mesoscopic approach. It is an
interesting alternative which is able to simulate complex phenomena on complex geometries. Its
high parallelization makes also this method attractive in order to perform simulations on parallel
hardware. Moreover, the emergence of high-performance computing (HPC) architectures using
GPUs [5] is also a great interest for many researchers.
Parallelization is indeed an important asset of lattice Boltzmann method. However, perform
simulations on large complex geometries can be very costly in computational resources. This
paper introduces a new progressive mesh algorithm in order to perform physical simulations on
complex geometries by the use of a multiphase and multicomponent lattice Boltzmann method.
The algorithm is able to automatically mesh the simulation domain according to the propagation
of fluids. Moreover, the integration of this algorithm on single-node multi-GPU architecture is
also an important matter which is studied in this paper. This method is an interesting alternative
which has never been exploited at the best of our knowledge.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
2
Section 2 first describes the multiphase and multicomponent lattice Boltzmann method. It is able
to simulate the behavior of fluids with several physical states (phase) and it is also able to model
several fluids (component) interacting with each other. Section 3 presents then several recent
works involving lattice Boltzmann method on GPUs. Section 4 mostly concerns the main
contribution of this paper: the inclusion of a progressive mesh method in the simulation code. The
principles of the method and the definition of an adapted criterion are firstly introduced. The
integration on a single-node multi-GPU architecture is then described. An analysis concerning
performance is also studied in section 5. The conclusion and future works are finally presented in
the last section.
2. THE LATTICE BOLTZMANN METHOD
2.1. The Single relaxation time Bhatnagar-Gross-Krook (SRT-BGK) Boltzmann
equation
The lattice Boltzmann method is based on three main discretizations: space, time and velocities.
Velocity space is reduced to a finite number of well-defined vectors. Figures 1(a) and 1(b)
illustrate this discrete scheme for D2Q9 and D3Q19 model.
The simulation grid is therefore discretized as a Cartesian grid and calculation steps are achieved
on this entire grid. The discrete Boltzmann equation[1] with a single relaxation timeBhatnagar-
Gross-Krook (SRT-BGK) collision term is defined by the following equation:
, Δ , 	
1
, , (1)
, 	 , 1 	
	
	
2
	
2
(2)
	
1
3
Δ!
Δ
" (3)
The function , corresponds to the discrete density distribution function along velocity
vector at a position and a time	 . The parameter corresponds to the relaxation time of the
simulation. The value is the fluid density and corresponds to the fluid velocity. Δ!andΔ are
the spatial and temporal steps of the simulation respectively. Parameters # are weighting values
defined according to the lattice Boltzmann scheme and can be found in [1].Macroscopic
quantities as density and velocity are finally computed as follows:
(a) D2Q9 scheme (b) D3Q19 scheme
Figure 1: Example of Lattice Boltzmann schemes
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
3
, 	 $ , (4)
, , 	 $ , (5)
2.2. Multiphase and Multi Component Lattice Boltzmann Model
Multiphase and multicomponent models (MPMC) allow performing complex simulations
involving several physical components. In this section, a MPMC-LBM model based on the work
achieved by Bao& Schaeffer [4] is presented.It includes several interaction forces based on
pseudo-potential. It is calculated as follows:
%& 	'
2 (& &
)&&
(6)
The term (& is the pressure term. It is calculated by the use of an equation of state as the Peng-
Robinson equation:
(& 	
&*&+&
1 ,& &
	
-&. +& &
1 2,& ,
(7)
Internal forces are then computed. The internal fluid interaction force is expressed as follows [2]
[3]:
/&& 	 0
)&
2
%& $ #
!1
%&
2 2
	
1 0
2
)&
2
%& # %&
2 2
(8)
The value0 is a weighting term generally fixed to 1.16 according to [2] [3]. The inter-component
force is also introduced as follows [4]:
/&&1 	 	
)&&2
2
%& $ #
!1
%&
2 2
(9)
Additional forces can be added into the simulation code as the gravity force, or a fluid-structure
interaction [3]. The incorporation of the force term is then achieved by a modifiedcollision
operator expressed as follows:
&, , Δ &, , 	
1
&, , &, , 	Δ &, (10)
Δ &, 	 &, &, & Δ & 	 &, &, & (11)
Δ &
/&Δ
&
(12)
Macroscopic quantities for each component are finally computed by the use of equations (4) and
(5).
3. LATTICE BOLTZMANN METHODS AND GPUS
The mass parallelism of GPUs has been quickly exploited in order to perform fast simulations[7]
[8] using lattice Boltzmann method. Recent works have shown that GPUs are also used with
multiphase and multicomponent models [16] [14]. The main aspects of GPU optimizations are
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
decomposed into several categories
overlap of memory transfers with computations ….
optimize global memory bandwidt
Concerning LBM, an adapted data structure such as the Structure of Array (SoA
studied and has proven to be efficient on GPU
Several access patterns are also
pattern, consists of using two calculation grids in GPU global memory in order to manage the
temporal and spatial dependency of the data (Equation
reading distribution functions from A and writing them to B, and reading from B
reciprocally. This pattern is commonly used and offers very goo
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
compression [6], Swap algorithm
technique is used in order to save memory due to spatial and temporal data dependency.
Recent works involving implementation of l
of several GPUs are also available. A first solution, proposed in
entire simulation domain into sub
LBM kernels on each sub-domain in parallel. CPU threads are used to handle each CUDA
context. Communications between sub
Zero-copy feature allows to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
performance.
Some approaches have finally been proposed
constituted of multiple GPUs by the use of MPI in combination with CUDA
our case, we only dispose of one computing node with multiple GPUs thus we don't
these architectures in this paper.
4. A PROGRESSIVE MESH ALG
ON SINGLE-NODE MULTI-GPU
4.1. Motivation
Works described in the previous section consider that the entire simulation domain is
divided into subdomains according to the number of
subdomains are therefore calculated in parallel.
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
decomposed into several categories [10] [9] as thread level parallelism, GPU memory acce
overlap of memory transfers with computations …. Data coalescence is needed in order to
optimize global memory bandwidth. This implies several conditions as described in [9
Concerning LBM, an adapted data structure such as the Structure of Array (SoA) has been well
studied and has proven to be efficient on GPU [7].
also described in the literature. The first one, named A
pattern, consists of using two calculation grids in GPU global memory in order to manage the
temporal and spatial dependency of the data (Equation (10)). Simulation steps alternate between
reading distribution functions from A and writing them to B, and reading from B and writing to A
This pattern is commonly used and offers very good performance [10] [11] [9
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
Swap algorithm [6] or A-A pattern technique [12]. In this paper, the A
technique is used in order to save memory due to spatial and temporal data dependency.
ks involving implementation of lattice Boltzmann method on a single-node composed
ilable. A first solution, proposed in [13] [17], consists in dividing the
simulation domain into subdomains according to the number of GPUs and performing
domain in parallel. CPU threads are used to handle each CUDA
context. Communications between sub-domains are performed using zero-copy memory transfers.
to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
Some approaches have finally been proposed recently to perform simulations on several no
constituted of multiple GPUs by the use of MPI in combination with CUDA [19][18
our case, we only dispose of one computing node with multiple GPUs thus we don't
PROGRESSIVE MESH ALGORITHM FOR LATTICE BOLTZMANN METHOD
GPU ARCHITECTURES
in the previous section consider that the entire simulation domain is
divided into subdomains according to the number of GPUs, as shown on Figure 2. All
therefore calculated in parallel.
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
to the number of GPUs.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
4
as thread level parallelism, GPU memory access,
Data coalescence is needed in order to
l conditions as described in [9].
) has been well
described in the literature. The first one, named A-B access
pattern, consists of using two calculation grids in GPU global memory in order to manage the
imulation steps alternate between
and writing to A
[10] [11] [9] on a
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
In this paper, the A-A pattern
technique is used in order to save memory due to spatial and temporal data dependency.
node composed
, consists in dividing the
domains according to the number of GPUs and performing
domain in parallel. CPU threads are used to handle each CUDA
copy memory transfers.
to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
to perform simulations on several nodes
[18][21] [15]. In
our case, we only dispose of one computing node with multiple GPUs thus we don't focus on
OLTZMANN METHODS
in the previous section consider that the entire simulation domain is meshed and
Figure 2. All
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
In this paper, a new approach is considered.
does not requires to be fully meshed at the beginning
new progressive mesh method
propagation of the simulated flui
beginning of the simulation (Figure 3(a))
propagation of the fluid as can be seen of Figure 3
the simulation geometry (Figure 3(c))
simulations. It is also a real advantage for an application on industrial structures mostly composed
of pipes or channels. It can indeed save a lot of memory and calculations
geometry used for the simulation.
Figure 3: Example of a 3D simu
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
fluid, (c) all subdomains are created and completely adapt to the simulation g
The progressive mesh algorithm firstly needs the introduction of a
create a new subdomain to the simulation. This
existing subdomains. Calculations on
optimization factor.
4.2. Definition of a Criterion for the Progressive Mesh
The definition of a criterion is an
for the simulation. This criterion needs to represent eff
velocity seems like a good choice in order to define an efficient criterion
fluid velocity between two iterations
dispersion. Our criterion is therefore defined as follows
‖6& ‖
The symbol ‖. ‖ stands for the Euclidean norm in this paper.
for all active subdomains on the
boundary, a new subdomain is created next to this boundary as shown on Figure 4.
generally fixed to 0 in this paper in order to
each subdomain.
(a) (b) (c)
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
approach is considered. For most simulations, the entire domain generally
fully meshed at the beginning of the simulation. We propose therefore a
method in order to dynamically create the mesh according to the
fluid. The idea consists in defining a first subdomain at the
(Figure 3(a)). Several subdomains can then be created following the
propagation of the fluid as can be seen of Figure 3(b). This method finally adapts automatically
(Figure 3(c)). This method is therefore applicable for any geometry and
. It is also a real advantage for an application on industrial structures mostly composed
It can indeed save a lot of memory and calculations according to the
geometry used for the simulation.
Figure 3: Example of a 3D simulation using the progressive mesh algorithm: (a) a first subdomain is
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
fluid, (c) all subdomains are created and completely adapt to the simulation geometry.
The progressive mesh algorithm firstly needs the introduction of an adapted criterion in order to
w subdomain to the simulation. This new subdomain needs then to be connected to
Calculations on single-node multi-GPU architecture are finally
of a Criterion for the Progressive Mesh
finition of a criterion is an important aspect in order to efficiently create new subdomains
s criterion needs to represent efficiently the propagation of fluid. The fluid
ike a good choice in order to define an efficient criterion. The difference of the
fluid velocity between two iterations is considered in order to observe efficiently
Our criterion is therefore defined as follows for thecomponent	8:
‖ 	‖ & , Δ & , ‖
stands for the Euclidean norm in this paper. This criterion needs to be calculated
subdomains on the boundaries. If the criterion exceeds anarbitrary threshold
boundary, a new subdomain is created next to this boundary as shown on Figure 4.The value
in this paper in order to detect any change of velocity on the boundaries of
(a) (b) (c)
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
5
simulations, the entire domain generally
We propose therefore a
create the mesh according to the
domain at the
Several subdomains can then be created following the
automatically to
This method is therefore applicable for any geometry and
. It is also a real advantage for an application on industrial structures mostly composed
according to the
lation using the progressive mesh algorithm: (a) a first subdomain is
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
eometry.
criterion in order to
s then to be connected to
GPU architecture are finally an important
important aspect in order to efficiently create new subdomains
fluid. The fluid
he difference of the
efficiently the fluid
(13)
This criterion needs to be calculated
threshold9on a
The value9 is
n the boundaries of
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 4: The criterion ǁC_α (x) ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S
then a new subdomain is created next to the boundary.
4.3. Algorithm
This section describes the algorithm for the
model with the inclusion of our progressive mesh algorithm.
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
process. Figure 5 describes our resulti
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
α ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S
then a new subdomain is created next to the boundary.
the algorithm for the multiphase and multicomponent lattice Boltzmann
model with the inclusion of our progressive mesh algorithm. It is also useful in order to
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
describes our resulting algorithm.
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
6
_2 is calculated on the boundary. If the criterion exceeds the threshold S
attice Boltzmann
It is also useful in order to
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
4.4. Integration on Single-Node Multi
Efficiency of inter-GPU communications is surely the most difficult task in order to obtain good
performance. Indeed, our simulations are composed of numerous subdomains which are a
dynamically. The repartition of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
simulation time.
4.4.1. Overlap Communications with Computations
Several data exchanges are needed for this type of model. The computation of interaction
inter-component / ! implies to have access to neighboring values of the pseudo
propagation step of LBM also implies to communicate several distribution functions
GPUs (Figure 6). Aligned buffers ma
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows
obtain a significant performance gain by reducing the waiting
the computation process into 2 steps
Computations on the needed boundaries are firstly done. Communi
subdomains are also done while computing
performed simultaneously with calculations which allow
In most cases for lattice Boltzmann method, memory is
page-locked memory which allow go
[17][13] [15].A different approach
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
performance, Nvidia launched GPUDirect with CUDA 4.0.
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
corresponds to values to communicate between subdomains. For colors, please refer to the PDF
version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Node Multi-GPU Architecture
GPU communications is surely the most difficult task in order to obtain good
Indeed, our simulations are composed of numerous subdomains which are a
of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
Communications with Computations
needed for this type of model. The computation of interaction
implies to have access to neighboring values of the pseudo-potential. The
propagation step of LBM also implies to communicate several distribution functions
ligned buffers may be used for data transactions.
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows
significant performance gain by reducing the waiting time of data. The idea is to separate
omputation process into 2 steps: boundary calculations and interior
Computations on the needed boundaries are firstly done. Communications between neighboring
domains are also done while computing the interior. The different communications are thus
calculations which allow good efficiency.
attice Boltzmann method, memory is transferred via zero-copy transactions to
locked memory which allow good overlapping between communications and computations
different approach is studied in this paper concerning inter-GPU communications.
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
e, Nvidia launched GPUDirect with CUDA 4.0.This technology allows to perform
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
values to communicate between subdomains. For colors, please refer to the PDF
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
7
GPU communications is surely the most difficult task in order to obtain good
Indeed, our simulations are composed of numerous subdomains which are added
of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
needed for this type of model. The computation of interaction /: and
potential. The
propagation step of LBM also implies to communicate several distribution functions between
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows to
time of data. The idea is to separate
calculations.
cations between neighboring
the interior. The different communications are thus
copy transactions to
od overlapping between communications and computations
GPU communications.
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
This technology allows to perform
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
values to communicate between subdomains. For colors, please refer to the PDF
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Peer-to-Peer transfers and memory accesses between two compatible GPUs. The idea is to
perform data transfer using Peer-
zero-copy transactions for others. This method allows to communicate data by bypassing the use
of the CPU and therefore to accelerate the transfer (Figure
improves performance and the efficiency of the simulatio
Figure 7: GPUDirect technology (source Nvidia).
4.4.2 Optimization of Data Transfer between
The repartition of GPUs is an important factor of optimization for this type of applications.
Communications cost is generally a bottleneck for
exchanges between sub domains
associated with one GPU.The first
belonging to the same GPU. In this case, the communication cost is extremely low because
communications are performed on the same
concern communications between
however made between Peer-to-
goal to optimize dynamically the repartition of
For a new sub domain	;, the function
Where ;′ denotes neighboring subdomains to
= ;, ;2
	>0.5 ∗ ABC D E-FA E
ABC D E-FA E
The function = ;, ;2
compares the different ways
subdomain and its neighbors. An arbitrary weighting
to-Peer communications. The function
The function / ; needs therefo
cost. This function is calculated for all available GPUs and the GPU with the minimum value is
assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned
dynamically and the same GPU could not be assigned two times as long as others GPUs are not
assigned. Figure 8 explains via a
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
5. RESULTS AND PERFORMAN
5.1. Hardware
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
simulations. Table 1 describes some Tesla
communications for our architecture are also described in Figure 9.
Table 1: Tesla C2050 Hardware specifications
Figure 9: Peer-to-Peer communications accessibility for our architecture.
CUDA compute capability
Total amount of global memory
(14) Multiprocessors, (32) scalar processors/MP
GPU clock rate
L2 cache
Total amount of shared memory per block
Total number of registers available per block
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
/ ; is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
ESULTS AND PERFORMANCE
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
simulations. Table 1 describes some Tesla C2050 hardware specifications.
communications for our architecture are also described in Figure 9.
Table 1: Tesla C2050 Hardware specifications
Peer communications accessibility for our architecture.
CUDA compute capability 2.0
Total amount of global memory 2687 MBytes
(14) Multiprocessors, (32) scalar processors/MP 448 CUDA cores
GPU clock rate 1147 MHz
L2 cache size 786432 bytes
Total amount of shared memory per block 49152 bytes
Total number of registers available per block 32768
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
9
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
C2050 hardware specifications. Peer-to-Peer
2.0
2687 MBytes
448 CUDA cores
1147 MHz
786432 bytes
49152 bytes
32768
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
10
5.2. Simulations
Two simulations are considered on large simulation domain in order to evaluate the performance
of our contribution. Both simulations include the use of two physical components. The geometry
however differs between these simulations. The first simulation is based on a simple geometry
composed of 1024*256*256 calculation cells where a fluid fills all simulation domains during the
simulation (Figure 10). The second simulation is based on a complex geometry composed of
1024*1024*128 calculations cells where the fluid moves within channels (Figure 11).
5.3. Performance
This section deals with the performance obtained by our method. A comparison between the
progressive mesh algorithm and the static mesh method generally used in literature is shown. The
optimization of the repartition of GPUs on subdomains is also studied. The performance metric
generally used for lattice Boltzmann method is the Million Lattice nodes Updates Per Second
(MLUPS). It is calculated as follows:
H E MNOPQ
KDR-BF	ABC ∗ F R, E	D 	B E- BDFA
ABR S- BDF	 BR
(16)
This classical approach generally used in literature in order to perform simulations consists in
equally dividing the simulation domain according to the number of GPUs. It offers generally
good performance as communications can be overlapped with calculations. The use of Peer-to-
Peer communications also has a beneficial effect on the performance, as shown on Figure 13.
Peer-to-Peer communications allow obtaining a performance gain between 8 and 12% according
Figure 10: A two-component leakage simulation on a simple geometry with a domain size of
1024*256*256 cells.
Figure 11: A two-component leakage simulation on a complex geometry composed of channels with a
domain size of 1024*1024*128 cells.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
to the number of GPUs used for the simulation described in Figure 10. Ze
communications offer a good scaling
of Peer-to-Peer communications, as shown on Figure 12.
The inclusion of the progressive mesh also has
performance. Sub domains of size 128*128*12
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
performance at the beginning of the simulation. The addition of
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation
shown on Figure 14, which leads to a
mesh. In terms of memory consumption, fast apparitions of news
lead to have the entire simulation domain in memory after a few iterations.
Figure 12: Comparison of performance between Peer
communications for the simulation shown on Figure 10.
Figure 13: Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
to the number of GPUs used for the simulation described in Figure 10. Ze
communications offer a good scaling but an almost perfect scaling is obtained with the inclusion
Peer communications, as shown on Figure 12.
f the progressive mesh also has an important beneficial effect on the simulation
of size 128*128*128 are considered for these simulations.
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
performance at the beginning of the simulation. The addition of sub domains
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation
which leads to a very slight decrease of performance compared to the static
In terms of memory consumption, fast apparitions of news sub domains are noted which
to have the entire simulation domain in memory after a few iterations.
Figure 12: Comparison of performance between Peer-to-Peer communications with zero
communications for the simulation shown on Figure 10.
Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
is also presented.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
11
to the number of GPUs used for the simulation described in Figure 10. Zero-copy
scaling is obtained with the inclusion
an important beneficial effect on the simulation
8 are considered for these simulations. Figures 13
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
b domains during the
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation, as
t decrease of performance compared to the static
are noted which
with zero-copy
Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 13 also compares performance between two different assignments for GPUs. The first one
is a simple assignation which assigns to ne
uses the optimization method presented in sec
leads to an important difference of performance
noted at the convergence of this simulation between the two approaches.
due to the fact that the communication cost is more
optimized assignment. Since subdomains are added dynamically and connected to e
therefore important to optimize these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
15 and 16. The main difference in this situation
complex and channelized. Physical simulations on channelized geometry are especially pre
on industrial structures.
In this case, the progressive mesh method shows excellent results. In terms of memory,
method is easily able to simulate on a global simulation domain of size 1024*1024*128
while the static mesh method is unable to perform the simulation. The amount of
is indeed too important for this simulation.
consumption during the simulation.
less important than the static mesh method. A gain of approximatively 50% of memory is noted
for this particular simulation. This is due
automatically adapts to the evolution of the simulation and so only needed zones of the global
simulation domain are meshed.
Figure 14: Comparison of memory
mesh method for the simulation shown on Figure 10.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 13 also compares performance between two different assignments for GPUs. The first one
is a simple assignation which assigns to new subdomain the first available GPU. The second one
presented in section 4.4.2. The comparison of these two methods
leads to an important difference of performance. Indeed, a difference of approximatively 30% is
simulation between the two approaches. This difference is mostly
due to the fact that the communication cost is more important for a simple assignment than an
optimized assignment. Since subdomains are added dynamically and connected to each other, it is
these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
15 and 16. The main difference in this situation is the geometry of the simulation which is more
complex and channelized. Physical simulations on channelized geometry are especially pre
In this case, the progressive mesh method shows excellent results. In terms of memory,
method is easily able to simulate on a global simulation domain of size 1024*1024*128
while the static mesh method is unable to perform the simulation. The amount of needed
is indeed too important for this simulation. Figure 15 shows the evolution of memory
consumption during the simulation. The memory cost at the convergence of the simulation is far
less important than the static mesh method. A gain of approximatively 50% of memory is noted
for this particular simulation. This is due to the fact that the progressive mesh method
automatically adapts to the evolution of the simulation and so only needed zones of the global
Figure 14: Comparison of memory consumption between the progressive mesh method and the static
mesh method for the simulation shown on Figure 10.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
12
Figure 13 also compares performance between two different assignments for GPUs. The first one
The second one
tion 4.4.2. The comparison of these two methods
. Indeed, a difference of approximatively 30% is
This difference is mostly
assignment than an
ach other, it is
these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
is the geometry of the simulation which is more
complex and channelized. Physical simulations on channelized geometry are especially present
In this case, the progressive mesh method shows excellent results. In terms of memory, this
method is easily able to simulate on a global simulation domain of size 1024*1024*128 and more
needed memory
the evolution of memory
The memory cost at the convergence of the simulation is far
less important than the static mesh method. A gain of approximatively 50% of memory is noted
to the fact that the progressive mesh method
automatically adapts to the evolution of the simulation and so only needed zones of the global
consumption between the progressive mesh method and the static
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 15: Comparison of memory consumption between the
method for the simulation shown on Figure 11.
The comparison of the repartition
performance gain (19%) is still noted for this simulation. This proves that a
method is important in order to obtain good performance.
not need to be fully meshed brings an important gain in performance. The geometry has therefore
an important impact on the performance on
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
of GPUs for the simulation shown on Figure 11.
6. CONCLUSION
In this paper, an efficient progressive
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
automatically added to the simulation by the use of an adapted criterion. This method is also able
to save a lot of memory and calculations
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 15: Comparison of memory consumption between the progressive mesh method and the static mesh
method for the simulation shown on Figure 11.
The comparison of the repartition of GPUs is also described in Figure 16. A
%) is still noted for this simulation. This proves that a dynamic optimization
method is important in order to obtain good performance. Moreover, the fact that the domain does
not need to be fully meshed brings an important gain in performance. The geometry has therefore
an important impact on the performance on the progressive mesh method.
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
of GPUs for the simulation shown on Figure 11.
In this paper, an efficient progressive mesh algorithm for physical simulations using the lattice
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
ded to the simulation by the use of an adapted criterion. This method is also able
and calculations in order to perform simulations on large installations.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
13
progressive mesh method and the static mesh
. An important
dynamic optimization
Moreover, the fact that the domain does
not need to be fully meshed brings an important gain in performance. The geometry has therefore
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
mesh algorithm for physical simulations using the lattice
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
ded to the simulation by the use of an adapted criterion. This method is also able
in order to perform simulations on large installations.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
14
The integration of the progressive mesh method on single-node multi-GPU architecture is also
treated. A dynamic optimization of the repartition of GPUs to subdomains is an important factor
in order to obtain good performance. The combination of all these contributions allows therefore
performing fast physical simulations on all types of geometry. The progressive mesh method is
therefore an interesting alternative because it allows obtaining similar or better performances than
the usual static mesh method.
The progressive mesh algorithm is however limited to the memory of the GPU which is generally
far more inferior to the CPU RAM. The creation of new subdomains is indeed possible while
there is a sufficient amount of memory on the GPUs. Extensions of this work to cases that require
more memory than all GPUs can handle is now under investigation. Data transfer optimizations
with the CPU host will therefore be essential to keep good performances.
ACKNOWLEDGEMENTS
This work has been made possible thanks to collaboration between academic and industrial
groups, gathered by the INNOCOLD association.
REFERENCES
[1] B. Chopard, J.L. Falcone J. Latt, The Lattice Boltzmann advection diffusion model revisited, The
European Physical Journal - Special Topics,Vol. 171, pp. 245-249, 2009.
[2] S. Gong, P. Cheng, Numerical investigation of droplet motion and coalescence by an improved lattice
Boltzmann model for phase transitionsand multiphase flows, Computers & Fluids , Vol. 53, pp. 93-
104, 2012.
[3] S. Gong, P. Cheng, A lattice Boltzmann method for liquid vapor phase change heat transfer,
Computers & Fluids, Vol. 54, pp. 93-104, 2012.
[4] J. Bao, L. Schaeffer, Lattice Boltzmann equation model for multicomponent multi-phase flow with
high density ratios, Applied MathematicalModelling, 2012.
[5] Nvidia, C. U. D. A. (2011). Nvidia cuda c programming guide. NVIDIA Corporation, 120, 18.8
[6] M. Wittmann, T. Zeiser, G. Hager, G. Wellein, Comparison of different propagation steps for Lattice
Boltzmann methods, Computers and Mathematicswith Applications, Vol. 65 pp. 924-935, 2013.
[7] J. Tölke, Implementation of a Lattice Boltzmann kernel using the compute unified device architecture
developed by nVIDIA, Computing andVisualization in Science, 1-11, 2008.
[8] J. Tölke, M. Krafczyk, TeraFLOP computing on a desktop PC with GPUs for 3D CFD, International
Journal of Computational Fluid Dynamics 22(7), pp. 443-456, 2008.
[9] F. Kuznik, C.Obrecht, G. Rusaouën, J-J. Roux, LBM based flow simulation using GPU computing
processor, Computers and Mathematics withApplications 27, 2009.
[10] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, A new approach to the lattice Boltzmann method
for graphics processing units, Computersand Mathematics with Applications 61, pp. 3628-3638, 2011.
[11] P.R. Rinaldi, E.A Dari, M.J. Vénere, A. Clausse, A Lattice-Boltzmannsolver for 3D fluid on GPU,
Simulation Modeling Pratice and Theory 25,pp. 163-171, 2012.
[12] P. Bailey, J. Myre, S. Walsh, D. Lilja, M. Saar, Accelerating lattice boltzmann fluid flows using
graphics processors, International Conferenceon Parallel Processing, pp. 550-557, 2009.
[13] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Multi-GPU implementation of the lattice
Boltzmann method, Computers and Mathematicswith Applications, 80, pp. 269-275, 2013.
[14] X. Li, Y. Zhang, X. Wang, W. Ge, GPU-based numerical simulation of multi-phase flow in porous
media using multiple-relaxation-time latticeBoltzmann method, Chemical Engineering Science, Vol.
102, pp. 209-219,2013.
[15] M. Januszewski, M. Kostur, Sailfish: A flexible multi-GPU implementationof the lattice Boltzmann
method, Computer Physics Communications,Vol. 185, pp. 2350-2368, 2014.
[16] F. Jiang, C. Hu, Numerical simulation of a rising CO2 droplet in the initial accelerating stage by a
multiphase lattice Boltzmann method,Applied Ocean Research, Vol. 45, pp. 1-9, 2014.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
15
[17] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux, Multi-GPU Implementation of a Hybrid
Thermal Lattice Boltzmann Solver using theTheLMA Framework, Computers and Fluids, Vol. 80,
pp. 269275, 2013.
[18] C. Rosales, Multiphase LBM Distributed over Multiple GPUs, CLUSTER’11 Proceedings of the
2011 IEEE International Conference onCluster Computing, pp. 1-7, 2011.
[19] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Scalable lattice Boltzmann solvers for CUDA
GPU clusters, Parallel Computing, Vol.39, pp. 259-270, 2013.
[20] J. Habich, C. Feichtinger, H. Köstler, G. Hager, G. Wellein, Performance engineering for the lattice
Boltzmann method on GPGPUs: Architecturalrequirements and performance results, Computer &
Fluids, Vol. 80, pp.276-282, 2013.
[21] C. Feichtinger, J. Habich, H. Köstler, U. Rüde, T. Aoki, Performance Modeling and Analysis of
Heterogeneous Lattice Boltzmann Simulationson CPU-GPU Clusters, Parallel Computing, 2014.
AUTHORS
Julien Duchateau is a PhD student in computer science at the Université du Littoral Côte d’Opale in France.
His main research interest are massive parallelism on CPUs and GPUs, physical simulations and computer
graphics.
François Rousselle is an associate professor in computer science at the Université du Littoral Côte d’Opale
in France. His main research interests are computer graphics, physical simulations, virtual reality and
massive parallelism.
Nicolas Maquignon is a PhD student in simulation and numerical physics at the Université du Littoral Côte
d’Opale. His main research interests are numerical physics, numerical mathematics and numerical
modeling.
Christophe Renaud is a professor in computer science at the Université du Littoral Côte d’Opale in France.
His main research interests are computer graphics, virtual reality, physical simulations and massive
parallelism.
Gilles Roussel is an associate professor in automatic at the Université du Littoral Côte d’Opale in France.
His main research interests are automatic, signal processing, physical simulations and industrial computing.

More Related Content

What's hot

DSP IEEE paper
DSP IEEE paperDSP IEEE paper
DSP IEEE paperprreiya
 
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...ijma
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Sangamesh Ragate
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows seteSAT Publishing House
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
 
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...ijcsa
 
A tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsA tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsRadu Ursu
 
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Serhan
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
Section based hex cell routing algorithm (sbhcr)
Section based hex cell routing algorithm (sbhcr)Section based hex cell routing algorithm (sbhcr)
Section based hex cell routing algorithm (sbhcr)IJCNCJournal
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR OptimizationniveditJain
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
 
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...csandit
 
11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial 11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial Vivan17
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsHardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsMarek Kraft
 

What's hot (19)

DSP IEEE paper
DSP IEEE paperDSP IEEE paper
DSP IEEE paper
 
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...
 
Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)Colfax-Winograd-Summary _final (1)
Colfax-Winograd-Summary _final (1)
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows set
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
 
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
 
A tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithmsA tutorial on CGAL polyhedron for subdivision algorithms
A tutorial on CGAL polyhedron for subdivision algorithms
 
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
Section based hex cell routing algorithm (sbhcr)
Section based hex cell routing algorithm (sbhcr)Section based hex cell routing algorithm (sbhcr)
Section based hex cell routing algorithm (sbhcr)
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
 
NUPHBP13483
NUPHBP13483NUPHBP13483
NUPHBP13483
 
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
 
11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial 11 construction productivity and cost estimation using artificial
11 construction productivity and cost estimation using artificial
 
Todtree
TodtreeTodtree
Todtree
 
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsHardware Architecture for Calculating LBP-Based Image Region Descriptors
Hardware Architecture for Calculating LBP-Based Image Region Descriptors
 
53
5353
53
 

Viewers also liked

Viewers also liked (20)

Expotita
ExpotitaExpotita
Expotita
 
Progresso moral da Humanidade
Progresso moral da Humanidade Progresso moral da Humanidade
Progresso moral da Humanidade
 
Duane michal
Duane michalDuane michal
Duane michal
 
Wales
WalesWales
Wales
 
ApS JM Puig.ppt
ApS JM Puig.pptApS JM Puig.ppt
ApS JM Puig.ppt
 
Inovace iRáno
Inovace iRánoInovace iRáno
Inovace iRáno
 
globalizacion
globalizacionglobalizacion
globalizacion
 
Sculpturalartof the loversdinvora
Sculpturalartof the loversdinvoraSculpturalartof the loversdinvora
Sculpturalartof the loversdinvora
 
Modelo osi resumo
Modelo osi resumoModelo osi resumo
Modelo osi resumo
 
Jornal Alba Ambiental
Jornal Alba AmbientalJornal Alba Ambiental
Jornal Alba Ambiental
 
Rd 225776correa 225776_320638
Rd 225776correa 225776_320638Rd 225776correa 225776_320638
Rd 225776correa 225776_320638
 
Gew 2016-Programme
Gew 2016-ProgrammeGew 2016-Programme
Gew 2016-Programme
 
Trabalho bologia
Trabalho bologiaTrabalho bologia
Trabalho bologia
 
Sistemas de ficheiros
Sistemas de ficheirosSistemas de ficheiros
Sistemas de ficheiros
 
Normas reguladoras de las competiciones de ámbito estatal juveniles
Normas reguladoras de las competiciones de ámbito estatal juvenilesNormas reguladoras de las competiciones de ámbito estatal juveniles
Normas reguladoras de las competiciones de ámbito estatal juveniles
 
EKITEN THINKING. Buenas prácticas en materia de alquiler público
EKITEN THINKING. Buenas prácticas en materia de alquiler públicoEKITEN THINKING. Buenas prácticas en materia de alquiler público
EKITEN THINKING. Buenas prácticas en materia de alquiler público
 
O menino que não gostava de ler
O menino que não gostava de lerO menino que não gostava de ler
O menino que não gostava de ler
 
Guia
GuiaGuia
Guia
 
Brochura 2ciclo
Brochura 2cicloBrochura 2ciclo
Brochura 2ciclo
 
Dia de s. martinho
Dia de s. martinhoDia de s. martinho
Dia de s. martinho
 

Similar to A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN METHOD ON SINGLE-NODE MULTI-GPU ARCHITECTURES

Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...ijcax
 
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...ijcax
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problemsRichard Ashworth
 
Optimum capacity allocation of distributed generation
Optimum capacity allocation of distributed generationOptimum capacity allocation of distributed generation
Optimum capacity allocation of distributed generationeSAT Publishing House
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
 
An Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationAn Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationGabriele D'Angelo
 
Optimum capacity allocation of distributed generation units using parallel ps...
Optimum capacity allocation of distributed generation units using parallel ps...Optimum capacity allocation of distributed generation units using parallel ps...
Optimum capacity allocation of distributed generation units using parallel ps...eSAT Journals
 
A general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmA general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmTA Minh Thuy
 
Task Scheduling using Hybrid Algorithm in Cloud Computing Environments
Task Scheduling using Hybrid Algorithm in Cloud Computing EnvironmentsTask Scheduling using Hybrid Algorithm in Cloud Computing Environments
Task Scheduling using Hybrid Algorithm in Cloud Computing Environmentsiosrjce
 
Paper on experimental setup for verifying - "Slow Learners are Fast"
Paper  on experimental setup for verifying  - "Slow Learners are Fast"Paper  on experimental setup for verifying  - "Slow Learners are Fast"
Paper on experimental setup for verifying - "Slow Learners are Fast"Robin Srivastava
 
A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAIJERD Editor
 
Accelerating S3D A GPGPU Case Study
Accelerating S3D  A GPGPU Case StudyAccelerating S3D  A GPGPU Case Study
Accelerating S3D A GPGPU Case StudyMartha Brown
 

Similar to A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN METHOD ON SINGLE-NODE MULTI-GPU ARCHITECTURES (20)

FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
A0270107
A0270107A0270107
A0270107
 
40120140507002
4012014050700240120140507002
40120140507002
 
40120140507002
4012014050700240120140507002
40120140507002
 
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
 
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...
 
cuTau Leaping
cuTau LeapingcuTau Leaping
cuTau Leaping
 
Producer consumer-problems
Producer consumer-problemsProducer consumer-problems
Producer consumer-problems
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
Optimum capacity allocation of distributed generation
Optimum capacity allocation of distributed generationOptimum capacity allocation of distributed generation
Optimum capacity allocation of distributed generation
 
Hardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmpHardback solution to accelerate multimedia computation through mgp in cmp
Hardback solution to accelerate multimedia computation through mgp in cmp
 
An Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationAn Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed Simulation
 
Optimum capacity allocation of distributed generation units using parallel ps...
Optimum capacity allocation of distributed generation units using parallel ps...Optimum capacity allocation of distributed generation units using parallel ps...
Optimum capacity allocation of distributed generation units using parallel ps...
 
A general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmA general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithm
 
Task Scheduling using Hybrid Algorithm in Cloud Computing Environments
Task Scheduling using Hybrid Algorithm in Cloud Computing EnvironmentsTask Scheduling using Hybrid Algorithm in Cloud Computing Environments
Task Scheduling using Hybrid Algorithm in Cloud Computing Environments
 
N0173696106
N0173696106N0173696106
N0173696106
 
D0931621
D0931621D0931621
D0931621
 
Paper on experimental setup for verifying - "Slow Learners are Fast"
Paper  on experimental setup for verifying  - "Slow Learners are Fast"Paper  on experimental setup for verifying  - "Slow Learners are Fast"
Paper on experimental setup for verifying - "Slow Learners are Fast"
 
A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDA
 
Accelerating S3D A GPGPU Case Study
Accelerating S3D  A GPGPU Case StudyAccelerating S3D  A GPGPU Case Study
Accelerating S3D A GPGPU Case Study
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 

A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN METHOD ON SINGLE-NODE MULTI-GPU ARCHITECTURES

  • 1. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 DOI:10.5121/ijdps.2015.6501 1 A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN METHOD ON SINGLE-NODE MULTI-GPU ARCHITECTURES Julien Duchateau1 , François Rousselle1 , Nicolas Maquignon1 , Gilles Roussel1 , Christophe Renaud1 1 Laboratoire d’Informatique, Signal, Image de la Côte d’Opale Université du Littoral Côte d’Opale, Calais, France ABSTRACT In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is able to perform various types of simulations on complex geometries. The use of this algorithm combined with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm. KEYWORDS Progressive mesh, Lattice Boltzmann method,single-node multi-GPU, parallel computing. 1. INTRODUCTION The lattice Boltzmann method (LBM) is a computational fluid dynamics (CFD) method. It is a relatively recent technique which is able to approximate Navier-Stokes equations by a collision- propagation scheme [1]. Lattice Boltzmann method however differs from standard approaches as finite element method (FEM) or finite volume method (FVM) by its mesoscopic approach. It is an interesting alternative which is able to simulate complex phenomena on complex geometries. Its high parallelization makes also this method attractive in order to perform simulations on parallel hardware. Moreover, the emergence of high-performance computing (HPC) architectures using GPUs [5] is also a great interest for many researchers. Parallelization is indeed an important asset of lattice Boltzmann method. However, perform simulations on large complex geometries can be very costly in computational resources. This paper introduces a new progressive mesh algorithm in order to perform physical simulations on complex geometries by the use of a multiphase and multicomponent lattice Boltzmann method. The algorithm is able to automatically mesh the simulation domain according to the propagation of fluids. Moreover, the integration of this algorithm on single-node multi-GPU architecture is also an important matter which is studied in this paper. This method is an interesting alternative which has never been exploited at the best of our knowledge.
  • 2. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 2 Section 2 first describes the multiphase and multicomponent lattice Boltzmann method. It is able to simulate the behavior of fluids with several physical states (phase) and it is also able to model several fluids (component) interacting with each other. Section 3 presents then several recent works involving lattice Boltzmann method on GPUs. Section 4 mostly concerns the main contribution of this paper: the inclusion of a progressive mesh method in the simulation code. The principles of the method and the definition of an adapted criterion are firstly introduced. The integration on a single-node multi-GPU architecture is then described. An analysis concerning performance is also studied in section 5. The conclusion and future works are finally presented in the last section. 2. THE LATTICE BOLTZMANN METHOD 2.1. The Single relaxation time Bhatnagar-Gross-Krook (SRT-BGK) Boltzmann equation The lattice Boltzmann method is based on three main discretizations: space, time and velocities. Velocity space is reduced to a finite number of well-defined vectors. Figures 1(a) and 1(b) illustrate this discrete scheme for D2Q9 and D3Q19 model. The simulation grid is therefore discretized as a Cartesian grid and calculation steps are achieved on this entire grid. The discrete Boltzmann equation[1] with a single relaxation timeBhatnagar- Gross-Krook (SRT-BGK) collision term is defined by the following equation: , Δ , 1 , , (1) , , 1 2 2 (2) 1 3 Δ! Δ " (3) The function , corresponds to the discrete density distribution function along velocity vector at a position and a time . The parameter corresponds to the relaxation time of the simulation. The value is the fluid density and corresponds to the fluid velocity. Δ!andΔ are the spatial and temporal steps of the simulation respectively. Parameters # are weighting values defined according to the lattice Boltzmann scheme and can be found in [1].Macroscopic quantities as density and velocity are finally computed as follows: (a) D2Q9 scheme (b) D3Q19 scheme Figure 1: Example of Lattice Boltzmann schemes
  • 3. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 3 , $ , (4) , , $ , (5) 2.2. Multiphase and Multi Component Lattice Boltzmann Model Multiphase and multicomponent models (MPMC) allow performing complex simulations involving several physical components. In this section, a MPMC-LBM model based on the work achieved by Bao& Schaeffer [4] is presented.It includes several interaction forces based on pseudo-potential. It is calculated as follows: %& ' 2 (& & )&& (6) The term (& is the pressure term. It is calculated by the use of an equation of state as the Peng- Robinson equation: (& &*&+& 1 ,& & -&. +& & 1 2,& , (7) Internal forces are then computed. The internal fluid interaction force is expressed as follows [2] [3]: /&& 0 )& 2 %& $ # !1 %& 2 2 1 0 2 )& 2 %& # %& 2 2 (8) The value0 is a weighting term generally fixed to 1.16 according to [2] [3]. The inter-component force is also introduced as follows [4]: /&&1 )&&2 2 %& $ # !1 %& 2 2 (9) Additional forces can be added into the simulation code as the gravity force, or a fluid-structure interaction [3]. The incorporation of the force term is then achieved by a modifiedcollision operator expressed as follows: &, , Δ &, , 1 &, , &, , Δ &, (10) Δ &, &, &, & Δ & &, &, & (11) Δ & /&Δ & (12) Macroscopic quantities for each component are finally computed by the use of equations (4) and (5). 3. LATTICE BOLTZMANN METHODS AND GPUS The mass parallelism of GPUs has been quickly exploited in order to perform fast simulations[7] [8] using lattice Boltzmann method. Recent works have shown that GPUs are also used with multiphase and multicomponent models [16] [14]. The main aspects of GPU optimizations are
  • 4. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 decomposed into several categories overlap of memory transfers with computations …. optimize global memory bandwidt Concerning LBM, an adapted data structure such as the Structure of Array (SoA studied and has proven to be efficient on GPU Several access patterns are also pattern, consists of using two calculation grids in GPU global memory in order to manage the temporal and spatial dependency of the data (Equation reading distribution functions from A and writing them to B, and reading from B reciprocally. This pattern is commonly used and offers very goo single GPU. Several techniques are however presented in literature in order to reduce significantly the computational memory cost without loss of information such as grids compression [6], Swap algorithm technique is used in order to save memory due to spatial and temporal data dependency. Recent works involving implementation of l of several GPUs are also available. A first solution, proposed in entire simulation domain into sub LBM kernels on each sub-domain in parallel. CPU threads are used to handle each CUDA context. Communications between sub Zero-copy feature allows to perform efficient communications by a mapping between CPU and GPU pointers. Data must however be read and written only once in order to obtain good performance. Some approaches have finally been proposed constituted of multiple GPUs by the use of MPI in combination with CUDA our case, we only dispose of one computing node with multiple GPUs thus we don't these architectures in this paper. 4. A PROGRESSIVE MESH ALG ON SINGLE-NODE MULTI-GPU 4.1. Motivation Works described in the previous section consider that the entire simulation domain is divided into subdomains according to the number of subdomains are therefore calculated in parallel. Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 decomposed into several categories [10] [9] as thread level parallelism, GPU memory acce overlap of memory transfers with computations …. Data coalescence is needed in order to optimize global memory bandwidth. This implies several conditions as described in [9 Concerning LBM, an adapted data structure such as the Structure of Array (SoA) has been well studied and has proven to be efficient on GPU [7]. also described in the literature. The first one, named A pattern, consists of using two calculation grids in GPU global memory in order to manage the temporal and spatial dependency of the data (Equation (10)). Simulation steps alternate between reading distribution functions from A and writing them to B, and reading from B and writing to A This pattern is commonly used and offers very good performance [10] [11] [9 single GPU. Several techniques are however presented in literature in order to reduce significantly the computational memory cost without loss of information such as grids Swap algorithm [6] or A-A pattern technique [12]. In this paper, the A technique is used in order to save memory due to spatial and temporal data dependency. ks involving implementation of lattice Boltzmann method on a single-node composed ilable. A first solution, proposed in [13] [17], consists in dividing the simulation domain into subdomains according to the number of GPUs and performing domain in parallel. CPU threads are used to handle each CUDA context. Communications between sub-domains are performed using zero-copy memory transfers. to perform efficient communications by a mapping between CPU and GPU pointers. Data must however be read and written only once in order to obtain good Some approaches have finally been proposed recently to perform simulations on several no constituted of multiple GPUs by the use of MPI in combination with CUDA [19][18 our case, we only dispose of one computing node with multiple GPUs thus we don't PROGRESSIVE MESH ALGORITHM FOR LATTICE BOLTZMANN METHOD GPU ARCHITECTURES in the previous section consider that the entire simulation domain is divided into subdomains according to the number of GPUs, as shown on Figure 2. All therefore calculated in parallel. Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according to the number of GPUs. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 4 as thread level parallelism, GPU memory access, Data coalescence is needed in order to l conditions as described in [9]. ) has been well described in the literature. The first one, named A-B access pattern, consists of using two calculation grids in GPU global memory in order to manage the imulation steps alternate between and writing to A [10] [11] [9] on a single GPU. Several techniques are however presented in literature in order to reduce significantly the computational memory cost without loss of information such as grids In this paper, the A-A pattern technique is used in order to save memory due to spatial and temporal data dependency. node composed , consists in dividing the domains according to the number of GPUs and performing domain in parallel. CPU threads are used to handle each CUDA copy memory transfers. to perform efficient communications by a mapping between CPU and GPU pointers. Data must however be read and written only once in order to obtain good to perform simulations on several nodes [18][21] [15]. In our case, we only dispose of one computing node with multiple GPUs thus we don't focus on OLTZMANN METHODS in the previous section consider that the entire simulation domain is meshed and Figure 2. All Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
  • 5. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 In this paper, a new approach is considered. does not requires to be fully meshed at the beginning new progressive mesh method propagation of the simulated flui beginning of the simulation (Figure 3(a)) propagation of the fluid as can be seen of Figure 3 the simulation geometry (Figure 3(c)) simulations. It is also a real advantage for an application on industrial structures mostly composed of pipes or channels. It can indeed save a lot of memory and calculations geometry used for the simulation. Figure 3: Example of a 3D simu created at the beginning of the simulation, (b) several subdomains are created following the propagation of fluid, (c) all subdomains are created and completely adapt to the simulation g The progressive mesh algorithm firstly needs the introduction of a create a new subdomain to the simulation. This existing subdomains. Calculations on optimization factor. 4.2. Definition of a Criterion for the Progressive Mesh The definition of a criterion is an for the simulation. This criterion needs to represent eff velocity seems like a good choice in order to define an efficient criterion fluid velocity between two iterations dispersion. Our criterion is therefore defined as follows ‖6& ‖ The symbol ‖. ‖ stands for the Euclidean norm in this paper. for all active subdomains on the boundary, a new subdomain is created next to this boundary as shown on Figure 4. generally fixed to 0 in this paper in order to each subdomain. (a) (b) (c) International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 approach is considered. For most simulations, the entire domain generally fully meshed at the beginning of the simulation. We propose therefore a method in order to dynamically create the mesh according to the fluid. The idea consists in defining a first subdomain at the (Figure 3(a)). Several subdomains can then be created following the propagation of the fluid as can be seen of Figure 3(b). This method finally adapts automatically (Figure 3(c)). This method is therefore applicable for any geometry and . It is also a real advantage for an application on industrial structures mostly composed It can indeed save a lot of memory and calculations according to the geometry used for the simulation. Figure 3: Example of a 3D simulation using the progressive mesh algorithm: (a) a first subdomain is created at the beginning of the simulation, (b) several subdomains are created following the propagation of fluid, (c) all subdomains are created and completely adapt to the simulation geometry. The progressive mesh algorithm firstly needs the introduction of an adapted criterion in order to w subdomain to the simulation. This new subdomain needs then to be connected to Calculations on single-node multi-GPU architecture are finally of a Criterion for the Progressive Mesh finition of a criterion is an important aspect in order to efficiently create new subdomains s criterion needs to represent efficiently the propagation of fluid. The fluid ike a good choice in order to define an efficient criterion. The difference of the fluid velocity between two iterations is considered in order to observe efficiently Our criterion is therefore defined as follows for thecomponent 8: ‖ ‖ & , Δ & , ‖ stands for the Euclidean norm in this paper. This criterion needs to be calculated subdomains on the boundaries. If the criterion exceeds anarbitrary threshold boundary, a new subdomain is created next to this boundary as shown on Figure 4.The value in this paper in order to detect any change of velocity on the boundaries of (a) (b) (c) International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 5 simulations, the entire domain generally We propose therefore a create the mesh according to the domain at the Several subdomains can then be created following the automatically to This method is therefore applicable for any geometry and . It is also a real advantage for an application on industrial structures mostly composed according to the lation using the progressive mesh algorithm: (a) a first subdomain is created at the beginning of the simulation, (b) several subdomains are created following the propagation of eometry. criterion in order to s then to be connected to GPU architecture are finally an important important aspect in order to efficiently create new subdomains fluid. The fluid he difference of the efficiently the fluid (13) This criterion needs to be calculated threshold9on a The value9 is n the boundaries of
  • 6. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 4: The criterion ǁC_α (x) ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S then a new subdomain is created next to the boundary. 4.3. Algorithm This section describes the algorithm for the model with the inclusion of our progressive mesh algorithm. summarize the previous sections. The calculation of the criterion and the creation of new subdomains are achieved at the last step of the algorithm in order to not disturb the simulation process. Figure 5 describes our resulti Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of our progressive mesh method. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 α ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S then a new subdomain is created next to the boundary. the algorithm for the multiphase and multicomponent lattice Boltzmann model with the inclusion of our progressive mesh algorithm. It is also useful in order to summarize the previous sections. The calculation of the criterion and the creation of new subdomains are achieved at the last step of the algorithm in order to not disturb the simulation describes our resulting algorithm. Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of our progressive mesh method. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 6 _2 is calculated on the boundary. If the criterion exceeds the threshold S attice Boltzmann It is also useful in order to summarize the previous sections. The calculation of the criterion and the creation of new subdomains are achieved at the last step of the algorithm in order to not disturb the simulation Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of our progressive mesh method. For colors, please refer to the PDF version of this paper.
  • 7. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 4.4. Integration on Single-Node Multi Efficiency of inter-GPU communications is surely the most difficult task in order to obtain good performance. Indeed, our simulations are composed of numerous subdomains which are a dynamically. The repartition of GPUs to the different subdomains is an important factor of optimization. An efficient assignment can have an important impact on the performance of the simulation. Indeed, it can reduce the communication time between subdomains and so reduce the simulation time. 4.4.1. Overlap Communications with Computations Several data exchanges are needed for this type of model. The computation of interaction inter-component / ! implies to have access to neighboring values of the pseudo propagation step of LBM also implies to communicate several distribution functions GPUs (Figure 6). Aligned buffers ma In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer with algorithm calculations. Indeed, overlapping computations and communications allows obtain a significant performance gain by reducing the waiting the computation process into 2 steps Computations on the needed boundaries are firstly done. Communi subdomains are also done while computing performed simultaneously with calculations which allow In most cases for lattice Boltzmann method, memory is page-locked memory which allow go [17][13] [15].A different approach In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve performance, Nvidia launched GPUDirect with CUDA 4.0. Figure 6: Schematic example for communication of distribution functions in 2D: red arrows corresponds to values to communicate between subdomains. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Node Multi-GPU Architecture GPU communications is surely the most difficult task in order to obtain good Indeed, our simulations are composed of numerous subdomains which are a of GPUs to the different subdomains is an important factor of optimization. An efficient assignment can have an important impact on the performance of the simulation. Indeed, it can reduce the communication time between subdomains and so reduce the Communications with Computations needed for this type of model. The computation of interaction implies to have access to neighboring values of the pseudo-potential. The propagation step of LBM also implies to communicate several distribution functions ligned buffers may be used for data transactions. In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer with algorithm calculations. Indeed, overlapping computations and communications allows significant performance gain by reducing the waiting time of data. The idea is to separate omputation process into 2 steps: boundary calculations and interior Computations on the needed boundaries are firstly done. Communications between neighboring domains are also done while computing the interior. The different communications are thus calculations which allow good efficiency. attice Boltzmann method, memory is transferred via zero-copy transactions to locked memory which allow good overlapping between communications and computations different approach is studied in this paper concerning inter-GPU communications. In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve e, Nvidia launched GPUDirect with CUDA 4.0.This technology allows to perform Figure 6: Schematic example for communication of distribution functions in 2D: red arrows values to communicate between subdomains. For colors, please refer to the PDF International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 7 GPU communications is surely the most difficult task in order to obtain good Indeed, our simulations are composed of numerous subdomains which are added of GPUs to the different subdomains is an important factor of optimization. An efficient assignment can have an important impact on the performance of the simulation. Indeed, it can reduce the communication time between subdomains and so reduce the needed for this type of model. The computation of interaction /: and potential. The propagation step of LBM also implies to communicate several distribution functions between In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer with algorithm calculations. Indeed, overlapping computations and communications allows to time of data. The idea is to separate calculations. cations between neighboring the interior. The different communications are thus copy transactions to od overlapping between communications and computations GPU communications. In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve This technology allows to perform Figure 6: Schematic example for communication of distribution functions in 2D: red arrows values to communicate between subdomains. For colors, please refer to the PDF
  • 8. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Peer-to-Peer transfers and memory accesses between two compatible GPUs. The idea is to perform data transfer using Peer- zero-copy transactions for others. This method allows to communicate data by bypassing the use of the CPU and therefore to accelerate the transfer (Figure improves performance and the efficiency of the simulatio Figure 7: GPUDirect technology (source Nvidia). 4.4.2 Optimization of Data Transfer between The repartition of GPUs is an important factor of optimization for this type of applications. Communications cost is generally a bottleneck for exchanges between sub domains associated with one GPU.The first belonging to the same GPU. In this case, the communication cost is extremely low because communications are performed on the same concern communications between however made between Peer-to- goal to optimize dynamically the repartition of For a new sub domain ;, the function Where ;′ denotes neighboring subdomains to = ;, ;2 >0.5 ∗ ABC D E-FA E ABC D E-FA E The function = ;, ;2 compares the different ways subdomain and its neighbors. An arbitrary weighting to-Peer communications. The function The function / ; needs therefo cost. This function is calculated for all available GPUs and the GPU with the minimum value is assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned dynamically and the same GPU could not be assigned two times as long as others GPUs are not assigned. Figure 8 explains via a
  • 9. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 5. RESULTS AND PERFORMAN 5.1. Hardware 8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform simulations. Table 1 describes some Tesla communications for our architecture are also described in Figure 9. Table 1: Tesla C2050 Hardware specifications Figure 9: Peer-to-Peer communications accessibility for our architecture. CUDA compute capability Total amount of global memory (14) Multiprocessors, (32) scalar processors/MP GPU clock rate L2 cache Total amount of shared memory per block Total number of registers available per block Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function / ; is calculated for all available GPUs and the GPU which have the minimum value is chosen. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 ESULTS AND PERFORMANCE 8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform simulations. Table 1 describes some Tesla C2050 hardware specifications. communications for our architecture are also described in Figure 9. Table 1: Tesla C2050 Hardware specifications Peer communications accessibility for our architecture. CUDA compute capability 2.0 Total amount of global memory 2687 MBytes (14) Multiprocessors, (32) scalar processors/MP 448 CUDA cores GPU clock rate 1147 MHz L2 cache size 786432 bytes Total amount of shared memory per block 49152 bytes Total number of registers available per block 32768 Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function is calculated for all available GPUs and the GPU which have the minimum value is chosen. For colors, please refer to the PDF version of this paper. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 9 8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform C2050 hardware specifications. Peer-to-Peer 2.0 2687 MBytes 448 CUDA cores 1147 MHz 786432 bytes 49152 bytes 32768 Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
  • 10. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 10 5.2. Simulations Two simulations are considered on large simulation domain in order to evaluate the performance of our contribution. Both simulations include the use of two physical components. The geometry however differs between these simulations. The first simulation is based on a simple geometry composed of 1024*256*256 calculation cells where a fluid fills all simulation domains during the simulation (Figure 10). The second simulation is based on a complex geometry composed of 1024*1024*128 calculations cells where the fluid moves within channels (Figure 11). 5.3. Performance This section deals with the performance obtained by our method. A comparison between the progressive mesh algorithm and the static mesh method generally used in literature is shown. The optimization of the repartition of GPUs on subdomains is also studied. The performance metric generally used for lattice Boltzmann method is the Million Lattice nodes Updates Per Second (MLUPS). It is calculated as follows: H E MNOPQ KDR-BF ABC ∗ F R, E D B E- BDFA ABR S- BDF BR (16) This classical approach generally used in literature in order to perform simulations consists in equally dividing the simulation domain according to the number of GPUs. It offers generally good performance as communications can be overlapped with calculations. The use of Peer-to- Peer communications also has a beneficial effect on the performance, as shown on Figure 13. Peer-to-Peer communications allow obtaining a performance gain between 8 and 12% according Figure 10: A two-component leakage simulation on a simple geometry with a domain size of 1024*256*256 cells. Figure 11: A two-component leakage simulation on a complex geometry composed of channels with a domain size of 1024*1024*128 cells.
  • 11. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 to the number of GPUs used for the simulation described in Figure 10. Ze communications offer a good scaling of Peer-to-Peer communications, as shown on Figure 12. The inclusion of the progressive mesh also has performance. Sub domains of size 128*128*12 and 14 describes performance in terms of calculations and memory consumption for the simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent performance at the beginning of the simulation. The addition of simulation has for consequence a decrease of performance until the convergence of the simulation. In this particular case, all simulation domain is meshed at the end of the simulation shown on Figure 14, which leads to a mesh. In terms of memory consumption, fast apparitions of news lead to have the entire simulation domain in memory after a few iterations. Figure 12: Comparison of performance between Peer communications for the simulation shown on Figure 10. Figure 13: Comparison of performance between the progressive mesh method and the static mesh method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 to the number of GPUs used for the simulation described in Figure 10. Ze communications offer a good scaling but an almost perfect scaling is obtained with the inclusion Peer communications, as shown on Figure 12. f the progressive mesh also has an important beneficial effect on the simulation of size 128*128*128 are considered for these simulations. and 14 describes performance in terms of calculations and memory consumption for the simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent performance at the beginning of the simulation. The addition of sub domains simulation has for consequence a decrease of performance until the convergence of the simulation. In this particular case, all simulation domain is meshed at the end of the simulation which leads to a very slight decrease of performance compared to the static In terms of memory consumption, fast apparitions of news sub domains are noted which to have the entire simulation domain in memory after a few iterations. Figure 12: Comparison of performance between Peer-to-Peer communications with zero communications for the simulation shown on Figure 10. Comparison of performance between the progressive mesh method and the static mesh method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment is also presented. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 11 to the number of GPUs used for the simulation described in Figure 10. Zero-copy scaling is obtained with the inclusion an important beneficial effect on the simulation 8 are considered for these simulations. Figures 13 and 14 describes performance in terms of calculations and memory consumption for the simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent b domains during the simulation has for consequence a decrease of performance until the convergence of the simulation. In this particular case, all simulation domain is meshed at the end of the simulation, as t decrease of performance compared to the static are noted which with zero-copy Comparison of performance between the progressive mesh method and the static mesh method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
  • 12. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 13 also compares performance between two different assignments for GPUs. The first one is a simple assignation which assigns to ne uses the optimization method presented in sec leads to an important difference of performance noted at the convergence of this simulation between the two approaches. due to the fact that the communication cost is more optimized assignment. Since subdomains are added dynamically and connected to e therefore important to optimize these communications in order to reduce the simulation time. The same comparison is also done for the simulation presented on Figure 11, as shown on Figures 15 and 16. The main difference in this situation complex and channelized. Physical simulations on channelized geometry are especially pre on industrial structures. In this case, the progressive mesh method shows excellent results. In terms of memory, method is easily able to simulate on a global simulation domain of size 1024*1024*128 while the static mesh method is unable to perform the simulation. The amount of is indeed too important for this simulation. consumption during the simulation. less important than the static mesh method. A gain of approximatively 50% of memory is noted for this particular simulation. This is due automatically adapts to the evolution of the simulation and so only needed zones of the global simulation domain are meshed. Figure 14: Comparison of memory mesh method for the simulation shown on Figure 10. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 13 also compares performance between two different assignments for GPUs. The first one is a simple assignation which assigns to new subdomain the first available GPU. The second one presented in section 4.4.2. The comparison of these two methods leads to an important difference of performance. Indeed, a difference of approximatively 30% is simulation between the two approaches. This difference is mostly due to the fact that the communication cost is more important for a simple assignment than an optimized assignment. Since subdomains are added dynamically and connected to each other, it is these communications in order to reduce the simulation time. The same comparison is also done for the simulation presented on Figure 11, as shown on Figures 15 and 16. The main difference in this situation is the geometry of the simulation which is more complex and channelized. Physical simulations on channelized geometry are especially pre In this case, the progressive mesh method shows excellent results. In terms of memory, method is easily able to simulate on a global simulation domain of size 1024*1024*128 while the static mesh method is unable to perform the simulation. The amount of needed is indeed too important for this simulation. Figure 15 shows the evolution of memory consumption during the simulation. The memory cost at the convergence of the simulation is far less important than the static mesh method. A gain of approximatively 50% of memory is noted for this particular simulation. This is due to the fact that the progressive mesh method automatically adapts to the evolution of the simulation and so only needed zones of the global Figure 14: Comparison of memory consumption between the progressive mesh method and the static mesh method for the simulation shown on Figure 10. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 12 Figure 13 also compares performance between two different assignments for GPUs. The first one The second one tion 4.4.2. The comparison of these two methods . Indeed, a difference of approximatively 30% is This difference is mostly assignment than an ach other, it is these communications in order to reduce the simulation time. The same comparison is also done for the simulation presented on Figure 11, as shown on Figures is the geometry of the simulation which is more complex and channelized. Physical simulations on channelized geometry are especially present In this case, the progressive mesh method shows excellent results. In terms of memory, this method is easily able to simulate on a global simulation domain of size 1024*1024*128 and more needed memory the evolution of memory The memory cost at the convergence of the simulation is far less important than the static mesh method. A gain of approximatively 50% of memory is noted to the fact that the progressive mesh method automatically adapts to the evolution of the simulation and so only needed zones of the global consumption between the progressive mesh method and the static
  • 13. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 15: Comparison of memory consumption between the method for the simulation shown on Figure 11. The comparison of the repartition performance gain (19%) is still noted for this simulation. This proves that a method is important in order to obtain good performance. not need to be fully meshed brings an important gain in performance. The geometry has therefore an important impact on the performance on Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment of GPUs for the simulation shown on Figure 11. 6. CONCLUSION In this paper, an efficient progressive Boltzmann method is presented. This progressive mesh method can be a useful tool in order to perform several types of physical simulations. Its main advantage is that subdomains are automatically added to the simulation by the use of an adapted criterion. This method is also able to save a lot of memory and calculations International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 Figure 15: Comparison of memory consumption between the progressive mesh method and the static mesh method for the simulation shown on Figure 11. The comparison of the repartition of GPUs is also described in Figure 16. A %) is still noted for this simulation. This proves that a dynamic optimization method is important in order to obtain good performance. Moreover, the fact that the domain does not need to be fully meshed brings an important gain in performance. The geometry has therefore an important impact on the performance on the progressive mesh method. Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment of GPUs for the simulation shown on Figure 11. In this paper, an efficient progressive mesh algorithm for physical simulations using the lattice Boltzmann method is presented. This progressive mesh method can be a useful tool in order to perform several types of physical simulations. Its main advantage is that subdomains are ded to the simulation by the use of an adapted criterion. This method is also able and calculations in order to perform simulations on large installations. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 13 progressive mesh method and the static mesh . An important dynamic optimization Moreover, the fact that the domain does not need to be fully meshed brings an important gain in performance. The geometry has therefore Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment mesh algorithm for physical simulations using the lattice Boltzmann method is presented. This progressive mesh method can be a useful tool in order to perform several types of physical simulations. Its main advantage is that subdomains are ded to the simulation by the use of an adapted criterion. This method is also able in order to perform simulations on large installations.
  • 14. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 14 The integration of the progressive mesh method on single-node multi-GPU architecture is also treated. A dynamic optimization of the repartition of GPUs to subdomains is an important factor in order to obtain good performance. The combination of all these contributions allows therefore performing fast physical simulations on all types of geometry. The progressive mesh method is therefore an interesting alternative because it allows obtaining similar or better performances than the usual static mesh method. The progressive mesh algorithm is however limited to the memory of the GPU which is generally far more inferior to the CPU RAM. The creation of new subdomains is indeed possible while there is a sufficient amount of memory on the GPUs. Extensions of this work to cases that require more memory than all GPUs can handle is now under investigation. Data transfer optimizations with the CPU host will therefore be essential to keep good performances. ACKNOWLEDGEMENTS This work has been made possible thanks to collaboration between academic and industrial groups, gathered by the INNOCOLD association. REFERENCES [1] B. Chopard, J.L. Falcone J. Latt, The Lattice Boltzmann advection diffusion model revisited, The European Physical Journal - Special Topics,Vol. 171, pp. 245-249, 2009. [2] S. Gong, P. Cheng, Numerical investigation of droplet motion and coalescence by an improved lattice Boltzmann model for phase transitionsand multiphase flows, Computers & Fluids , Vol. 53, pp. 93- 104, 2012. [3] S. Gong, P. Cheng, A lattice Boltzmann method for liquid vapor phase change heat transfer, Computers & Fluids, Vol. 54, pp. 93-104, 2012. [4] J. Bao, L. Schaeffer, Lattice Boltzmann equation model for multicomponent multi-phase flow with high density ratios, Applied MathematicalModelling, 2012. [5] Nvidia, C. U. D. A. (2011). Nvidia cuda c programming guide. NVIDIA Corporation, 120, 18.8 [6] M. Wittmann, T. Zeiser, G. Hager, G. Wellein, Comparison of different propagation steps for Lattice Boltzmann methods, Computers and Mathematicswith Applications, Vol. 65 pp. 924-935, 2013. [7] J. Tölke, Implementation of a Lattice Boltzmann kernel using the compute unified device architecture developed by nVIDIA, Computing andVisualization in Science, 1-11, 2008. [8] J. Tölke, M. Krafczyk, TeraFLOP computing on a desktop PC with GPUs for 3D CFD, International Journal of Computational Fluid Dynamics 22(7), pp. 443-456, 2008. [9] F. Kuznik, C.Obrecht, G. Rusaouën, J-J. Roux, LBM based flow simulation using GPU computing processor, Computers and Mathematics withApplications 27, 2009. [10] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, A new approach to the lattice Boltzmann method for graphics processing units, Computersand Mathematics with Applications 61, pp. 3628-3638, 2011. [11] P.R. Rinaldi, E.A Dari, M.J. Vénere, A. Clausse, A Lattice-Boltzmannsolver for 3D fluid on GPU, Simulation Modeling Pratice and Theory 25,pp. 163-171, 2012. [12] P. Bailey, J. Myre, S. Walsh, D. Lilja, M. Saar, Accelerating lattice boltzmann fluid flows using graphics processors, International Conferenceon Parallel Processing, pp. 550-557, 2009. [13] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Multi-GPU implementation of the lattice Boltzmann method, Computers and Mathematicswith Applications, 80, pp. 269-275, 2013. [14] X. Li, Y. Zhang, X. Wang, W. Ge, GPU-based numerical simulation of multi-phase flow in porous media using multiple-relaxation-time latticeBoltzmann method, Chemical Engineering Science, Vol. 102, pp. 209-219,2013. [15] M. Januszewski, M. Kostur, Sailfish: A flexible multi-GPU implementationof the lattice Boltzmann method, Computer Physics Communications,Vol. 185, pp. 2350-2368, 2014. [16] F. Jiang, C. Hu, Numerical simulation of a rising CO2 droplet in the initial accelerating stage by a multiphase lattice Boltzmann method,Applied Ocean Research, Vol. 45, pp. 1-9, 2014.
  • 15. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015 15 [17] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux, Multi-GPU Implementation of a Hybrid Thermal Lattice Boltzmann Solver using theTheLMA Framework, Computers and Fluids, Vol. 80, pp. 269275, 2013. [18] C. Rosales, Multiphase LBM Distributed over Multiple GPUs, CLUSTER’11 Proceedings of the 2011 IEEE International Conference onCluster Computing, pp. 1-7, 2011. [19] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Scalable lattice Boltzmann solvers for CUDA GPU clusters, Parallel Computing, Vol.39, pp. 259-270, 2013. [20] J. Habich, C. Feichtinger, H. Köstler, G. Hager, G. Wellein, Performance engineering for the lattice Boltzmann method on GPGPUs: Architecturalrequirements and performance results, Computer & Fluids, Vol. 80, pp.276-282, 2013. [21] C. Feichtinger, J. Habich, H. Köstler, U. Rüde, T. Aoki, Performance Modeling and Analysis of Heterogeneous Lattice Boltzmann Simulationson CPU-GPU Clusters, Parallel Computing, 2014. AUTHORS Julien Duchateau is a PhD student in computer science at the Université du Littoral Côte d’Opale in France. His main research interest are massive parallelism on CPUs and GPUs, physical simulations and computer graphics. François Rousselle is an associate professor in computer science at the Université du Littoral Côte d’Opale in France. His main research interests are computer graphics, physical simulations, virtual reality and massive parallelism. Nicolas Maquignon is a PhD student in simulation and numerical physics at the Université du Littoral Côte d’Opale. His main research interests are numerical physics, numerical mathematics and numerical modeling. Christophe Renaud is a professor in computer science at the Université du Littoral Côte d’Opale in France. His main research interests are computer graphics, virtual reality, physical simulations and massive parallelism. Gilles Roussel is an associate professor in automatic at the Université du Littoral Côte d’Opale in France. His main research interests are automatic, signal processing, physical simulations and industrial computing.