In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is able to mesh automatically the simulation domain according to the propagation of fluids. This method can also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations
by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is
able to mesh automatically the simulation domain according to the propagation of fluids. This method can
also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
IMPACT OF PARTIAL DEMAND INCREASE ON THE PERFORMANCE OF IP NETWORKS AND RE-OP...EM Legacy
12th GI/ITG CONFERENCE ON MEASURING, MODELLING AND EVALUATION OF COMPUTER AND COMMUNICATION SYSTEMS 3rd POLISH-GERMAN TELETRAFFIC SYMPOSIUM
PGTS 2004
Eueung Mulyana, Ulrich Killat
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...cseij
For learning problem of Radial Basis Function Process Neural Network (RBF-PNN), an optimization
training method based on GA combined with SA is proposed in this paper. Through building generalized
Fréchet distance to measure similarity between time-varying function samples, the learning problem of
radial basis centre functions and connection weights is converted into the training on corresponding
discrete sequence coefficients. Network training objective function is constructed according to the least
square error criterion, and global optimization solving of network parameters is implemented in feasible
solution space by use of global optimization feature of GA and probabilistic jumping property of SA . The
experiment results illustrate that the training algorithm improves the network training efficiency and
stability.
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMsipij
In this paper the Delay Computation method for Common Sub expression Elimination algorithm is being implemented on Cyclotomic Fast Fourier Transform. The Common Sub Expression Elimination algorithm is combined with the delay computing method and is known as Gate Level Delay Computation with Common Sub expression Elimination Algorithm. Common sub expression elimination is effective
optimization method used to reduce adders in cyclotomic Fourier transform. The delay computing method is based on delay matrix and suitable for implementation with computers. The Gate level delay computation method is used to find critical path delay and it is analyzed on various finite field elements. The presented algorithm is established through a case study in Cyclotomic Fast Fourier Transform over finite field. If Cyclotomic Fast Fourier Transform is implemented directly then the system will have high additive complexities. So by using GLDC-CSE algorithm on cyclotomic fast Fourier transform, the additive
complexities will be reduced and also the area and area delay product will be reduced.
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations
by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is
able to mesh automatically the simulation domain according to the propagation of fluids. This method can
also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
IMPACT OF PARTIAL DEMAND INCREASE ON THE PERFORMANCE OF IP NETWORKS AND RE-OP...EM Legacy
12th GI/ITG CONFERENCE ON MEASURING, MODELLING AND EVALUATION OF COMPUTER AND COMMUNICATION SYSTEMS 3rd POLISH-GERMAN TELETRAFFIC SYMPOSIUM
PGTS 2004
Eueung Mulyana, Ulrich Killat
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
RADIAL BASIS FUNCTION PROCESS NEURAL NETWORK TRAINING BASED ON GENERALIZED FR...cseij
For learning problem of Radial Basis Function Process Neural Network (RBF-PNN), an optimization
training method based on GA combined with SA is proposed in this paper. Through building generalized
Fréchet distance to measure similarity between time-varying function samples, the learning problem of
radial basis centre functions and connection weights is converted into the training on corresponding
discrete sequence coefficients. Network training objective function is constructed according to the least
square error criterion, and global optimization solving of network parameters is implemented in feasible
solution space by use of global optimization feature of GA and probabilistic jumping property of SA . The
experiment results illustrate that the training algorithm improves the network training efficiency and
stability.
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMsipij
In this paper the Delay Computation method for Common Sub expression Elimination algorithm is being implemented on Cyclotomic Fast Fourier Transform. The Common Sub Expression Elimination algorithm is combined with the delay computing method and is known as Gate Level Delay Computation with Common Sub expression Elimination Algorithm. Common sub expression elimination is effective
optimization method used to reduce adders in cyclotomic Fourier transform. The delay computing method is based on delay matrix and suitable for implementation with computers. The Gate level delay computation method is used to find critical path delay and it is analyzed on various finite field elements. The presented algorithm is established through a case study in Cyclotomic Fast Fourier Transform over finite field. If Cyclotomic Fast Fourier Transform is implemented directly then the system will have high additive complexities. So by using GLDC-CSE algorithm on cyclotomic fast Fourier transform, the additive
complexities will be reduced and also the area and area delay product will be reduced.
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...ijma
Thresholding operators have been used successfully for denoising signals, mostly in the wavelet domain.
These operators transform a noisy coefficient into a denoised coefficient with a mapping that depends on
signal statistics and the value of the noisy coefficient itself. This paper demonstrates that a polynomial
threshold mapping can be used for enhanced denoising of Principal Component Analysis (PCA) transform
coefficients. In particular, two polynomial threshold operators are used here to map the coefficients
obtained with the popular local pixel grouping method (LPG-PCA), which eventually improves the
denoising power of LPG-PCA. The method reduces the computational burden of LPG-PCA, by eliminating
the need for a second iteration in most cases. Quality metrics and visual assessment show the improvement.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
Abstract Parallel computing operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time by taking advantage of non-local resources and overcoming memory constraints. Multiplication of larger matrices requires a lot of computation time. This paper deals with the two methods for handling Parallel Matrix Multiplication. First is, dividing the rows of one of the input matrices into set of rows based on the number of slaves and assigning one rows set for each slave for computation. Second method is, assigning one row of one of the input matrices at a time for each slave starting from first row to first slave and second row to second slave and so on and loop backs to the first slave when last slave assignment is finished and repeated until all rows are finished assigning. These two methods are implemented using Parallel Virtual Machine and the computation is performed for different sizes of matrices over the different number of nodes. The results show that the row per slave method gives the optimal computation time in PVM based parallel matrix multiplication. Keywords: Parallel Execution, Cluster Computing, MPI (Message Passing Interface), PVM (Parallel Virtual Machine) RAM (Random Access Memory).
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...ijcsa
Computer industry has widely accepted that future performance increases must largely come from increasing the number of processing cores on a die. This has led to NoC processors. Task scheduling is one of the most challenging problems facing parallel programmers today which is known to be NP-complete. A good principle is space-sharing of cores and to schedule multiple DAGs simultaneously on NoC processor. Hence the need to find optimal number of cores for a DAG for a particular scheduling method and further which region of cores on NoC, to be allotted for a DAG . In this work, a method is proposed to find near-optimal minimal block of cores for a DAG on a NoC processor. Further, a time efficient framework and three on-line block allotment policies to the submitted DAGs are experimented. The objectives of the policies, is to improve the NoC throughput. The policies are experimented on a simulator and found to deliver better performance than the policies found in literature..
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
Section based hex cell routing algorithm (sbhcr)IJCNCJournal
A Hex-Cell network topology can be constructed using units of hexagon cells. It has been introduced in the literature as interconnection network suitable for large parallel computers, which can connect large number of nodes with three links per node. Although this topology exhibits attractive characteristics such as embeddability, symmetry, regularity, strong resilience, and simple routing, the previously suggested routing algorithms suffer from the high number of logical operations and the need for readdressing of nodes every time a new level is add to the network. This negatively impacts the performance of the network as it increases the execution time of these algorithms. In this paper we propose an improved optimal point to point routing algorithm for Hex-Cell network. The algorithm is based on dividing the Hex-Cell topology into six divisions, hence the name Section Based Hex-Cell Routing (SBHCR). The SBHCR algorithm is simple and preserves the advantage of the addressing scheme proposed for the Hex-Cell network. It does not depend on the depth of the network topology which leads to overcome the issue of readdressing of nodes every time a new level is added. Evaluation against two previously suggested routing algorithms has shown the superiority of SBHCR in term of less logical operations.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
Subgraph matching is a fundamental operation for querying
graph-structured data. Due to potential errors and noises in real world graph data, exact subgraph matching is sometimes not appropriate in practice.
In this paper we consider an approximate subgraph matching model that allows missing edges. Based on this model, approximate subgraph matching finds all occurrences of a given query graph in a database graph,
allowing missing edges. A straightforward approach to this problem is to first generate query subgraphs of the query graph by deleting edges and then perform exact subgraph matching for each query subgraph. In this paper we propose a sharing based approach to approximate subgraph matching, called SASUM. Our method is based on the fact that query subgraphs are highly overlapped. Due to this overlapping nature of query subgraphs, the matches of a query subgraph can be computed from the matches of a smaller query subgraph, which results in reducing the number of query subgraphs that need costly exact subgraph matching. Our method uses a lattice framework to identify sharing opportunities between query subgraphs. To further reduce the number of graphs that need exact subgraph matching, SASUM generates small base graphs that are shared by query subgraphs and chooses the minimum number of base graphs whose matches are used to derive the matching results of all query subgraphs. A comprehensive set of experiments shows that our approach outperforms the state-of-the-art
approach by orders of magnitude in terms of query execution time.
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel
system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting
of multiple collections of nodes with different types of computing devices. The execution engine of the
system is open for optimizer implementations, focusing on various criteria. In this paper, we propose a new
optimizer for KernelHive, that utilizes distributed databases and performs data prefetching to optimize the
execution time of applications, which process large input data. Employing a versatile data management
scheme, which allows combining various distributed data providers, we propose using NoSQL databases
for our purposes. We support our solution with results of experiments with real executions of our OpenCL
implementation of a regular expression matching application in various hardware configurations.
Additionally, we propose a network-aware scheduling scheme for selecting hardware for the proposed
optimizer and present simulations that demonstrate its advantages.
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsMarek Kraft
In this paper, an efficient hardware architecture, enabling the computation of LBP-based image region descriptors is presented. The complete region descriptor is formed by combining individual local descriptors and arranging them into a grid, as typically used in object detection and recognition. The proposed solution performs massively parallel, pipelined computations, facilitating the processing of over two hundred VGA frames per second and can easily be adopted to different window and grid sizes for the use of other descriptors.
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...ijma
Thresholding operators have been used successfully for denoising signals, mostly in the wavelet domain.
These operators transform a noisy coefficient into a denoised coefficient with a mapping that depends on
signal statistics and the value of the noisy coefficient itself. This paper demonstrates that a polynomial
threshold mapping can be used for enhanced denoising of Principal Component Analysis (PCA) transform
coefficients. In particular, two polynomial threshold operators are used here to map the coefficients
obtained with the popular local pixel grouping method (LPG-PCA), which eventually improves the
denoising power of LPG-PCA. The method reduces the computational burden of LPG-PCA, by eliminating
the need for a second iteration in most cases. Quality metrics and visual assessment show the improvement.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
Abstract Parallel computing operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time by taking advantage of non-local resources and overcoming memory constraints. Multiplication of larger matrices requires a lot of computation time. This paper deals with the two methods for handling Parallel Matrix Multiplication. First is, dividing the rows of one of the input matrices into set of rows based on the number of slaves and assigning one rows set for each slave for computation. Second method is, assigning one row of one of the input matrices at a time for each slave starting from first row to first slave and second row to second slave and so on and loop backs to the first slave when last slave assignment is finished and repeated until all rows are finished assigning. These two methods are implemented using Parallel Virtual Machine and the computation is performed for different sizes of matrices over the different number of nodes. The results show that the row per slave method gives the optimal computation time in PVM based parallel matrix multiplication. Keywords: Parallel Execution, Cluster Computing, MPI (Message Passing Interface), PVM (Parallel Virtual Machine) RAM (Random Access Memory).
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...ijcsa
Computer industry has widely accepted that future performance increases must largely come from increasing the number of processing cores on a die. This has led to NoC processors. Task scheduling is one of the most challenging problems facing parallel programmers today which is known to be NP-complete. A good principle is space-sharing of cores and to schedule multiple DAGs simultaneously on NoC processor. Hence the need to find optimal number of cores for a DAG for a particular scheduling method and further which region of cores on NoC, to be allotted for a DAG . In this work, a method is proposed to find near-optimal minimal block of cores for a DAG on a NoC processor. Further, a time efficient framework and three on-line block allotment policies to the submitted DAGs are experimented. The objectives of the policies, is to improve the NoC throughput. The policies are experimented on a simulator and found to deliver better performance than the policies found in literature..
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
Section based hex cell routing algorithm (sbhcr)IJCNCJournal
A Hex-Cell network topology can be constructed using units of hexagon cells. It has been introduced in the literature as interconnection network suitable for large parallel computers, which can connect large number of nodes with three links per node. Although this topology exhibits attractive characteristics such as embeddability, symmetry, regularity, strong resilience, and simple routing, the previously suggested routing algorithms suffer from the high number of logical operations and the need for readdressing of nodes every time a new level is add to the network. This negatively impacts the performance of the network as it increases the execution time of these algorithms. In this paper we propose an improved optimal point to point routing algorithm for Hex-Cell network. The algorithm is based on dividing the Hex-Cell topology into six divisions, hence the name Section Based Hex-Cell Routing (SBHCR). The SBHCR algorithm is simple and preserves the advantage of the addressing scheme proposed for the Hex-Cell network. It does not depend on the depth of the network topology which leads to overcome the issue of readdressing of nodes every time a new level is added. Evaluation against two previously suggested routing algorithms has shown the superiority of SBHCR in term of less logical operations.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
Subgraph matching is a fundamental operation for querying
graph-structured data. Due to potential errors and noises in real world graph data, exact subgraph matching is sometimes not appropriate in practice.
In this paper we consider an approximate subgraph matching model that allows missing edges. Based on this model, approximate subgraph matching finds all occurrences of a given query graph in a database graph,
allowing missing edges. A straightforward approach to this problem is to first generate query subgraphs of the query graph by deleting edges and then perform exact subgraph matching for each query subgraph. In this paper we propose a sharing based approach to approximate subgraph matching, called SASUM. Our method is based on the fact that query subgraphs are highly overlapped. Due to this overlapping nature of query subgraphs, the matches of a query subgraph can be computed from the matches of a smaller query subgraph, which results in reducing the number of query subgraphs that need costly exact subgraph matching. Our method uses a lattice framework to identify sharing opportunities between query subgraphs. To further reduce the number of graphs that need exact subgraph matching, SASUM generates small base graphs that are shared by query subgraphs and chooses the minimum number of base graphs whose matches are used to derive the matching results of all query subgraphs. A comprehensive set of experiments shows that our approach outperforms the state-of-the-art
approach by orders of magnitude in terms of query execution time.
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
Rapid development of diverse computer architectures and hardware accelerators caused that designing parallel systems faces new problems resulting from their heterogeneity. Our implementation of a parallel
system called KernelHive allows to efficiently run applications in a heterogeneous environment consisting
of multiple collections of nodes with different types of computing devices. The execution engine of the
system is open for optimizer implementations, focusing on various criteria. In this paper, we propose a new
optimizer for KernelHive, that utilizes distributed databases and performs data prefetching to optimize the
execution time of applications, which process large input data. Employing a versatile data management
scheme, which allows combining various distributed data providers, we propose using NoSQL databases
for our purposes. We support our solution with results of experiments with real executions of our OpenCL
implementation of a regular expression matching application in various hardware configurations.
Additionally, we propose a network-aware scheduling scheme for selecting hardware for the proposed
optimizer and present simulations that demonstrate its advantages.
Hardware Architecture for Calculating LBP-Based Image Region DescriptorsMarek Kraft
In this paper, an efficient hardware architecture, enabling the computation of LBP-based image region descriptors is presented. The complete region descriptor is formed by combining individual local descriptors and arranging them into a grid, as typically used in object detection and recognition. The proposed solution performs massively parallel, pipelined computations, facilitating the processing of over two hundred VGA frames per second and can easily be adopted to different window and grid sizes for the use of other descriptors.
Lors de la Semaine Mondiale de l'Entrepreneuriat, venez découvrir et partager les Synergies de l'Entrepreneuriat sur le territoire de Paris Ouest La Défense
The paper presents a nature inspired algorithm that copies the big bang theory of evolution.
This algorithm is simple with regard to number of parameters. Embedded systems are powered by
batteries and enhancing the operating time of the battery by reducing the power consumption is vital.
Embedded systems consume power while accessing the memory during their operation. An efficient
method for power management is proposed in this work. The proposed method, reduce the energy
consumption in memories from 76% up to 98% as compared to other methods reported in the
literature.
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...ijcax
Methods for Molecular Dynamics(MD) simulations are investigated. MD simulation is the widely used computer simulation approach to study the properties of molecular system. Force calculation in MD is computationally intensive. Paral-lel programming techniques can be applied to improve those calculations.
The major aim of this paper is to speed up the MD simulation calculations by/using General Purpose Graphics Processing Unit(GPU) computing paradigm, an efficient and economical way for parallel computing. For that we are proposing a method called cell charge approximation which treats the
electrostatic interactions in MD simulations.This method reduces the complexity of force calculations.
Cell Charge Approximation for Accelerating Molecular Simulation on CUDA-Enabl...ijcax
Methods for Molecular Dynamics(MD) simulations are investigated. MD simulation is the widely used computer simulation approach to study the properties of molecular system. Force calculation in MD is computationally intensive. Paral-lel programming techniques can be applied to improve those calculations.The major aim of this paper is to speed up the MD simulation calculations by/using General Purpose
Graphics Processing Unit(GPU) computing paradigm, an efficient and economical way for parallel computing. For that we are proposing a method called cell charge approximation which treats the electrostatic interactions in MD simulations. This method reduces the complexity of force calculations.
Brief Explanation about the Tau-Leaping Process, Parallel Processing and NVIDIA's CUDA architecture
And the use of cuTau - Leaping for simulation of Biological systems
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
An Adaptive Load Balancing Middleware for Distributed SimulationGabriele D'Angelo
The simulation is useful to support the design and performance evaluation of complex systems, possibly composed by a massive number of interacting entities. For this reason, the simulation of such systems may need aggregate computation and memory resources obtained by clusters of parallel and distributed execution units. Shared computer clusters composed of available Commercial-Off-the-Shelf hardware are preferable to dedicated systems, mainly for cost reasons. The performance of distributed simulations is influenced by the heterogeneity of execution units and by their respective CPU load in background. Adaptive load balancing mechanisms could improve the resources utilization and the simulation process execution, by dynamically tuning the simulation load with an eye to the synchronization and communication overheads reduction. In this work it will be presented the GAIA+ framework: a new load balancing mechanism for distributed simulation. The framework has been evaluated by performing testbed simulations of a wireless ad hoc network model. Results confirm the effectiveness of the proposed solutions.
Optimum capacity allocation of distributed generation units using parallel ps...eSAT Journals
Abstract This paper proposes the application of Parallel Particle Swarm Optimization (PPSO) technique to find the optimal sizing of multiple DG(Distributed Generation) units in the radial distribution network by reduction in real power losses and enhancement in voltage profile. Message passing interface (MPI) is used for the parallelization of PSO. The initial population of PSO algorithm has been divided between the processors at run time. The proposed technique is tested on standard 123-bus test system and the obtained results show that the simulation time is significantly reduced and is concluded that parallelization helps in enhancing the performance of basic PSO. The procedure has been implemented in an environment in which OpenDSS (Open Distribution System Simulator) is driven from MATLAB. An adaptive weight particle swarm optimization algorithm has been developed in MATLAB , parallelization is achieved using MATLABMPI and the unbalanced three-phase distribution load flow (DLF) has been performed using Electric Power Research Institute’s (EPRI) open source tool OpenDSS. Index Terms: Distributed Generation, Message Passing Interface, Optimal Placement, Parallel Particle Swarm Optimisation
Task Scheduling using Hybrid Algorithm in Cloud Computing Environmentsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Review on Image Compression in Parallel using CUDAIJERD Editor
Now a days images are prodigiously and sizably voluminous in size. So, this size is not facilely fits in applications. For that image compression is require. Image Compression algorithms are more resource conserving. It takes more time to consummate the task of compression. Utilizing Parallel implementation of the compression algorithm this quandary can be overcome. CUDA (Compute Unified Device Architecture) Provides parallel execution for algorithm utilizing the multi-threading. CUDA is NVIDIA`s parallel computing platform. CUDA uses GPU (Graphical Processing Unit) for the parallel execution. GPU have the number of the cores for parallel execution support. Image compression can additionally implemented in parallel utilizing CUDA. There are number of algorithms for image compression. Among them DWT (Discrete Wavelet Transform) is best suited for parallel implementation due to its more mathematical calculation and good compression result compare to other methods. In this paper included different parallel techniques for image compression. With the actualizing this image compression algorithm over the GPU utilizing CUDA it will perform the operations in parallel. In this way, vast diminish in processing time is conceivable. Furthermore it is conceivable to enhance the execution of image compression algorithms.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN METHOD ON SINGLE-NODE MULTI-GPU ARCHITECTURES
1. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
DOI:10.5121/ijdps.2015.6501 1
A PROGRESSIVE MESH METHOD FOR PHYSICAL
SIMULATIONS USING LATTICE BOLTZMANN
METHOD ON SINGLE-NODE MULTI-GPU
ARCHITECTURES
Julien Duchateau1
, François Rousselle1
, Nicolas Maquignon1
, Gilles Roussel1
,
Christophe Renaud1
1
Laboratoire d’Informatique, Signal, Image de la Côte d’Opale
Université du Littoral Côte d’Opale, Calais, France
ABSTRACT
In this paper, a new progressive mesh algorithm is introduced in order to perform fast physical simulations
by the use of a lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. This algorithm is
able to mesh automatically the simulation domain according to the propagation of fluids. This method can
also be useful in order to perform several types of physical simulations. In this paper, we associate this
algorithm with a multiphase and multicomponent lattice Boltzmann model (MPMC–LBM) because it is
able to perform various types of simulations on complex geometries. The use of this algorithm combined
with the massive parallelism of GPUs[5] allows to obtain very good performance in comparison with the
staticmesh method used in literature. Several simulations are shown in order to evaluate the algorithm.
KEYWORDS
Progressive mesh, Lattice Boltzmann method,single-node multi-GPU, parallel computing.
1. INTRODUCTION
The lattice Boltzmann method (LBM) is a computational fluid dynamics (CFD) method. It is a
relatively recent technique which is able to approximate Navier-Stokes equations by a collision-
propagation scheme [1]. Lattice Boltzmann method however differs from standard approaches as
finite element method (FEM) or finite volume method (FVM) by its mesoscopic approach. It is an
interesting alternative which is able to simulate complex phenomena on complex geometries. Its
high parallelization makes also this method attractive in order to perform simulations on parallel
hardware. Moreover, the emergence of high-performance computing (HPC) architectures using
GPUs [5] is also a great interest for many researchers.
Parallelization is indeed an important asset of lattice Boltzmann method. However, perform
simulations on large complex geometries can be very costly in computational resources. This
paper introduces a new progressive mesh algorithm in order to perform physical simulations on
complex geometries by the use of a multiphase and multicomponent lattice Boltzmann method.
The algorithm is able to automatically mesh the simulation domain according to the propagation
of fluids. Moreover, the integration of this algorithm on single-node multi-GPU architecture is
also an important matter which is studied in this paper. This method is an interesting alternative
which has never been exploited at the best of our knowledge.
2. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
2
Section 2 first describes the multiphase and multicomponent lattice Boltzmann method. It is able
to simulate the behavior of fluids with several physical states (phase) and it is also able to model
several fluids (component) interacting with each other. Section 3 presents then several recent
works involving lattice Boltzmann method on GPUs. Section 4 mostly concerns the main
contribution of this paper: the inclusion of a progressive mesh method in the simulation code. The
principles of the method and the definition of an adapted criterion are firstly introduced. The
integration on a single-node multi-GPU architecture is then described. An analysis concerning
performance is also studied in section 5. The conclusion and future works are finally presented in
the last section.
2. THE LATTICE BOLTZMANN METHOD
2.1. The Single relaxation time Bhatnagar-Gross-Krook (SRT-BGK) Boltzmann
equation
The lattice Boltzmann method is based on three main discretizations: space, time and velocities.
Velocity space is reduced to a finite number of well-defined vectors. Figures 1(a) and 1(b)
illustrate this discrete scheme for D2Q9 and D3Q19 model.
The simulation grid is therefore discretized as a Cartesian grid and calculation steps are achieved
on this entire grid. The discrete Boltzmann equation[1] with a single relaxation timeBhatnagar-
Gross-Krook (SRT-BGK) collision term is defined by the following equation:
, Δ ,
1
, , (1)
, , 1
2
2
(2)
1
3
Δ!
Δ
" (3)
The function , corresponds to the discrete density distribution function along velocity
vector at a position and a time . The parameter corresponds to the relaxation time of the
simulation. The value is the fluid density and corresponds to the fluid velocity. Δ!andΔ are
the spatial and temporal steps of the simulation respectively. Parameters # are weighting values
defined according to the lattice Boltzmann scheme and can be found in [1].Macroscopic
quantities as density and velocity are finally computed as follows:
(a) D2Q9 scheme (b) D3Q19 scheme
Figure 1: Example of Lattice Boltzmann schemes
3. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
3
, $ , (4)
, , $ , (5)
2.2. Multiphase and Multi Component Lattice Boltzmann Model
Multiphase and multicomponent models (MPMC) allow performing complex simulations
involving several physical components. In this section, a MPMC-LBM model based on the work
achieved by Bao& Schaeffer [4] is presented.It includes several interaction forces based on
pseudo-potential. It is calculated as follows:
%& '
2 (& &
)&&
(6)
The term (& is the pressure term. It is calculated by the use of an equation of state as the Peng-
Robinson equation:
(&
&*&+&
1 ,& &
-&. +& &
1 2,& ,
(7)
Internal forces are then computed. The internal fluid interaction force is expressed as follows [2]
[3]:
/&& 0
)&
2
%& $ #
!1
%&
2 2
1 0
2
)&
2
%& # %&
2 2
(8)
The value0 is a weighting term generally fixed to 1.16 according to [2] [3]. The inter-component
force is also introduced as follows [4]:
/&&1
)&&2
2
%& $ #
!1
%&
2 2
(9)
Additional forces can be added into the simulation code as the gravity force, or a fluid-structure
interaction [3]. The incorporation of the force term is then achieved by a modifiedcollision
operator expressed as follows:
&, , Δ &, ,
1
&, , &, , Δ &, (10)
Δ &, &, &, & Δ & &, &, & (11)
Δ &
/&Δ
&
(12)
Macroscopic quantities for each component are finally computed by the use of equations (4) and
(5).
3. LATTICE BOLTZMANN METHODS AND GPUS
The mass parallelism of GPUs has been quickly exploited in order to perform fast simulations[7]
[8] using lattice Boltzmann method. Recent works have shown that GPUs are also used with
multiphase and multicomponent models [16] [14]. The main aspects of GPU optimizations are
4. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
decomposed into several categories
overlap of memory transfers with computations ….
optimize global memory bandwidt
Concerning LBM, an adapted data structure such as the Structure of Array (SoA
studied and has proven to be efficient on GPU
Several access patterns are also
pattern, consists of using two calculation grids in GPU global memory in order to manage the
temporal and spatial dependency of the data (Equation
reading distribution functions from A and writing them to B, and reading from B
reciprocally. This pattern is commonly used and offers very goo
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
compression [6], Swap algorithm
technique is used in order to save memory due to spatial and temporal data dependency.
Recent works involving implementation of l
of several GPUs are also available. A first solution, proposed in
entire simulation domain into sub
LBM kernels on each sub-domain in parallel. CPU threads are used to handle each CUDA
context. Communications between sub
Zero-copy feature allows to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
performance.
Some approaches have finally been proposed
constituted of multiple GPUs by the use of MPI in combination with CUDA
our case, we only dispose of one computing node with multiple GPUs thus we don't
these architectures in this paper.
4. A PROGRESSIVE MESH ALG
ON SINGLE-NODE MULTI-GPU
4.1. Motivation
Works described in the previous section consider that the entire simulation domain is
divided into subdomains according to the number of
subdomains are therefore calculated in parallel.
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
decomposed into several categories [10] [9] as thread level parallelism, GPU memory acce
overlap of memory transfers with computations …. Data coalescence is needed in order to
optimize global memory bandwidth. This implies several conditions as described in [9
Concerning LBM, an adapted data structure such as the Structure of Array (SoA) has been well
studied and has proven to be efficient on GPU [7].
also described in the literature. The first one, named A
pattern, consists of using two calculation grids in GPU global memory in order to manage the
temporal and spatial dependency of the data (Equation (10)). Simulation steps alternate between
reading distribution functions from A and writing them to B, and reading from B and writing to A
This pattern is commonly used and offers very good performance [10] [11] [9
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
Swap algorithm [6] or A-A pattern technique [12]. In this paper, the A
technique is used in order to save memory due to spatial and temporal data dependency.
ks involving implementation of lattice Boltzmann method on a single-node composed
ilable. A first solution, proposed in [13] [17], consists in dividing the
simulation domain into subdomains according to the number of GPUs and performing
domain in parallel. CPU threads are used to handle each CUDA
context. Communications between sub-domains are performed using zero-copy memory transfers.
to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
Some approaches have finally been proposed recently to perform simulations on several no
constituted of multiple GPUs by the use of MPI in combination with CUDA [19][18
our case, we only dispose of one computing node with multiple GPUs thus we don't
PROGRESSIVE MESH ALGORITHM FOR LATTICE BOLTZMANN METHOD
GPU ARCHITECTURES
in the previous section consider that the entire simulation domain is
divided into subdomains according to the number of GPUs, as shown on Figure 2. All
therefore calculated in parallel.
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
to the number of GPUs.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
4
as thread level parallelism, GPU memory access,
Data coalescence is needed in order to
l conditions as described in [9].
) has been well
described in the literature. The first one, named A-B access
pattern, consists of using two calculation grids in GPU global memory in order to manage the
imulation steps alternate between
and writing to A
[10] [11] [9] on a
single GPU. Several techniques are however presented in literature in order to reduce
significantly the computational memory cost without loss of information such as grids
In this paper, the A-A pattern
technique is used in order to save memory due to spatial and temporal data dependency.
node composed
, consists in dividing the
domains according to the number of GPUs and performing
domain in parallel. CPU threads are used to handle each CUDA
copy memory transfers.
to perform efficient communications by a mapping between CPU and
GPU pointers. Data must however be read and written only once in order to obtain good
to perform simulations on several nodes
[18][21] [15]. In
our case, we only dispose of one computing node with multiple GPUs thus we don't focus on
OLTZMANN METHODS
in the previous section consider that the entire simulation domain is meshed and
Figure 2. All
Figure 2: Division of the simulation domain: the entire domain is decomposed into subdomains according
5. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
In this paper, a new approach is considered.
does not requires to be fully meshed at the beginning
new progressive mesh method
propagation of the simulated flui
beginning of the simulation (Figure 3(a))
propagation of the fluid as can be seen of Figure 3
the simulation geometry (Figure 3(c))
simulations. It is also a real advantage for an application on industrial structures mostly composed
of pipes or channels. It can indeed save a lot of memory and calculations
geometry used for the simulation.
Figure 3: Example of a 3D simu
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
fluid, (c) all subdomains are created and completely adapt to the simulation g
The progressive mesh algorithm firstly needs the introduction of a
create a new subdomain to the simulation. This
existing subdomains. Calculations on
optimization factor.
4.2. Definition of a Criterion for the Progressive Mesh
The definition of a criterion is an
for the simulation. This criterion needs to represent eff
velocity seems like a good choice in order to define an efficient criterion
fluid velocity between two iterations
dispersion. Our criterion is therefore defined as follows
‖6& ‖
The symbol ‖. ‖ stands for the Euclidean norm in this paper.
for all active subdomains on the
boundary, a new subdomain is created next to this boundary as shown on Figure 4.
generally fixed to 0 in this paper in order to
each subdomain.
(a) (b) (c)
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
approach is considered. For most simulations, the entire domain generally
fully meshed at the beginning of the simulation. We propose therefore a
method in order to dynamically create the mesh according to the
fluid. The idea consists in defining a first subdomain at the
(Figure 3(a)). Several subdomains can then be created following the
propagation of the fluid as can be seen of Figure 3(b). This method finally adapts automatically
(Figure 3(c)). This method is therefore applicable for any geometry and
. It is also a real advantage for an application on industrial structures mostly composed
It can indeed save a lot of memory and calculations according to the
geometry used for the simulation.
Figure 3: Example of a 3D simulation using the progressive mesh algorithm: (a) a first subdomain is
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
fluid, (c) all subdomains are created and completely adapt to the simulation geometry.
The progressive mesh algorithm firstly needs the introduction of an adapted criterion in order to
w subdomain to the simulation. This new subdomain needs then to be connected to
Calculations on single-node multi-GPU architecture are finally
of a Criterion for the Progressive Mesh
finition of a criterion is an important aspect in order to efficiently create new subdomains
s criterion needs to represent efficiently the propagation of fluid. The fluid
ike a good choice in order to define an efficient criterion. The difference of the
fluid velocity between two iterations is considered in order to observe efficiently
Our criterion is therefore defined as follows for thecomponent 8:
‖ ‖ & , Δ & , ‖
stands for the Euclidean norm in this paper. This criterion needs to be calculated
subdomains on the boundaries. If the criterion exceeds anarbitrary threshold
boundary, a new subdomain is created next to this boundary as shown on Figure 4.The value
in this paper in order to detect any change of velocity on the boundaries of
(a) (b) (c)
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
5
simulations, the entire domain generally
We propose therefore a
create the mesh according to the
domain at the
Several subdomains can then be created following the
automatically to
This method is therefore applicable for any geometry and
. It is also a real advantage for an application on industrial structures mostly composed
according to the
lation using the progressive mesh algorithm: (a) a first subdomain is
created at the beginning of the simulation, (b) several subdomains are created following the propagation of
eometry.
criterion in order to
s then to be connected to
GPU architecture are finally an important
important aspect in order to efficiently create new subdomains
fluid. The fluid
he difference of the
efficiently the fluid
(13)
This criterion needs to be calculated
threshold9on a
The value9 is
n the boundaries of
6. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 4: The criterion ǁC_α (x) ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S
then a new subdomain is created next to the boundary.
4.3. Algorithm
This section describes the algorithm for the
model with the inclusion of our progressive mesh algorithm.
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
process. Figure 5 describes our resulti
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
α ǁ_2 is calculated on the boundary. If the criterion exceeds the threshold S
then a new subdomain is created next to the boundary.
the algorithm for the multiphase and multicomponent lattice Boltzmann
model with the inclusion of our progressive mesh algorithm. It is also useful in order to
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
describes our resulting algorithm.
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
6
_2 is calculated on the boundary. If the criterion exceeds the threshold S
attice Boltzmann
It is also useful in order to
summarize the previous sections. The calculation of the criterion and the creation of new
subdomains are achieved at the last step of the algorithm in order to not disturb the simulation
Figure 5: Algorithm for the multiphase and multicomponent Lattice Boltzmann model with the inclusion of
our progressive mesh method. For colors, please refer to the PDF version of this paper.
7. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
4.4. Integration on Single-Node Multi
Efficiency of inter-GPU communications is surely the most difficult task in order to obtain good
performance. Indeed, our simulations are composed of numerous subdomains which are a
dynamically. The repartition of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
simulation time.
4.4.1. Overlap Communications with Computations
Several data exchanges are needed for this type of model. The computation of interaction
inter-component / ! implies to have access to neighboring values of the pseudo
propagation step of LBM also implies to communicate several distribution functions
GPUs (Figure 6). Aligned buffers ma
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows
obtain a significant performance gain by reducing the waiting
the computation process into 2 steps
Computations on the needed boundaries are firstly done. Communi
subdomains are also done while computing
performed simultaneously with calculations which allow
In most cases for lattice Boltzmann method, memory is
page-locked memory which allow go
[17][13] [15].A different approach
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
performance, Nvidia launched GPUDirect with CUDA 4.0.
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
corresponds to values to communicate between subdomains. For colors, please refer to the PDF
version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Node Multi-GPU Architecture
GPU communications is surely the most difficult task in order to obtain good
Indeed, our simulations are composed of numerous subdomains which are a
of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
Communications with Computations
needed for this type of model. The computation of interaction
implies to have access to neighboring values of the pseudo-potential. The
propagation step of LBM also implies to communicate several distribution functions
ligned buffers may be used for data transactions.
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows
significant performance gain by reducing the waiting time of data. The idea is to separate
omputation process into 2 steps: boundary calculations and interior
Computations on the needed boundaries are firstly done. Communications between neighboring
domains are also done while computing the interior. The different communications are thus
calculations which allow good efficiency.
attice Boltzmann method, memory is transferred via zero-copy transactions to
locked memory which allow good overlapping between communications and computations
different approach is studied in this paper concerning inter-GPU communications.
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
e, Nvidia launched GPUDirect with CUDA 4.0.This technology allows to perform
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
values to communicate between subdomains. For colors, please refer to the PDF
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
7
GPU communications is surely the most difficult task in order to obtain good
Indeed, our simulations are composed of numerous subdomains which are added
of GPUs to the different subdomains is an important factor of
optimization. An efficient assignment can have an important impact on the performance of the
simulation. Indeed, it can reduce the communication time between subdomains and so reduce the
needed for this type of model. The computation of interaction /: and
potential. The
propagation step of LBM also implies to communicate several distribution functions between
In order to obtain a simulation time as short as possible, it is necessary to overlap data transfer
with algorithm calculations. Indeed, overlapping computations and communications allows to
time of data. The idea is to separate
calculations.
cations between neighboring
the interior. The different communications are thus
copy transactions to
od overlapping between communications and computations
GPU communications.
In most recent HPC architectures, several GPUs can be connected to the same PCIe. To improve
This technology allows to perform
Figure 6: Schematic example for communication of distribution functions in 2D: red arrows
values to communicate between subdomains. For colors, please refer to the PDF
8. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Peer-to-Peer transfers and memory accesses between two compatible GPUs. The idea is to
perform data transfer using Peer-
zero-copy transactions for others. This method allows to communicate data by bypassing the use
of the CPU and therefore to accelerate the transfer (Figure
improves performance and the efficiency of the simulatio
Figure 7: GPUDirect technology (source Nvidia).
4.4.2 Optimization of Data Transfer between
The repartition of GPUs is an important factor of optimization for this type of applications.
Communications cost is generally a bottleneck for
exchanges between sub domains
associated with one GPU.The first
belonging to the same GPU. In this case, the communication cost is extremely low because
communications are performed on the same
concern communications between
however made between Peer-to-
goal to optimize dynamically the repartition of
For a new sub domain ;, the function
Where ;′ denotes neighboring subdomains to
= ;, ;2
>0.5 ∗ ABC D E-FA E
ABC D E-FA E
The function = ;, ;2
compares the different ways
subdomain and its neighbors. An arbitrary weighting
to-Peer communications. The function
The function / ; needs therefo
cost. This function is calculated for all available GPUs and the GPU with the minimum value is
assigned to this subdomain. In order to keep load balancing, all GPUs have to be assigned
dynamically and the same GPU could not be assigned two times as long as others GPUs are not
assigned. Figure 8 explains via a
9. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
5. RESULTS AND PERFORMAN
5.1. Hardware
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
simulations. Table 1 describes some Tesla
communications for our architecture are also described in Figure 9.
Table 1: Tesla C2050 Hardware specifications
Figure 9: Peer-to-Peer communications accessibility for our architecture.
CUDA compute capability
Total amount of global memory
(14) Multiprocessors, (32) scalar processors/MP
GPU clock rate
L2 cache
Total amount of shared memory per block
Total number of registers available per block
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
/ ; is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
ESULTS AND PERFORMANCE
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
simulations. Table 1 describes some Tesla C2050 hardware specifications.
communications for our architecture are also described in Figure 9.
Table 1: Tesla C2050 Hardware specifications
Peer communications accessibility for our architecture.
CUDA compute capability 2.0
Total amount of global memory 2687 MBytes
(14) Multiprocessors, (32) scalar processors/MP 448 CUDA cores
GPU clock rate 1147 MHz
L2 cache size 786432 bytes
Total amount of shared memory per block 49152 bytes
Total number of registers available per block 32768
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
colors, please refer to the PDF version of this paper.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
9
8 NVIDIA Tesla C2050 graphics cards Fermi architecture based machine are used to perform
C2050 hardware specifications. Peer-to-Peer
2.0
2687 MBytes
448 CUDA cores
1147 MHz
786432 bytes
49152 bytes
32768
Figure 8: Schematic example in 2D for the optimization of the repartition of GPUs. The function
is calculated for all available GPUs and the GPU which have the minimum value is chosen. For
10. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
10
5.2. Simulations
Two simulations are considered on large simulation domain in order to evaluate the performance
of our contribution. Both simulations include the use of two physical components. The geometry
however differs between these simulations. The first simulation is based on a simple geometry
composed of 1024*256*256 calculation cells where a fluid fills all simulation domains during the
simulation (Figure 10). The second simulation is based on a complex geometry composed of
1024*1024*128 calculations cells where the fluid moves within channels (Figure 11).
5.3. Performance
This section deals with the performance obtained by our method. A comparison between the
progressive mesh algorithm and the static mesh method generally used in literature is shown. The
optimization of the repartition of GPUs on subdomains is also studied. The performance metric
generally used for lattice Boltzmann method is the Million Lattice nodes Updates Per Second
(MLUPS). It is calculated as follows:
H E MNOPQ
KDR-BF ABC ∗ F R, E D B E- BDFA
ABR S- BDF BR
(16)
This classical approach generally used in literature in order to perform simulations consists in
equally dividing the simulation domain according to the number of GPUs. It offers generally
good performance as communications can be overlapped with calculations. The use of Peer-to-
Peer communications also has a beneficial effect on the performance, as shown on Figure 13.
Peer-to-Peer communications allow obtaining a performance gain between 8 and 12% according
Figure 10: A two-component leakage simulation on a simple geometry with a domain size of
1024*256*256 cells.
Figure 11: A two-component leakage simulation on a complex geometry composed of channels with a
domain size of 1024*1024*128 cells.
11. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
to the number of GPUs used for the simulation described in Figure 10. Ze
communications offer a good scaling
of Peer-to-Peer communications, as shown on Figure 12.
The inclusion of the progressive mesh also has
performance. Sub domains of size 128*128*12
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
performance at the beginning of the simulation. The addition of
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation
shown on Figure 14, which leads to a
mesh. In terms of memory consumption, fast apparitions of news
lead to have the entire simulation domain in memory after a few iterations.
Figure 12: Comparison of performance between Peer
communications for the simulation shown on Figure 10.
Figure 13: Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
to the number of GPUs used for the simulation described in Figure 10. Ze
communications offer a good scaling but an almost perfect scaling is obtained with the inclusion
Peer communications, as shown on Figure 12.
f the progressive mesh also has an important beneficial effect on the simulation
of size 128*128*128 are considered for these simulations.
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
performance at the beginning of the simulation. The addition of sub domains
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation
which leads to a very slight decrease of performance compared to the static
In terms of memory consumption, fast apparitions of news sub domains are noted which
to have the entire simulation domain in memory after a few iterations.
Figure 12: Comparison of performance between Peer-to-Peer communications with zero
communications for the simulation shown on Figure 10.
Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
is also presented.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
11
to the number of GPUs used for the simulation described in Figure 10. Zero-copy
scaling is obtained with the inclusion
an important beneficial effect on the simulation
8 are considered for these simulations. Figures 13
and 14 describes performance in terms of calculations and memory consumption for the
simulation presented on Figure 10. Note that the progressive mesh algorithm obtains excellent
b domains during the
simulation has for consequence a decrease of performance until the convergence of the
simulation. In this particular case, all simulation domain is meshed at the end of the simulation, as
t decrease of performance compared to the static
are noted which
with zero-copy
Comparison of performance between the progressive mesh method and the static mesh
method for the simulation shown on Figure 10. The inclusion of the optimization for GPU assignment
12. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 13 also compares performance between two different assignments for GPUs. The first one
is a simple assignation which assigns to ne
uses the optimization method presented in sec
leads to an important difference of performance
noted at the convergence of this simulation between the two approaches.
due to the fact that the communication cost is more
optimized assignment. Since subdomains are added dynamically and connected to e
therefore important to optimize these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
15 and 16. The main difference in this situation
complex and channelized. Physical simulations on channelized geometry are especially pre
on industrial structures.
In this case, the progressive mesh method shows excellent results. In terms of memory,
method is easily able to simulate on a global simulation domain of size 1024*1024*128
while the static mesh method is unable to perform the simulation. The amount of
is indeed too important for this simulation.
consumption during the simulation.
less important than the static mesh method. A gain of approximatively 50% of memory is noted
for this particular simulation. This is due
automatically adapts to the evolution of the simulation and so only needed zones of the global
simulation domain are meshed.
Figure 14: Comparison of memory
mesh method for the simulation shown on Figure 10.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 13 also compares performance between two different assignments for GPUs. The first one
is a simple assignation which assigns to new subdomain the first available GPU. The second one
presented in section 4.4.2. The comparison of these two methods
leads to an important difference of performance. Indeed, a difference of approximatively 30% is
simulation between the two approaches. This difference is mostly
due to the fact that the communication cost is more important for a simple assignment than an
optimized assignment. Since subdomains are added dynamically and connected to each other, it is
these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
15 and 16. The main difference in this situation is the geometry of the simulation which is more
complex and channelized. Physical simulations on channelized geometry are especially pre
In this case, the progressive mesh method shows excellent results. In terms of memory,
method is easily able to simulate on a global simulation domain of size 1024*1024*128
while the static mesh method is unable to perform the simulation. The amount of needed
is indeed too important for this simulation. Figure 15 shows the evolution of memory
consumption during the simulation. The memory cost at the convergence of the simulation is far
less important than the static mesh method. A gain of approximatively 50% of memory is noted
for this particular simulation. This is due to the fact that the progressive mesh method
automatically adapts to the evolution of the simulation and so only needed zones of the global
Figure 14: Comparison of memory consumption between the progressive mesh method and the static
mesh method for the simulation shown on Figure 10.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
12
Figure 13 also compares performance between two different assignments for GPUs. The first one
The second one
tion 4.4.2. The comparison of these two methods
. Indeed, a difference of approximatively 30% is
This difference is mostly
assignment than an
ach other, it is
these communications in order to reduce the simulation time.
The same comparison is also done for the simulation presented on Figure 11, as shown on Figures
is the geometry of the simulation which is more
complex and channelized. Physical simulations on channelized geometry are especially present
In this case, the progressive mesh method shows excellent results. In terms of memory, this
method is easily able to simulate on a global simulation domain of size 1024*1024*128 and more
needed memory
the evolution of memory
The memory cost at the convergence of the simulation is far
less important than the static mesh method. A gain of approximatively 50% of memory is noted
to the fact that the progressive mesh method
automatically adapts to the evolution of the simulation and so only needed zones of the global
consumption between the progressive mesh method and the static
13. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 15: Comparison of memory consumption between the
method for the simulation shown on Figure 11.
The comparison of the repartition
performance gain (19%) is still noted for this simulation. This proves that a
method is important in order to obtain good performance.
not need to be fully meshed brings an important gain in performance. The geometry has therefore
an important impact on the performance on
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
of GPUs for the simulation shown on Figure 11.
6. CONCLUSION
In this paper, an efficient progressive
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
automatically added to the simulation by the use of an adapted criterion. This method is also able
to save a lot of memory and calculations
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
Figure 15: Comparison of memory consumption between the progressive mesh method and the static mesh
method for the simulation shown on Figure 11.
The comparison of the repartition of GPUs is also described in Figure 16. A
%) is still noted for this simulation. This proves that a dynamic optimization
method is important in order to obtain good performance. Moreover, the fact that the domain does
not need to be fully meshed brings an important gain in performance. The geometry has therefore
an important impact on the performance on the progressive mesh method.
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
of GPUs for the simulation shown on Figure 11.
In this paper, an efficient progressive mesh algorithm for physical simulations using the lattice
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
ded to the simulation by the use of an adapted criterion. This method is also able
and calculations in order to perform simulations on large installations.
International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
13
progressive mesh method and the static mesh
. An important
dynamic optimization
Moreover, the fact that the domain does
not need to be fully meshed brings an important gain in performance. The geometry has therefore
Figure 16: Comparison of performance between a simple repartition of GPUs with an optimized assignment
mesh algorithm for physical simulations using the lattice
Boltzmann method is presented. This progressive mesh method can be a useful tool in order to
perform several types of physical simulations. Its main advantage is that subdomains are
ded to the simulation by the use of an adapted criterion. This method is also able
in order to perform simulations on large installations.
14. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
14
The integration of the progressive mesh method on single-node multi-GPU architecture is also
treated. A dynamic optimization of the repartition of GPUs to subdomains is an important factor
in order to obtain good performance. The combination of all these contributions allows therefore
performing fast physical simulations on all types of geometry. The progressive mesh method is
therefore an interesting alternative because it allows obtaining similar or better performances than
the usual static mesh method.
The progressive mesh algorithm is however limited to the memory of the GPU which is generally
far more inferior to the CPU RAM. The creation of new subdomains is indeed possible while
there is a sufficient amount of memory on the GPUs. Extensions of this work to cases that require
more memory than all GPUs can handle is now under investigation. Data transfer optimizations
with the CPU host will therefore be essential to keep good performances.
ACKNOWLEDGEMENTS
This work has been made possible thanks to collaboration between academic and industrial
groups, gathered by the INNOCOLD association.
REFERENCES
[1] B. Chopard, J.L. Falcone J. Latt, The Lattice Boltzmann advection diffusion model revisited, The
European Physical Journal - Special Topics,Vol. 171, pp. 245-249, 2009.
[2] S. Gong, P. Cheng, Numerical investigation of droplet motion and coalescence by an improved lattice
Boltzmann model for phase transitionsand multiphase flows, Computers & Fluids , Vol. 53, pp. 93-
104, 2012.
[3] S. Gong, P. Cheng, A lattice Boltzmann method for liquid vapor phase change heat transfer,
Computers & Fluids, Vol. 54, pp. 93-104, 2012.
[4] J. Bao, L. Schaeffer, Lattice Boltzmann equation model for multicomponent multi-phase flow with
high density ratios, Applied MathematicalModelling, 2012.
[5] Nvidia, C. U. D. A. (2011). Nvidia cuda c programming guide. NVIDIA Corporation, 120, 18.8
[6] M. Wittmann, T. Zeiser, G. Hager, G. Wellein, Comparison of different propagation steps for Lattice
Boltzmann methods, Computers and Mathematicswith Applications, Vol. 65 pp. 924-935, 2013.
[7] J. Tölke, Implementation of a Lattice Boltzmann kernel using the compute unified device architecture
developed by nVIDIA, Computing andVisualization in Science, 1-11, 2008.
[8] J. Tölke, M. Krafczyk, TeraFLOP computing on a desktop PC with GPUs for 3D CFD, International
Journal of Computational Fluid Dynamics 22(7), pp. 443-456, 2008.
[9] F. Kuznik, C.Obrecht, G. Rusaouën, J-J. Roux, LBM based flow simulation using GPU computing
processor, Computers and Mathematics withApplications 27, 2009.
[10] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, A new approach to the lattice Boltzmann method
for graphics processing units, Computersand Mathematics with Applications 61, pp. 3628-3638, 2011.
[11] P.R. Rinaldi, E.A Dari, M.J. Vénere, A. Clausse, A Lattice-Boltzmannsolver for 3D fluid on GPU,
Simulation Modeling Pratice and Theory 25,pp. 163-171, 2012.
[12] P. Bailey, J. Myre, S. Walsh, D. Lilja, M. Saar, Accelerating lattice boltzmann fluid flows using
graphics processors, International Conferenceon Parallel Processing, pp. 550-557, 2009.
[13] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Multi-GPU implementation of the lattice
Boltzmann method, Computers and Mathematicswith Applications, 80, pp. 269-275, 2013.
[14] X. Li, Y. Zhang, X. Wang, W. Ge, GPU-based numerical simulation of multi-phase flow in porous
media using multiple-relaxation-time latticeBoltzmann method, Chemical Engineering Science, Vol.
102, pp. 209-219,2013.
[15] M. Januszewski, M. Kostur, Sailfish: A flexible multi-GPU implementationof the lattice Boltzmann
method, Computer Physics Communications,Vol. 185, pp. 2350-2368, 2014.
[16] F. Jiang, C. Hu, Numerical simulation of a rising CO2 droplet in the initial accelerating stage by a
multiphase lattice Boltzmann method,Applied Ocean Research, Vol. 45, pp. 1-9, 2014.
15. International Journal of Distributed and Parallel Systems (IJDPS) Vol.6, No.5, September 2015
15
[17] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux, Multi-GPU Implementation of a Hybrid
Thermal Lattice Boltzmann Solver using theTheLMA Framework, Computers and Fluids, Vol. 80,
pp. 269275, 2013.
[18] C. Rosales, Multiphase LBM Distributed over Multiple GPUs, CLUSTER’11 Proceedings of the
2011 IEEE International Conference onCluster Computing, pp. 1-7, 2011.
[19] C. Obrecht, F. Kuznik, B. Tourancheau, J-J. Roux, Scalable lattice Boltzmann solvers for CUDA
GPU clusters, Parallel Computing, Vol.39, pp. 259-270, 2013.
[20] J. Habich, C. Feichtinger, H. Köstler, G. Hager, G. Wellein, Performance engineering for the lattice
Boltzmann method on GPGPUs: Architecturalrequirements and performance results, Computer &
Fluids, Vol. 80, pp.276-282, 2013.
[21] C. Feichtinger, J. Habich, H. Köstler, U. Rüde, T. Aoki, Performance Modeling and Analysis of
Heterogeneous Lattice Boltzmann Simulationson CPU-GPU Clusters, Parallel Computing, 2014.
AUTHORS
Julien Duchateau is a PhD student in computer science at the Université du Littoral Côte d’Opale in France.
His main research interest are massive parallelism on CPUs and GPUs, physical simulations and computer
graphics.
François Rousselle is an associate professor in computer science at the Université du Littoral Côte d’Opale
in France. His main research interests are computer graphics, physical simulations, virtual reality and
massive parallelism.
Nicolas Maquignon is a PhD student in simulation and numerical physics at the Université du Littoral Côte
d’Opale. His main research interests are numerical physics, numerical mathematics and numerical
modeling.
Christophe Renaud is a professor in computer science at the Université du Littoral Côte d’Opale in France.
His main research interests are computer graphics, virtual reality, physical simulations and massive
parallelism.
Gilles Roussel is an associate professor in automatic at the Université du Littoral Côte d’Opale in France.
His main research interests are automatic, signal processing, physical simulations and industrial computing.