Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Pontifical Catholic University of Rio Grande do Sul – PUCRS
Graduate Program in Computer Science
Faculty of Informatics
Pa...
Upcoming SlideShare
Loading in …5
×

Parallelization Strategies for Implementing Nbody Codes on Multicore Architectures

1,191 views

Published on

Published in: Science
  • Be the first to comment

  • Be the first to like this

Parallelization Strategies for Implementing Nbody Codes on Multicore Architectures

  1. 1. Pontifical Catholic University of Rio Grande do Sul – PUCRS Graduate Program in Computer Science Faculty of Informatics Parallelization Strategies for N-Body Simulations on Multicore Architectures Filipo Novo Mór Thais Christina Webber dos Santos César Augusto Missio Marcon GPU implementation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 Global Memory Shared Memory Bank Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Global Memory Shared Memory Bank Active Tasks Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 BARRIER Global Memory Shared Memory Bank Active Tasks Active Transfers data bufferization on shared memory data on buffer consumption by parallel threads at the end, global memory is updated Initially, information about particles is copied from RAM to the GPU memory. Then, the code runs in a pipeline where several data transferences between shared and global memory will take place while data on shared memory buffer is consumed. At the end of the process all remaining data, now updated, is copied back to the global memory and then back to the RAM on the CPU. The computational cost is given by (1) where C is the cost of the force calculation function; n is the amount of particles; p is the amount of parallel running threads; M is the cost of data transfer between shared and global memories on GPU, and T is the cost of data transfer between RAM and the GPU memory. 𝐶 𝑛2 𝑝 + 2𝑀𝑛 + 2𝑇𝑛 (1) Multicore CPU Implementation 𝐶 𝑛2 𝑝 For a multicore CPU, a standard serial code was parallelized by adding OpenMP directives directly on it. The computational cost was reduced from n2 to (2), once there is no need to memory transfers. p is the parallel threads, which normally is one per amount of CPU core. (2) filipo.mor@acad.pucrs.br The N-Body Simulation The Particle-Particle method does not round off the force summation, such that the accuracy equals to the machine precision. Energy Monitoring during the simulation. Computational Cost: O(n2). Partial Results and Perspectives  Cost saving by real speedup achievement.  Visualization module already implemented. Next Steps:  Cluster implementation (CUDA + MPI + GPUDirect).  Exploration of hierarchical algorithms (such as Barnes&Hut and mesh).  OpenCL version.  OpenACC version. Execution quoted on Amazon EC2 service. Version Cost Serial 0,33$ OpenMP 0,08$ CUDA 0,05$

×