Domain Decomposition Methods in Hybrid CPU-GPU Environment<br />A new era in scientific computing<br />National Technical ...
Hybrid CPU-GPU Implementation<br />The Dual DDM FETI solver has been implemented in hybrid CPU-GPU workstations with the p...
Challenges<br />Due to the fact that the CPU and GPU platforms are heterogeneous and feature different programming paradig...
General processing flow of CUDA programming<br />National Technical University of Athens<br />4<br />
CUDA device<br /><ul><li>Very large number of streaming processors.</li></ul>National Technical University of Athens<br />...
CUDA Threads<br /><ul><li>A thread is the smallest unit of processing that can be scheduled by an operating system.
GPUs generate a large number of threads in order to exploit data parallelism.
All threads generated by a kernel define a grid and are organized in blocks. </li></ul>National Technical University of At...
CUDA Memory<br />CUDA devices have a variety of different memories that need to be utilized by programmers in order to ach...
Projection step<br />This matrix is global, spanning across the whole domain and not associated with subdomains, a direct ...
Dot products<br />Dot products are present in several steps of the solving procedure:<br /> Thus, they consume a significa...
Sparse Matrix – Vector multiplication<br />SpMV are also encountered in several steps, like the projection and preconditio...
Dynamic Load-Balancing<br />Ideally both the CPU and the GPU(s) must be at full load at all times.<br />The heterogeneity ...
Dynamic Load-Balancing<br />National Technical University of Athens<br />12<br />
Dynamic Load Balancing – Major Subdomain Tasks<br />National Technical University of Athens<br />13<br />
Numerical Examples<br />National Technical University of Athens<br />14<br />
Example 1<br />115,320 dof<br />Number of subdomains:  45 – 300<br />Intel Core 2 Quad Q6600 2.4GHz 4 physical cores – 4 l...
DoF for different number of subdomains<br />National Technical University of Athens<br />16<br />
Load Balancing: Q6600 & GTX 285<br />National Technical University of Athens<br />17<br />
Load Balancing: Q6600 & GTX 285<br />National Technical University of Athens<br />18<br />
Solution Time<br />National Technical University of Athens<br />19<br />*With direct subdomain solver<br />
Example 2<br />1,058,610 dof<br />Number of subdomains:  125 to 2744 <br />Intel Core i7-950 Processor 3.06GHz 4 physical ...
DoF for different number of subdomains<br />National Technical University of Athens<br />21<br />
Load Balancing: i7 & GTX 580<br />National Technical University of Athens<br />22<br />
Load Balancing: i7 & GTX 580<br />National Technical University of Athens<br />23<br />
Solution Time<br />National Technical University of Athens<br />24<br />*With direct subdomain solver<br />
Upcoming SlideShare
Loading in...5
×

Hybrid CPU/GPU Computing with Domain Decomposition

1,666

Published on

A hybrid CPU/GPU implementation of the FETI domain decomposition method.

More information can be found here:
http://www.sciencedirect.com/science/article/pii/S0045782511000235

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,666
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Hybrid CPU/GPU Computing with Domain Decomposition"

  1. 1. Domain Decomposition Methods in Hybrid CPU-GPU Environment<br />A new era in scientific computing<br />National Technical University of Athens<br />M. Papadrakakis, G. Stavroulakis, A.Karatarakis<br />
  2. 2. Hybrid CPU-GPU Implementation<br />The Dual DDM FETI solver has been implemented in hybrid CPU-GPU workstations with the purpose of exploiting all available processing power and CPU memory resources in order to handle even larger problems. <br />National Technical University of Athens<br />2<br />
  3. 3. Challenges<br />Due to the fact that the CPU and GPU platforms are heterogeneous and feature different programming paradigms, special considerations had to be made in a number of steps of the FETI algorithm to achieve optimum efficiency. <br />One of the main issues which has to be dealt with is the difference in performance between the CPU and GPU. In particular, the difference in performance between the CPU and GPU is not the same when calculating dot products, executing matrix-vector multiplications or solving linear systems directly with the Cholesky factorization.<br />A common bottleneck is encountered in data transfers between CPU and GPU.<br />Load Balancing!<br />National Technical University of Athens<br />3<br />
  4. 4. General processing flow of CUDA programming<br />National Technical University of Athens<br />4<br />
  5. 5. CUDA device<br /><ul><li>Very large number of streaming processors.</li></ul>National Technical University of Athens<br />5<br />
  6. 6. CUDA Threads<br /><ul><li>A thread is the smallest unit of processing that can be scheduled by an operating system.
  7. 7. GPUs generate a large number of threads in order to exploit data parallelism.
  8. 8. All threads generated by a kernel define a grid and are organized in blocks. </li></ul>National Technical University of Athens<br />6<br />
  9. 9. CUDA Memory<br />CUDA devices have a variety of different memories that need to be utilized by programmers in order to achieve high performance. <br />National Technical University of Athens<br />7<br />
  10. 10. Projection step<br />This matrix is global, spanning across the whole domain and not associated with subdomains, a direct solver is generally not appropriate to perform this task. For this reason, a PCG solver with a diagonal preconditioner is applied in parallel at each projection step of the PCPG algorithm.<br />National Technical University of Athens<br />8<br />
  11. 11. Dot products<br />Dot products are present in several steps of the solving procedure:<br /> Thus, they consume a significant amount of processing time and have to be implemented efficiently.<br />National Technical University of Athens<br />9<br />
  12. 12. Sparse Matrix – Vector multiplication<br />SpMV are also encountered in several steps, like the projection and preconditioning step. In order to achieve maximum efficiency of this time-consuming operation, an optimized CUDA kernel calculating the result of a SpMV multiplication has to be implemented.<br />National Technical University of Athens<br />10<br />
  13. 13. Dynamic Load-Balancing<br />Ideally both the CPU and the GPU(s) must be at full load at all times.<br />The heterogeneity of computer components has been addressed in this work by implementing a dynamic load balancing procedure based on task queues<br />National Technical University of Athens<br />11<br />
  14. 14. Dynamic Load-Balancing<br />National Technical University of Athens<br />12<br />
  15. 15. Dynamic Load Balancing – Major Subdomain Tasks<br />National Technical University of Athens<br />13<br />
  16. 16. Numerical Examples<br />National Technical University of Athens<br />14<br />
  17. 17. Example 1<br />115,320 dof<br />Number of subdomains: 45 – 300<br />Intel Core 2 Quad Q6600 2.4GHz 4 physical cores – 4 logical cores8MB L2 cache<br />3GB RAM<br />NVIDIA GTX285 with 1GB GDDR3 memory<br />National Technical University of Athens<br />15<br />
  18. 18. DoF for different number of subdomains<br />National Technical University of Athens<br />16<br />
  19. 19. Load Balancing: Q6600 & GTX 285<br />National Technical University of Athens<br />17<br />
  20. 20. Load Balancing: Q6600 & GTX 285<br />National Technical University of Athens<br />18<br />
  21. 21. Solution Time<br />National Technical University of Athens<br />19<br />*With direct subdomain solver<br />
  22. 22. Example 2<br />1,058,610 dof<br />Number of subdomains: 125 to 2744 <br />Intel Core i7-950 Processor 3.06GHz 4 physical cores – 8 logical cores8MB cache<br />6GB RAM<br />NVIDIA GTX285 with 1GB GDDR3 memory<br />NVIDIA GTX580 with 1.5GB GDDR5 memory<br />National Technical University of Athens<br />20<br />
  23. 23. DoF for different number of subdomains<br />National Technical University of Athens<br />21<br />
  24. 24. Load Balancing: i7 & GTX 580<br />National Technical University of Athens<br />22<br />
  25. 25. Load Balancing: i7 & GTX 580<br />National Technical University of Athens<br />23<br />
  26. 26. Solution Time<br />National Technical University of Athens<br />24<br />*With direct subdomain solver<br />
  27. 27. Speedup<br />In example 1(~100,000 degrees – small), the hybrid-parallel implementation is 20 times faster than a conventional implementation<br />In example 2 (~1,000,000 degrees) the hybrid-parallel implementation is 40-45 times faster than a conventional implementation! <br />National Technical University of Athens<br />M. Papadrakakis, G. Stavroulakis, A.Karatarakis<br />
  28. 28. Further Information<br />http://www.sciencedirect.com/science/article/pii/S0045782511000235<br />National Technical University of Athens<br />M. Papadrakakis, G. Stavroulakis, A.Karatarakis<br />

×