A Scalable Tridiagonal Solver    For GPUs Team:WenMin Xiao&ChaoQun Li Institute of information science and  technology of ...
Outline <ul><li>Introduction </li></ul><ul><li>Algorithms </li></ul><ul><ul><li>Classic algorithms </li></ul></ul><ul><ul>...
What is a tridiagonal system?
What is it used for? <ul><li>Scientific and engineering application </li></ul><ul><ul><li>Numerical ocean models </li></ul...
Two Applications on GPU Depth of field blur, Michael Kass et al. Shallow water simulation OpenGL and Shader language  CUDA...
A Classic Serial Algorithm <ul><li>Gussian elimination in tridiagonal case(Thomas algorithm) </li></ul>Phase 1:Forword Red...
Parallel Algorithms <ul><li>Coarse-grained   algorithms(multi-core CPU) </li></ul><ul><ul><li>Two-way Gussian elimination ...
Cyclic Reduction 2-4  threads working Forward Reduction Backward Substitution 8-unkown system 4-unkown system 2-unkown sys...
Parallel Cyclic Reduction(PCR) Forward Redution No Backward Substitution One 8-unkown system Two 4-unkown systems Four 2-u...
Advantages of Previous Algorithms <ul><li>Thoms </li></ul><ul><ul><li>It is easy to parallel solve the independent system ...
Hybird Algorithm <ul><li>Improved PCR(Tiled PCR) </li></ul><ul><li>P-Thomas </li></ul>One 8-unkown system One PCR step Par...
GPU Implementation <ul><li>Linear systems mapped to  </li></ul><ul><li>multiprocessors (blocks) </li></ul><ul><li>Equation...
Tiled PCR <ul><li>A variant of  incomplete  PCR in a sense that it stops breaking down the system before the algorithm rea...
Dependency & Parallelism How to Reduce Redundancy? <ul><li>A set of elements being an object of PCR operation </li></ul><u...
Dependency & Parallelism cont Fine-grained tiling <ul><li>A set of elements being an object of PCR operation </li></ul><ul...
Cache Design Buffered Sliding Window Illustration of the buffered sliding window 1. Immedicate   results  are cached 2.Eac...
Components of Buffered Sliding Window <ul><li>Bottom buffer </li></ul><ul><ul><li>The input element are just loaded from g...
Example
Advantages of TPCR <ul><li>Fewer steps  </li></ul><ul><li>Better memory latency hiding(less idle time of GPU execution) </...
Thread-level Parallel  Thomas Algorithm <ul><li>Solves multiple independent system that each thread solves a different sys...
Performance Evaluation Test-Platform <ul><li>Nvidia GTX480 GPU with 1.5GB memory bandwith </li></ul><ul><li>3.33GHZ Intel ...
Performance Results Parameter  M  and  N : number of systems and system size 8.3x and 49x speedups 5x and 30x speedups
Performance Analysis <ul><li>Factors that determine performance </li></ul><ul><ul><li>Size of intermeidate results cache <...
Summary <ul><li>We studied 3 kinds of algorithm for addressing tridiagonal solver </li></ul><ul><li>We learned Sophisticat...
Reference <ul><li>http:// en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm </li></ul><ul><li>Fast Tridiagonal Solvers on...
Question? Thanks
Upcoming SlideShare
Loading in …5
×

Tridiagonal solver in gpu

1,403 views
1,248 views

Published on

A Scalable Tridiagonal Solver for GPU

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,403
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Tridiagonal solver in gpu

  1. 1. A Scalable Tridiagonal Solver For GPUs Team:WenMin Xiao&ChaoQun Li Institute of information science and technology of Hunan University
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Algorithms </li></ul><ul><ul><li>Classic algorithms </li></ul></ul><ul><ul><li>Design algorithm for gpu architecture </li></ul></ul><ul><li>Performance </li></ul><ul><li>Summary </li></ul>
  3. 3. What is a tridiagonal system?
  4. 4. What is it used for? <ul><li>Scientific and engineering application </li></ul><ul><ul><li>Numerical ocean models </li></ul></ul><ul><ul><li>Semi-coarsening for multi-grid solvers </li></ul></ul><ul><ul><li>Spectral Poisson Solvers </li></ul></ul><ul><ul><li>Cubic Spline Approximation </li></ul></ul><ul><li>Video and computer-animated films </li></ul><ul><ul><li>Depth of field blurs </li></ul></ul><ul><ul><li>Fluid simulation </li></ul></ul>
  5. 5. Two Applications on GPU Depth of field blur, Michael Kass et al. Shallow water simulation OpenGL and Shader language CUDA Cyclic reduction Cyclic reduction 2006 2007
  6. 6. A Classic Serial Algorithm <ul><li>Gussian elimination in tridiagonal case(Thomas algorithm) </li></ul>Phase 1:Forword Reduction Phase 2:Backward Substitution Elimination steps? Complexity? 2n-1 O(n)=2(n-1)+1
  7. 7. Parallel Algorithms <ul><li>Coarse-grained algorithms(multi-core CPU) </li></ul><ul><ul><li>Two-way Gussian elimination </li></ul></ul><ul><ul><li>Sub-structing method </li></ul></ul><ul><li>Fine-grained algorithms (many-core GPU) </li></ul><ul><ul><li>Cyclic Reduction (CR) </li></ul></ul><ul><ul><li>Parallel Cyclic Reduction (PCR) </li></ul></ul><ul><ul><li>Recursive Doubling (RD) </li></ul></ul><ul><ul><li>Hybrid Thomas-PCR algorithm </li></ul></ul>A set of equations mapped to one thread A single equation mapped to one thread
  8. 8. Cyclic Reduction 2-4 threads working Forward Reduction Backward Substitution 8-unkown system 4-unkown system 2-unkown system Solve 2 unkowns Solve the rest 2 unkowns Solve the rest 4 unkonws 2*log2(8)-1 = 2*3 -1 = 5 steps
  9. 9. Parallel Cyclic Reduction(PCR) Forward Redution No Backward Substitution One 8-unkown system Two 4-unkown systems Four 2-unkown systems Solve all unkowns 4 threads working log 2 (8)=3 steps
  10. 10. Advantages of Previous Algorithms <ul><li>Thoms </li></ul><ul><ul><li>It is easy to parallel solve the independent system </li></ul></ul><ul><li>CR </li></ul><ul><ul><li>Every step we reduce the system by half </li></ul></ul><ul><li>PCR </li></ul><ul><ul><li>Fewer steps required </li></ul></ul>
  11. 11. Hybird Algorithm <ul><li>Improved PCR(Tiled PCR) </li></ul><ul><li>P-Thomas </li></ul>One 8-unkown system One PCR step Parallel Thomas
  12. 12. GPU Implementation <ul><li>Linear systems mapped to </li></ul><ul><li>multiprocessors (blocks) </li></ul><ul><li>Equations mapped to </li></ul><ul><li>processors (threads) </li></ul>
  13. 13. Tiled PCR <ul><li>A variant of incomplete PCR in a sense that it stops breaking down the system before the algorithm reaches the smallest possible matrics </li></ul>Redundancy of naive tiling of PCR <ul><li>Redundancy </li></ul><ul><li>Anthor type of Redundancy </li></ul><ul><li>K-step PCR: the number of redundant memory access per tile boundary </li></ul>
  14. 14. Dependency & Parallelism How to Reduce Redundancy? <ul><li>A set of elements being an object of PCR operation </li></ul><ul><li>Cached results </li></ul>Solution 1 Redundancy is also exist!
  15. 15. Dependency & Parallelism cont Fine-grained tiling <ul><li>A set of elements being an object of PCR operation </li></ul><ul><li>Cached results </li></ul>Solution 2 Without redundancy Sequential Computation
  16. 16. Cache Design Buffered Sliding Window Illustration of the buffered sliding window 1. Immedicate results are cached 2.Each tile are processed parallel 3.Each of tile has multiple sub tiles 4.Sub tiles are processed sequentially using cache
  17. 17. Components of Buffered Sliding Window <ul><li>Bottom buffer </li></ul><ul><ul><li>The input element are just loaded from global memory </li></ul></ul><ul><ul><li>and ready to be processed </li></ul></ul><ul><li>Middle buffer </li></ul><ul><li>mostly interacts with the elements in the bottom by providing denpendency to them at the same time referring them </li></ul><ul><li>Top buffer </li></ul><ul><li>caches elements from the middle buffer for the last step of PCR </li></ul>
  18. 18. Example
  19. 19. Advantages of TPCR <ul><li>Fewer steps </li></ul><ul><li>Better memory latency hiding(less idle time of GPU execution) </li></ul><ul><li>Minimizing redundant global memory access </li></ul><ul><li>Arbitrary tiling size </li></ul>
  20. 20. Thread-level Parallel Thomas Algorithm <ul><li>Solves multiple independent system that each thread solves a different system </li></ul><ul><li>Global memory should be coalesced </li></ul>64B aligned segment 128B aligned segment
  21. 21. Performance Evaluation Test-Platform <ul><li>Nvidia GTX480 GPU with 1.5GB memory bandwith </li></ul><ul><li>3.33GHZ Intel quad-core i7 975 CPU </li></ul><ul><li>Fedora 12 Linux </li></ul><ul><li>CUDA 3.2 </li></ul>
  22. 22. Performance Results Parameter M and N : number of systems and system size 8.3x and 49x speedups 5x and 30x speedups
  23. 23. Performance Analysis <ul><li>Factors that determine performance </li></ul><ul><ul><li>Size of intermeidate results cache </li></ul></ul><ul><ul><li>Global/shared memory access </li></ul></ul><ul><ul><li>Overhead for synchronization </li></ul></ul><ul><ul><li>Bank conflick </li></ul></ul>
  24. 24. Summary <ul><li>We studied 3 kinds of algorithm for addressing tridiagonal solver </li></ul><ul><li>We learned Sophisticatedly designed tiling and the buffer sliding window in TPCR algorithm </li></ul>
  25. 25. Reference <ul><li>http:// en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm </li></ul><ul><li>Fast Tridiagonal Solvers on GPU </li></ul><ul><li>CUDA Programming Guide </li></ul><ul><li>高性能运算之 CUDA </li></ul>
  26. 26. Question? Thanks

×