GPU Programming

       Roberto Bonvallet
      Departamento de Inform´ tica
                               a
Universidad ...
CPU vs GPU peak performance
CPU and GPU architectures


  Control   ALU ALU
            ALU ALU
  Cache



  DRAM                DRAM
CPU and GPU architectures




                      DRAM
CPU and GPU architectures
Nvidia Tesla architecture
Task and data parallelism
Task and data parallelism



                            Task parallelism:
                                distributed
   ...
Task and data parallelism



                            Task parallelism:
                                distributed
   ...
Thread and memory hierarchies




                      Thread hierarchy:
Thread and memory hierarchies




                      Thread hierarchy:
                          grid of blocks
Thread and memory hierarchies




                      Thread hierarchy:
                          grid of blocks
       ...
Thread and memory hierarchies




                      Thread hierarchy:
                          grid of blocks
       ...
Matrix-matrix multiplication
Matrix-matrix multiplication

                        cij =       aik bkj
                                k
Matrix-matrix multiplication

                        cij =       aik bkj
                                k

             ...
Matrix-matrix multiplication

                        cij =         aik bkj
                                 k

          ...
Nvidia C1060



 Core clock            602 Mhz
 Multiprocessors       30
 Thread processors     240 = 30 × 8
 Memory size ...
CUDA programming

    Array allocation and copying
    cudaMalloc((void **) &p, mem_size);

    cudaMemcpy(host_p, dev_p, ...
CUDA programming

    Kernel definition
    __global__ void
    vector_sum(float *a, float *b, float *c) {
        int i = ...
CUDA programming

    Kernel definition
    __global__ void
    vector_sum(float *a, float *b, float *c) {
        int i = ...
Vortex Methods


                 Fluid discretized as vortices
                 (x, y, α)
Vortex Methods


                 Fluid discretized as vortices
                 (x, y, α)
                 Vortex interac...
Vortex Methods


                 Fluid discretized as vortices
                 (x, y, α)
                 Vortex interac...
GPU velocity evaluation
Upcoming SlideShare
Loading in …5
×

GPU programming

2,558 views
2,465 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,558
On SlideShare
0
From Embeds
0
Number of Embeds
212
Actions
Shares
0
Downloads
95
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GPU programming

  1. 1. GPU Programming Roberto Bonvallet Departamento de Inform´ tica a Universidad T´ cnica Federico Santa Mar´a e ı Junio de 2010
  2. 2. CPU vs GPU peak performance
  3. 3. CPU and GPU architectures Control ALU ALU ALU ALU Cache DRAM DRAM
  4. 4. CPU and GPU architectures DRAM
  5. 5. CPU and GPU architectures
  6. 6. Nvidia Tesla architecture
  7. 7. Task and data parallelism
  8. 8. Task and data parallelism Task parallelism: distributed processing distributed memory message passing
  9. 9. Task and data parallelism Task parallelism: distributed processing distributed memory message passing Data parallelism: same instruction on different data shared memory
  10. 10. Thread and memory hierarchies Thread hierarchy:
  11. 11. Thread and memory hierarchies Thread hierarchy: grid of blocks
  12. 12. Thread and memory hierarchies Thread hierarchy: grid of blocks blocks of threads
  13. 13. Thread and memory hierarchies Thread hierarchy: grid of blocks blocks of threads Memory hierarchy: global memory (large, slow) shared memory (per-block, small, fast) registers (per-thread, small, fast)
  14. 14. Matrix-matrix multiplication
  15. 15. Matrix-matrix multiplication cij = aik bkj k
  16. 16. Matrix-matrix multiplication cij = aik bkj k Cij = Aik Bkj k
  17. 17. Matrix-matrix multiplication cij = aik bkj k Cij = Aik Bkj k Multiplication kernel: initialize element of Cij = 0 for each k: fetch element of Aik , Bkj into shared memory synchronize compute element of Cij = Cij + Aik Bkj synchronize
  18. 18. Nvidia C1060 Core clock 602 Mhz Multiprocessors 30 Thread processors 240 = 30 × 8 Memory size 4 GB Memory bandwidth 102.4 GB/s Single precision pp 933.12 Gflop Double precision pp 77.76 Gflop
  19. 19. CUDA programming Array allocation and copying cudaMalloc((void **) &p, mem_size); cudaMemcpy(host_p, dev_p, mem_size, cudaMemcpyHostToDevice); [...] cudaMemcpy(dev_p, host_p, mem_size, cudaMemcpyDeviceToHost); cudaFree(p);
  20. 20. CUDA programming Kernel definition __global__ void vector_sum(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }
  21. 21. CUDA programming Kernel definition __global__ void vector_sum(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; } Kernel launch f<<<grid_size, block_size, sh_mem_size>>>(a, b, c);
  22. 22. Vortex Methods Fluid discretized as vortices (x, y, α)
  23. 23. Vortex Methods Fluid discretized as vortices (x, y, α) Vortex interaction: 1 K(x, y) = − (−y, x) 2π x
  24. 24. Vortex Methods Fluid discretized as vortices (x, y, α) Vortex interaction: 1 K(x, y) = − (−y, x) 2π x Biot-Savart law: u(x) = αp K(x − xp ) p
  25. 25. GPU velocity evaluation

×