Heterogenous Parallel Programming
Class of 2014

Week 1 Summary

Update 1

CUDA

Pipat Methavanitpong
Heterogeneous Computing
● Diversity of Computing Units
○

CPU, GPU, DSP, Configurable Cores, Cloud Computing

● Right Man,...
Latency and Throughput Orientation
Latency

Throughput

● Min Time
● Smart / Weak
● Best Path

● Max Throughput
● Stupid /...
Latency and Throughput Orientation
CPU

GPU

● Best for Sequential
● Powerful ALU

● Best for Parallel
● Weak ALU

○
○
○

...
Latency and Throughput Orientation
CPU
ALU

GPU
ALU
Control

ALU

ALU

Cache
DRAM

DRAM
System Cost
● Hardware + Software Cost
● Software dominates after 2010
● Reduce Software Cost = One on Many
○

Scalability...
Data Parallelism
Manipulation of Data in Parallel
e.g. Vector Addition

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

+
...
Introduction to CUDA
➔
➔
➔
➔
➔
➔
➔

CUDA = Compute Unified Device Architecture
Introduced by NVIDIA
Distribute workload fr...
CUDA Thread Organization

Block

Block

Block

Block

Block

Grid

● Grid = [Vector~3D Matrix] of Blocks
○ Block = [Vector...
CUDA Thread Organization
Grid Dimension
Declaration

Declaration

dim3 DimGrid(x,y,z);
*var name can be others

dim3 DimBl...
CUDA Memory Organization
A Thread have its Private Registers
Threads in a Block have common Shared Memory
Blocks in a same...
Memory Management Command
Prototype

typedef enum cudaError cudaError_t

// Allocate Memory on Device
cudaError_t cudaMall...
Kernel
Terminology for Function for Device to be called by Host
Declared by adding attribute to Function
Attribute

Return...
Row-Major Layout
Way of addressing an element in an Array
Multi-dimensional array can be addressed by 1D array
C / C++ use...
Sample Code: Vector Addition
__global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) {
int pos = blockIdx.x * blo...
Error Checking Pattern
cudaError_t err = cudaMalloc((void **)) &d_input1, size);
if (err != cudaSuccess) {
printf(“%s in %...
Upcoming SlideShare
Loading in...5
×

HPP Week 1 Summary

2,457

Published on

This is my summary of Cousera 2014's Heterogeneous Parallel Programming Week 1
The first week introduce the need of HPP, organization of CUDA programming, and basic CUDA program.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,457
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
71
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

HPP Week 1 Summary

  1. 1. Heterogenous Parallel Programming Class of 2014 Week 1 Summary Update 1 CUDA Pipat Methavanitpong
  2. 2. Heterogeneous Computing ● Diversity of Computing Units ○ CPU, GPU, DSP, Configurable Cores, Cloud Computing ● Right Man, Right Job ○ Each application requires different orientation to perform best ● Application Examples ○ Financial Analysis, Scientific Simulation, Digital Audio Processing, Computer Vision, Numerical Methods, Interactive Physics
  3. 3. Latency and Throughput Orientation Latency Throughput ● Min Time ● Smart / Weak ● Best Path ● Max Throughput ● Stupid / Strong ● Brute Force
  4. 4. Latency and Throughput Orientation CPU GPU ● Best for Sequential ● Powerful ALU ● Best for Parallel ● Weak ALU ○ ○ ○ Few Low Latency Lightly Pipelined ● Large Cache ○ Lower Latency than RAM ● Sophisticated Control ○ ○ Smart Branch INSN* to take Smart Hazard Handling ○ ○ ○ Many High Latency Heavily Pipelined ● Small Cache ○ But boost mem throughput ● Simple Control ○ ○ No Predict No Data Forwarding
  5. 5. Latency and Throughput Orientation CPU ALU GPU ALU Control ALU ALU Cache DRAM DRAM
  6. 6. System Cost ● Hardware + Software Cost ● Software dominates after 2010 ● Reduce Software Cost = One on Many ○ Scalability ■ ○ Same Arch / New Hardware Offer: # of cores, pipeline depth, vector length Portability ■ Different Arch: x86, ARM ■ Different Org and Interfaces: Latency/Throughput, Shared/Distributed Mem
  7. 7. Data Parallelism Manipulation of Data in Parallel e.g. Vector Addition A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] + + + + C[0] C[1] C[2] C[3]
  8. 8. Introduction to CUDA ➔ ➔ ➔ ➔ ➔ ➔ ➔ CUDA = Compute Unified Device Architecture Introduced by NVIDIA Distribute workload from a Host to CUDA capable Devices NVIDIA = GPU = Throughput Oriented = Best Parallel Use of GPU to compute as CPU = GPGPU GPGPU = General Purpose GPU Extend C / C++ / Fortran
  9. 9. CUDA Thread Organization Block Block Block Block Block Grid ● Grid = [Vector~3D Matrix] of Blocks ○ Block = [Vector~3D Matrix] of Threads ■ Thread = One that computes Thread Thread Thread Thread
  10. 10. CUDA Thread Organization Grid Dimension Declaration Declaration dim3 DimGrid(x,y,z); *var name can be others dim3 DimBlock(x,y,z); *var name can be others This Block dim3 DimGrid (2,1,1); dim3 DimBlock (256,1,1); Block Dimension This Thread Block 0 t0 Block 1 t1 t2 ... t255 t0 t1 t2 ... t255
  11. 11. CUDA Memory Organization A Thread have its Private Registers Threads in a Block have common Shared Memory Blocks in a same Grid have common Global and Constant Memory Shared Thread Global, Constant Block Grid HOST But Host can only access Global and Constant Memory Register Register Register Register
  12. 12. Memory Management Command Prototype typedef enum cudaError cudaError_t // Allocate Memory on Device cudaError_t cudaMalloc(void** devPtr, size_t size) enum cudaError // Copy Data 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. cudaSuccess cudaErrorMissingConfiguration cudaErrorMemoryAllocation cudaErrorInitializationError cudaErrorLaunchFailure cudaErrorPriorLaunchFailure cudaErrorLaunchTimeout cudaErrorLaunchOutOfResources cudaErrorInvalidDeviceFunction cudaErrorInvalidConfiguration cudaErrorInvalidDevice … … cudaError_t cudaMemcpy(void* dst, const void* src, size_t size, enum cudaMemcpyKind kind) // Free Memory on Device cudaError_t cudaFree(void* devPtr) enum cudaMemcpyKind 0. 1. 2. 3. 4. cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyDefault For more information http://developer.download. nvidia. com/compute/cuda/4_1/rel/tool kit/docs/online/group__CUDA RT__MEMORY.html size - size in bytes
  13. 13. Kernel Terminology for Function for Device to be called by Host Declared by adding attribute to Function Attribute Return Type Function Type Executed on Only Callable from __device__ any DeviceFunc() device device __global__ void KernelFunc() device host host host __host__ any HostFunc() This attribute is optional Starting Kernel Function by giving it Grid&Block Structure and Parameters KernelFunc<<<dimGrid,dimBlock>>>(param1, param2, …); Waiting for all thrown tasks to complete before move on cudaDeviceSynchronize();
  14. 14. Row-Major Layout Way of addressing an element in an Array Multi-dimensional array can be addressed by 1D array C / C++ use Row-Major Layout A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A2,1 A2,2 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,2 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A1,3 A2,0 A0,0 A0 A0,0 A2,3 Fortran uses Col-Major Index
  15. 15. Sample Code: Vector Addition __global__ void vecAdd(int *d_vIn1, int *d_vIn2, *d_vOut, int n) { int pos = blockIdx.x * blockDim.x + threadIdx.x; if (pos < n) d_vOut[pos] = d_vIn1[pos] + d_vIn2[pos]; } … int main() { int vecLength = …; int* h_input1 = {…}; int* h_input2 = {…}; int* h_output = (int *) malloc(vecLength * sizeof(int)); int* d_input1, d_input2, d_output; cudaMalloc((void **) &d_input1, vecLength * sizeof(int)); cudaMalloc((void **) &d_input2, vecLength * sizeof(int)); cudaMalloc((void **) &d_output, vecLength * sizeof(int)); cudaMemcpy(d_input1,h_input1,vecLength*sizeof(int),cudaMemcpyHostToDevice); cudaMemcpy(d_input2,h_input2,vecLength*sizeof(int),cudaMemcpyHostToDevice); dim3 dimGrid((vecLength-1)/256+1,1,1); dim3 dimBlock(256,1,1); vecAdd<<<dimGrid,dimBlock>>>(d_input1,d_input2,d_output,vecLength); cudaDeviceSynchronize(); cudaMemcpy(h_output,d_output,vecLength*sizeof(int),cudaMemcpyDeviceToHost); cudaFree(d_input1); cudaFree(d_input2); cudaFree(d_output); return 0; }
  16. 16. Error Checking Pattern cudaError_t err = cudaMalloc((void **)) &d_input1, size); if (err != cudaSuccess) { printf(“%s in %s at line %dn”, cudaGetErrorString(err), __FILE__, __LINE__); exit(EXIT_FAILURE); }
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×