GPU Computing

Parallel Computing on GPUs Christian Kehl 01.01.2011

Overview Basics of Parallel Computing Brief Historyof SIMD vs. MIMD Architectures OpenCL Common Application Domain Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP 2

Basics of Parallel Computing Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007 3

Basics of Parallel Computing 4

Brief Historyof SIMD vs. MIMD Architectures 6

Brief Historyof SIMD vs. MIMD Architectures 2004– programmable GPU Core via Shader Technology 2007 – CUDA (Compute Unified Device Architecture) Release 1.0 December 2008 – First Open Compute Language Spec March 2009 – Uniform Shader, first BETA Releases of OpenCL August 2009 – Release and Implementation of OpenCL 1.0 9

Brief Historyof SIMD vs. MIMD Architectures SIMD technologies in GPUs: Vector processing (ILLIAC IV) mathematical operation units (ILLIAC IV) Pipelining (CRAY-1) local memory caching (CRAY-1) atomic instructions (CRAY-1) synchronized instruction execution and memory access (MASPAR) 10

Platform Model OpenCL One Host + one or more Compute Devices EachCompute Deviceis composed of one or moreCompute Units EachCompute Unitis further divided into one or moreProcessing Elements 12

Kernel Execution OpenCL Total number of work-items = Gx * Gy Size of each work-group = Sx * Sy Global ID can be computed from work-group ID and local ID 13

Memory Model OpenCL Address spaces Private - private to a work-item Local - local to a work-group Global - accessible by all work-items in all work-groups Constant - read only global space 16

Programming Language OpenCL Every GPU Computing technology natively written in C/C++ (Host) Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby) Device Code exclusively written in standard C + Extensions 17

Language Restrictions OpenCL Pointers to functions not allowed Pointers to pointers allowed within a kernel, but not as an argument Bit-fields not supported Variable-length arrays and structures not supported Recursion not supported Writes to a pointer of types less than 32-bit not supported Double types not supported, but reserved 3D Image writes not supported Some restrictions are addressed through extensions 18

Common Application Domain Multimedia Data and Tasks best-suitedfor SIMD Processing Multimedia Data – sequentialBytestreams; each Byte independent Image Processing in particularsuitedfor GPUs original GPU task: „Compute <several FLOP> forevery Pixel ofthescreen“ ( Computer Graphics) same taskforimages, onlyFLOP‘sare different 20

Common Application Domain – Image Processing possiblefeaturesrealizable on the GPU contrast- andluminanceconfiguration gammascaling (pixel-by-pixel-) histogramscaling convolutionfiltering edgehighlighting negative image / imageinversion … 21

Inversion Image Processing simple example: Inversion implementationanduseof a frameworkforswitchingbetween different GPGPU technologies creationof a commandqueueforeach GPU reading GPU kernel via kernelfile on-the-fly creationofbuffersforinputandoutputimage memorycopyofinputimagedatato global GPU memory setofkernelargumentsandkernelexecution memorycopyof GPU outputbufferdatatonewimage 22

Image Processing Inversion evaluatedandconfirmedminimumspeedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP 4 : 1 23

GPU Computing Case Study: Monte Carlo-Study of a Spring-Mass-System on GPUs

MC Study of a SMS using OpenCL andOpenMP Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée 26

Task Spring-Mass-System definedby a differential equation Behaviorofthesystem must besimulatedovervaryingdampingvalues Therefore: numericalsolution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000 Analysis ofcomputation time andspeed-upfor different computearchitectures 27

Task based on Simulation News Europe (SNE) CP2: 1000 simulationiterationsoversimulationhorizonwithgenerateddampingvalues (Monte-Carlo Study) consequtiveaveragingfor s(t) tε[0 … 2] sec; h=0.01  200 steps 28

Task on presentarchitecturestoolightweighted -> Modification: 5000 iterationswith Monte-Carlo h=0.001  2000 steps Aimof Analysis: Knowledgeabout spring behaviorfor different dampingvalues (trajectoryarray) 29

Task Simple Spring-Mass-System d … dampingconstant c … spring constant Movement equationderivedbyNewton‘s 2ndaxiom Modelling needed -> „Massenfreischnitt“ massismoved forcebalancing Equation 30

MC Study of a SMS using OpenCL andOpenMP 31 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plots Speed-Up-Study ParallizationConclusions Resumée

Modelling numericalintegrationbased on 2nd order differential equation DE order n  n DEs 1st order 32

Modelling Transformation bysubstitution 33 ,[object Object]

5000 iterations,[object Object]

Euler as simple ODE solver numericalintegrationby explicit Euler method 35

existing MIMD Solutions Approach can not beappliedto GPU Architectures MIMD-Requirements: each PE withowninstructionflow each PE canaccess RAM individually GPU Architecture -> SIMD each PE computesthe same instructionatthe same time each PE hastobeatthe same instructionforaccessing RAM  Therefore: Development SIMD-Approach 38

An SIMD Approach S.P./R.F.: simultaneousexecutionofsequential Simulation withvarying d-Parameter on spatiallydistributedPE‘s Averagingdependend on trajectories C.K.: simultaneouscomputationwith all d-Parameters for time tn, iterative repetitionuntiltend Averagingdependend on steps 40

OpenMP Parallization Technology based on sharedmemoryprinciple synchronizationhiddenfordeveloper threadmanagementcontrolable For System-V-based OS: parallizationbyprocessforking For Windows-based OS: parallizationbyWinThreadcreation (AMD Study/Intel Tech Paper) 43

OpenMP in C/C++: pragma-basedpreprocessordirectives in C# representedby ParallelLoops morethan just parallizing Loops (AMD Tech Report) Literature: AMD/Intel Tech Papers Thomas Rauber, „Parallele Programmierung“ Barbara Chapman, „UsingOpenMP: Portable Shared Memory Parallel Programming“ 44

MC Study of a SMS using OpenCL andOpenMP 45 Task Modelling Euler as simple ODE solver Existing MIMD Solutions An SIMD-Approach OpenMP Result Plot Speed-Up-Study ParallizationConclusions Resumée

Result Plot resultingtrajectoryfor all technologies 46

Speed-Up Study 48 OpenMP – own Study – Comparison CPU/GPU SIMD Single: presented SIMD approach on CPU SIMD OpenMP: presented SIMD approachparallized on CPU SIMD OpenCL: Controlofnumberofexecutingunits not possible, thereforeonly 1 value

Speed-Up Study 49 SIMD OpenCL SIMD single MIMD single SIMD OpenMP MIMD OpenMP

ParallizationConclusions problemunsuitedfor SIMD parallization On-GPU-Reductiontoo time expensive, Therefore: Euler computation on GPU Averagecomputation on CPU most time intensive operation: MemCopybetween GPU and Main Memory formorecomplexproblems oder different ODE solverprocedurespeed-upbehaviorcanchange 51

ParallizationConclusion MIMD-Approach S.P./R.F. efficientfor SNE CP2 OpenMPrealizationfor MIMD- and SIMD-Approach possible (anddone) OpenMP MIMD realizationalmost linear speedup moreset Threads than PEs physicallyavailableleadstosignificant Thread-Overhead OpenMPchoosesautomaticallynumberthreadstophysicalavailable PEs fordynamicassignement 52

Resumée taskcanbesolved on CPUs and GPUs For GPU Computing newapproachesandalgorithmportingrequired although GPUs have massive numberof parallel operatingcores, speed-up not foreveryapplicationdomainpossible 54

Resumée Advantages GPU Computing: forsuitedproblems (e.g. Multimedia) very fast andscalable cheap HPC technology in comparisontoscientificsupercomputers energy-efficient massive computing power in smallsize Disadvantage GPU Computing: limited instructionset strictly SIMD SIMD Algorithmdevelopmenthard noexecutionsupervision (e.g. segmentation/page fault) 55

GPU Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to GPU Computing

Similar to GPU Computing (20)

More from Christian Kehl

More from Christian Kehl (20)

Recently uploaded

Recently uploaded (20)

GPU Computing

Editor's Notes