Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GPU Computing

2,371 views

Published on

Published in: Education
  • Be the first to comment

GPU Computing

  1. 1. Parallel Computing on GPUs<br />Christian Kehl<br />01.01.2011<br />
  2. 2. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />2<br />
  3. 3. Basics of Parallel Computing<br />Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007<br />3<br />
  4. 4. Basics of Parallel Computing<br />4<br />
  5. 5. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />5<br />
  6. 6. Brief Historyof SIMD vs. MIMD Architectures<br />6<br />
  7. 7. Brief Historyof SIMD vs. MIMD Architectures<br />7<br />
  8. 8. Brief Historyof SIMD vs. MIMD Architectures<br />8<br />
  9. 9. Brief Historyof SIMD vs. MIMD Architectures<br />2004– programmable GPU Core via Shader Technology<br />2007 – CUDA (Compute Unified Device Architecture) Release 1.0<br />December 2008 – First Open Compute Language Spec<br />March 2009 – Uniform Shader, first BETA Releases of OpenCL<br />August 2009 – Release and Implementation of <br />OpenCL 1.0<br />9<br />
  10. 10. Brief Historyof SIMD vs. MIMD Architectures<br />SIMD technologies in GPUs:<br />Vector processing (ILLIAC IV)<br />mathematical operation units (ILLIAC IV)<br />Pipelining (CRAY-1)<br />local memory caching (CRAY-1)<br />atomic instructions (CRAY-1)<br />synchronized instruction execution and memory access (MASPAR)<br />10<br />
  11. 11. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />11<br />
  12. 12. Platform Model<br />OpenCL<br />One Host + one or more Compute Devices<br />EachCompute Deviceis composed of one or moreCompute Units<br />EachCompute Unitis further divided into one or moreProcessing Elements<br />12<br />
  13. 13. Kernel Execution<br />OpenCL<br />Total number of work-items = Gx * Gy<br />Size of each work-group = Sx * Sy<br />Global ID can be computed from work-group ID and local ID<br />13<br />
  14. 14. Memory Management<br />OpenCL<br />14<br />
  15. 15. Memory Management<br />OpenCL<br />15<br />
  16. 16. Memory Model<br />OpenCL<br />Address spaces<br />Private - private to a work-item<br />Local - local to a work-group<br />Global - accessible by all work-items in all work-groups<br />Constant - read only global space<br />16<br />
  17. 17. Programming Language<br />OpenCL<br />Every GPU Computing technology natively written in C/C++ (Host)<br />Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby)<br />Device Code exclusively written in standard C + Extensions<br />17<br />
  18. 18. Language Restrictions<br />OpenCL<br />Pointers to functions not allowed<br />Pointers to pointers allowed within a kernel, but not as an argument<br />Bit-fields not supported<br />Variable-length arrays and structures not supported<br />Recursion not supported<br />Writes to a pointer of types less than 32-bit not supported<br />Double types not supported, but reserved<br />3D Image writes not supported<br />Some restrictions are addressed through extensions<br />18<br />
  19. 19. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />19<br />
  20. 20. Common Application Domain<br />Multimedia Data and Tasks best-suitedfor SIMD Processing<br />Multimedia Data – sequentialBytestreams; each Byte independent<br />Image Processing in particularsuitedfor GPUs<br />original GPU task: „Compute <several FLOP> forevery Pixel ofthescreen“ ( Computer Graphics)<br />same taskforimages, onlyFLOP‘sare different<br />20<br />
  21. 21. Common Application Domain – <br />Image Processing<br />possiblefeaturesrealizable on the GPU<br />contrast- andluminanceconfiguration<br />gammascaling<br />(pixel-by-pixel-) histogramscaling<br />convolutionfiltering<br />edgehighlighting<br />negative image / imageinversion<br />…<br />21<br />
  22. 22. Inversion<br />Image Processing<br />simple example: Inversion<br />implementationanduseof a frameworkforswitchingbetween different GPGPU technologies<br />creationof a commandqueueforeach GPU<br />reading GPU kernel via kernelfile on-the-fly<br />creationofbuffersforinputandoutputimage<br />memorycopyofinputimagedatato global GPU memory<br />setofkernelargumentsandkernelexecution<br />memorycopyof GPU outputbufferdatatonewimage<br />22<br />
  23. 23. Image Processing<br />Inversion<br />evaluatedandconfirmedminimumspeedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP<br /> 4 : 1<br />23<br />
  24. 24. GPU Computing<br />Case Study: Monte Carlo-Study of a Spring-Mass-System on GPUs <br />
  25. 25. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />25<br />
  26. 26. MC Study of a SMS using OpenCL andOpenMP<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />26<br />
  27. 27. Task<br />Spring-Mass-System definedby a differential equation<br />Behaviorofthesystem must besimulatedovervaryingdampingvalues<br />Therefore: numericalsolution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000<br />Analysis ofcomputation time andspeed-upfor different computearchitectures<br />27<br />
  28. 28. Task<br />based on Simulation News Europe (SNE) CP2:<br />1000 simulationiterationsoversimulationhorizonwithgenerateddampingvalues (Monte-Carlo Study)<br />consequtiveaveragingfor s(t)<br />tε[0 … 2] sec; h=0.01  200 steps<br />28<br />
  29. 29. Task<br />on presentarchitecturestoolightweighted<br /> -> Modification:<br />5000 iterationswith Monte-Carlo<br />h=0.001  2000 steps<br />Aimof Analysis: Knowledgeabout spring behaviorfor different dampingvalues (trajectoryarray)<br />29<br />
  30. 30. Task<br />Simple Spring-Mass-System<br /> d … dampingconstant<br /> c … spring constant<br />Movement equationderivedbyNewton‘s 2ndaxiom<br />Modelling needed -> „Massenfreischnitt“<br />massismoved<br />forcebalancing Equation<br />30<br />
  31. 31. MC Study of a SMS using OpenCL andOpenMP<br />31<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  32. 32. Modelling<br />numericalintegrationbased on 2nd order differential equation<br />DE order n  n DEs 1st order<br />32<br />
  33. 33. Modelling<br />Transformation bysubstitution<br />33<br /><ul><li>randomdampingparameter d forintervallimits [800;1200];
  34. 34. 5000 iterations</li></li></ul><li>MC Study of a SMS using OpenCL andOpenMP<br />34<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  35. 35. Euler as simple ODE solver<br />numericalintegrationby explicit Euler method<br />35<br />
  36. 36. MC Study of a SMS using OpenCL andOpenMP<br />36<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  37. 37. existing MIMD Solutions<br />37<br />
  38. 38. existing MIMD Solutions<br />Approach can not beappliedto GPU Architectures<br />MIMD-Requirements:<br />each PE withowninstructionflow<br />each PE canaccess RAM individually<br />GPU Architecture -> SIMD<br />each PE computesthe same instructionatthe same time<br />each PE hastobeatthe same instructionforaccessing RAM<br /> Therefore: Development SIMD-Approach<br />38<br />
  39. 39. MC Study of a SMS using OpenCL andOpenMP<br />39<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  40. 40. An SIMD Approach<br />S.P./R.F.:<br />simultaneousexecutionofsequential Simulation withvarying d-Parameter on spatiallydistributedPE‘s<br />Averagingdependend on trajectories<br />C.K.:<br />simultaneouscomputationwith all d-Parameters for time tn, iterative repetitionuntiltend<br />Averagingdependend on steps<br />40<br />
  41. 41. An SIMD-Approach<br />41<br />
  42. 42. MC Study of a SMS using OpenCL andOpenMP<br />42<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  43. 43. OpenMP<br />Parallization Technology based on sharedmemoryprinciple<br />synchronizationhiddenfordeveloper<br />threadmanagementcontrolable<br />For System-V-based OS:<br />parallizationbyprocessforking<br />For Windows-based OS:<br />parallizationbyWinThreadcreation (AMD Study/Intel Tech Paper)<br />43<br />
  44. 44. OpenMP<br />in C/C++: pragma-basedpreprocessordirectives<br />in C# representedby ParallelLoops<br />morethan just parallizing Loops (AMD Tech Report)<br />Literature:<br />AMD/Intel Tech Papers<br />Thomas Rauber, „Parallele Programmierung“<br />Barbara Chapman, „UsingOpenMP: Portable Shared Memory Parallel Programming“<br />44<br />
  45. 45. MC Study of a SMS using OpenCL andOpenMP<br />45<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plot<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  46. 46. Result Plot<br />resultingtrajectoryfor all technologies<br />46<br />
  47. 47. MC Study of a SMS using OpenCL andOpenMP<br />47<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  48. 48. Speed-Up Study<br />48<br />OpenMP – own Study – Comparison CPU/GPU<br />SIMD Single: presented SIMD approach on CPU<br />SIMD OpenMP: presented SIMD approachparallized on CPU<br />SIMD OpenCL: Controlofnumberofexecutingunits not possible, thereforeonly 1 value<br />
  49. 49. Speed-Up Study<br />49<br />SIMD OpenCL<br />SIMD single<br />MIMD single<br />SIMD OpenMP<br />MIMD OpenMP<br />
  50. 50. MC Study of a SMS using OpenCL andOpenMP<br />50<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  51. 51. ParallizationConclusions<br />problemunsuitedfor SIMD parallization<br />On-GPU-Reductiontoo time expensive, <br />Therefore:<br />Euler computation on GPU<br />Averagecomputation on CPU<br />most time intensive operation: MemCopybetween GPU and Main Memory<br />formorecomplexproblems oder different ODE solverprocedurespeed-upbehaviorcanchange<br />51<br />
  52. 52. ParallizationConclusion<br />MIMD-Approach S.P./R.F. efficientfor SNE CP2<br />OpenMPrealizationfor MIMD- and SIMD-Approach possible (anddone)<br />OpenMP MIMD realizationalmost linear speedup<br />moreset Threads than PEs physicallyavailableleadstosignificant Thread-Overhead<br />OpenMPchoosesautomaticallynumberthreadstophysicalavailable PEs fordynamicassignement<br />52<br />
  53. 53. MC Study of a SMS using OpenCL andOpenMP<br />53<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
  54. 54. Resumée<br />taskcanbesolved on CPUs and GPUs<br />For GPU Computing newapproachesandalgorithmportingrequired<br />although GPUs have massive numberof parallel operatingcores, speed-up not foreveryapplicationdomainpossible<br />54<br />
  55. 55. Resumée<br />Advantages GPU Computing:<br />forsuitedproblems (e.g. Multimedia) very fast andscalable<br />cheap HPC technology in comparisontoscientificsupercomputers<br />energy-efficient<br />massive computing power in smallsize<br />Disadvantage GPU Computing:<br />limited instructionset<br />strictly SIMD<br />SIMD Algorithmdevelopmenthard<br />noexecutionsupervision (e.g. segmentation/page fault)<br />55<br />
  56. 56. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />56<br />

×