GPU Computing

2,266 views

Published on

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,266
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
77
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • - GPU-GDRAM ist weiterhin unterteilt, entsprechend der physikalischen Architektur der Verarbeitungseinheit
  • GPU Computing

    1. 1. Parallel Computing on GPUs<br />Christian Kehl<br />01.01.2011<br />
    2. 2. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />2<br />
    3. 3. Basics of Parallel Computing<br />Ref.: René Fink, „Untersuchungen zur Parallelverarbeitung mit wissenschaftlich-technischen Berechnungsumgebungen“, Diss Uni Rostock 2007<br />3<br />
    4. 4. Basics of Parallel Computing<br />4<br />
    5. 5. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />5<br />
    6. 6. Brief Historyof SIMD vs. MIMD Architectures<br />6<br />
    7. 7. Brief Historyof SIMD vs. MIMD Architectures<br />7<br />
    8. 8. Brief Historyof SIMD vs. MIMD Architectures<br />8<br />
    9. 9. Brief Historyof SIMD vs. MIMD Architectures<br />2004– programmable GPU Core via Shader Technology<br />2007 – CUDA (Compute Unified Device Architecture) Release 1.0<br />December 2008 – First Open Compute Language Spec<br />March 2009 – Uniform Shader, first BETA Releases of OpenCL<br />August 2009 – Release and Implementation of <br />OpenCL 1.0<br />9<br />
    10. 10. Brief Historyof SIMD vs. MIMD Architectures<br />SIMD technologies in GPUs:<br />Vector processing (ILLIAC IV)<br />mathematical operation units (ILLIAC IV)<br />Pipelining (CRAY-1)<br />local memory caching (CRAY-1)<br />atomic instructions (CRAY-1)<br />synchronized instruction execution and memory access (MASPAR)<br />10<br />
    11. 11. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />11<br />
    12. 12. Platform Model<br />OpenCL<br />One Host + one or more Compute Devices<br />EachCompute Deviceis composed of one or moreCompute Units<br />EachCompute Unitis further divided into one or moreProcessing Elements<br />12<br />
    13. 13. Kernel Execution<br />OpenCL<br />Total number of work-items = Gx * Gy<br />Size of each work-group = Sx * Sy<br />Global ID can be computed from work-group ID and local ID<br />13<br />
    14. 14. Memory Management<br />OpenCL<br />14<br />
    15. 15. Memory Management<br />OpenCL<br />15<br />
    16. 16. Memory Model<br />OpenCL<br />Address spaces<br />Private - private to a work-item<br />Local - local to a work-group<br />Global - accessible by all work-items in all work-groups<br />Constant - read only global space<br />16<br />
    17. 17. Programming Language<br />OpenCL<br />Every GPU Computing technology natively written in C/C++ (Host)<br />Host-Code Bindings to several other languages are existing (Fortran, Java, C#, Ruby)<br />Device Code exclusively written in standard C + Extensions<br />17<br />
    18. 18. Language Restrictions<br />OpenCL<br />Pointers to functions not allowed<br />Pointers to pointers allowed within a kernel, but not as an argument<br />Bit-fields not supported<br />Variable-length arrays and structures not supported<br />Recursion not supported<br />Writes to a pointer of types less than 32-bit not supported<br />Double types not supported, but reserved<br />3D Image writes not supported<br />Some restrictions are addressed through extensions<br />18<br />
    19. 19. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />19<br />
    20. 20. Common Application Domain<br />Multimedia Data and Tasks best-suitedfor SIMD Processing<br />Multimedia Data – sequentialBytestreams; each Byte independent<br />Image Processing in particularsuitedfor GPUs<br />original GPU task: „Compute <several FLOP> forevery Pixel ofthescreen“ ( Computer Graphics)<br />same taskforimages, onlyFLOP‘sare different<br />20<br />
    21. 21. Common Application Domain – <br />Image Processing<br />possiblefeaturesrealizable on the GPU<br />contrast- andluminanceconfiguration<br />gammascaling<br />(pixel-by-pixel-) histogramscaling<br />convolutionfiltering<br />edgehighlighting<br />negative image / imageinversion<br />…<br />21<br />
    22. 22. Inversion<br />Image Processing<br />simple example: Inversion<br />implementationanduseof a frameworkforswitchingbetween different GPGPU technologies<br />creationof a commandqueueforeach GPU<br />reading GPU kernel via kernelfile on-the-fly<br />creationofbuffersforinputandoutputimage<br />memorycopyofinputimagedatato global GPU memory<br />setofkernelargumentsandkernelexecution<br />memorycopyof GPU outputbufferdatatonewimage<br />22<br />
    23. 23. Image Processing<br />Inversion<br />evaluatedandconfirmedminimumspeedup – G80 GPU OpenCL VS. 8-core-CPU OpenMP<br /> 4 : 1<br />23<br />
    24. 24. GPU Computing<br />Case Study: Monte Carlo-Study of a Spring-Mass-System on GPUs <br />
    25. 25. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />25<br />
    26. 26. MC Study of a SMS using OpenCL andOpenMP<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />26<br />
    27. 27. Task<br />Spring-Mass-System definedby a differential equation<br />Behaviorofthesystem must besimulatedovervaryingdampingvalues<br />Therefore: numericalsolution in t; tε[0.0 … 2] sec. for a stepsize h=1/1000<br />Analysis ofcomputation time andspeed-upfor different computearchitectures<br />27<br />
    28. 28. Task<br />based on Simulation News Europe (SNE) CP2:<br />1000 simulationiterationsoversimulationhorizonwithgenerateddampingvalues (Monte-Carlo Study)<br />consequtiveaveragingfor s(t)<br />tε[0 … 2] sec; h=0.01  200 steps<br />28<br />
    29. 29. Task<br />on presentarchitecturestoolightweighted<br /> -> Modification:<br />5000 iterationswith Monte-Carlo<br />h=0.001  2000 steps<br />Aimof Analysis: Knowledgeabout spring behaviorfor different dampingvalues (trajectoryarray)<br />29<br />
    30. 30. Task<br />Simple Spring-Mass-System<br /> d … dampingconstant<br /> c … spring constant<br />Movement equationderivedbyNewton‘s 2ndaxiom<br />Modelling needed -> „Massenfreischnitt“<br />massismoved<br />forcebalancing Equation<br />30<br />
    31. 31. MC Study of a SMS using OpenCL andOpenMP<br />31<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    32. 32. Modelling<br />numericalintegrationbased on 2nd order differential equation<br />DE order n  n DEs 1st order<br />32<br />
    33. 33. Modelling<br />Transformation bysubstitution<br />33<br /><ul><li>randomdampingparameter d forintervallimits [800;1200];
    34. 34. 5000 iterations</li></li></ul><li>MC Study of a SMS using OpenCL andOpenMP<br />34<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    35. 35. Euler as simple ODE solver<br />numericalintegrationby explicit Euler method<br />35<br />
    36. 36. MC Study of a SMS using OpenCL andOpenMP<br />36<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    37. 37. existing MIMD Solutions<br />37<br />
    38. 38. existing MIMD Solutions<br />Approach can not beappliedto GPU Architectures<br />MIMD-Requirements:<br />each PE withowninstructionflow<br />each PE canaccess RAM individually<br />GPU Architecture -> SIMD<br />each PE computesthe same instructionatthe same time<br />each PE hastobeatthe same instructionforaccessing RAM<br /> Therefore: Development SIMD-Approach<br />38<br />
    39. 39. MC Study of a SMS using OpenCL andOpenMP<br />39<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    40. 40. An SIMD Approach<br />S.P./R.F.:<br />simultaneousexecutionofsequential Simulation withvarying d-Parameter on spatiallydistributedPE‘s<br />Averagingdependend on trajectories<br />C.K.:<br />simultaneouscomputationwith all d-Parameters for time tn, iterative repetitionuntiltend<br />Averagingdependend on steps<br />40<br />
    41. 41. An SIMD-Approach<br />41<br />
    42. 42. MC Study of a SMS using OpenCL andOpenMP<br />42<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    43. 43. OpenMP<br />Parallization Technology based on sharedmemoryprinciple<br />synchronizationhiddenfordeveloper<br />threadmanagementcontrolable<br />For System-V-based OS:<br />parallizationbyprocessforking<br />For Windows-based OS:<br />parallizationbyWinThreadcreation (AMD Study/Intel Tech Paper)<br />43<br />
    44. 44. OpenMP<br />in C/C++: pragma-basedpreprocessordirectives<br />in C# representedby ParallelLoops<br />morethan just parallizing Loops (AMD Tech Report)<br />Literature:<br />AMD/Intel Tech Papers<br />Thomas Rauber, „Parallele Programmierung“<br />Barbara Chapman, „UsingOpenMP: Portable Shared Memory Parallel Programming“<br />44<br />
    45. 45. MC Study of a SMS using OpenCL andOpenMP<br />45<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plot<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    46. 46. Result Plot<br />resultingtrajectoryfor all technologies<br />46<br />
    47. 47. MC Study of a SMS using OpenCL andOpenMP<br />47<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    48. 48. Speed-Up Study<br />48<br />OpenMP – own Study – Comparison CPU/GPU<br />SIMD Single: presented SIMD approach on CPU<br />SIMD OpenMP: presented SIMD approachparallized on CPU<br />SIMD OpenCL: Controlofnumberofexecutingunits not possible, thereforeonly 1 value<br />
    49. 49. Speed-Up Study<br />49<br />SIMD OpenCL<br />SIMD single<br />MIMD single<br />SIMD OpenMP<br />MIMD OpenMP<br />
    50. 50. MC Study of a SMS using OpenCL andOpenMP<br />50<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    51. 51. ParallizationConclusions<br />problemunsuitedfor SIMD parallization<br />On-GPU-Reductiontoo time expensive, <br />Therefore:<br />Euler computation on GPU<br />Averagecomputation on CPU<br />most time intensive operation: MemCopybetween GPU and Main Memory<br />formorecomplexproblems oder different ODE solverprocedurespeed-upbehaviorcanchange<br />51<br />
    52. 52. ParallizationConclusion<br />MIMD-Approach S.P./R.F. efficientfor SNE CP2<br />OpenMPrealizationfor MIMD- and SIMD-Approach possible (anddone)<br />OpenMP MIMD realizationalmost linear speedup<br />moreset Threads than PEs physicallyavailableleadstosignificant Thread-Overhead<br />OpenMPchoosesautomaticallynumberthreadstophysicalavailable PEs fordynamicassignement<br />52<br />
    53. 53. MC Study of a SMS using OpenCL andOpenMP<br />53<br />Task<br />Modelling<br />Euler as simple ODE solver<br />Existing MIMD Solutions<br />An SIMD-Approach<br />OpenMP<br />Result Plots<br />Speed-Up-Study<br />ParallizationConclusions<br />Resumée<br />
    54. 54. Resumée<br />taskcanbesolved on CPUs and GPUs<br />For GPU Computing newapproachesandalgorithmportingrequired<br />although GPUs have massive numberof parallel operatingcores, speed-up not foreveryapplicationdomainpossible<br />54<br />
    55. 55. Resumée<br />Advantages GPU Computing:<br />forsuitedproblems (e.g. Multimedia) very fast andscalable<br />cheap HPC technology in comparisontoscientificsupercomputers<br />energy-efficient<br />massive computing power in smallsize<br />Disadvantage GPU Computing:<br />limited instructionset<br />strictly SIMD<br />SIMD Algorithmdevelopmenthard<br />noexecutionsupervision (e.g. segmentation/page fault)<br />55<br />
    56. 56. Overview<br />Basics of Parallel Computing<br />Brief Historyof SIMD vs. MIMD Architectures<br />OpenCL<br />Common Application Domain<br />Monte Carlo-Study of a Spring-Mass-System using OpenCL andOpenMP<br />56<br />

    ×