GPU ArchitecturePerhaad Mistry & Dana Schaa,Northeastern University Computer ArchitectureResearch Lab, with Benedict R. Gaster, AMD© 2011
Instructor NotesWe describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional programming coursesContrast conventional multicore CPU architecture with high level view of AMD and Nvidia GPU ArchitectureThis lecture starts with a high level architectural view of all GPUs, discusses each vendor’s architecture and then converges back to the OpenCL specStress on the difference between the AMD VLIW architecture and Nvidia scalar architectureAlso discuss the different memory architectureBrief discussion of ICD and compilation flow of OpenCL provides a lead to Lecture 5 where the first complete OpenCL program is written2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
TopicsMapping the OpenCL spec to many-core hardware AMD GPU ArchitectureNvidia GPU ArchitectureCell Broadband EngineOpenCL Specific TopicsOpenCL Compilation SystemInstallable Client Driver (ICD)3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
MotivationWhy are we discussing vendor specific hardware if OpenCL is platform independent ?Gain intuition of how a program’s loops and data need to map to OpenCL kernels in order to obtain performanceObserve similarities and differences between Nvidia and AMD hardwareUnderstanding hardware will allow for platform specific tuning of code in later lectures4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Conventional CPU ArchitectureSpace devoted  to control logic instead of  ALUCPUs are optimized to minimize the latency of a single threadCan efficiently handle control flow intensive workloadsMulti level caches used to hide latencyLimited number of registers due to smaller number of active threadsControl logic to reorder execution, provide ILP and minimize pipeline stallsConventional CPU Block DiagramControl LogicL2 CacheL3 CacheALUL1 Cache ~ 25GBPSSystem MemoryA present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Modern GPGPU ArchitectureGeneric many core GPULess space devoted to control logic and cachesLarge register files to support multiple thread contextsLow latency hardware managed thread switchingLarge number of ALU per “core” with small user managed cache per core Memory bus optimized for  bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneouslyHigh Bandwidth bus to ALUsOn Board System MemorySimple ALUsCache6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD GPU Hardware ArchitectureAMD 5870 – Cypress20  SIMD engines16 SIMD units per core5 multiply-adds per functional unit (VLIW processing)2.72 Teraflops Single Precision544 Gigaflops Double PrecisionSource:  Introductory OpenCL SAAHPC2010, Benedict R. Gaster7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMD EngineOne SIMD EngineA SIMD engine consists of a set of “Stream Cores”Stream cores arranged as a five way Very Long Instruction Word (VLIW) processor Up to five scalar operations can be issued in a VLIW instructionScalar operations executed on each processing elementStream cores within compute unit execute same VLIW instructionThe block of work-items that are executed together is called a wavefront.64 work items for 5870One Stream CoreInstruction and Control FlowT-Processing ElementBranchExecution UnitProcessingElementsGeneral Purpose RegistersSource:  AMD Accelerated Parallel Processing OpenCL Programming Guide8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Platform as seen in OpenCLIndividual work-items execute on a single processing elementProcessing element refers to a single VLIW coreMultiple work-groups execute on a compute unitA compute unit refers to a SIMD Engine9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD GPU Memory ArchitectureMemory per compute unitLocal data store (on-chip)Registers L1 cache (8KB for 5870) per compute unitL2 Cache shared between compute units (512KB for 5870)Fast path for only 32 bit operationsComplete path for atomics and < 32bit operationsSIMD Engine LDS, RegistersL1 CacheCompute Unit to Memory X-barL2 CacheWrite CacheAtomic PathLDS10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Memory Model in OpenCLSubset of hardware memory exposed in OpenCLLocal Data Share (LDS) exposed as local memoryShare data between items of a work group designed to increase performanceHigh Bandwidth access per SIMD EnginePrivate memory utilizes registers per work itemConstant Memory__constant tags utilize L1 cache.Private MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit  NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
AMD Constant Memory UsageConstant Memory declarations for AMD GPUs only beneficial for following access patternsDirect-Addressing Patterns: For non array constant values where the address is known initiallySame Index Patterns: When all work-items reference the same constant addressGlobally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if  less than 16KBCases where each work item accesses different indices, are not cached and deliver the same performance as a global memory readSource:  AMD Accelerated Parallel Processing OpenCL Programming Guide12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia GPUs - Fermi Architecture Instruction CacheCoreCoreCoreCoreGTX 480 - Compute 2.0 capability15 cores or Streaming Multiprocessors (SMs)Each SM features 32 CUDA processors480  CUDA processorsGlobal memory  with ECCWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFULDSTLDSTSource: NVIDIA’s Next Generation CUDA Architecture WhitepaperInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia GPUs – Fermi ArchitectureSM  executes threads in groups of 32 called warps.Two warp issue units per SMConcurrent kernel executionExecute multiple  kernels simultaneously to improve efficiencyCUDA core consists of a single ALU and floating point unit FPUInstruction CacheCoreCoreCoreCoreWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFUSource: NVIDIA’s Next Generation CUDA Compute Architecture WhitepaperLDSTLDSTInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMT and SIMDSIMT denotes scalar instructions and multiple threads sharing an instruction streamHW determines instruction stream sharing across ALUsE.g. NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures (“wavefronts”) where all the threads in a warp /wavefront proceed in lockstepDivergence between threads handled using predicationSIMT instructions specify the execution and branching behavior of a single threadSIMD instructions exposes vector width, E.g. of SIMD: explicit vector instructions like x86 SSE15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
SIMT Execution ModelSIMD execution can be combined with pipelining
ALUs all execute the same instruction
Pipelining is used to break instruction into phases
When first instruction completes (4 cycles here), the next instruction is ready to executeSIMD WidthAddMulAddMulAddMulAddMulAddMulAddMul…AddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulWavefrontAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMul…Cycle16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia Memory HierarchyL1 cache per SM configurable to support shared memory and caching of  global memory48 KB Shared / 16 KB of L1 cache16 KB Shared / 48 KB of L1 cacheData shared between work items of a group  using shared memoryEach SM has a 32K register bank L2 cache (768KB) that services all operations (load, store and texture)Unified path to global for loads and storesRegistersThread BlockL1 CacheShared MemoryL2 CacheGlobal Memory17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Nvidia Memory Model in OpenCLLike AMD, a subset of hardware memory exposed in OpenCLConfigurable shared memory is usable as local memory Local memory used to share data between items of a work group at lower latency than global memory Private memory utilizes registers per work itemPrivate MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit  NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Cell Broadband EngineSPE 2SPE 0SPE 1SPE 3Developed by Sony, Toshiba, IBMTransitioned from embedded platforms into HPC via the Playstation 3OpenCL drivers available for Cell Bladecenter serversConsists of a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPE)Uses the IBM XL C for OpenCL compilerSPUSPUSPUSPULSLSLSLS25 GBPS25 GBPS25 GBPSElement Interconnect ~ 200GBPSLS = Local store per SPE of 256KBMemory & Interrupt ControllerL1 and L2 CachePOWER PCPPESource: http://www.alphaworks.ibm.com/tech/opencl19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Cell BE and OpenCLCell Power/VMX CPU used as a CL_DEVICE_TYPE_CPUCell SPU (CL_DEVICE_TYPE_ACCELERATOR) No. of compute units on a SPU accelerator device is <=16Local memory size <= 256KB256K of local storage divided among OpenCL kernel, 8KB global data cache, local, constant and private variablesOpenCL accelerator devices, and OpenCL CPU device share a common memory busProvides extensions like “Device Fission” and “Migrate Objects” to specify where an object resides (discussed in Lecture 10)No support for OpenCL images, sampler objects, atomics and  byte addressable memorySource: http://www.alphaworks.ibm.com/tech/opencl20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
An Optimal GPGPU KernelFrom the discussion on hardware we see that an ideal kernel for a GPU:Has thousands of independent pieces of workUses all available compute unitsAllows interleaving for latency hidingIs amenable to instruction stream sharingMaps to SIMD execution by preventing divergence between work itemsHas high arithmetic intensityRatio of math operations to memory access is highNot limited by memory bandwidthNote that these caveats apply to all GPUs21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Lec04 gpu architecture

  • 1.
    GPU ArchitecturePerhaad Mistry& Dana Schaa,Northeastern University Computer ArchitectureResearch Lab, with Benedict R. Gaster, AMD© 2011
  • 2.
    Instructor NotesWe describemotivation for talking about underlying device architecture because device architecture is often avoided in conventional programming coursesContrast conventional multicore CPU architecture with high level view of AMD and Nvidia GPU ArchitectureThis lecture starts with a high level architectural view of all GPUs, discusses each vendor’s architecture and then converges back to the OpenCL specStress on the difference between the AMD VLIW architecture and Nvidia scalar architectureAlso discuss the different memory architectureBrief discussion of ICD and compilation flow of OpenCL provides a lead to Lecture 5 where the first complete OpenCL program is written2Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 3.
    TopicsMapping the OpenCLspec to many-core hardware AMD GPU ArchitectureNvidia GPU ArchitectureCell Broadband EngineOpenCL Specific TopicsOpenCL Compilation SystemInstallable Client Driver (ICD)3Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 4.
    MotivationWhy are wediscussing vendor specific hardware if OpenCL is platform independent ?Gain intuition of how a program’s loops and data need to map to OpenCL kernels in order to obtain performanceObserve similarities and differences between Nvidia and AMD hardwareUnderstanding hardware will allow for platform specific tuning of code in later lectures4Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 5.
    Conventional CPU ArchitectureSpacedevoted to control logic instead of ALUCPUs are optimized to minimize the latency of a single threadCan efficiently handle control flow intensive workloadsMulti level caches used to hide latencyLimited number of registers due to smaller number of active threadsControl logic to reorder execution, provide ILP and minimize pipeline stallsConventional CPU Block DiagramControl LogicL2 CacheL3 CacheALUL1 Cache ~ 25GBPSSystem MemoryA present day multicore CPU could have more than one ALU ( typically < 32) and some of the cache hierarchy is usually shared across cores5Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 6.
    Modern GPGPU ArchitectureGenericmany core GPULess space devoted to control logic and cachesLarge register files to support multiple thread contextsLow latency hardware managed thread switchingLarge number of ALU per “core” with small user managed cache per core Memory bus optimized for bandwidth ~150 GBPS bandwidth allows us to service a large number of ALUs simultaneouslyHigh Bandwidth bus to ALUsOn Board System MemorySimple ALUsCache6Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 7.
    AMD GPU HardwareArchitectureAMD 5870 – Cypress20 SIMD engines16 SIMD units per core5 multiply-adds per functional unit (VLIW processing)2.72 Teraflops Single Precision544 Gigaflops Double PrecisionSource: Introductory OpenCL SAAHPC2010, Benedict R. Gaster7Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 8.
    SIMD EngineOne SIMDEngineA SIMD engine consists of a set of “Stream Cores”Stream cores arranged as a five way Very Long Instruction Word (VLIW) processor Up to five scalar operations can be issued in a VLIW instructionScalar operations executed on each processing elementStream cores within compute unit execute same VLIW instructionThe block of work-items that are executed together is called a wavefront.64 work items for 5870One Stream CoreInstruction and Control FlowT-Processing ElementBranchExecution UnitProcessingElementsGeneral Purpose RegistersSource: AMD Accelerated Parallel Processing OpenCL Programming Guide8Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 9.
    AMD Platform asseen in OpenCLIndividual work-items execute on a single processing elementProcessing element refers to a single VLIW coreMultiple work-groups execute on a compute unitA compute unit refers to a SIMD Engine9Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 10.
    AMD GPU MemoryArchitectureMemory per compute unitLocal data store (on-chip)Registers L1 cache (8KB for 5870) per compute unitL2 Cache shared between compute units (512KB for 5870)Fast path for only 32 bit operationsComplete path for atomics and < 32bit operationsSIMD Engine LDS, RegistersL1 CacheCompute Unit to Memory X-barL2 CacheWrite CacheAtomic PathLDS10Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 11.
    AMD Memory Modelin OpenCLSubset of hardware memory exposed in OpenCLLocal Data Share (LDS) exposed as local memoryShare data between items of a work group designed to increase performanceHigh Bandwidth access per SIMD EnginePrivate memory utilizes registers per work itemConstant Memory__constant tags utilize L1 cache.Private MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory11Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 12.
    AMD Constant MemoryUsageConstant Memory declarations for AMD GPUs only beneficial for following access patternsDirect-Addressing Patterns: For non array constant values where the address is known initiallySame Index Patterns: When all work-items reference the same constant addressGlobally scoped constant arrays: Arrays that are initialized, globally scoped can use the cache if less than 16KBCases where each work item accesses different indices, are not cached and deliver the same performance as a global memory readSource: AMD Accelerated Parallel Processing OpenCL Programming Guide12Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 13.
    Nvidia GPUs -Fermi Architecture Instruction CacheCoreCoreCoreCoreGTX 480 - Compute 2.0 capability15 cores or Streaming Multiprocessors (SMs)Each SM features 32 CUDA processors480 CUDA processorsGlobal memory with ECCWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFULDSTLDSTSource: NVIDIA’s Next Generation CUDA Architecture WhitepaperInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue13Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 14.
    Nvidia GPUs –Fermi ArchitectureSM executes threads in groups of 32 called warps.Two warp issue units per SMConcurrent kernel executionExecute multiple kernels simultaneously to improve efficiencyCUDA core consists of a single ALU and floating point unit FPUInstruction CacheCoreCoreCoreCoreWarp Scheduler Warp Scheduler Dispatch UnitDispatch UnitCoreCoreCoreCoreRegister File 32768 x 32bitLDSTLDSTCoreCoreCoreCoreSFULDSTLDSTCoreCoreCoreCoreLDSTLDSTSFUCoreCoreCoreCoreLDSTLDSTSFULDSTLDSTCoreCoreCoreCoreCUDA CoreDispatch PortLDSTLDSTOperand CollectorCoreCoreCoreCoreSFUSource: NVIDIA’s Next Generation CUDA Compute Architecture WhitepaperLDSTLDSTInterconnect MemoryFP UnitInt UnitCoreCoreCoreCoreLDSTLDSTL1 Cache / 64kB Shared MemoryL2 CacheResult Queue14Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 15.
    SIMT and SIMDSIMTdenotes scalar instructions and multiple threads sharing an instruction streamHW determines instruction stream sharing across ALUsE.g. NVIDIA GeForce (“SIMT” warps), AMD Radeon architectures (“wavefronts”) where all the threads in a warp /wavefront proceed in lockstepDivergence between threads handled using predicationSIMT instructions specify the execution and branching behavior of a single threadSIMD instructions exposes vector width, E.g. of SIMD: explicit vector instructions like x86 SSE15Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 16.
    SIMT Execution ModelSIMDexecution can be combined with pipelining
  • 17.
    ALUs all executethe same instruction
  • 18.
    Pipelining is usedto break instruction into phases
  • 19.
    When first instructioncompletes (4 cycles here), the next instruction is ready to executeSIMD WidthAddMulAddMulAddMulAddMulAddMulAddMul…AddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulWavefrontAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMulAddMul…Cycle16Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 20.
    Nvidia Memory HierarchyL1cache per SM configurable to support shared memory and caching of global memory48 KB Shared / 16 KB of L1 cache16 KB Shared / 48 KB of L1 cacheData shared between work items of a group using shared memoryEach SM has a 32K register bank L2 cache (768KB) that services all operations (load, store and texture)Unified path to global for loads and storesRegistersThread BlockL1 CacheShared MemoryL2 CacheGlobal Memory17Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 21.
    Nvidia Memory Modelin OpenCLLike AMD, a subset of hardware memory exposed in OpenCLConfigurable shared memory is usable as local memory Local memory used to share data between items of a work group at lower latency than global memory Private memory utilizes registers per work itemPrivate MemoryPrivate MemoryPrivate MemoryPrivate MemoryWorkitem 1Workitem 1Workitem 1Workitem 1Compute Unit 1Compute Unit NLocal MemoryLocal MemoryGlobal / Constant Memory Data CacheCompute DeviceGlobal MemoryCompute Device Memory18Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 22.
    Cell Broadband EngineSPE2SPE 0SPE 1SPE 3Developed by Sony, Toshiba, IBMTransitioned from embedded platforms into HPC via the Playstation 3OpenCL drivers available for Cell Bladecenter serversConsists of a Power Processing Element (PPE) and multiple Synergistic Processing Elements (SPE)Uses the IBM XL C for OpenCL compilerSPUSPUSPUSPULSLSLSLS25 GBPS25 GBPS25 GBPSElement Interconnect ~ 200GBPSLS = Local store per SPE of 256KBMemory & Interrupt ControllerL1 and L2 CachePOWER PCPPESource: http://www.alphaworks.ibm.com/tech/opencl19Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 23.
    Cell BE andOpenCLCell Power/VMX CPU used as a CL_DEVICE_TYPE_CPUCell SPU (CL_DEVICE_TYPE_ACCELERATOR) No. of compute units on a SPU accelerator device is <=16Local memory size <= 256KB256K of local storage divided among OpenCL kernel, 8KB global data cache, local, constant and private variablesOpenCL accelerator devices, and OpenCL CPU device share a common memory busProvides extensions like “Device Fission” and “Migrate Objects” to specify where an object resides (discussed in Lecture 10)No support for OpenCL images, sampler objects, atomics and byte addressable memorySource: http://www.alphaworks.ibm.com/tech/opencl20Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
  • 24.
    An Optimal GPGPUKernelFrom the discussion on hardware we see that an ideal kernel for a GPU:Has thousands of independent pieces of workUses all available compute unitsAllows interleaving for latency hidingIs amenable to instruction stream sharingMaps to SIMD execution by preventing divergence between work itemsHas high arithmetic intensityRatio of math operations to memory access is highNot limited by memory bandwidthNote that these caveats apply to all GPUs21Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011

Editor's Notes

  • #5 This point is important because this would be a common question while teaching an open platform agnostic programming
  • #6 Basic computer architecture points for multicore CPUs
  • #7 High level 10,000 feet view of what a GPU looks like, irrespective of whether its AMD’s or Nvidia’s
  • #8 Very AMD specific discussion of low-level GPU architecture
  • #9 Very AMD specific discussion of low-level GPU architectureDiscusses a single SIMD engine and the stream cores
  • #10 We converge back to the OpenCL terminology to understand how the AMD GPU maps onto the OpenCL processing elements
  • #11 A brief overview of the AMD GPU memory architecture (As per Evergreen Series)
  • #12 The mapping of the AMD GPU memory components to the OpenCL terminology.Architecturally this is similar for AMD and Nvidia except that each ones have their own vendor specific namesSimilar types of memory are mapped to local memory for both AMD and Nvidia
  • #13 Summary of the usage of constant memory.Important because there are a restricted set of cases where there is a hardware provided performance boost while using constant memoryThis will have greater context with a complete example which would have to be later.This slide is added in case some one is reading this while optimizing some application and needs device specific details
  • #14 Nvidia “Fermi” Architecture, High level overview.
  • #15 Architectural highlights of a SM in a Fermi GPUMention scalar nature of a CUDA core unlike AMD’s VLIW architecture
  • #16 The SIMT execution model of GPU threads. SIMD specifies vector width as in SSE. However the SIMT execution model doesn’t necessarily need to know the number of threads in a warp for a OpenCL program.The concept of a warp / wavefront is not within OpenCL.
  • #17 The SIMT Execution mode which shows how different threads execute the same instruction
  • #18 Nvidia specific GPU memory architecture. Main highlight is the configurable L1 : Shared size ratioL2 is not exposed in the OpenCL specification
  • #19 Similar to AMD in the sense that low latency memory which is the shared memory becomes OpenCL local memory
  • #20 Brief introductionon the Cell
  • #21 Brief overview of how the Cell’s memory architecture maps to OpenCLFor usage of the Cell in specific applications a high level view is given and Lec 10 discusses its special extensionsOptimizations in Lec 6-8 do not apply to the Cell because of its very different architecture
  • #22 Discusses an optimal kernel to show how irrespective of the different underlying architecture, an optimum program for both AMD and Nvidia would have similar characteristics
  • #23 Explains how platform agnostic OpenCL code is mapped to a device specific Instruction Set Architecture.
  • #24 The ICD is added in order to explain how we can interface different OpenCL implementations with a similar compilation tool-chain