Introduction To GPUs 2012

1,332 views

Published on

A basic introduction to GPU architecture. Based on Kayvon\'s "From Shader Code to a Teraflop: How GPU Shader Cores Work"

Updated to include the latest GPUs: AMD Tahiti (HD7970) and NVIDIA Kepler (GTX690)

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,332
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
53
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Introduction To GPUs 2012

  1. 1. Introduction to GPU Compute Architecture Ofer RosenbergBased on“From Shader Code to a Teraflop: How GPU Shader Cores Work”,By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD
  2. 2. Intro: some numbers… Sources: http://www.anandtech.com/show/6025/radeon-hd-7970-ghz- edition-review-catching-up-to-gtx-680 http://ark.intel.com/products/65722 http://www.techpowerup.com/cpudb/1028/Intel_Xeon_E3- 1290V2.html Intel IvyBridge 4C AMD Radeon HD 7970 Ratio (E3-1290V2 )Cores 4 32Frequency 3700MHz 1000MHzProcess 22nm 28nmTransistor Count 1400M 4310M ~3xPower 87W 250W ~3xCompute Power 237 GFLOPS 4096 GFLOPS ~17x 2
  3. 3. Content1. Three major ideas that make GPU processing cores run fast2. Closer look at real GPU designs – NVIDIA GTX 680 – AMD Radeon 79703. The GPU memory hierarchy: moving data to processors4. Heterogeneous Cores
  4. 4. Part 1: throughput processing• Three key concepts behind how modern GPU processing cores run code• Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU core) designs 2. Optimize shaders/compute kernels 3. Establish intuition: what workloads might benefit from the design of these architectures?
  5. 5. What’s in a GPU? A GPU is a heterogeneous chip multi-processor (highly tuned for graphics) Shader Shader Input Assembly Tex Core Core Rasterizer Shader Shader Tex Core Core Output Blend Shader Shader Video Decode Tex Core Core Work HW Shader Shader Tex Distributor or Core Core SW?
  6. 6. A diffuse reflectance shadersampler mySamp;Texture2D<float3> myTex;float3 lightDir; Shader programming model:float4 diffuseShader(float3 norm, float2 uv) Fragments are processed{ independently, float3 kd; but there is no explicit parallel kd = myTex.Sample(mySamp, uv); kd *= clamp( dot(lightDir, norm), 0.0, 1.0); programming return float4(kd, 1.0);}
  7. 7. Compile shader 1 unshaded fragment input recordsampler mySamp;Texture2D<float3> myTex;float3 lightDir; <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0]float4 diffuseShader(float3 norm, float2 uv) madd r3, v1, cb0[1], r3{ madd r3, v2, cb0[2], r3 float3 kd; clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 kd = myTex.Sample(mySamp, uv); mul o1, r1, r3 kd *= clamp( dot(lightDir, norm), 0.0, 1.0); mul o2, r2, r3 return float4(kd, 1.0); mov o3, l(1.0)} 1 shaded fragment output record
  8. 8. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  9. 9. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  10. 10. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  11. 11. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  12. 12. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  13. 13. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  14. 14. Execute shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 ALU mul r3, v0, cb0[0] (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) Execution mul o0, r0, r3 mul o1, r1, r3 Context mul o2, r2, r3 mov o3, l(1.0)
  15. 15. “CPU-style” cores Fetch/ Decode Data cache (a big one) ALU (Execute) Execution Out-of-order control logic Context Fancy branch predictor Memory pre-fetcher
  16. 16. Slimming down Fetch/ Decode Idea #1: ALU (Execute) Remove components that help a single instruction Execution Context stream run fast
  17. 17. Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode <diffuseShader>: <diffuseShader>: sample r0, v4, t0, s0 ALU ALU sample r0, v4, t0, s0 mul r3, v0, cb0[0] mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 (Execute) (Execute) madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o0, r0, r3 mul o1, r1, r3 Execution Execution mul o1, r1, r3 mul o2, r2, r3 mul o2, r2, r3 mov o3, l(1.0) Context Context mov o3, l(1.0)
  18. 18. Four cores (four fragments in parallel) Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context Fetch/ Fetch/ Decode Decode ALU ALU (Execute) (Execute) Execution Execution Context Context
  19. 19. Sixteen cores (sixteen fragments in parallel) 16 cores = 16 simultaneous instruction streams
  20. 20. Instruction stream sharing But ... many fragments should be able to share an instruction stream! <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)
  21. 21. Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context
  22. 22. Add ALUs Fetch/ Idea #2: Decode Amortize cost/complexity of ALU 1 ALU 2 ALU 3 ALU 4 managing an instruction ALU 5 ALU 6 ALU 7 ALU 8 stream across many ALUs Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SIMD processing Shared Ctx Data
  23. 23. Modifying the shader Fetch/ Decode <diffuseShader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 ALU 5 ALU 6 ALU 7 ALU 8 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 Ctx Ctx Ctx Ctx mul o2, r2, r3 mov o3, l(1.0) Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one fragment using scalar ops on scalar registers
  24. 24. Modifying the shader Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 Ctx Ctx Ctx Ctx VEC8_mov o3, l(1.0) Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes eight fragments using vector ops on vector registers
  25. 25. Modifying the shader 1 2 3 4 5 6 7 8 Fetch/ Decode <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] ALU 1 ALU 2 ALU 3 ALU 4 VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 ALU 5 ALU 6 ALU 7 ALU 8 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 Ctx Ctx Ctx Ctx VEC8_mov o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data
  26. 26. 128 fragments in parallel 16 cores = 128 ALUs , 16 simultaneous instruction streams
  27. 27. vertices/fragments128 [ primitives OpenCL work items ] in parallel vertices primitives fragments
  28. 28. But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  29. 29. But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  30. 30. But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional Not all ALUs do useful work! shader code> Worst case: 1/8 peak performance
  31. 31. But what about branches? 1 2 ... ... 8 Time (clocks) ALU 1 ALU 2 . . . . . . ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { y = pow(x, exp); y *= Ks; refl = y + Ka; } else { x = 0; refl = Ka; } <resume unconditional shader code>
  32. 32. ClarificationSIMD processing does not imply SIMD instructions• Option 1: explicit vector instructions – x86 SSE, AVX, Intel Larrabee• Option 2: scalar instructions, implicit HW vectorization – HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) – NVIDIA GeForce (“SIMT” warps), ATI Radeon architectures (“wavefronts”) In practice: 16 to 64 fragments share an instruction stream.
  33. 33. Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100’s to 1000’s of cyclesWe’ve removed the fancy caches and logic that helps avoid stalls.
  34. 34. But we have LOTS of independent fragments. Idea #3: Interleave processing of many fragments on a singlecore to avoid stalls caused by high latency operations.
  35. 35. Hiding shader stallsTime (clocks) Frag 1 … 8 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data
  36. 36. Hiding shader stallsTime (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 1 2 3 4
  37. 37. Hiding shader stallsTime (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Stall Runnable
  38. 38. Hiding shader stallsTime (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Stall Stall Stall Runnable Stall
  39. 39. Throughput!Time (clocks) Frag 1 … 8 Frag 9 … 16 Frag 17 … 24 Frag 25 … 32 1 2 3 4 Start Stall Start Stall Start Stall Runnable Stall Runnable Done! Runnable Done! Runnable Increase run time of one group Done! to increase throughput of many groups Done!
  40. 40. Storing contexts Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Pool of context storage 256 KB
  41. 41. Eighteen small contexts (maximal latency hiding) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8
  42. 42. Twelve medium contexts Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8
  43. 43. Four large contexts (low latency hiding ability) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 1 2 3 4
  44. 44. ClarificationInterleaving between contexts can be managed byhardware or software (or both!)• NVIDIA / AMD Radeon GPUs – HW schedules / manages all contexts (lots of them) – Special on-chip storage holds fragment state• Intel Larrabee – HW manages four x86 (big) contexts at fine granularity – SW scheduling interleaves many groups of fragments on each HW context – L1-L2 cache holds fragment state (as determined by SW)
  45. 45. Example chip16 cores8 mul-add ALUs per core(128 total)16 simultaneousinstruction streams64 concurrent (but interleaved)instruction streams512 concurrent fragments= 256 GFLOPs (@ 1GHz)
  46. 46. Summary: three key ideas1. Use many “slimmed down cores” to run in parallel2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments) – Option 1: Explicit SIMD vector instructions – Option 2: Implicit sharing managed by hardware3. Avoid latency stalls by interleaving execution of many groups of fragments – When one group stalls, work on another group
  47. 47. Part 2:Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 680 AMD Radeon HD 7970
  48. 48. Disclaimer• The following slides describe “a reasonable way to think” about the architecture of commercial GPUs• Many factors play a role in actual chip performance
  49. 49. NVIDIA GeForce GTX 680 (Kepler)• NVIDIA-speak: – 1536 stream processors (“CUDA cores”) – “SIMT execution”• Generic speak: – 8 cores – 6 groups of 32-wide SIMD functional units per core
  50. 50. NVIDIA GeForce GTX 680 “core” Fetch/ Decode = SIMD function unit, control shared across 32 units (1 MUL-ADD per clock) • Groups of 32 [fragments/vertices/CUDA threads] share an instruction stream • Up to 64 groups are simultaneously interleaved Execution contexts • Up to 2048 individual contexts can be (256 KB) stored “Shared” memory (16+48 KB)Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  51. 51. NVIDIA GeForce GTX 680 “core” Fetch/ Decode = SIMD function unit, control shared across 32 units Fetch/ Decode (1 MUL-ADD per clock) Fetch/ Decode • The core contains 192 functional units Fetch/ Decode • Six groups are selected each clock Fetch/ Decode (decode, fetch, and execute six Fetch/ instruction streams in parallel) Decode Execution contexts (128 KB) “Shared” memory (16+48 KB)Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  52. 52. NVIDIA GeForce GTX 680 “SMX” Fetch/ Decode = CUDA core (1 MUL-ADD per clock) Fetch/ Decode Fetch/ Decode • The SMX contains 192 CUDA cores Fetch/ Decode • Six warps are selected each clock Fetch/ Decode (decode, fetch, and execute six Fetch/ warps in parallel) Decode Execution contexts • Up to 64 warps are interleaved, (128 KB) totaling 2048 CUDA threads “Shared” memory (16+48 KB)Source: http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  53. 53. NVIDIA GeForce GTX 680 There are 8 of these things on the GTX 680: That’s 16,384 fragments! Or 16,384 CUDA threads!
  54. 54. AMD Radeon HD 7970 (Tahiti)• AMD-speak: – 2048 stream processors – “GCN”: Graphics Core Next• Generic speak: – 32 cores – 4 groups of 64-wide SIMD functional units per core
  55. 55. AMD Radeon HD 7970 “core” Fetch/ = SIMD function unit, Decode control shared across 64 units (1 MUL-ADD per clock) • Groups of 64 [fragments/vertices/OpenCL Workitems] share an instruction stream • Up to 40 groups are simultaneously interleaved Execution Execution Execution Execution context context context context 64 KB 64 KB 64 KB 64 KB • Up to 2560 individual contexts can be “Shared” memory stored (64 KB)Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
  56. 56. AMD Radeon HD 7970 “core” Fetch/ = SIMD function unit, Decode control shared across 64 units (1 MUL-ADD per clock) Fetch/ Decode • The core contains 64 functional units Fetch/ Decode • Four clocks to execute an instruction for all fragments in a group Fetch/ Decode • Four groups are executed in parallel Execution context Execution context Execution context Execution context each clock (decode, fetch, and 64 KB 64 KB 64 KB 64 KB execute four instruction streams in “Shared” memory parallel) (64 KB)Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
  57. 57. AMD Radeon HD 7970 “Compute Unit” Fetch/ = SIMD function unit, Decode control shared across 64 units (1 MUL-ADD per clock) Fetch/ Decode • Groups of 64 [fragments/vertices/etc.] Fetch/ are a wavefront Decode • Four wavefronts are executed each Fetch/ Decode clock Execution context Execution context Execution context Execution context • Up to 40 wavefronts are interleaved, 64 KB 64 KB 64 KB 64 KB totaling 2560 Threads “Shared” memory (64 KB)Source: http://developer.amd.com/afds/assets/presentations/2620_final.pdf http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute
  58. 58. AMD Radeon HD 7970There are 32 of these things on the HD 7970: That’s 81,920 fragments!
  59. 59. The talk thus far: processing dataPart 3: moving data to processors
  60. 60. Recall: “CPU-style” core OOO exec logic Branch predictor Fetch/Decode Data cache ALU (a big one) Execution Context
  61. 61. “CPU-style” memory hierarchy OOO exec logic L1 cache Branch predictor (32 KB) 25 GB/sec Fetch/Decode to memory L3 cache ALU (8 MB) shared across cores Execution contexts L2 cache (256 KB) CPU cores run efficiently when data is resident in cache (caches reduce latency, provide high bandwidth)
  62. 62. Throughput core (GPU-style) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 280 GB/sec ALU 5 ALU 6 ALU 7 ALU 8 Memory Execution contexts (256 KB) More ALUs, no large traditional cache hierarchy: Need high-bandwidth connection to memory
  63. 63. Bandwidth is a critical resource– A high-end GPU (e.g. Radeon HD 7970) has... • Over twenty times (4.1 TFLOPS) the compute performance of quad-core CPU • No large cache hierarchy to absorb memory requests– GPU memory system is designed for throughput • Wide bus (280 GB/sec) • Repack/reorder/interleave memory requests to maximize use of memory bus • Still, this is only ten times the bandwidth available to CPU
  64. 64. Bandwidth thought experiment ATask: element-wise multiply two long vectors A and B ×1.Load input A[i] B +2.Load input B[i] C =3.Load input C[i] D4.Compute A[i] × B[i] + C[i]5.Store result into D[i] Four memory operations (16 bytes) for every MUL-ADD Radeon HD 7970 can do 2048 MUL-ADDS per clock Need ~32 TB/sec of bandwidth to keep functional units busy Less than 1% efficiency… but 6x faster than CPU!
  65. 65. Bandwidth limited! If processors request data at too high a rate, the memory system cannot keep up. No amount of latency hiding helps this.Overcoming bandwidth limits are a common challenge for GPU-compute application developers.
  66. 66. Reducing bandwidth requirements• Request data less often (instead, do more math) – “arithmetic intensity”• Fetch data from memory less often (share/reuse data across fragments – on-chip communication or storage
  67. 67. Reducing bandwidth requirements• Two examples of on-chip storage – Texture caches – OpenCL “local memory” (CUDA shared memory) 1 2 3 4 Texture caches: Capture reuse across Texture data fragments, not temporal reuse within a single shader program
  68. 68. Modern GPU memory hierarchy Fetch/ Decode Texture cache ALU 1 ALU 2 ALU 3 ALU 4 (read-only) ALU 5 ALU 6 ALU 7 ALU 8 L2 cache (~1 MB) Memory Shared “local” Execution storage contexts or (256 KB) L1 cache (64 KB)On-chip storage takes load off memory system.Many developers calling for more cache-like storage(particularly GPU-compute applications)
  69. 69. Don’t forget about offload cost…• PCIe bandwidth/latency – 8GB/s each direction in practice – Attempt to pipeline/multi-buffer uploads and downloads• Dispatch latency – O(10) usec to dispatch from CPU to GPU – This means offload cost if O(10M) instructions 69
  70. 70. Heterogeneous devices to the rescue ?• Tighter integration of CPU and GPU style cores – Reduce offload cost – Reduce memory copies/transfers – Power management• Industry shifting rapidly in this direction – AMD Fusion APUs – NVIDIA Tegra 3 – Intel SNB/IVY – Apple A4 and A5 –… – QUALCOMM Snapdragon – TI OMAP –… 70
  71. 71. AMD & Intel Heterogeneous Devices
  72. 72. Others – Compute is on its way ? … NVIDIA Qualcomm Apple Terga 3 Snapdragon S4 A5

×