Tao zhang

575 views

Published on

Published in: Education, Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
575
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Tao zhang

  1. 1. Introduction toMulticore architecture Tao Zhang Oct. 21, 2010
  2. 2. Overview Part1: General multicore architecture Part2: GPU architecture
  3. 3. Part1:General Multicore architecture
  4. 4. Uniprocessor Performance (SPECint) 10000 3X From Hennessy and Patterson, Computer Architecture: A Quantitative ??%/year Approach, 4th edition, 2006 1000Performance (vs. VAX-11/780) 52%/year 100 ⇒ Sea change in chip 10 25%/year design: multiple “cores” or processors per chip 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present 6
  5. 5. Conventional Wisdom (CW) in Computer Architecture Old CW: Chips reliable internally, errors at pins New CW: ≤65 nm ⇒ high soft & hard error rates Old CW: Demonstrate new ideas by building chips New CW: Mask costs, ECAD costs, GHz clock rates ⇒ researchers can’t build believable prototypes Old CW: Innovate via compiler optimizations + architecture New: Takes > 10 years before new optimization at leading conference gets into production compilers Old: Hardware is hard to change, SW is flexible New: Hardware is flexible, SW is hard to change 4
  6. 6. Conventional Wisdom (CW) in Computer Architecture Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply) Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New CW: “ILP wall” diminishing returns on more ILP New: Power Wall + Memory Wall + ILP Wall = Brick Wall  Old CW: Uniprocessor performance 2X / 1.5 yrs  New CW: Uniprocessor performance only 2X / 5 yrs? 5
  7. 7. The Memory Wall • On die caches are both area intensive and power intensive  StrongArm dissipates more than 43% power in caches  Caches incur huge area costs ECE 4100/6100 (21) The Power Wall P  CVdd f  Vdd I st  Vdd I leak 2• Power per transistor scales with frequency but also scales with Vdd  Lower Vdd can be compensated for with increased pipelining to keep throughput constant  Power per transistor is not same as power per area  power density is the problem!  Multiple units can be run at lower frequencies to keep throughput constant, while saving power ECE 4100/6100 (22)
  8. 8. The Current Power Trend Sun’s 10000 Surface RocketPower Density (W/cm2) 1000 Nozzle Nuclear 100 Reactor 8086 Hot Plate 10 4004 P6 8008 8085 386 Pentium® 286 486 8080 1 1970 1980 1990 2000 2010 Year Source: Intel Corp. ECE 4100/6100 (23) Improving Power/Perfomance P  CVdd f  Vdd I st  Vdd I leak 2 • Consider constant die size and decreasing core area each generation = more cores/chip  Effect of lowering voltage and frequency  power reduction  Increasing cores/chip  performance increase Better power performance! ECE 4100/6100 (24)
  9. 9. The Memory Wall µProc1000 CPU 60%/yr. “Moore’s Law”100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1 Time ECE 4100/6100 (19) The Memory Wall Average access time Year?• Increasing the number of cores increases the demanded memory bandwidth• What architectural techniques can meet this demand? ECE 4100/6100 (20)
  10. 10. The Memory Wall µProc1000 CPU 60%/yr. “Moore’s Law”100 Processor-Memory Performance Gap: (grows 50% / year) 10 DRAM 7%/yr. DRAM 1 Time ECE 4100/6100 (19) The Memory Wall Average access time Year?• Increasing the number of cores increases the demanded memory bandwidth• What architectural techniques can meet this demand? ECE 4100/6100 (20)
  11. 11. The Memory Wall • On die caches are both area intensive and power intensive  StrongArm dissipates more than 43% power in caches  Caches incur huge area costs ECE 4100/6100 (21) The Power Wall P  CVdd f  Vdd I st  Vdd I leak 2• Power per transistor scales with frequency but also scales with Vdd  Lower Vdd can be compensated for with increased pipelining to keep throughput constant  Power per transistor is not same as power per area  power density is the problem!  Multiple units can be run at lower frequencies to keep throughput constant, while saving power ECE 4100/6100 (22)
  12. 12. The ILP Wall• Limiting phenomena for ILP extraction:  Clock rate: at the wall each increase in clock rate has a corresponding CPI increase (branches, other hazards)  Instruction fetch and decode: at the wall more instructions cannot be fetched and decoded per clock cycle  Cache hit rate: poor locality can limit ILP and it adversely affects memory bandwidth  ILP in applications: serial fraction on applications• Reality:  Limit studies cap IPC at 100-400 (using ideal processor)  Current processors have IPC of only 2-8/thread? ECE 4100/6100 (17) The ILP Wall: Options• Increase granularity of parallelism  Simultaneous Multi-threading to exploit TLP o TLP has to exist  otherwise poor utilization results  Coarse grain multithreading  Throughput computing• New languages/applications  Data intensive computing in the enterprise  Media rich applications ECE 4100/6100 (18)
  13. 13. Part2:GPU architecture
  14. 14. GPU Evolution - Hardware 1995 1999 2002 2003 2004 2005 2006-2007 NV1 GeForce 256 GeForce4 GeForce FX GeForce 6 GeForce 7 GeForce 8 1 Million 22 Million 63 Million 130 Million 222 Million 302 Million 754 MillionTransistors Transistors Transistors Transistors Transistors Transistors Transistors 2008 GeForce GTX 200 1.4 Billion TransistorsBeyond Programmable Shading: In Action
  15. 15. GPU Architectures:Past/Present/Future 1995: Z-Buffered Triangles Riva 128: 1998: Textured Tris NV10: 1999: Fixed Function X-Formed Shaded Triangles NV20: 2001: FFX Triangles with Combiners at Pixels NV30: 2002: Programmable Vertex and Pixel Shaders (!) NV50: 2006: Unified shaders, CUDA GIobal Illumination, Physics, Ray tracing, AI future???: extrapolate trajectory Trajectory == Extension + Unification© NVIDIA Corporation 2007
  16. 16. No Lighting Per-Vertex Lighting Per-Pixel LightingCopyright © NVIDIA Corporation 2006 Unreal © Epic
  17. 17. The Classic Graphics Hardware Texture Maps Combine vertices into Texture map Transform triangle, fragments Z-cull Project convert to Light fragments Alpha Blend Vertex Triangle Fragment Fragment Frame- Shader Setup Shader Blender Buffer(s) programmable configurable GPU fixed
  18. 18. Modern Graphics Hardware Pipelining 1 2 3  Number of stages 1 Parallelism 2 3  Number of parallel processes 1 2 3 Parallelism + pipelining 1 2 3  Number of parallel pipelines 1 2 3
  19. 19. Modern GPUs: Unified Design Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core
  20. 20. Why unify? Vertex Shader Pixel Shader Idle hardware Heavy Geometry Workload Perf = 4 Vertex Shader Idle hardware Pixel Shader Heavy Pixel© NVIDIA Corporation 2007 Workload Perf = 8
  21. 21. Why unify? Unified Shader Vertex Workload Pixel Heavy Geometry Workload Perf = 11 Unified Shader Pixel Workload Vertex Heavy Pixel© NVIDIA Corporation 2007 Workload Perf = 11
  22. 22. GeForce 8: Modern GPU Architecture Host Input Assembler Setup & Rasterize Vertex Thread Issue Geom Thread Issue Pixel Thread IssueSP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Thread ProcessorTF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Framebuffer Beyond Programmable Shading: In Action
  23. 23. Hardware Implementation:A Set of SIMD Multiprocessors Device The device is a set of Multiprocessor N multiprocessors Multiprocessor 2 Each multiprocessor is a Multiprocessor 1 set of 32-bit processors with a Single Instruction Multiple Data architecture Instruction At each clock cycle, a Processor 1 Processor 2 … Processor M Unit multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size© NVIDIA Corporation 2007
  24. 24. Goal: Performance per millimeter • For GPUs, perfomance == throughput • Strategy: hide latency with computation not cache Heavy multithreading! • Implication: need many threads to hide latency – Occupancy – typically prefer 128 or more threads/TPA – Multiple thread blocks/TPA help minimize effect of barriers • Strategy: Single Instruction Multiple Thread (SIMT) – Support SPMD programming model – Balance performance with ease of programmingBeyond Programmable Shading: In Action
  25. 25. SIMT Thread Execution • High-level description of SIMT: – Launch zillions of threads – When they do the same thing, hardware makes them go fast – When they do different things, hardware handles it gracefullyBeyond Programmable Shading: In Action
  26. 26. SIMT Thread Execution• Groups of 32 threads formed into warps – always executing same instruction – some become inactive when code path diverges – hardware automatically handles divergence• Warps are the primitive unit of scheduling – pick 1 of 32 warps for each instruction slot – Note warps may be running different programs/shaders!• SIMT execution is an implementation choice – sharing control logic leaves more space for ALUs – largely invisible to programmer – must understand for performance, not correctness Beyond Programmable Shading: In Action
  27. 27. GPU Architecture: Trends• Long history of ever-increasing programmability – Culminating today in CUDA: program GPU directly in C• Graphics pipeline, APIs are abstractions – CUDA + graphics enable “replumbing” the pipeline• Future: continue adding expressiveness, flexibility – CUDA, OpenCL, DX11 Compute Shader, ... – Lower barrier further between compute and graphics Beyond Programmable Shading: In Action
  28. 28. CPU/GPU Parallelism Moore’s Law gives you more and more transistors What do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible Tactics: – Cache (area limiting) – Instruction/Data prefetch – Speculative execution limited by “perimeter” – communication bandwidth …then add task parallelism…multi-core GPU strategy: make the workload (as many threads as possible) run as fast as possible Tactics: – Parallelism (1000s of threads) – Pipelining limited by “area” – compute capability© NVIDIA Corporation 2007
  29. 29. GPU Architecture Massively Parallel 1000s of processors (today) Power Efficient Fixed Function Hardware = area & power efficient Lack of speculation. More processing, less leaky cache Latency Tolerant from Day 1 Memory Bandwidth Saturate 512 Bits of Exotic DRAMs All Day Long (140 GB/sec today) No end in sight for Effective Memory Bandwidth Commercially Viable Parallelism Largest installed base of Massively Parallel (N>4) Processors Using CUDA!!! Not just as graphics Not dependent on large caches for performance Computing power = Freq * Transistors Moore’s law ^2© NVIDIA Corporation 2007
  30. 30. GPU Architecture: Summary• From fixed function to configurable to programmable architecture now centers on flexible processor core• Goal: performance / mm2 (perf == throughput) architecture uses heavy multithreading• Goal: balance performance with ease of use SIMT: hardware-managed parallel thread executionBeyond Programmable Shading: In Action

×