• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Gpu and The Brick Wall
 

Gpu and The Brick Wall

on

  • 1,398 views

GPU programing

GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW

Statistics

Views

Total Views
1,398
Views on SlideShare
1,352
Embed Views
46

Actions

Likes
0
Downloads
15
Comments
0

2 Embeds 46

http://ugurcandan.net 45
http://ugurcandan.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • NVIDIA planned to put 512 PEs into a single GPU, but the GTX480 turns out to have 480 PEs.
  • GPU can achieve 10x performance over CPU. 
  • Notice the third place is PowerXCell. Rmax is the performance of Linpack benchmark. Rpeak is the raw performance of the machine.
  • This gap is narrowed by multi-core CPUs.
  • Comparing raw performance is less interesting.
  • The area breakdown is an approximation, but it is good enough to see the trend.
  • The size of L3 in high end and low end CPUs are quite different.
  • This break down is also an approximation.
  • Numbers are based on Intel Nehalem at 45nm and the presentation of Bill Dally.
  • More registers are required to store the contexts of threads.
  • Hiding memory latency by multi-threading. The Cell uses a relatively static approach. The overlapping of computation and DMA transfer is explicitly specified by programmer.
  • Fine-grained multi-threading can keep the PEs busy even the program has little ILP.
  • The cache can still help.
  • The address assignment and translation is done dynamically by hardware.
  • The vector core should be larger than scalar core.
  • From scalar to vector.
  • From vector to threads.
  • Warp can be grouped at run time by hardware. In this case it will be transparent to the programmer.
  • The NVIDIA Fermi PE can do int and fp.
  • We have ignored some architectural features of Fermi.  Noticeably the interconnection network is not discussed here. 
  • These features are summarized by the paper of Michael Garland and David Kirk.
  • The vector program use SSE as example. However, the "incps" is not an SSE instruction. It is used here to represent incrementation of the vector.
  • Each thread uses its ID to locate its working data set.
  • The scheduler tries to maintain load balancing among SMs.
  • Numbers taken from an old paper on G80 architecture, but it should be similar to the GF100 architecture.
  • The old architecture has 16 banks.
  • It is a trend to use threads to hide vector width. The OpenCL applies the same programming model.
  • It is arguable whether working on threads is more productive.
  • This example assumes the two warp schedulers are decoupled. It is possible that they are coupled together, at the cost of hardware complexity.
  • Assume the register file has one read port. The register file may need two read port to support instructions with 3 source operands, e.g. the Fused Multiply Add (FMA).
  • 5 issue VLIW.
  • The atomic unit is helpful in voting operation, e.g. histogram. 
  • The figure is taken from 8800 GPU. See the paper of Samuel Williams for more detail.
  • The number is obtained in 8800 GPU.
  • The latency hiding is addressed in the PhD thesis of Samuel Williams.

Gpu and The Brick Wall Gpu and The Brick Wall Presentation Transcript

  • Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08
  • Today's Topics
      • GPU architecture
      • GPU programming
      • GPU micro-architecture
      • Performance optimization and model
      • Trends
  • Today's Topics
      • GPU architecture
      • GPU programming
      • GPU micro-architecture
      • Performance optimization and model
      • Trends
  • System Architecture
  • GPU Architecture NVIDIA Fermi, 512 Processing Elements (PEs)
  • What Can It Do? Render triangles. NVIDIA GTX480 can render 1.6 billion triangles per second!
  • General Purposed Computing ref:  http://www.nvidia.com/object/tesla_computing_solutions.html
  • The Vision of NVIDIA
    • "Within the next few years, there will be single-chip graphics devices more powerful and versatile than any graphics system that has ever been built, at any price." 
    • -- David Kirk, NVIDIA,  1998
  • Single-Chip GPU v.s. Fastest Super Computers ref:  http://www.llnl.gov/str/JanFeb05/Seager.html
  • Top500 Super Computer in June 2010
  • GPU Will Top the List in Nov 2010
  • The Gap Between CPU and GPU ref: Tesla GPU Computing Brochure
  • GPU Has 10x Comp Density Given the same chip area , the achievable performance of GPU is 10x higher than that of CPU.
  • Evolution of Intel Pentium Pentium I Pentium II Pentium III Pentium IV Chip area breakdown Q: What can you observe? Why?
  • Extrapolation of Single Core CPU If we extrapolate the trend, in a few generations, Pentium will look like: Of course, we know it did not happen.  Q: What happened instead? Why?
  • Evolution of Multi-core CPUs Penryn Bloomfield Gulftown Beckton Chip area breakdown Q: What can you observe? Why?
  • Let's Take a Closer Look Less than 10% of total chip area is used for the real execution. Q: Why?
  • The Memory Hierarchy Notes on Energy at 45nm:  64-bit Int ADD takes about 1 pJ. 64-bit FP FMA takes about 200 pJ. It seems we can not further increase the computational density.
  • The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link
  • The Brick Wall -- UC Berkeley's View Power Wall : power expensive, transistors free Memory Wall : Memory slow, multiplies fast ILP Wall : diminishing returns on more ILP HW Power Wall + Memory Wall + ILP Wall = Brick Wall David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link
  • How to Break the Brick Wall? Hint: how to exploit the parallelism inside the application?
  • Step 1: Trade Latency with Throughput Hind the memory latency through fine-grained interleaved threading.
  • Interleaved Multi-threading
  • Interleaved Multi-threading
    • The granularity of interleaved multi-threading:
      • 100 cycles : hide off-chip memory latency
      • 10 cycles : + hide cache latency
      • 1 cycle : + hide branch latency, instruction dependency
  • Interleaved Multi-threading
    • The granularity of interleaved multi-threading:
      • 100 cycles: hide off-chip memory latency
      • 10 cycles: + hide cache latency
      • 1 cycle: + hide branch latency, instruction dependency
    • Fine-grained interleaved multi-threading:
    • Pros : ?
    • Cons : ?
  • Interleaved Multi-threading
    • The granularity of interleaved multi-threading:
      • 100 cycles: hide off-chip memory latency
      • 10 cycles: + hide cache latency
      • 1 cycle: + hide branch latency, instruction dependency
    • Fine-grained interleaved multi-threading:
    • Pros : remove branch predictor, OOO scheduler, large cache
    • Cons : register pressure, etc.
  • Fine-Grained Interleaved Threading Pros:  reduce cache size, no branch predictor,  no OOO scheduler Cons:  register pressure, thread scheduler, require huge parallelism Without and with fine-grained interleaved threading
  • HW Support Register file supports zero overhead context switch between interleaved threads.
  • Can We Make Further Improvement?
    • Reducing large cache gives 2x computational density.
    • Q: Can we make further improvements?
    Hint: We have only utilized thread level parallelism (TLP) so far.
  • Step 2: Single Instruction Multiple Data SSE has 4 data lanes GPU has 8/16/24/... data lanes GPU uses wide SIMD: 8/16/24/... processing elements (PEs) CPU uses short SIMD: usually has vector width of 4.
  • Hardware Support Supporting interleaved threading + SIMD execution
  • Single Instruction Multiple Thread (SIMT) Hide vector width using scalar threads.
  • Example of SIMT Execution Assume 32 threads are grouped into one warp.
  • Step 3: Simple Core The Stream Multiprocessor (SM) is a light weight core compared to IA core. Light weight PE: Fused Multiply Add (FMA) SFU: Special Function Unit
  • NVIDIA's Motivation of Simple Core "This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train." --Bill Dally, NVIDIA
  • Review: How Do We Reach Here? NVIDIA Fermi, 512 Processing Elements (PEs)
  • Throughput Oriented Architectures
      • Fine-grained interleaved threading (~2x comp density)
      • SIMD/SIMT (>10x comp density)
      • Simple core (~2x comp density)
    • Key architectural features of throughput oriented processor.
    ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. ( link )
  • Today's Topics
      • GPU architecture
      • GPU programming
      • GPU micro-architecture
      • Performance optimization and model
      • Trends
  • CUDA Programming Massive number (>10000) of light-weight threads.
  • Express Data Parallelism in Threads 
    • Compare thread program with vector program.
  • Vector Program
    • Scalar program
    •  
    • float A[4][8];
    • do-all(i=0;i<4;i++){
    •     do-all(j=0;j<8;j++){
    •         A[i][j]++;
    •      }
    • }
    • Vector program (vector width of 8)
    • float A[4][8];
    • do-all(i=0;i<4;i++){
    •      movups xmm0, [ &A[i][0] ]
    •      incps xmm0
    •      movups [ &A[i][0] ], xmm0
    • }
    •  
    Vector width is exposed to programmers.
  • CUDA Program
    • Scalar program
    •  
    • float A[4][8];
    • do-all(i=0;i<4;i++) {
    •     do-all(j=0;j<8;j++) {
    •         A[i][j]++;
    •      }
    • }
    • CUDA program
    • float A[4][8];
    •  
    • kernelF<<<(4,1),(8,1)>>>(A);
    •  
    • __device__    kernelF(A){
    •      i = blockIdx.x;
    •      j = threadIdx.x;
    •     A[i][j]++;
    • }
    •  
      • CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).
      • Hardware converts TLP into DLP at run time.
  • Two Levels of Thread Hierarchy
    • kernelF<<<(4,1),(8,1)>>>(A);
    •  
    • __device__    kernelF(A){
    •      i = blockIdx.x;
    •      j = threadIdx.x;
    •     A[i][j]++;
    • }
    •  
  • Multi-dimension Thread and Block ID
    • kernelF<<<(2,2),(4,2)>>>(A);
    •  
    • __device__    kernelF(A){
    •      i = blockDim.x * blockIdx.y
    •          + blockIdx.x;
    •      j = threadDim.x * threadIdx.y
    •          + threadIdx.x;
    •     A[i][j]++;
    • }
    •  
    Both grid and thread block can have two dimensional index.
  • Scheduling Thread Blocks on SM Example: Scheduling 4 thread blocks on 3 SMs.
  • Executing Thread Block on SM
    • kernelF<<<(2,2), (4,2) >>>(A);
    •  
    • __device__    kernelF(A){
    •      i = blockDim.x * blockIdx.y
    •          + blockIdx.x;
    •      j = threadDim.x * threadIdx.y
    •          + threadIdx.x;
    •     A[i][j]++;
    • }
    •  
    Executed on machine with width of 4: Executed on machine with width of 8: Notes: the number of Processing Elements (PEs) is transparent to programmer.
  • Multiple Levels of Memory Hierarchy Name Cache? cycle read-only? Global L1/L2 200~400 (cache miss) R/W Shared No 1~3 R/W Constant Yes 1~3 Read-only Texture Yes ~100 Read-only Local L1/L2 200~400 (cache miss) R/W
  • Explicit Management of Shared Mem Shared memory is frequently used to exploit locality.
  • Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16]; //allocate smem      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter with 3x3 window 3x3 window on image Image data in DRAM
  • Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j]; // load to smem      __sync(); // thread wait at barrier      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Stage data in shared mem
  • Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync(); // every thread is ready      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image all threads finish the load
  • Shared Memory and Synchronization kernelF<<<(1,1),(16,16)>>>(A);   __device__    kernelF(A){      __shared__ smem[16][16];      i = threadIdx.y;      j = threadIdx.x;      smem[i][j] = A[i][j];      __sync();      A[i][j] = ( smem[i-1][j-1]                    + smem[i-1][j]                    ...                    + smem[i+1][i+1] ) / 9; }   Example: average filter over 3x3 window 3x3 window on image Start computation
  • Programmers Think in Threads Q: Why make this hassle?
  • Why Use Thread instead of Vector?
    • Thread Pros:
      • Portability . Machine width is transparent in ISA.
      • Productivity . Programmers do not need to take care the vector width of the machine.
    • Thread Cons:
      • Manual sync . Give up lock-step within vector.
      • Scheduling of thread could be inefficient.
      • Debug . &quot;Threads considered harmful&quot;. Thread program is notoriously hard to debug.  
  • Features of CUDA
      • Programmers explicitly express DLP in terms of TLP.
      • Programmers explicitly manage memory hierarchy.
      • etc.
  • Today's Topics
      • GPU architecture
      • GPU programming
      • GPU micro-architecture
      • Performance optimization and model
      • Trends
  • Micro-architecture GF100 micro-architecture
  • HW Groups Threads Into Warps Example: 32 threads per warp
  • Example of Implementation Note: NVIDIA may use a more complicated implementation.
  • Example
    • Program Address : Inst
    • 0x0004 : add r0, r1, r2
    • 0x0008 : sub r3, r4, r5
    Assume warp 0 and warp 1 are scheduled for execution.
  • Read Src Op
    • Program Address: Inst
    • 0x0004: add r0, r1 , r2
    • 0x0008: sub r3, r4 , r5
    Read source operands: r1 for warp 0 r4 for warp 1
  • Buffer Src Op
    • Program Address: Inst
    • 0x0004: add r0, r1 , r2
    • 0x0008: sub r3, r4 , r5
    Push ops to op collector: r1 for warp 0 r4 for warp 1
  • Read Src Op
    • Program Address: Inst
    • 0x0004: add r0, r1, r2
    • 0x0008: sub r3, r4, r5
    Read source operands: r2 for warp 0 r5 for warp 1
  • Buffer Src Op
    • Program Address: Inst
    • 0x0004: add r0, r1, r2
    • 0x0008: sub r3, r4, r5
    Push ops to op collector: r2 for warp 0 r5 for warp 1
  • Execute
    • Program Address: Inst
    • 0x0004: add r0, r1, r2
    • 0x0008: sub r3, r4, r5
    Compute the first 16 threads in the warp.
  • Execute
    • Program Address: Inst
    • 0x0004: add r0, r1, r2
    • 0x0008: sub r3, r4, r5
    Compute the last 16 threads in the warp.
  • Write back
    • Program Address: Inst
    • 0x0004: add r0 , r1, r2
    • 0x0008: sub r3 , r4, r5
    Write back: r0 for warp 0 r3 for warp 1
  • Other High Performance GPU
      • ATI Radeon 5000 series.
  • ATI Radeon 5000 Series Architecture
  • Radeon SIMD Engine
      • 16 Stream Cores (SC)
      • Local Data Share
  • VLIW Stream Core (SC)
  • Local Data Share (LDS)
  • Today's Topics
      • GPU architecture
      • GPU programming
      • GPU micro-architecture
      • Performance optimization and model
      • Trends
  • Performance Optimization
    • Optimizations on memory latency tolerance
      • Reduce register pressure
      • Reduce shared memory pressure
    •   
    • Optimizations on memory bandwidth
      • Global memory coalesce
      • Avoid shared memory bank conflicts
      • Grouping byte access
      • Avoid Partition camping
    •  
    • Optimizations on computation efficiency
      • Mul/Add balancing
      • Increase floating point proportion 
    •  
    • Optimizations on operational intensity
      • Use tiled algorithm
      • Tuning thread granularity
  • Performance Optimization
    • Optimizations on memory latency tolerance
      • Reduce register pressure
      • Reduce shared memory pressure
    •   
    • Optimizations on memory bandwidth
      • Global memory coalesce
      • Avoid shared memory bank conflicts
      • Grouping byte access
      • Avoid Partition camping
    •  
    • Optimizations on computation efficiency
      • Mul/Add balancing
      • Increase floating point proportion 
    •  
    • Optimizations on operational intensity
      • Use tiled algorithm
      • Tuning thread granularity
  • Shared Mem Contains Multiple Banks
  • Compute Capability Need arch info to perform optimization. ref: NVIDIA, &quot;CUDA C Programming Guide&quot;, ( link )
  • Shared Memory (compute capability 2.x) without bank conflict: with bank conflict:
  • Performance Optimization
    • Optimizations on memory latency tolerance
      • Reduce register pressure
      • Reduce shared memory pressure
    •   
    • Optimizations on memory bandwidth
      • Global memory alignment and coalescing
      • Avoid shared memory bank conflicts
      • Grouping byte access
      • Avoid Partition camping
    •  
    • Optimizations on computation efficiency
      • Mul/Add balancing
      • Increase floating point proportion 
    •  
    • Optimizations on operational intensity
      • Use tiled algorithm
      • Tuning thread granularity
  • Global Memory In Off-Chip DRAM
    • Address space is interleaved among multiple channels.
  • Global Memory
  • Global Memory
  • Global Memory
  • Roofline Model Identify performance bottleneck:  computation bound v.s. bandwidth bound
  • Optimization Is Key for Attainable Gflops/s
  • Computation, Bandwidth, Latency
    • Illustrating three bottlenecks in the Roofline model.
  • Today's Topics
      • GPU architecture
      • GPU programming
      • GPU micro-architecture
      • Performance optimization and model
      • Trends
  • Trends
    • Coming architectures:
      • Intel's Larabee successor: Many Integrated Core (MIC)
      • CPU/GPU fusion, Intel Sandy Bridge, AMD Llano.
  • Intel Many Integrated Core (MIC) 32 core version of MIC:
  • Intel Sandy Bridge
    • Highlight:
      • Reconfigurable shared L3 for CPU and GPU
      • Ring bus
  • Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
  • Sandy Bridge's New CPU-GPU interface  ref: &quot;Intel's Sandy Bridge Architecture Exposed&quot;, from Anandtech, ( link )
  • AMD Llano Fusion APU (expt. Q3 2011)
    • Notes:
      • CPU and GPU are not sharing cache?
      • Unknown interface between CPU/GPU
  • GPU Research in ES Group
    • GPU research in the Electronic Systems group.
    • http://www.es.ele.tue.nl/~gpuattue/