• In general, as Jared mentions, using too many registers per thread is
not desirable because it reduces occupancy, and therefore reduces
latency hiding ability in the kernel. GPUs thrive on parallelism and do
so by covering memory latency with work from other threads.
• Therefore, you should probably not optimize arrays into registers.
Instead, ensure that your memory accesses to those arrays across
threads are as close to sequential as possible so you maximize
coalescing (i.e. minimize memory transactions).
37. Interpreting Output of --ptxas-options=-v
• Each CUDA thread is using 46 registers?
• There is no register spilling to local memory(shared memory)?
• Is 72 bytes the sum-total of the memory for the stack frames of the __global__ (撰寫
平行化的副程式)and __device__(給__global__函數呼叫的副程式) functions?
• PTX level allows many more virtual registers than the hardware.
Those are mapped to hardware registers at load time. The register
limit you specify allows you to set an upper limit on the hardware
resources used by the generated binary. It serves as a heuristic for the
compiler to decide when to spill (see below) registers when compiling
to PTX already so certain concurrency needs can be met.
• For Fermi GPUs there are at most 64 hardware registers. The 64th is
used by the ABI as the stack pointer and thus for "register spilling" (it
means freeing up registers by temporarily storing their values on the
stack and happens when more registers are needed than available) so
it is untouchable.
• Dynamically indexed arrays cannot be stored in registers, because the GPU
register file is not dynamically addressable.
• Scalar variables are automatically stored in registers by the compiler.
• Statically-indexed (i.e. where the index can be determined at compile
time), small arrays (say, less than 16 floats) may be stored in registers by the
65. Q&A-5: trace code討論
• Note: For disassembly instruction to work
properly, cuobjdump must be installed and present in your $PATH.