Tensor Core
"SIMD" for GPU
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
Tensor Cores
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
Tensor Cores
https://www.nvidia.com/en-us/data-center/tensorcore/
12X
https://www.nvidia.com/en-us/data-center/tensorcore/
Supported Types
namespace experimental {
namespace precision {
struct u4; // 4-bit unsigned
struct s4; // 4-bit signed
struct b1; // 1-bit
}
enum bmmaBitOp { bmmaBitOpXOR = 1 };
enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 };
}
• Input : FP16, u8, s8, u4, s4, b1

• Accumulator : FP16, FP32, int

• Also in experimental:
= x +
m
k
k
n
m
n
m
n
Mixed Precision
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
Programming
CUDA Library
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
also in TensorRT 3
cuBLAS cuDNN
CUDA WMMA API
https://en.wikipedia.org/wiki/Joanna_J%C4%99drzejczyk
CPU Level
simpleTensorCoreGEMM.cu
https://github.com/parallel-forall/code-samples/blob/master/posts/tensor-cores/simpleTensorCoreGEMM.cu
call kernel function in wrap
Warp-Level
http://on-demand.gputechconf.com/gtc/2017/presentation/s7132-mark-harris-new-cuda-features-and-beyond.pdf
(In short)
Warp-Level :

Initialization
Values
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
simpleTensorCoreGEMM.cu
Kernel function in wrap
Warp-Level :

Fragments on Registers
Fragment Type
Clear Acc
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
Warp-Level :
Tile Calculation(compute one tile of the output matrix per warp)
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
= x +
Warp-Level :
Finishing
Optional Scaling
C = alpha * Acc + beta * C
Store to Memory
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
Availability
• V100, Titan V

• RTX 2070, RTX 2080, RTX 2080 Ti, etc.
Tensor Core

Tensor Core