1. The CUDA programming model uses parallel threads organized in cooperative thread arrays (CTAs) to execute the same program on many threads simultaneously.
2. CTAs are grouped into grids and threads within a CTA can share memory. Each CTA implements a thread block.
3. The GPU architecture has streaming multiprocessors that perform computations and global memory like CPU RAM that is accessible to both the GPU and CPU.