This document discusses GPU architectures for deep learning. GPUs achieve high efficiency through massive parallelism and specialized hardware. They combine CPU and ASIC approaches to find a balance between performance and programmability. GPUs employ techniques like single-instruction multiple-thread execution and specialized function units. They have a memory hierarchy including registers, shared memory, caches, and global memory. Programming models like CUDA and OpenCL allow developers to write kernels for GPU execution, though performance depends on architecture tuning. Fully-connected neural network layers are used as an example computation that maps efficiently to GPU parallelism.