This tutorial discusses parallelizing programs on GPUs using CUDA. It introduces GPU threads, thread blocks, and grids. It presents two exercises to parallelize vector addition using multiple threads in a block and multiple blocks. The first exercise spreads the computation across 256 threads in a block, achieving a 62x speedup. The second exercise uses multiple blocks of 256 threads each to parallelize the computation further, achieving a 1261x speedup over the single-threaded version. Profiling results are provided to compare the performance of the different implementations.