This document summarizes CUDA programming using CUBLAS and direct parallelization. It first introduces CUBLAS, which implements BLAS functions on GPUs using CUDA. It describes how to initialize CUBLAS, transfer data between host and device memory, execute CUBLAS functions, and clean up. It then discusses direct parallelization, where each thread is assigned a specific task. It explains how to determine grid and block sizes, allocate device memory, copy data to the device, execute kernels, and copy results back to host memory. The document provides examples of using CUBLAS and coding a direct parallelization kernel for a matrix-vector multiplication operation.