This document discusses various techniques for identifying performance limiters in GPU code using CUDA. It recommends timing different parts of code, profiling to collect metrics and events, prototyping kernel parts separately, and benchmarking hardware characteristics. It provides examples of measuring wall time and GPU time. It also lists common profiling events, metrics, and discusses a case study of profiling a matrix transpose. The document emphasizes that profiling helps verify assumptions and identify bottlenecks, but does not replace optimization work.