This document discusses optimizations for implementing an N-body simulation algorithm on GPUs using OpenCL. It begins with an overview of the basic N-body algorithm and its parallel implementation. Two key optimizations are explored: using local memory to enable data reuse across work items, and unrolling the computation loop. Performance results on AMD and Nvidia GPUs show that data reuse provides significant speedup, and loop unrolling further improves performance on the AMD GPU. An example N-body application is provided to experiment with these optimization techniques.