CuMF is a matrix factorization algorithm for recommendation systems that accelerates training using GPUs. It addresses two main challenges: irregular memory access when aggregating user vectors and the inability of a single GPU to handle large matrices. CuMF optimizes memory usage through register tiling and achieves scale-out by distributing computation across multiple GPUs in parallel. Experimental results show CuMF trains up to 10x faster and is 100x more cost-efficient than competing algorithms on CPUs or multiple machines.