Matrix factorization (MF) is used by many popular algorithms, e.g., collaborative filtering. GPU with massive cores and high intra-chip memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics.
In this talk I will introduce cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme.
With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, this cuMF can solve the largest matrix factorization problem ever reported yet in current literature. We also use cuMF to accelerate the ALS implementation in Spark MLlib.
A paper on CuMF is to be published at HPDC 2016 with a pre-print at http://arxiv.org/abs/1603.03820.