Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
The document analyzes the performance of a lattice quantum chromodynamics (QCD) application implemented using the Asynchronous Partitioned Global Address Space (APGAS) programming model in X10. It finds that the X10 implementation achieves a 102.8x speedup in strong scaling up to 256 places. However, the MPI implementation outperforms X10, being 2.26-2.58x faster due to more optimized communication overlapping in MPI. Analysis shows the X10 implementation suffers overhead from thread activation and synchronization. Communication using one-sided operations also outperforms two-sided in X10. The work contributes an X10 implementation of lattice QCD and evaluates its scalability and performance compared to MPI.
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersKoichi Shirahata
This document proposes a hybrid scheduling technique for MapReduce jobs on GPU-based heterogeneous clusters. It aims to accelerate MapReduce by efficiently scheduling Map tasks to both CPUs and GPUs to minimize total job execution time. The technique was implemented in Hadoop using its Pipes feature to invoke CUDA programs from Java. Evaluation on a K-means application showed the hybrid scheduling approach was 1.93 times faster than CPU-only execution at 64 nodes by better utilizing multiple GPUs.
A GPU Implementation of Generalized Graph Processing Algorithm GIM-VKoichi Shirahata
This document summarizes a GPU implementation of the generalized iterative matrix-vector multiplication (GIM-V) graph processing algorithm on the Mars GPU-based MapReduce framework. The key findings are that the GPU implementation achieved speedups of 8.8-39x compared to a CPU-based Hadoop implementation, and 2.72x faster mapping than a non-MapReduce CPU implementation. However, the GPU implementation introduced significant performance overhead in the reduce stage. The researchers aim to further optimize the implementation by reducing data transfer costs through graph partitioning and utilizing local storage.
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
This document analyzes the performance of lattice quantum chromodynamics (QCD) simulations using the asynchronous partitioned global address space (APGAS) programming model on GPUs. It implements lattice QCD in X10 CUDA and compares performance to other implementations. Results show a 19.4x speedup from using X10 CUDA on 32 nodes of the TSUBAME 2.5 supercomputer compared to the original X10 implementation. Optimizations like data layout transformation and communication overlapping contributed to this acceleration.
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Proce...Koichi Shirahata
This document summarizes an approach for out-of-core GPU memory management for large-scale graph processing on heterogeneous supercomputers. The approach introduces techniques for out-of-core GPU data management in a GPU-MapReduce framework. It implements out-of-core GPU sorting and investigates balancing scale-up and scale-out approaches. Performance tests on two supercomputers show up to 2.1x speedup over CPUs and 1.71x power efficiency from scaling up GPU usage.
A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for...Koichi Shirahata
The document proposes implementing a scalable multi-GPU version of the graph processing algorithm GIM-V using MapReduce on large-scale heterogeneous supercomputers. It aims to optimize load balancing between GPU devices for large graph processing. The key contributions are: 1) Implementing GIM-V on a multi-GPU MapReduce framework called Mars by extending it with MPI for inter-GPU communication; 2) Optimizing GIM-V and load balancing for graph partitioning across GPUs. Evaluation on a 768-GPU supercomputer showed up to 1.52x speedup over CPU-based processing.
Performance Analysis of MapReduce Implementations on High Performance Homolog...Koichi Shirahata
This document describes performance analyses of MapReduce implementations for large-scale homology searches. It introduces homology searches and their use in metagenome analysis using sequence databases that are growing enormously in size. Two MapReduce designs for homology searches are proposed: one replicates the database on all nodes, while the other distributes the database. Preliminary experiments show MapReduce exhibits good scaling and comparable performance to MPI implementations. The goal is high-performance MapReduce homology searches for extremely large databases.