Graph 500 Benchmark and Reference Implementations David Bader, Jason Riedy Georgia Institute of Technology (booth 1561)
Benchmark ProblemInitial benchmark problem: Graph Search (BFS)● Convert an input edge list to some internal format once (timed).● Randomly select multiple search roots.● Separately compute breadth-first search trees starting fromeach search root (timed). ● Return the array of parent nodes; parent[i] = j means j is the parent of i in the tree. ● Validate the output. Other problems under consideration for the future (e.g. independent set, ...)
Benchmark & Reference Impl. Structure1.Generate the edge list.2.Construct a graph from the edge list.3.Randomly sample 64 unique search keys withdegree at least one, not counting self-loops.4.For each search key: Timed kernels 1.Compute the BFS parent array. 2.Validate that the parent array is a correct BFS search tree for the given search tree.5.Compute and output performance information. ● (Take care to report correct quartiles, means, and deviations, e.g. harmonic for rates.)
Problem Classes● Sizes chosen to range from Problem Class Sizecurrently accessible tooptimistically ahead. Toy (10) 17 GiB● Chosen as powers of two Mini (11) 140 GiBclose to powers of 10. Small (12) 1.1 TiB ● Toy: 1010 → 226 = 17 GiB 15 42 Medium (13) 18 TiB ● Huge: 10 → 2 = 1.1 PiB!● Submissions ranged up to the Large (14) 140 TiBMedium class. Huge (15) 1.1 PiB ● Next year, will someone tackle Large? Huge?
Reference ImplementationsMultiple reference implementations:● High-level but undefinitive code in GNU Octave.● Single shared-memory driver for: ● two sequential examples, ● one OpenMP code, and ● Two Cray XMT codes.● Separate, fully distributed MPI code from Jeremiah Willcock of Indiana (who also wrote the reproducible, parallel generator). (This space intentionally left unoptimized.)
Reference ImplementationsMultiple reference implementations:● High-level sketch in GNU Octave. (24 lines in the timed kernels as counted by cloc) ● Not intended to be definitive. ● Used for executable examples in specification.● Two sequential codes to demonstrate that the driver handles different kernels. ● The first forms a linked list on the unaltered, uncopied input. (103 lines) ● The second copies into a CSR graph representation. (171 lines)
Reference ImplementationsMultiple reference implementations:● One OpenMP code for wide portability. (342 lines) ● Uses mmap for pseudo-out-of-core operation, can tackle anything that fits on a disk if you have the time...● A Cray XMT code and a slight variation. (186 lines, 210 lines) ● Slight variation reduces hot-spotting in the BFS queue.● An MPI code by Jeremiah Willcock from Indiana. (1107 lines) ● Fully distributed, runtime on SMP roughly comparable to OpenMP. (This space intentionally left unoptimized.)
Untuned Performance for Comparison Threads Mean time (s) Mean rate (TEPS) 4 9.2 1.0 x 107 8 6.9 1.1 x 107 16 4.9 0.91 x 107 Untuned Cray XMT implementation performance against the toy class on PNNLsUntuned OpenMP on scale- 128-processor Cray XMT24 (smaller than Toy) using Processors Mean time (s) Mean rate (TEPS)a dual quad-core Intel XeonX5570 processors 32 23.7 4.5 x 107(2.93GHz, 8MiB cache) with48 GiB physical memory. 64 24.3 4.4 x 107The 16-thread results useHyperThreading. The toy 128 28.2 3.8 x 107class ran too long...
[ EXPLORATION OF SHARED MEMORY GRAPH BENCHMARKS: THE GRAPH500 ] David A. Bader (PI), Jason Riedy[ OBJECTIVE ]Explore benchmarks for high-performancedata-intensive computations on parallel,shared-memory platforms.[ DESCRIPTION ]Current high-performance architectures arebuilt to run linear algebra operationseffectively. These architectures seem a poor Image Source: Nexus (Facebook application)fit for the massive growth of irregular datacoming from biological, social, regulatory, 5 8 Image Source: Giot et al., “A Proteinand other sources. There are no widely 1 Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003supported benchmarks to guide 0 7 3 4 6 9architectural decisions for these sourceapplications. vertex 2 Problem Class SizeGeorgia Tech worked within Graph500steering committee to draft a new breadth- Toy (10) 17 GiBfirst search benchmark acceptable for wide Mini (11) 140 GiBparticipation. Georgia Tech also providedand supports the OpenMP and Cray XMT Small (12) 1.1 TiBshared-memory reference codes. Medium (13) 18 TiBFor more: Visit the Graph500 BoF! Large (14) 140 TiB[ FUNDING ]Sandia National Labs Huge (15) 1.1 PiB