Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Graph 500 Benchmark and Reference         Implementations        David Bader, Jason Riedy      Georgia Institute of Techno...
Benchmark ProblemInitial benchmark problem:                            Graph Search (BFS)●   Convert an input edge list to...
Benchmark & Reference Impl. Structure1.Generate the edge list.2.Construct a graph from the edge list.3.Randomly sample 64 ...
Problem Classes●   Sizes chosen to range from                                    Problem Class    Sizecurrently accessible...
Reference ImplementationsMultiple reference implementations:●   High-level but undefinitive code in GNU Octave.●   Single ...
Reference ImplementationsMultiple reference implementations:●   High-level sketch in GNU Octave. (24 lines in the timed ke...
Reference ImplementationsMultiple reference implementations:●   One OpenMP code for wide portability. (342 lines)    ●   U...
Untuned Performance for Comparison   Threads     Mean time (s)   Mean rate (TEPS)      4             9.2            1.0 x ...
[ EXPLORATION OF SHARED MEMORY GRAPH BENCHMARKS:                      THE GRAPH500 ]                              David A....
Upcoming SlideShare
Loading in …5
×

Graph500

1,378 views

Published on

Engineering the Graph500 benchmark.

Published in: Technology
  • Be the first to comment

Graph500

  1. 1. Graph 500 Benchmark and Reference Implementations David Bader, Jason Riedy Georgia Institute of Technology (booth 1561)
  2. 2. Benchmark ProblemInitial benchmark problem: Graph Search (BFS)● Convert an input edge list to some internal format once (timed).● Randomly select multiple search roots.● Separately compute breadth-first search trees starting fromeach search root (timed). ● Return the array of parent nodes; parent[i] = j means j is the parent of i in the tree. ● Validate the output. Other problems under consideration for the future (e.g. independent set, ...)
  3. 3. Benchmark & Reference Impl. Structure1.Generate the edge list.2.Construct a graph from the edge list.3.Randomly sample 64 unique search keys withdegree at least one, not counting self-loops.4.For each search key: Timed kernels 1.Compute the BFS parent array. 2.Validate that the parent array is a correct BFS search tree for the given search tree.5.Compute and output performance information. ● (Take care to report correct quartiles, means, and deviations, e.g. harmonic for rates.)
  4. 4. Problem Classes● Sizes chosen to range from Problem Class Sizecurrently accessible tooptimistically ahead. Toy (10) 17 GiB● Chosen as powers of two Mini (11) 140 GiBclose to powers of 10. Small (12) 1.1 TiB ● Toy: 1010 → 226 = 17 GiB 15 42 Medium (13) 18 TiB ● Huge: 10 → 2 = 1.1 PiB!● Submissions ranged up to the Large (14) 140 TiBMedium class. Huge (15) 1.1 PiB ● Next year, will someone tackle Large? Huge?
  5. 5. Reference ImplementationsMultiple reference implementations:● High-level but undefinitive code in GNU Octave.● Single shared-memory driver for: ● two sequential examples, ● one OpenMP code, and ● Two Cray XMT codes.● Separate, fully distributed MPI code from Jeremiah Willcock of Indiana (who also wrote the reproducible, parallel generator). (This space intentionally left unoptimized.)
  6. 6. Reference ImplementationsMultiple reference implementations:● High-level sketch in GNU Octave. (24 lines in the timed kernels as counted by cloc) ● Not intended to be definitive. ● Used for executable examples in specification.● Two sequential codes to demonstrate that the driver handles different kernels. ● The first forms a linked list on the unaltered, uncopied input. (103 lines) ● The second copies into a CSR graph representation. (171 lines)
  7. 7. Reference ImplementationsMultiple reference implementations:● One OpenMP code for wide portability. (342 lines) ● Uses mmap for pseudo-out-of-core operation, can tackle anything that fits on a disk if you have the time...● A Cray XMT code and a slight variation. (186 lines, 210 lines) ● Slight variation reduces hot-spotting in the BFS queue.● An MPI code by Jeremiah Willcock from Indiana. (1107 lines) ● Fully distributed, runtime on SMP roughly comparable to OpenMP. (This space intentionally left unoptimized.)
  8. 8. Untuned Performance for Comparison Threads Mean time (s) Mean rate (TEPS) 4 9.2 1.0 x 107 8 6.9 1.1 x 107 16 4.9 0.91 x 107 Untuned Cray XMT implementation performance against the toy class on PNNLsUntuned OpenMP on scale- 128-processor Cray XMT24 (smaller than Toy) using Processors Mean time (s) Mean rate (TEPS)a dual quad-core Intel XeonX5570 processors 32 23.7 4.5 x 107(2.93GHz, 8MiB cache) with48 GiB physical memory. 64 24.3 4.4 x 107The 16-thread results useHyperThreading. The toy 128 28.2 3.8 x 107class ran too long...
  9. 9. [ EXPLORATION OF SHARED MEMORY GRAPH BENCHMARKS: THE GRAPH500 ] David A. Bader (PI), Jason Riedy[ OBJECTIVE ]Explore benchmarks for high-performancedata-intensive computations on parallel,shared-memory platforms.[ DESCRIPTION ]Current high-performance architectures arebuilt to run linear algebra operationseffectively. These architectures seem a poor Image Source: Nexus (Facebook application)fit for the massive growth of irregular datacoming from biological, social, regulatory, 5 8 Image Source: Giot et al., “A Proteinand other sources. There are no widely 1 Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003supported benchmarks to guide 0 7 3 4 6 9architectural decisions for these sourceapplications. vertex 2 Problem Class SizeGeorgia Tech worked within Graph500steering committee to draft a new breadth- Toy (10) 17 GiBfirst search benchmark acceptable for wide Mini (11) 140 GiBparticipation. Georgia Tech also providedand supports the OpenMP and Cray XMT Small (12) 1.1 TiBshared-memory reference codes. Medium (13) 18 TiBFor more: Visit the Graph500 BoF! Large (14) 140 TiB[ FUNDING ]Sandia National Labs Huge (15) 1.1 PiB

×