Upcoming SlideShare
Loading in...5
×

# Graph500

1,003

Published on

Engineering the Graph500 benchmark.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

No Downloads
Views
Total Views
1,003
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Graph500

1. 1. Graph 500 Benchmark and Reference Implementations David Bader, Jason Riedy Georgia Institute of Technology (booth 1561)
2. 2. Benchmark ProblemInitial benchmark problem: Graph Search (BFS)● Convert an input edge list to some internal format once (timed).● Randomly select multiple search roots.● Separately compute breadth-first search trees starting fromeach search root (timed). ● Return the array of parent nodes; parent[i] = j means j is the parent of i in the tree. ● Validate the output. Other problems under consideration for the future (e.g. independent set, ...)
3. 3. Benchmark & Reference Impl. Structure1.Generate the edge list.2.Construct a graph from the edge list.3.Randomly sample 64 unique search keys withdegree at least one, not counting self-loops.4.For each search key: Timed kernels 1.Compute the BFS parent array. 2.Validate that the parent array is a correct BFS search tree for the given search tree.5.Compute and output performance information. ● (Take care to report correct quartiles, means, and deviations, e.g. harmonic for rates.)
4. 4. Problem Classes● Sizes chosen to range from Problem Class Sizecurrently accessible tooptimistically ahead. Toy (10) 17 GiB● Chosen as powers of two Mini (11) 140 GiBclose to powers of 10. Small (12) 1.1 TiB ● Toy: 1010 → 226 = 17 GiB 15 42 Medium (13) 18 TiB ● Huge: 10 → 2 = 1.1 PiB!● Submissions ranged up to the Large (14) 140 TiBMedium class. Huge (15) 1.1 PiB ● Next year, will someone tackle Large? Huge?
5. 5. Reference ImplementationsMultiple reference implementations:● High-level but undefinitive code in GNU Octave.● Single shared-memory driver for: ● two sequential examples, ● one OpenMP code, and ● Two Cray XMT codes.● Separate, fully distributed MPI code from Jeremiah Willcock of Indiana (who also wrote the reproducible, parallel generator). (This space intentionally left unoptimized.)
6. 6. Reference ImplementationsMultiple reference implementations:● High-level sketch in GNU Octave. (24 lines in the timed kernels as counted by cloc) ● Not intended to be definitive. ● Used for executable examples in specification.● Two sequential codes to demonstrate that the driver handles different kernels. ● The first forms a linked list on the unaltered, uncopied input. (103 lines) ● The second copies into a CSR graph representation. (171 lines)
7. 7. Reference ImplementationsMultiple reference implementations:● One OpenMP code for wide portability. (342 lines) ● Uses mmap for pseudo-out-of-core operation, can tackle anything that fits on a disk if you have the time...● A Cray XMT code and a slight variation. (186 lines, 210 lines) ● Slight variation reduces hot-spotting in the BFS queue.● An MPI code by Jeremiah Willcock from Indiana. (1107 lines) ● Fully distributed, runtime on SMP roughly comparable to OpenMP. (This space intentionally left unoptimized.)
8. 8. Untuned Performance for Comparison Threads Mean time (s) Mean rate (TEPS) 4 9.2 1.0 x 107 8 6.9 1.1 x 107 16 4.9 0.91 x 107 Untuned Cray XMT implementation performance against the toy class on PNNLsUntuned OpenMP on scale- 128-processor Cray XMT24 (smaller than Toy) using Processors Mean time (s) Mean rate (TEPS)a dual quad-core Intel XeonX5570 processors 32 23.7 4.5 x 107(2.93GHz, 8MiB cache) with48 GiB physical memory. 64 24.3 4.4 x 107The 16-thread results useHyperThreading. The toy 128 28.2 3.8 x 107class ran too long...
9. 9. [ EXPLORATION OF SHARED MEMORY GRAPH BENCHMARKS: THE GRAPH500 ] David A. Bader (PI), Jason Riedy[ OBJECTIVE ]Explore benchmarks for high-performancedata-intensive computations on parallel,shared-memory platforms.[ DESCRIPTION ]Current high-performance architectures arebuilt to run linear algebra operationseffectively. These architectures seem a poor Image Source: Nexus (Facebook application)fit for the massive growth of irregular datacoming from biological, social, regulatory, 5 8 Image Source: Giot et al., “A Proteinand other sources. There are no widely 1 Interaction Map of Drosophila melanogaster”, Science 302, 1722-1736, 2003supported benchmarks to guide 0 7 3 4 6 9architectural decisions for these sourceapplications. vertex 2 Problem Class SizeGeorgia Tech worked within Graph500steering committee to draft a new breadth- Toy (10) 17 GiBfirst search benchmark acceptable for wide Mini (11) 140 GiBparticipation. Georgia Tech also providedand supports the OpenMP and Cray XMT Small (12) 1.1 TiBshared-memory reference codes. Medium (13) 18 TiBFor more: Visit the Graph500 BoF! Large (14) 140 TiB[ FUNDING ]Sandia National Labs Huge (15) 1.1 PiB
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.