GraphBLAS and Emus
Jason Riedy (all opinions my own, no planning guarantees)
GraphBLAS BoF at IEEE HPEC, 22 September 2020
Lucata/ Emu Technology
Lucata’s PGAS Architecture
1 nodelet
Gossamer
Core 1
Memory-Side Processor
Gossamer
Core 4
...
Migration Engine
RapidIODisk I/O
8 nodelets
per node
64 nodelets
per Chick
RapidIO
Stationary
Core
• Cacheless multithreaded
multicore
• Memory-side “processor” at
narrow-channel DRAM
• Stationary core for OS
• Physically distributed
memory
• Threads migrate in
hardware on reads!
GraphBLAS and Emus — 22 Sep 2020 2/8
Pointer-Chasing Benchmark
Data-dependent loads, fine-grained access1
Ordered
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Intra-block shuffle: weak locality
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Full block shuffle: weak locality
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1
Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Vuduc, Riedy. “An Initial Characterization of the Emu
Chick,” Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2018.
GraphBLAS and Emus — 22 Sep 2020 3/8
Selected Results: x86 Pointer-Chasing Benchmark
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
0
20
40
60
80
100Memorybandwidth(GBs) peak STREAM bandwidth
56 threads
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
peak STREAM bandwidth
112 threads
block_shuffle intra_block_shuffle full_block_shuffle
Haswell results, every pattern is different.2
2
Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Riedy, Vuduc, Conte. “A Microbenchmark Characterization
of the Emu Chick.” Parallel Computing, 10.1016/j.parco.2019.04.012
GraphBLAS and Emus — 22 Sep 2020 4/8
Selected Results: Emu Pointer-Chasing Benchmark
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
0
2
4
6
8
10
12
Memorybandwidth(GBs)
peak STREAM bandwidth
2048 threads
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
peak STREAM bandwidth
4096 threads
block_shuffle intra_block_shuffle full_block_shuffle
Mostly flat performance, high utilization.2
GraphBLAS and Emus — 22 Sep 2020 5/8
Selected Results: BFS on a Dynamic Data Structure
15 16 17 18 19 20 21
scale
0
20
40
60
80
100
MTEPS
Emu single node - Cilk
Emu multi-node - Cilk
x86 Haswell - STINGER
x86 Haswell - Cilk
0
500
1000
1500
EdgeBandwidth(MB/s)
Note: Streaming data structure, not statically optimized. 3
3
Hein, Eswar, Abdurrahman Yasar, Prasanth Chatarasi, Li, Young, Conte, Ümit Çatalyürek, Vuduc, Riedy, Bora Uçar.
“Programming Strategies for Irregular Algorithms on the Emu Chick.” ACM ToPC, to appear.
https://arxiv.org/abs/1901.02775
GraphBLAS and Emus — 22 Sep 2020 6/8
Implications for a GraphBLAS Implementation
• We can be more flexible in data organization.
• Not tied to CSR / CSC / COO.
• NCDIMM: No cache line issues
• Stride between vertices, values can be arbitrary.
• Can incorporate more semantic information.
• Targeting “streaming” use.
• High rate of change in a massive graph.
• Linked list of blocks... (STINGER, HORNET)
• But must remember graphs live in a separate
memory space.
• Gossamer side calls stay there.
• Stationary core calls must transfer input and output.
GraphBLAS and Emus — 22 Sep 2020 7/8
Experiences “Porting” Existing Apps & Bindings
Capabilities nice to have:
• Allocating memory to hold k entries w/o knowing the type
• Converting the support to a bool GxB_Matrix (T →bool)
• Eases operating on masks of different types
• Execution context: SC-GC, GC-GC, SC-SC
• A sized blob type that is not a UDT
• Sometimes used to hold keys with no GB meaning
• Selects and ops with bools still useful
• Users want “iterators”
• Some uses are horrible to replace without a relational
join-type operation
• Still coming up with more...
GraphBLAS and Emus — 22 Sep 2020 8/8

GraphBLAS and Emus

  • 1.
    GraphBLAS and Emus JasonRiedy (all opinions my own, no planning guarantees) GraphBLAS BoF at IEEE HPEC, 22 September 2020 Lucata/ Emu Technology
  • 2.
    Lucata’s PGAS Architecture 1nodelet Gossamer Core 1 Memory-Side Processor Gossamer Core 4 ... Migration Engine RapidIODisk I/O 8 nodelets per node 64 nodelets per Chick RapidIO Stationary Core • Cacheless multithreaded multicore • Memory-side “processor” at narrow-channel DRAM • Stationary core for OS • Physically distributed memory • Threads migrate in hardware on reads! GraphBLAS and Emus — 22 Sep 2020 2/8
  • 3.
    Pointer-Chasing Benchmark Data-dependent loads,fine-grained access1 Ordered 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Intra-block shuffle: weak locality 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Full block shuffle: weak locality 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Vuduc, Riedy. “An Initial Characterization of the Emu Chick,” Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2018. GraphBLAS and Emus — 22 Sep 2020 3/8
  • 4.
    Selected Results: x86Pointer-Chasing Benchmark 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Block size (number of 16B elements) 0 20 40 60 80 100Memorybandwidth(GBs) peak STREAM bandwidth 56 threads 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Block size (number of 16B elements) peak STREAM bandwidth 112 threads block_shuffle intra_block_shuffle full_block_shuffle Haswell results, every pattern is different.2 2 Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Riedy, Vuduc, Conte. “A Microbenchmark Characterization of the Emu Chick.” Parallel Computing, 10.1016/j.parco.2019.04.012 GraphBLAS and Emus — 22 Sep 2020 4/8
  • 5.
    Selected Results: EmuPointer-Chasing Benchmark 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Block size (number of 16B elements) 0 2 4 6 8 10 12 Memorybandwidth(GBs) peak STREAM bandwidth 2048 threads 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Block size (number of 16B elements) peak STREAM bandwidth 4096 threads block_shuffle intra_block_shuffle full_block_shuffle Mostly flat performance, high utilization.2 GraphBLAS and Emus — 22 Sep 2020 5/8
  • 6.
    Selected Results: BFSon a Dynamic Data Structure 15 16 17 18 19 20 21 scale 0 20 40 60 80 100 MTEPS Emu single node - Cilk Emu multi-node - Cilk x86 Haswell - STINGER x86 Haswell - Cilk 0 500 1000 1500 EdgeBandwidth(MB/s) Note: Streaming data structure, not statically optimized. 3 3 Hein, Eswar, Abdurrahman Yasar, Prasanth Chatarasi, Li, Young, Conte, Ümit Çatalyürek, Vuduc, Riedy, Bora Uçar. “Programming Strategies for Irregular Algorithms on the Emu Chick.” ACM ToPC, to appear. https://arxiv.org/abs/1901.02775 GraphBLAS and Emus — 22 Sep 2020 6/8
  • 7.
    Implications for aGraphBLAS Implementation • We can be more flexible in data organization. • Not tied to CSR / CSC / COO. • NCDIMM: No cache line issues • Stride between vertices, values can be arbitrary. • Can incorporate more semantic information. • Targeting “streaming” use. • High rate of change in a massive graph. • Linked list of blocks... (STINGER, HORNET) • But must remember graphs live in a separate memory space. • Gossamer side calls stay there. • Stationary core calls must transfer input and output. GraphBLAS and Emus — 22 Sep 2020 7/8
  • 8.
    Experiences “Porting” ExistingApps & Bindings Capabilities nice to have: • Allocating memory to hold k entries w/o knowing the type • Converting the support to a bool GxB_Matrix (T →bool) • Eases operating on masks of different types • Execution context: SC-GC, GC-GC, SC-SC • A sized blob type that is not a UDT • Sometimes used to hold keys with no GB meaning • Selects and ops with bools still useful • Users want “iterators” • Some uses are horrible to replace without a relational join-type operation • Still coming up with more... GraphBLAS and Emus — 22 Sep 2020 8/8