isca22-feng-menda_for sparse transposition and dataflow.pptx

MeNDA: A Near-Memory Multi-way Merge
Solution for Sparse Transposition and Dataflows
Siying Feng*, Xin He*, Kuan-Yu Chen*, Liu Ke+,
Xuan Zhang+, David Blaauw*, Trevor Mudge*, Ronald Dreslinski*
*University of Michigan +Washington University in St. Louis

Sparse Linear Algebra is Everywhere
Fluid Dynamics
Sparse Linear Algebra
Machine Learning Circuit Simulation
Electromagnetics
Structural Engineering
Robotics & Kinetics
Graph Analytics
X = X =
+
Recommendation
Systems
2
Sparse Matrix-Matrix
Multiplication (SpMM)
Sparse Matrix-Vector
Multiplication (SpMV)
Sparse Gathering

Sparse Matrix Transposition: Definition
• Compressed storage formats (CSR/CSC)
• index array, value array, pointer array
• saves storage and avoids computation on zeros
• Sparse matrix transposition
• swap the row and column indices of elements
• equivalent to conversion between CSR and CSC
a b
c d
e f
g h
i j
0 2 4 6 8 10
0 2 1 4 0 4 2 3 0 2
a b c d e f g h i j
A
Pointer
Index
Value
A in CSR / AT in CSC
a e i
c
b g j
h
d f
AT
3

Sparse Matrix Transposition: A Growing Bottleneck
• Sparse matrix transposition is an essential building block of sparse linear algebra
• Misconception: transposition overhead used to be minor and easily amortized
• Reality: exploding dataset size made the transposition overhead non-negligible
• the performance of graph processing has been greatly improved
• seldom efforts have been spent on sparse matrix transposition
4
Figure. Execution time breakdown of SSSP on a recent graph framework
Runtime transposition using a state-of-the-art implementation
can introduce a 126% performance overhead

Sparse Matrix Transposition: Memory-bound Nature
• Sparse matrix transposition is memory bandwidth bound
• By lifting the roofline by 8× , the throughput can improve by 4.1-5.2×
• Sparse matrix transposition has low computational intensity
8x
5.2x
mergeTrans
mergeTrans x8
5
Figure. Roofline Model of MergeTrans
The high memory requirement and low arithmetic intensity make
sparse matrix transposition a promising candidate for NMP

MeNDA: A Scalable Near-DRAM Architecture
• Processing units (PUs) are deployed in DIMM buffer chips
• limit modification to DIMM hardware to buffer chips
• PUs are deployed beside each rank
• exploit rank-level and DIMM-level parallelism
HOST
MC
DIMM
DRAM
DRAM
DRAM
DRAM
… …
Rank
PU
Rank
PU
Buffer Device
DIMM
6
HOST
MC
DIMM
DIMM
Effective BW
= Channel BW
= 20 GB/s
Effective BW
= Rank BW * # ranks
= 80 GB/s

MeNDA: Merge-sort Algorithm
• Merge sort is chosen for
• wide application in sparse linear algebra
• spatial locality
• An L-leaf merge tree merges L streams
• logLN iterations to merge N streams
• Input and output are in CSR/CSC
• Intermediate data are in COO
• takes up less storage
• easy to decode
7
merge sort (Iteration 0)
a
e
c
b
d
f
g h
a
e
c
b
h
d
f
g
Leaf 0
Leaf 1
Leaf 2
Leaf 3
a
e
c
b
d
f
j
a
e
c b
h
d
f
i
g
j
Leaf 0
i
h
g
Leaf 1
merge sort (Iteration 1)
Figure. Transpose a 5x5 matrix with a 4-leaf merge tree

MeNDA: Processing Unit Microarchitecture
8
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
HOST
MC
DIMM
DRAM
DRAM
DRAM
DRAM
… …
Rank
PU
Rank
PU
Buffer Device
DIMM

hardware merge tree
• FIFOs between PEs
PE
PE
PE
Controller
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Output Buffer
DDR. DQ
9

10
PE
PE
PE
Controller
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Output Buffer
DDR. DQ
output buffer
• connects to root PE
• sends store requests
in memory blocks

prefetch buffers
• connect to leaf PEs
• send requests for
matrix rows
11
PE
PE
PE
Controller
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Output Buffer
DDR. DQ

Controller
• FSM
• assigns pointers to
prefetch buffers
12
PE
PE
PE
Controller
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Output Buffer
DDR. DQ

request queue
• holds outstanding
requests
• separate queues
for reads and
writes
13
PE
PE
PE
Controller
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Output Buffer
DDR. DQ

memory interface unit
• mimics memory
controller
• schedules requests
• translates addresses
• generates DRAM
commands
14
PE
PE
PE
Controller
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Output Buffer
DDR. DQ

MeNDA: Dataflow
PE
PE
PE
Controller
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Output Buffer
DDR. DQ
load rowPtr
1
recieve rowPtr
send ptr
load rows
receive
rows
send rows
send
merged rows
2
3
8
5
6
7
4
15

MeNDA: Input Co-location and Workload Balancing
• Each rank transposes a horizontal matrix partition
• eliminate expensive communication across ranks/DIMMs
• use original CSR format without preprocessing
• allow easy matrix traversing after transposition
• use techniques proposed in prior work [1]
• Workload balancing: NNZ-based partitioning
• partition index/value with page alignment
• assign row pointers to corresponding ranks
• duplicate row pointers used by 2 ranks
• Host write start addresses to MMRs
• Partitioning is automated and hidden from programmers
Rank 0
Rank 1
Rank 2
Rank 3
RowPtr
Indices
Values
Rank 0 Rank 1 Rank 2 Rank 3
Duplicated Page across ranks
Rank 0
Rank 1
Rank 2
Rank 3
RowPt
Indices
Values
Ran
Rank 0
Ran
16
[1] Cho et al., “Near Data Acceleration with Concurrent Host Access ”, ISCA 2020.

MeNDA: Adaptation to Support SpMV
additional units
• FP adder
• reduce elements
with same rowID
• 16-way FP multiplier
• Delay buffer
• cacheline size
• hold columns waiting
for vector elements
PE
PE
PE
Controller
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Output Buffer
DDR. DQ
Adders
+
…
X
X
X
Delay
Buffer
Multiplier
17

Methodology: Architecture Modeling
• Performance modeling
• custom cycle-accurate simulator
• memory interface connected to Ramulator
• 4 channels x 2 ranks/channel DDR4_2400R
• Area and energy modeling
• RTL synthesis of PU in 40 nm
• Baselines
• transposition on CPU: mergeTrans/scanTrans[1]
• transposition on GPU: cusparse
• SpMV accelerator: Sadi et al. [2]
• Dataset
• synthetic uniform and power-law matrices
• real-world matrices from SuiteSparse Matrix Collection
Cycle-accurate
Simulator
Data
Generator
Single-rank Ramulator
Rank N
Cycle-accurate
Simulator
Data
Generator
Single-rank Ramulator
Rank 0
Matrix partitioning engine
Input Matrix File
MAX
Power Model
…
Transposition Time
Power
Stats
In-Memory Accelerator Simulation
18
[1] Want et al., “Parallel Transposition of Sparse Data Structures”, ICS 2016.
[2] Sadi et al., “Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization”, MICRO 2019.

Evaluation: Comparison vs. CPU/GPU
• Performance benefits come from both reduction in memory traffic and
improvement in memory bandwidth utilization.
ASIC_320k
amazon
bcsstk32
language
mac_econ
parabolic
rajat21
sme3Dc
Slashdot
stomach
transient
twotone
venkat01
webbase
wiki-Talk
geomean
0
10
20
30
40
50
60
Speedup
speedup over scanTrans
speedup over mergeTrans
speedup over cuSPARSE
19
11.2x less memory traffic
2.7x higher BW utilization
19x
12x
8x
MeNDA achieves an average speedup of 19x, 12x and 8x over
scanTrans, mergeTrans, and cuSPARSE, respectively.

Evaluation: Integration with CoSPARSE
• Single Source Shortest Path on amazon (N = 262k, NNZ = 1.23M).
• MeNDA decreases transposition overhead from 126% to 5%.
• allowing CoSPARSE to store only one copy of graph and support larger graphs
• having negligble impact on graph execution time with changed data mapping
• MeNDA consumes 76.8 mW at 800 MHz and 7.1 mm2.
20
[1] Feng et al., “CoSPARSE: A Software and Hardware Reconfigurable SpMV Framework for Graph Analytics”, DAC 2021.
[2] Pal et al., “Outerspace: An outer product based sparse matrix multiplication accelerator”, HPCA 2018.
Figure. Execution time breakdown of SSSP on CoSPARSE[1, 2]
sparse
iterations
dense iterations
transposition

Evaluation: Comparison vs. SpMV accelerator
• Sadi et al. [1] cannot perform transposition without introducing frequent
synchronization or large on-chip buffers.
• MeNDA achieves a similar iso-bandwidth throughput at 0.043 GTEPS/(GB/s)
compared to Sadi et al. at 0.049 GTEPS/(GB/s).
• MeNDA shows an average efficiency gain of 3.8×.
21
[1] Sadi et al., “Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization”, MICRO 2019.
MeNDA focuses on lightweight PUs targeting commodity DIMMs,
which have better capacity scalability than HBM devices.

Conclusion
• Sparse matrix transposition is a promising candidate for NMP
• MeNDA is a near-DRAM solution to multi-way merge for sparse dataflows
• MeNDA presents significant gains over existing solutions
22

isca22-feng-menda_for sparse transposition and dataflow.pptx

More Related Content

Similar to isca22-feng-menda_for sparse transposition and dataflow.pptx

More from ssuser30e7d2

Recently uploaded

isca22-feng-menda_for sparse transposition and dataflow.pptx

Editor's Notes