isca22-feng-menda_for sparse transposition and dataflow.pptx
1. MeNDA: A Near-Memory Multi-way Merge
Solution for Sparse Transposition and Dataflows
Siying Feng*, Xin He*, Kuan-Yu Chen*, Liu Ke+,
Xuan Zhang+, David Blaauw*, Trevor Mudge*, Ronald Dreslinski*
*University of Michigan +Washington University in St. Louis
2. Sparse Linear Algebra is Everywhere
Fluid Dynamics
Sparse Linear Algebra
Machine Learning Circuit Simulation
Electromagnetics
Structural Engineering
Robotics & Kinetics
Graph Analytics
X = X =
+
Recommendation
Systems
2
Sparse Matrix-Matrix
Multiplication (SpMM)
Sparse Matrix-Vector
Multiplication (SpMV)
Sparse Gathering
3. Sparse Matrix Transposition: Definition
• Compressed storage formats (CSR/CSC)
• index array, value array, pointer array
• saves storage and avoids computation on zeros
• Sparse matrix transposition
• swap the row and column indices of elements
• equivalent to conversion between CSR and CSC
a b
c d
e f
g h
i j
0 2 4 6 8 10
0 2 1 4 0 4 2 3 0 2
a b c d e f g h i j
A
Pointer
Index
Value
A in CSR / AT in CSC
a e i
c
b g j
h
d f
AT
3
4. Sparse Matrix Transposition: A Growing Bottleneck
• Sparse matrix transposition is an essential building block of sparse linear algebra
• Misconception: transposition overhead used to be minor and easily amortized
• Reality: exploding dataset size made the transposition overhead non-negligible
• the performance of graph processing has been greatly improved
• seldom efforts have been spent on sparse matrix transposition
4
Figure. Execution time breakdown of SSSP on a recent graph framework
Runtime transposition using a state-of-the-art implementation
can introduce a 126% performance overhead
5. Sparse Matrix Transposition: Memory-bound Nature
• Sparse matrix transposition is memory bandwidth bound
• By lifting the roofline by 8× , the throughput can improve by 4.1-5.2×
• Sparse matrix transposition has low computational intensity
8x
5.2x
mergeTrans
mergeTrans x8
5
Figure. Roofline Model of MergeTrans
The high memory requirement and low arithmetic intensity make
sparse matrix transposition a promising candidate for NMP
6. MeNDA: A Scalable Near-DRAM Architecture
• Processing units (PUs) are deployed in DIMM buffer chips
• limit modification to DIMM hardware to buffer chips
• PUs are deployed beside each rank
• exploit rank-level and DIMM-level parallelism
HOST
MC
DIMM
DRAM
DRAM
DRAM
DRAM
… …
Rank
PU
Rank
PU
Buffer Device
DIMM
6
HOST
MC
DIMM
DIMM
Effective BW
= Channel BW
= 20 GB/s
Effective BW
= Rank BW * # ranks
= 80 GB/s
7. MeNDA: Merge-sort Algorithm
• Merge sort is chosen for
• wide application in sparse linear algebra
• spatial locality
• An L-leaf merge tree merges L streams
• logLN iterations to merge N streams
• Input and output are in CSR/CSC
• Intermediate data are in COO
• takes up less storage
• easy to decode
7
merge sort (Iteration 0)
a
e
c
b
d
f
g h
a
e
c
b
h
d
f
g
Leaf 0
Leaf 1
Leaf 2
Leaf 3
a
e
c
b
d
f
j
a
e
c b
h
d
f
i
g
j
Leaf 0
i
h
g
Leaf 1
merge sort (Iteration 1)
Figure. Transpose a 5x5 matrix with a 4-leaf merge tree
8. MeNDA: Processing Unit Microarchitecture
8
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
HOST
MC
DIMM
DRAM
DRAM
DRAM
DRAM
… …
Rank
PU
Rank
PU
Buffer Device
DIMM
9. MeNDA: Processing Unit Microarchitecture
hardware merge tree
• FIFOs between PEs
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
9
10. MeNDA: Processing Unit Microarchitecture
10
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
output buffer
• connects to root PE
• sends store requests
in memory blocks
11. MeNDA: Processing Unit Microarchitecture
prefetch buffers
• connect to leaf PEs
• send requests for
matrix rows
11
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
12. MeNDA: Processing Unit Microarchitecture
Controller
• FSM
• assigns pointers to
prefetch buffers
12
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
13. MeNDA: Processing Unit Microarchitecture
request queue
• holds outstanding
requests
• separate queues
for reads and
writes
13
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
14. MeNDA: Processing Unit Microarchitecture
memory interface unit
• mimics memory
controller
• schedules requests
• translates addresses
• generates DRAM
commands
14
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
16. MeNDA: Input Co-location and Workload Balancing
• Each rank transposes a horizontal matrix partition
• eliminate expensive communication across ranks/DIMMs
• use original CSR format without preprocessing
• allow easy matrix traversing after transposition
• use techniques proposed in prior work [1]
• Workload balancing: NNZ-based partitioning
• partition index/value with page alignment
• assign row pointers to corresponding ranks
• duplicate row pointers used by 2 ranks
• Host write start addresses to MMRs
• Partitioning is automated and hidden from programmers
Rank 0
Rank 1
Rank 2
Rank 3
RowPtr
Indices
Values
Rank 0 Rank 1 Rank 2 Rank 3
Rank 0 Rank 1 Rank 2 Rank 3
Rank 0 Rank 1 Rank 2 Rank 3
Duplicated Page across ranks
Rank 0
Rank 1
Rank 2
Rank 3
RowPt
Indices
Values
Ran
Rank 0
Ran
16
[1] Cho et al., “Near Data Acceleration with Concurrent Host Access ”, ISCA 2020.
17. MeNDA: Adaptation to Support SpMV
additional units
• FP adder
• reduce elements
with same rowID
• 16-way FP multiplier
• Delay buffer
• cacheline size
• hold columns waiting
for vector elements
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
Adders
+
…
X
X
X
Delay
Buffer
Multiplier
17
18. Methodology: Architecture Modeling
• Performance modeling
• custom cycle-accurate simulator
• memory interface connected to Ramulator
• 4 channels x 2 ranks/channel DDR4_2400R
• Area and energy modeling
• RTL synthesis of PU in 40 nm
• Baselines
• transposition on CPU: mergeTrans/scanTrans[1]
• transposition on GPU: cusparse
• SpMV accelerator: Sadi et al. [2]
• Dataset
• synthetic uniform and power-law matrices
• real-world matrices from SuiteSparse Matrix Collection
Cycle-accurate
Simulator
Data
Generator
Single-rank Ramulator
Rank N
Cycle-accurate
Simulator
Data
Generator
Single-rank Ramulator
Rank 0
Matrix partitioning engine
Input Matrix File
MAX
Power Model
…
Transposition Time
Power
Stats
In-Memory Accelerator Simulation
18
[1] Want et al., “Parallel Transposition of Sparse Data Structures”, ICS 2016.
[2] Sadi et al., “Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization”, MICRO 2019.
19. Evaluation: Comparison vs. CPU/GPU
• Performance benefits come from both reduction in memory traffic and
improvement in memory bandwidth utilization.
ASIC_320k
amazon
bcsstk32
language
mac_econ
parabolic
rajat21
sme3Dc
Slashdot
stomach
transient
twotone
venkat01
webbase
wiki-Talk
geomean
0
10
20
30
40
50
60
Speedup
speedup over scanTrans
speedup over mergeTrans
speedup over cuSPARSE
19
11.2x less memory traffic
2.7x higher BW utilization
19x
12x
8x
MeNDA achieves an average speedup of 19x, 12x and 8x over
scanTrans, mergeTrans, and cuSPARSE, respectively.
20. Evaluation: Integration with CoSPARSE
• Single Source Shortest Path on amazon (N = 262k, NNZ = 1.23M).
• MeNDA decreases transposition overhead from 126% to 5%.
• allowing CoSPARSE to store only one copy of graph and support larger graphs
• having negligble impact on graph execution time with changed data mapping
• MeNDA consumes 76.8 mW at 800 MHz and 7.1 mm2.
20
[1] Feng et al., “CoSPARSE: A Software and Hardware Reconfigurable SpMV Framework for Graph Analytics”, DAC 2021.
[2] Pal et al., “Outerspace: An outer product based sparse matrix multiplication accelerator”, HPCA 2018.
Figure. Execution time breakdown of SSSP on CoSPARSE[1, 2]
sparse
iterations
dense iterations
transposition
21. Evaluation: Comparison vs. SpMV accelerator
• Sadi et al. [1] cannot perform transposition without introducing frequent
synchronization or large on-chip buffers.
• MeNDA achieves a similar iso-bandwidth throughput at 0.043 GTEPS/(GB/s)
compared to Sadi et al. at 0.049 GTEPS/(GB/s).
• MeNDA shows an average efficiency gain of 3.8×.
21
[1] Sadi et al., “Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization”, MICRO 2019.
MeNDA focuses on lightweight PUs targeting commodity DIMMs,
which have better capacity scalability than HBM devices.
22. Conclusion
• Sparse matrix transposition is a promising candidate for NMP
• MeNDA is a near-DRAM solution to multi-way merge for sparse dataflows
• MeNDA presents significant gains over existing solutions
22
Sparse linear algebra is prevailing in a wide variety of domains, such as machine learning, graph analytics, and scientific computing. Sparse linear algebra is notorious for its irregular memory access pattern, so many near-memory hardware accelerators have been proposed recently for common sparse kernels, such sparse matrix vector multiplication and sparse gatherings. However, not all important sparse kernels have received enough attention and sparse matrix transposition is one of them.
Due to their large sizes and high sparsity, sparse matrices are often stored in compressed formats to save storage and avoid computations on zero elements. Commonly used formats are compressed sparse row (CSR) and compressed sparse column (CSC). CSR and CSC store sparse matrices in three arrays. The index array and value array store the row/column index and value of each NZ, respectively, and the pointer array holds the start pointer of each row/column. Sparse matrix transposition swaps the row index and column index of each NZ. Therefore, transposing a sparse matrix is in essence equivalent to converting a sparse matrix from CSR to CSC, or the opposite. For simplicity, we will use converting a matrix from CSR to CSC to denote general sparse matrix transposition from this point.
Sparse matrix transposition is an essential building block in both the processing and pre-processing stages of sparse linear algebra applications. For example, many recent graph frameworks requires input graphs in both CSR and CSC to support dynamic dataflow reconfiguration.
A common misconception regarding the runtime transposition overhead is that it is minor compared to the end-to-end execution time and can be easily amortized.
The figure here shows the breakdown of the execution time of running sssp on a recent graph framework. The top bar shows the misconception we just mentioned.
However, the middle bar is the reality. Recent breakthroughs have significantly improved the performance of graph processing. Consequently, runtime transposition using a state-of-the-art implementation can introduce a 126% performance overhead.
To understand the bottleneck of sparse matrix transposition, we performed roofline analysis on mergeTrans, a recently proposed sparse matrix transposition implementation. The roofline model shows that sparse matrix transposition is memory bandwidth bound because the data points are close to the “roof”, which are the red and blue lines labeling the maximum throughputs achieved when the system memory bandwidth is fully utilized. If we increase the system memory bandwidth by 8x, the throughput can be improved by up to 5.2x. This shows the potential benefit of applying NMP on sparse matrix transposition because NMP exposes the high internal memory bandwidth of DRAM devices. Meanwhile, transposition has low computational intensity because there are no floating point operations. The high memory requirement and low arithmetic intensity both make sparse matrix transposition a promising candidate for NMP.
In this work, we propose MeNDA, a scalable near-memory architecture that takes advantage of the high internal memory bandwidth of DRAM devices.
Inspired by prior works, we deploy custom processing units (PUs) in the buffer chips of DIMMs to minimize the hardware modifications. The PUs are deployed beside each rank to exploit rank-level and DIMM-level parallelism. Each PU can access its corresponding memory rank in parallel without going through the off-chip memory interface and therefore the effective memory BW becomes the sum of all rank BW. MeNDA is highly scalable because a higher throughput can be achieved by populating a memory channel with multiple MeNDA enabled DIMMs.
In this work, we adopted the merge sort algorithm not only because merge sort presents better spatial locality but also because merge sort is widely used in sparse linear algebra. Here we show the dataflow of transposing a 5x5 matrix assuming using a 4-leaf merge tree. A 4-leaf merge tree merges 4 sorted streams into a single stream.
In iteration 0, the first 4 rows are merged and then the last 1 row is converted. Then in iteration 1, the two sorted streams are merged into the final result.The input and output are stored in CSR or CSC while the intermediate data are stored in COO. The COO format stores the row index, column index, and value of each NZ in three separate arrays. Because the intermediate data may contain numerous empty rows/columns, COO not only is easy to decode but also tends to take up less storage than CSR/CSC.
Now we will introduce the microarchitecture of the PU.
The PU sits within the buffer chips
and access its local memory ranks by sending commands
through the command/address bus and receive data through the data bus.
During execution, a PU transpose the matrix stored in DRAM and
writes results directly back to DRAM with no need for host communication or assistance.
The key component of the PU is a hardware merge tree.
The *leaf PEs* are connected to *2 prefetch buffers* while other PEs
are connected to 2 child PEs through FIFOs.
Ideally, we want the merge tree to be as *wide as possible*
because *the number of leaves *
directly determines
the number of transposition iterations,
which is proportional to the total memory traffic.
However, the area and power of the merge tree grow quadratically with the number of leaves and
we have a limited area and power budget within the DIMM buffer chips.
So in this work, we use 1024-leaf merge trees,
which can transpose a wide range of real-world sparse matrices
within 2 iterations while having a moderate power and area overhead.
The *root PE* is connected to an output buffer, which aims to send store requests at cacheline granularity.
Prefetch buffers are in charge of sending memory load requests to
feed the *leaves* with correct data.
To reduce power consumption, the prefetch buffers are implemented as multi-bank SRAM.
The controller is an FSM that
*assigns matrix rows/columns* to each prefetch buffer.
All the memory requests are sent to a request queue with separate queues for loads and stores. To avoid duplicate load requests to the same cacheline, which happens when *multiple rows reside in a single cacheline*, the *read queue* is implemented as content addressable memory (CAM) to allow parallel request merging. The memory response will be broadcasted to all prefetch buffers so we don’t need to keep track of the requesters.
The requests stored in the request queue are processed by the memory interface unit, which mimics a simplified memory controller. The request scheduler selects requests based on a first come first serve policy that prioritizes requests ready to launch and DRAM row hits. Then the address decoder translates the incoming physical address to a DRAM address. And finally the command generator generates DRAM commands.
The overall dataflow of a PU goes like this. The *controller* keeps 1.sending requests for the row *pointers*(to Req Q).
2.After the data come back, as long as the prefetch buffers have availability, 3.the controller will send the start and end *addresses of 'the' rows* sequentially to the prefetch buffers.
4. The *prefetch buffers* send out load requests(to Req Q) for matrix rows or intermediate data as long as they have enough vacancies and
5,6. keep feeding the merge tree as long as they have valid data.
The merged results are sent from the root PE to the output buffer. The output buffer generate store requests as soon as there are enough results to form a cacheline.
To achieve maximum throughput, we partition data across different ranks. To avoid expensive communications between rank PUs, we let
each PU transpose a *horizontal partition of the matrix* and
keep all the input operands needed by the PU in its local memory rank. In this way, preprocessing is not needed, and it is also easy to locate a NZ after transposition.
The index and value arrays are partitioned by NNZ with page alignment for workload balancing.
The *row pointers* required by a partition are assigned to the *corresponding rank* and
the page needed by two ranks are duplicated.
After partitioning, the host will write the physical addresses
to the memory mapped registers so that the PUs can access them during execution.
The partitioning is automated and handled by the host during data allocation. Because all the changes are made in memory mapping, they are completely hidden from the programmer.
Because of the wide application of merge sort, MeNDA can be adapted to support other sparse linear algebra kernels, such as outerproduct SpMV. The *merge phase of (outer)SpMV *has the same dataflow as sparse matrix transposition, and thus can be implemented directly on MeNDA. However, transposition does not involve floating point computations, so floating point adders and multipliers are added. To perform outerproduct, we store the *input matrix is a partitioned CSC* format, which matches the format of the transposed matrix generated by our work.
To model the performance of MeNDA, we built a cycle-accurate simulator and connected the memory interface to Ramulator. The area and power are estimated by synthesizing an RTL model of the PU. For sparse matrix transposition, we compare to two prior CPU implementations and cusparse library on GPU. For SpMV, we compare to an accelerator equipped with HBM. Finally, we used both synthetic matrices and real-world matrices in our evaluation.
The figure here shows the speedup of MeNDA over scanTrans and mergeTrans on CPU and cuSPARSE on GPU. The red line represents a speedup of 1. The speedup of MeNDA comes from both the reduction in memory traffic and the improvement in memory bandwidth utilization. Taking wiki-Talk as an example, compared to mergeTrans, MeNDA reduces the memory traffic by 11.2x while exhibiting 2.7x higher bandwidth utilization. Overall, MeNDA achieves an average speedup of 19.1x, 12.0x and 7.7x over scanTrans, mergeTrans, and cuSPARSE, respectively.
To understand the benefit of MeNDA on end-to-end workloads, we integrated MeNDA with CoSPARSE, a recent graph analytics framework.
The figure here shows the execution time of SSSP on CoSPARSE for the graph amazon with 2 copies of the input graph, with runtime transposition using mergeTrans, and using MeNDA. Although integrating MeNDA requires changes in memory mapping, such change has negligible impact on the original graph execution. Sparse matrix transposition happens each time CoSPARSE switches from the dense dataflow to the sparse dataflow or the opposite. As shown by the red bars, using MeNDA for runtime graph transposition decreases the transposition overhead from 126% to 5% . As dataset sizes keep growing, MeNDA can prevent designs like CoSPARSE from expensive disk accesses when the DRAM devices can only fit a single copy of the graph with minor overhead. Synthesis shows that a MeNDA PU takes up 7.1 mm2 in 40 nm and consumes 78.6 mW at 800MHz.
For a fair comparison in SpMV, we use *iso-bandwidth throughput* as the performance metric. Compared to the SpMV accelerator proposed by Sadi et al., MeNDA achieves a similar average iso-bandwidth throughout at 0.043 GTEPS per bandwidth and an average improvement of 3.8x in GTEPS/W, as shown in this figure. While the SpMV accelerator is a monolithic design with 4 HBM stacks, our work MeNDA features lightweight PUs that can be integrated into commodity DRAM devices, which has better capacity scalability than HBM.
We also did a thread scaling analysis. The figure here shows the memory bandwidth utilized by mergeTrans when we increase the number of threads. While the peak bandwidth is at 76.8 GB/s, the utilized memory bandwidth starts to saturate at 16 threads and reaches the maximum at 64 threads at 59.6 GB/s. In practice, there is little performance benefit beyond 16 threads and further saturating the bandwidth is undesirable because of the significantly increased memory latency. What sparse matrix transposition will most benefit from is an approach that reduces memory latency and relieves the contention at the off-chip memory interface.
There has been some some changes in the dataflow for SpMV.
the controller send load requests for the vector elements along with the requests for the column pointers. Because all columns are eventually merged into a single vector, we don’t need to track the column indices of NZs anymore. Instead, the prefetch buffer storage aimed for the column indices now holds the vector elements.
When the memory interface returns the matrix values, the prefetch buffers that need the matrix values send their vector elements to the multiplier. If they don’t have the vector elements, the memory response will wait in the delay buffer until the vector elements are ready. Meanwhile, if the delay buffer is occupied, the memory interface unit will prioritize requests for the vector elements until the delay buffer is empty.
The result elements go through the adders before entering the output buffer so that elements with the same index are accumulated.
The rest of the dataflow remain the same.
To understand the benefit of MeNDA on an end-to-end workload, we integrated MeNDA with CoSPARSE. CoSPARSE is a graph analytics framework based on a reconfigurable architecture called Transmuter. The figure here shows a Transmuter architecture with 2 tiles and 4 PEs per tile. For evaluation, we apply a system with 8 tiles and 16 PEs per tile. Similar to many recent graph frameworks, CoSPARSE reconfigures the dataflow based on the density of the active vertex set. The dense iterations use a row-major COO format and no program modifications are needed because all the memory mapping changes are hidden from the programmers. The sparse iterations use a partitioned CSC format. Because there are 8 tiles and 8 memory ranks in total, we let each tile to compute the data in a rank.