SlideShare a Scribd company logo
1 of 23
MeNDA: A Near-Memory Multi-way Merge
Solution for Sparse Transposition and Dataflows
Siying Feng*, Xin He*, Kuan-Yu Chen*, Liu Ke+,
Xuan Zhang+, David Blaauw*, Trevor Mudge*, Ronald Dreslinski*
*University of Michigan +Washington University in St. Louis
Sparse Linear Algebra is Everywhere
Fluid Dynamics
Sparse Linear Algebra
Machine Learning Circuit Simulation
Electromagnetics
Structural Engineering
Robotics & Kinetics
Graph Analytics
X = X =
+
Recommendation
Systems
2
Sparse Matrix-Matrix
Multiplication (SpMM)
Sparse Matrix-Vector
Multiplication (SpMV)
Sparse Gathering
Sparse Matrix Transposition: Definition
• Compressed storage formats (CSR/CSC)
• index array, value array, pointer array
• saves storage and avoids computation on zeros
• Sparse matrix transposition
• swap the row and column indices of elements
• equivalent to conversion between CSR and CSC
a b
c d
e f
g h
i j
0 2 4 6 8 10
0 2 1 4 0 4 2 3 0 2
a b c d e f g h i j
A
Pointer
Index
Value
A in CSR / AT in CSC
a e i
c
b g j
h
d f
AT
3
Sparse Matrix Transposition: A Growing Bottleneck
• Sparse matrix transposition is an essential building block of sparse linear algebra
• Misconception: transposition overhead used to be minor and easily amortized
• Reality: exploding dataset size made the transposition overhead non-negligible
• the performance of graph processing has been greatly improved
• seldom efforts have been spent on sparse matrix transposition
4
Figure. Execution time breakdown of SSSP on a recent graph framework
Runtime transposition using a state-of-the-art implementation
can introduce a 126% performance overhead
Sparse Matrix Transposition: Memory-bound Nature
• Sparse matrix transposition is memory bandwidth bound
• By lifting the roofline by 8× , the throughput can improve by 4.1-5.2×
• Sparse matrix transposition has low computational intensity
8x
5.2x
mergeTrans
mergeTrans x8
5
Figure. Roofline Model of MergeTrans
The high memory requirement and low arithmetic intensity make
sparse matrix transposition a promising candidate for NMP
MeNDA: A Scalable Near-DRAM Architecture
• Processing units (PUs) are deployed in DIMM buffer chips
• limit modification to DIMM hardware to buffer chips
• PUs are deployed beside each rank
• exploit rank-level and DIMM-level parallelism
HOST
MC
DIMM
DRAM
DRAM
DRAM
DRAM
… …
Rank
PU
Rank
PU
Buffer Device
DIMM
6
HOST
MC
DIMM
DIMM
Effective BW
= Channel BW
= 20 GB/s
Effective BW
= Rank BW * # ranks
= 80 GB/s
MeNDA: Merge-sort Algorithm
• Merge sort is chosen for
• wide application in sparse linear algebra
• spatial locality
• An L-leaf merge tree merges L streams
• logLN iterations to merge N streams
• Input and output are in CSR/CSC
• Intermediate data are in COO
• takes up less storage
• easy to decode
7
merge sort (Iteration 0)
a
e
c
b
d
f
g h
a
e
c
b
h
d
f
g
Leaf 0
Leaf 1
Leaf 2
Leaf 3
a
e
c
b
d
f
j
a
e
c b
h
d
f
i
g
j
Leaf 0
i
h
g
Leaf 1
merge sort (Iteration 1)
Figure. Transpose a 5x5 matrix with a 4-leaf merge tree
MeNDA: Processing Unit Microarchitecture
8
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
HOST
MC
DIMM
DRAM
DRAM
DRAM
DRAM
… …
Rank
PU
Rank
PU
Buffer Device
DIMM
MeNDA: Processing Unit Microarchitecture
hardware merge tree
• FIFOs between PEs
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
9
MeNDA: Processing Unit Microarchitecture
10
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
output buffer
• connects to root PE
• sends store requests
in memory blocks
MeNDA: Processing Unit Microarchitecture
prefetch buffers
• connect to leaf PEs
• send requests for
matrix rows
11
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
MeNDA: Processing Unit Microarchitecture
Controller
• FSM
• assigns pointers to
prefetch buffers
12
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
MeNDA: Processing Unit Microarchitecture
request queue
• holds outstanding
requests
• separate queues
for reads and
writes
13
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
MeNDA: Processing Unit Microarchitecture
memory interface unit
• mimics memory
controller
• schedules requests
• translates addresses
• generates DRAM
commands
14
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
MeNDA: Dataflow
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
load rowPtr
1
recieve rowPtr
send ptr
load rows
receive
rows
send rows
send
merged rows
2
3
8
5
6
7
4
15
MeNDA: Input Co-location and Workload Balancing
• Each rank transposes a horizontal matrix partition
• eliminate expensive communication across ranks/DIMMs
• use original CSR format without preprocessing
• allow easy matrix traversing after transposition
• use techniques proposed in prior work [1]
• Workload balancing: NNZ-based partitioning
• partition index/value with page alignment
• assign row pointers to corresponding ranks
• duplicate row pointers used by 2 ranks
• Host write start addresses to MMRs
• Partitioning is automated and hidden from programmers
Rank 0
Rank 1
Rank 2
Rank 3
RowPtr
Indices
Values
Rank 0 Rank 1 Rank 2 Rank 3
Rank 0 Rank 1 Rank 2 Rank 3
Rank 0 Rank 1 Rank 2 Rank 3
Duplicated Page across ranks
Rank 0
Rank 1
Rank 2
Rank 3
RowPt
Indices
Values
Ran
Rank 0
Ran
16
[1] Cho et al., “Near Data Acceleration with Concurrent Host Access ”, ISCA 2020.
MeNDA: Adaptation to Support SpMV
additional units
• FP adder
• reduce elements
with same rowID
• 16-way FP multiplier
• Delay buffer
• cacheline size
• hold columns waiting
for vector elements
PE
PE
PE
Controller
Merge Tree Prefetch Buffer
ReadQ WriteQ
Address Decoder
CMD Generator
Request Scheduler
Request Queue
DDR. C/A
Memory Interface Unit
Output Buffer
DDR. DQ
Adders
+
…
X
X
X
Delay
Buffer
Multiplier
17
Methodology: Architecture Modeling
• Performance modeling
• custom cycle-accurate simulator
• memory interface connected to Ramulator
• 4 channels x 2 ranks/channel DDR4_2400R
• Area and energy modeling
• RTL synthesis of PU in 40 nm
• Baselines
• transposition on CPU: mergeTrans/scanTrans[1]
• transposition on GPU: cusparse
• SpMV accelerator: Sadi et al. [2]
• Dataset
• synthetic uniform and power-law matrices
• real-world matrices from SuiteSparse Matrix Collection
Cycle-accurate
Simulator
Data
Generator
Single-rank Ramulator
Rank N
Cycle-accurate
Simulator
Data
Generator
Single-rank Ramulator
Rank 0
Matrix partitioning engine
Input Matrix File
MAX
Power Model
…
Transposition Time
Power
Stats
In-Memory Accelerator Simulation
18
[1] Want et al., “Parallel Transposition of Sparse Data Structures”, ICS 2016.
[2] Sadi et al., “Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization”, MICRO 2019.
Evaluation: Comparison vs. CPU/GPU
• Performance benefits come from both reduction in memory traffic and
improvement in memory bandwidth utilization.
ASIC_320k
amazon
bcsstk32
language
mac_econ
parabolic
rajat21
sme3Dc
Slashdot
stomach
transient
twotone
venkat01
webbase
wiki-Talk
geomean
0
10
20
30
40
50
60
Speedup
speedup over scanTrans
speedup over mergeTrans
speedup over cuSPARSE
19
11.2x less memory traffic
2.7x higher BW utilization
19x
12x
8x
MeNDA achieves an average speedup of 19x, 12x and 8x over
scanTrans, mergeTrans, and cuSPARSE, respectively.
Evaluation: Integration with CoSPARSE
• Single Source Shortest Path on amazon (N = 262k, NNZ = 1.23M).
• MeNDA decreases transposition overhead from 126% to 5%.
• allowing CoSPARSE to store only one copy of graph and support larger graphs
• having negligble impact on graph execution time with changed data mapping
• MeNDA consumes 76.8 mW at 800 MHz and 7.1 mm2.
20
[1] Feng et al., “CoSPARSE: A Software and Hardware Reconfigurable SpMV Framework for Graph Analytics”, DAC 2021.
[2] Pal et al., “Outerspace: An outer product based sparse matrix multiplication accelerator”, HPCA 2018.
Figure. Execution time breakdown of SSSP on CoSPARSE[1, 2]
sparse
iterations
dense iterations
transposition
Evaluation: Comparison vs. SpMV accelerator
• Sadi et al. [1] cannot perform transposition without introducing frequent
synchronization or large on-chip buffers.
• MeNDA achieves a similar iso-bandwidth throughput at 0.043 GTEPS/(GB/s)
compared to Sadi et al. at 0.049 GTEPS/(GB/s).
• MeNDA shows an average efficiency gain of 3.8×.
21
[1] Sadi et al., “Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization”, MICRO 2019.
MeNDA focuses on lightweight PUs targeting commodity DIMMs,
which have better capacity scalability than HBM devices.
Conclusion
• Sparse matrix transposition is a promising candidate for NMP
• MeNDA is a near-DRAM solution to multi-way merge for sparse dataflows
• MeNDA presents significant gains over existing solutions
22
Title Slide 2
Back up slides

More Related Content

Similar to isca22-feng-menda_for sparse transposition and dataflow.pptx

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Hsien-Hsin Sean Lee, Ph.D.
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
RT-CUDA: A Software Tool for CUDA Code Restructuring
RT-CUDA: A Software Tool for CUDA Code RestructuringRT-CUDA: A Software Tool for CUDA Code Restructuring
RT-CUDA: A Software Tool for CUDA Code RestructuringDr. Ayaz H. Khan, PhD
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Bruno Castelucci
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
Runtime Reconfigurable Network-on-chips for FPGA-based Systems
Runtime Reconfigurable Network-on-chips for FPGA-based SystemsRuntime Reconfigurable Network-on-chips for FPGA-based Systems
Runtime Reconfigurable Network-on-chips for FPGA-based SystemsMugdha2289
 
Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0Deepak Shankar
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraEmiliano
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfObject Automation
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesMugdha2289
 
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation FinalDhritiman Halder
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 

Similar to isca22-feng-menda_for sparse transposition and dataflow.pptx (20)

Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Dsp ajal
Dsp  ajalDsp  ajal
Dsp ajal
 
RT-CUDA: A Software Tool for CUDA Code Restructuring
RT-CUDA: A Software Tool for CUDA Code RestructuringRT-CUDA: A Software Tool for CUDA Code Restructuring
RT-CUDA: A Software Tool for CUDA Code Restructuring
 
Hpc 4 5
Hpc 4 5Hpc 4 5
Hpc 4 5
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Runtime Reconfigurable Network-on-chips for FPGA-based Systems
Runtime Reconfigurable Network-on-chips for FPGA-based SystemsRuntime Reconfigurable Network-on-chips for FPGA-based Systems
Runtime Reconfigurable Network-on-chips for FPGA-based Systems
 
Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0Designing memory controller for ddr5 and hbm2.0
Designing memory controller for ddr5 and hbm2.0
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with Cassandra
 
Deploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdfDeploying Pretrained Model In Edge IoT Devices.pdf
Deploying Pretrained Model In Edge IoT Devices.pdf
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
 
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation Final
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 

More from ssuser30e7d2

PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxssuser30e7d2
 
xeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxxeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxssuser30e7d2
 
xeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxxeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxssuser30e7d2
 
FOSDEM_2019_Buildroot_RISCV.pdf
FOSDEM_2019_Buildroot_RISCV.pdfFOSDEM_2019_Buildroot_RISCV.pdf
FOSDEM_2019_Buildroot_RISCV.pdfssuser30e7d2
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfssuser30e7d2
 
shift register.ppt
shift register.pptshift register.ppt
shift register.pptssuser30e7d2
 

More from ssuser30e7d2 (7)

PACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptxPACT_conference_2019_Tutorial_02_gpgpusim.pptx
PACT_conference_2019_Tutorial_02_gpgpusim.pptx
 
xeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxxeon phi_mattan erez.pptx
xeon phi_mattan erez.pptx
 
xeon phi_mattan erez.pptx
xeon phi_mattan erez.pptxxeon phi_mattan erez.pptx
xeon phi_mattan erez.pptx
 
FOSDEM_2019_Buildroot_RISCV.pdf
FOSDEM_2019_Buildroot_RISCV.pdfFOSDEM_2019_Buildroot_RISCV.pdf
FOSDEM_2019_Buildroot_RISCV.pdf
 
Project3.ppt
Project3.pptProject3.ppt
Project3.ppt
 
Gunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdfGunjae_ISCA15_slides.pdf
Gunjae_ISCA15_slides.pdf
 
shift register.ppt
shift register.pptshift register.ppt
shift register.ppt
 

Recently uploaded

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture designssuser87fa0c1
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 

Recently uploaded (20)

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
pipeline in computer architecture design
pipeline in computer architecture  designpipeline in computer architecture  design
pipeline in computer architecture design
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 

isca22-feng-menda_for sparse transposition and dataflow.pptx

  • 1. MeNDA: A Near-Memory Multi-way Merge Solution for Sparse Transposition and Dataflows Siying Feng*, Xin He*, Kuan-Yu Chen*, Liu Ke+, Xuan Zhang+, David Blaauw*, Trevor Mudge*, Ronald Dreslinski* *University of Michigan +Washington University in St. Louis
  • 2. Sparse Linear Algebra is Everywhere Fluid Dynamics Sparse Linear Algebra Machine Learning Circuit Simulation Electromagnetics Structural Engineering Robotics & Kinetics Graph Analytics X = X = + Recommendation Systems 2 Sparse Matrix-Matrix Multiplication (SpMM) Sparse Matrix-Vector Multiplication (SpMV) Sparse Gathering
  • 3. Sparse Matrix Transposition: Definition • Compressed storage formats (CSR/CSC) • index array, value array, pointer array • saves storage and avoids computation on zeros • Sparse matrix transposition • swap the row and column indices of elements • equivalent to conversion between CSR and CSC a b c d e f g h i j 0 2 4 6 8 10 0 2 1 4 0 4 2 3 0 2 a b c d e f g h i j A Pointer Index Value A in CSR / AT in CSC a e i c b g j h d f AT 3
  • 4. Sparse Matrix Transposition: A Growing Bottleneck • Sparse matrix transposition is an essential building block of sparse linear algebra • Misconception: transposition overhead used to be minor and easily amortized • Reality: exploding dataset size made the transposition overhead non-negligible • the performance of graph processing has been greatly improved • seldom efforts have been spent on sparse matrix transposition 4 Figure. Execution time breakdown of SSSP on a recent graph framework Runtime transposition using a state-of-the-art implementation can introduce a 126% performance overhead
  • 5. Sparse Matrix Transposition: Memory-bound Nature • Sparse matrix transposition is memory bandwidth bound • By lifting the roofline by 8× , the throughput can improve by 4.1-5.2× • Sparse matrix transposition has low computational intensity 8x 5.2x mergeTrans mergeTrans x8 5 Figure. Roofline Model of MergeTrans The high memory requirement and low arithmetic intensity make sparse matrix transposition a promising candidate for NMP
  • 6. MeNDA: A Scalable Near-DRAM Architecture • Processing units (PUs) are deployed in DIMM buffer chips • limit modification to DIMM hardware to buffer chips • PUs are deployed beside each rank • exploit rank-level and DIMM-level parallelism HOST MC DIMM DRAM DRAM DRAM DRAM … … Rank PU Rank PU Buffer Device DIMM 6 HOST MC DIMM DIMM Effective BW = Channel BW = 20 GB/s Effective BW = Rank BW * # ranks = 80 GB/s
  • 7. MeNDA: Merge-sort Algorithm • Merge sort is chosen for • wide application in sparse linear algebra • spatial locality • An L-leaf merge tree merges L streams • logLN iterations to merge N streams • Input and output are in CSR/CSC • Intermediate data are in COO • takes up less storage • easy to decode 7 merge sort (Iteration 0) a e c b d f g h a e c b h d f g Leaf 0 Leaf 1 Leaf 2 Leaf 3 a e c b d f j a e c b h d f i g j Leaf 0 i h g Leaf 1 merge sort (Iteration 1) Figure. Transpose a 5x5 matrix with a 4-leaf merge tree
  • 8. MeNDA: Processing Unit Microarchitecture 8 PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ HOST MC DIMM DRAM DRAM DRAM DRAM … … Rank PU Rank PU Buffer Device DIMM
  • 9. MeNDA: Processing Unit Microarchitecture hardware merge tree • FIFOs between PEs PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ 9
  • 10. MeNDA: Processing Unit Microarchitecture 10 PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ output buffer • connects to root PE • sends store requests in memory blocks
  • 11. MeNDA: Processing Unit Microarchitecture prefetch buffers • connect to leaf PEs • send requests for matrix rows 11 PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ
  • 12. MeNDA: Processing Unit Microarchitecture Controller • FSM • assigns pointers to prefetch buffers 12 PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ
  • 13. MeNDA: Processing Unit Microarchitecture request queue • holds outstanding requests • separate queues for reads and writes 13 PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ
  • 14. MeNDA: Processing Unit Microarchitecture memory interface unit • mimics memory controller • schedules requests • translates addresses • generates DRAM commands 14 PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ
  • 15. MeNDA: Dataflow PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ load rowPtr 1 recieve rowPtr send ptr load rows receive rows send rows send merged rows 2 3 8 5 6 7 4 15
  • 16. MeNDA: Input Co-location and Workload Balancing • Each rank transposes a horizontal matrix partition • eliminate expensive communication across ranks/DIMMs • use original CSR format without preprocessing • allow easy matrix traversing after transposition • use techniques proposed in prior work [1] • Workload balancing: NNZ-based partitioning • partition index/value with page alignment • assign row pointers to corresponding ranks • duplicate row pointers used by 2 ranks • Host write start addresses to MMRs • Partitioning is automated and hidden from programmers Rank 0 Rank 1 Rank 2 Rank 3 RowPtr Indices Values Rank 0 Rank 1 Rank 2 Rank 3 Rank 0 Rank 1 Rank 2 Rank 3 Rank 0 Rank 1 Rank 2 Rank 3 Duplicated Page across ranks Rank 0 Rank 1 Rank 2 Rank 3 RowPt Indices Values Ran Rank 0 Ran 16 [1] Cho et al., “Near Data Acceleration with Concurrent Host Access ”, ISCA 2020.
  • 17. MeNDA: Adaptation to Support SpMV additional units • FP adder • reduce elements with same rowID • 16-way FP multiplier • Delay buffer • cacheline size • hold columns waiting for vector elements PE PE PE Controller Merge Tree Prefetch Buffer ReadQ WriteQ Address Decoder CMD Generator Request Scheduler Request Queue DDR. C/A Memory Interface Unit Output Buffer DDR. DQ Adders + … X X X Delay Buffer Multiplier 17
  • 18. Methodology: Architecture Modeling • Performance modeling • custom cycle-accurate simulator • memory interface connected to Ramulator • 4 channels x 2 ranks/channel DDR4_2400R • Area and energy modeling • RTL synthesis of PU in 40 nm • Baselines • transposition on CPU: mergeTrans/scanTrans[1] • transposition on GPU: cusparse • SpMV accelerator: Sadi et al. [2] • Dataset • synthetic uniform and power-law matrices • real-world matrices from SuiteSparse Matrix Collection Cycle-accurate Simulator Data Generator Single-rank Ramulator Rank N Cycle-accurate Simulator Data Generator Single-rank Ramulator Rank 0 Matrix partitioning engine Input Matrix File MAX Power Model … Transposition Time Power Stats In-Memory Accelerator Simulation 18 [1] Want et al., “Parallel Transposition of Sparse Data Structures”, ICS 2016. [2] Sadi et al., “Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization”, MICRO 2019.
  • 19. Evaluation: Comparison vs. CPU/GPU • Performance benefits come from both reduction in memory traffic and improvement in memory bandwidth utilization. ASIC_320k amazon bcsstk32 language mac_econ parabolic rajat21 sme3Dc Slashdot stomach transient twotone venkat01 webbase wiki-Talk geomean 0 10 20 30 40 50 60 Speedup speedup over scanTrans speedup over mergeTrans speedup over cuSPARSE 19 11.2x less memory traffic 2.7x higher BW utilization 19x 12x 8x MeNDA achieves an average speedup of 19x, 12x and 8x over scanTrans, mergeTrans, and cuSPARSE, respectively.
  • 20. Evaluation: Integration with CoSPARSE • Single Source Shortest Path on amazon (N = 262k, NNZ = 1.23M). • MeNDA decreases transposition overhead from 126% to 5%. • allowing CoSPARSE to store only one copy of graph and support larger graphs • having negligble impact on graph execution time with changed data mapping • MeNDA consumes 76.8 mW at 800 MHz and 7.1 mm2. 20 [1] Feng et al., “CoSPARSE: A Software and Hardware Reconfigurable SpMV Framework for Graph Analytics”, DAC 2021. [2] Pal et al., “Outerspace: An outer product based sparse matrix multiplication accelerator”, HPCA 2018. Figure. Execution time breakdown of SSSP on CoSPARSE[1, 2] sparse iterations dense iterations transposition
  • 21. Evaluation: Comparison vs. SpMV accelerator • Sadi et al. [1] cannot perform transposition without introducing frequent synchronization or large on-chip buffers. • MeNDA achieves a similar iso-bandwidth throughput at 0.043 GTEPS/(GB/s) compared to Sadi et al. at 0.049 GTEPS/(GB/s). • MeNDA shows an average efficiency gain of 3.8×. 21 [1] Sadi et al., “Efficient SpMV Operation for Large and Highly Sparse Matrices Using Scalable Multi-Way Merge Parallelization”, MICRO 2019. MeNDA focuses on lightweight PUs targeting commodity DIMMs, which have better capacity scalability than HBM devices.
  • 22. Conclusion • Sparse matrix transposition is a promising candidate for NMP • MeNDA is a near-DRAM solution to multi-way merge for sparse dataflows • MeNDA presents significant gains over existing solutions 22
  • 23. Title Slide 2 Back up slides

Editor's Notes

  1. Sparse linear algebra is prevailing in a wide variety of domains, such as machine learning, graph analytics, and scientific computing. Sparse linear algebra is notorious for its irregular memory access pattern, so many near-memory hardware accelerators have been proposed recently for common sparse kernels, such sparse matrix vector multiplication and sparse gatherings. However, not all important sparse kernels have received enough attention and sparse matrix transposition is one of them.
  2. Due to their large sizes and high sparsity, sparse matrices are often stored in compressed formats to save storage and avoid computations on zero elements. Commonly used formats are compressed sparse row (CSR) and compressed sparse column (CSC). CSR and CSC store sparse matrices in three arrays. The index array and value array store the row/column index and value of each NZ, respectively, and the pointer array holds the start pointer of each row/column. Sparse matrix transposition swaps the row index and column index of each NZ. Therefore, transposing a sparse matrix is in essence equivalent to converting a sparse matrix from CSR to CSC, or the opposite. For simplicity, we will use converting a matrix from CSR to CSC to denote general sparse matrix transposition from this point.
  3. Sparse matrix transposition is an essential building block in both the processing and pre-processing stages of sparse linear algebra applications. For example, many recent graph frameworks requires input graphs in both CSR and CSC to support dynamic dataflow reconfiguration. A common misconception regarding the runtime transposition overhead is that it is minor compared to the end-to-end execution time and can be easily amortized. The figure here shows the breakdown of the execution time of running sssp on a recent graph framework. The top bar shows the misconception we just mentioned. However, the middle bar is the reality. Recent breakthroughs have significantly improved the performance of graph processing. Consequently, runtime transposition using a state-of-the-art implementation can introduce a 126% performance overhead.
  4. To understand the bottleneck of sparse matrix transposition, we performed roofline analysis on mergeTrans, a recently proposed sparse matrix transposition implementation. The roofline model shows that sparse matrix transposition is memory bandwidth bound because the data points are close to the “roof”, which are the red and blue lines labeling the maximum throughputs achieved when the system memory bandwidth is fully utilized. If we increase the system memory bandwidth by 8x, the throughput can be improved by up to 5.2x. This shows the potential benefit of applying NMP on sparse matrix transposition because NMP exposes the high internal memory bandwidth of DRAM devices. Meanwhile, transposition has low computational intensity because there are no floating point operations. The high memory requirement and low arithmetic intensity both make sparse matrix transposition a promising candidate for NMP.
  5. In this work, we propose MeNDA, a scalable near-memory architecture that takes advantage of the high internal memory bandwidth of DRAM devices. Inspired by prior works, we deploy custom processing units (PUs) in the buffer chips of DIMMs to minimize the hardware modifications. The PUs are deployed beside each rank to exploit rank-level and DIMM-level parallelism. Each PU can access its corresponding memory rank in parallel without going through the off-chip memory interface and therefore the effective memory BW becomes the sum of all rank BW. MeNDA is highly scalable because a higher throughput can be achieved by populating a memory channel with multiple MeNDA enabled DIMMs.
  6. In this work, we adopted the merge sort algorithm not only because merge sort presents better spatial locality but also because merge sort is widely used in sparse linear algebra. Here we show the dataflow of transposing a 5x5 matrix assuming using a 4-leaf merge tree. A 4-leaf merge tree merges 4 sorted streams into a single stream. In iteration 0, the first 4 rows are merged and then the last 1 row is converted. Then in iteration 1, the two sorted streams are merged into the final result. The input and output are stored in CSR or CSC while the intermediate data are stored in COO. The COO format stores the row index, column index, and value of each NZ in three separate arrays. Because the intermediate data may contain numerous empty rows/columns, COO not only is easy to decode but also tends to take up less storage than CSR/CSC.
  7. Now we will introduce the microarchitecture of the PU. The PU sits within the buffer chips and access its local memory ranks by sending commands through the command/address bus and receive data through the data bus. During execution, a PU transpose the matrix stored in DRAM and writes results directly back to DRAM with no need for host communication or assistance.
  8. The key component of the PU is a hardware merge tree. The *leaf PEs* are connected to *2 prefetch buffers* while other PEs are connected to 2 child PEs through FIFOs. Ideally, we want the merge tree to be as *wide as possible* because *the number of leaves * directly determines the number of transposition iterations, which is proportional to the total memory traffic. However, the area and power of the merge tree grow quadratically with the number of leaves and we have a limited area and power budget within the DIMM buffer chips. So in this work, we use 1024-leaf merge trees, which can transpose a wide range of real-world sparse matrices within 2 iterations while having a moderate power and area overhead.
  9. The *root PE* is connected to an output buffer, which aims to send store requests at cacheline granularity.
  10. Prefetch buffers are in charge of sending memory load requests to feed the *leaves* with correct data. To reduce power consumption, the prefetch buffers are implemented as multi-bank SRAM.
  11. The controller is an FSM that *assigns matrix rows/columns* to each prefetch buffer.
  12. All the memory requests are sent to a request queue with separate queues for loads and stores. To avoid duplicate load requests to the same cacheline, which happens when *multiple rows reside in a single cacheline*, the *read queue* is implemented as content addressable memory (CAM) to allow parallel request merging. The memory response will be broadcasted to all prefetch buffers so we don’t need to keep track of the requesters.
  13. The requests stored in the request queue are processed by the memory interface unit, which mimics a simplified memory controller. The request scheduler selects requests based on a first come first serve policy that prioritizes requests ready to launch and DRAM row hits. Then the address decoder translates the incoming physical address to a DRAM address. And finally the command generator generates DRAM commands.
  14. The overall dataflow of a PU goes like this. The *controller* keeps 1.sending requests for the row *pointers*(to Req Q). 2.After the data come back, as long as the prefetch buffers have availability, 3.the controller will send the start and end *addresses of 'the' rows* sequentially to the prefetch buffers. 4. The *prefetch buffers* send out load requests(to Req Q) for matrix rows or intermediate data as long as they have enough vacancies and 5,6. keep feeding the merge tree as long as they have valid data. The merged results are sent from the root PE to the output buffer. The output buffer generate store requests as soon as there are enough results to form a cacheline.
  15. To achieve maximum throughput, we partition data across different ranks. To avoid expensive communications between rank PUs, we let each PU transpose a *horizontal partition of the matrix* and keep all the input operands needed by the PU in its local memory rank. In this way, preprocessing is not needed, and it is also easy to locate a NZ after transposition. The index and value arrays are partitioned by NNZ with page alignment for workload balancing. The *row pointers* required by a partition are assigned to the *corresponding rank* and the page needed by two ranks are duplicated. After partitioning, the host will write the physical addresses to the memory mapped registers so that the PUs can access them during execution. The partitioning is automated and handled by the host during data allocation. Because all the changes are made in memory mapping, they are completely hidden from the programmer.
  16. Because of the wide application of merge sort, MeNDA can be adapted to support other sparse linear algebra kernels, such as outerproduct SpMV. The *merge phase of (outer)SpMV *has the same dataflow as sparse matrix transposition, and thus can be implemented directly on MeNDA. However, transposition does not involve floating point computations, so floating point adders and multipliers are added. To perform outerproduct, we store the *input matrix is a partitioned CSC* format, which matches the format of the transposed matrix generated by our work.
  17. To model the performance of MeNDA, we built a cycle-accurate simulator and connected the memory interface to Ramulator. The area and power are estimated by synthesizing an RTL model of the PU. For sparse matrix transposition, we compare to two prior CPU implementations and cusparse library on GPU. For SpMV, we compare to an accelerator equipped with HBM. Finally, we used both synthetic matrices and real-world matrices in our evaluation.
  18. The figure here shows the speedup of MeNDA over scanTrans and mergeTrans on CPU and cuSPARSE on GPU. The red line represents a speedup of 1. The speedup of MeNDA comes from both the reduction in memory traffic and the improvement in memory bandwidth utilization. Taking wiki-Talk as an example, compared to mergeTrans, MeNDA reduces the memory traffic by 11.2x while exhibiting 2.7x higher bandwidth utilization. Overall, MeNDA achieves an average speedup of 19.1x, 12.0x and 7.7x over scanTrans, mergeTrans, and cuSPARSE, respectively.
  19. To understand the benefit of MeNDA on end-to-end workloads, we integrated MeNDA with CoSPARSE, a recent graph analytics framework. The figure here shows the execution time of SSSP on CoSPARSE for the graph amazon with 2 copies of the input graph, with runtime transposition using mergeTrans, and using MeNDA. Although integrating MeNDA requires changes in memory mapping, such change has negligible impact on the original graph execution. Sparse matrix transposition happens each time CoSPARSE switches from the dense dataflow to the sparse dataflow or the opposite. As shown by the red bars, using MeNDA for runtime graph transposition decreases the transposition overhead from 126% to 5% . As dataset sizes keep growing, MeNDA can prevent designs like CoSPARSE from expensive disk accesses when the DRAM devices can only fit a single copy of the graph with minor overhead. Synthesis shows that a MeNDA PU takes up 7.1 mm2 in 40 nm and consumes 78.6 mW at 800MHz.
  20. For a fair comparison in SpMV, we use *iso-bandwidth throughput* as the performance metric. Compared to the SpMV accelerator proposed by Sadi et al., MeNDA achieves a similar average iso-bandwidth throughout at 0.043 GTEPS per bandwidth and an average improvement of 3.8x in GTEPS/W, as shown in this figure. While the SpMV accelerator is a monolithic design with 4 HBM stacks, our work MeNDA features lightweight PUs that can be integrated into commodity DRAM devices, which has better capacity scalability than HBM.
  21. We also did a thread scaling analysis. The figure here shows the memory bandwidth utilized by mergeTrans when we increase the number of threads. While the peak bandwidth is at 76.8 GB/s, the utilized memory bandwidth starts to saturate at 16 threads and reaches the maximum at 64 threads at 59.6 GB/s. In practice, there is little performance benefit beyond 16 threads and further saturating the bandwidth is undesirable because of the significantly increased memory latency. What sparse matrix transposition will most benefit from is an approach that reduces memory latency and relieves the contention at the off-chip memory interface.
  22. There has been some some changes in the dataflow for SpMV. the controller send load requests for the vector elements along with the requests for the column pointers. Because all columns are eventually merged into a single vector, we don’t need to track the column indices of NZs anymore. Instead, the prefetch buffer storage aimed for the column indices now holds the vector elements. When the memory interface returns the matrix values, the prefetch buffers that need the matrix values send their vector elements to the multiplier. If they don’t have the vector elements, the memory response will wait in the delay buffer until the vector elements are ready. Meanwhile, if the delay buffer is occupied, the memory interface unit will prioritize requests for the vector elements until the delay buffer is empty. The result elements go through the adders before entering the output buffer so that elements with the same index are accumulated. The rest of the dataflow remain the same.
  23. To understand the benefit of MeNDA on an end-to-end workload, we integrated MeNDA with CoSPARSE. CoSPARSE is a graph analytics framework based on a reconfigurable architecture called Transmuter. The figure here shows a Transmuter architecture with 2 tiles and 4 PEs per tile. For evaluation, we apply a system with 8 tiles and 16 PEs per tile. Similar to many recent graph frameworks, CoSPARSE reconfigures the dataflow based on the density of the active vertex set. The dense iterations use a row-major COO format and no program modifications are needed because all the memory mapping changes are hidden from the programmers. The sparse iterations use a partitioned CSC format. Because there are 8 tiles and 8 memory ranks in total, we let each tile to compute the data in a rank.