SlideShare a Scribd company logo
Student: Chun-Feng Chen
Advisor: Bo-Cheng Charles Lai
Mar. 8th, 2018
NCTU Institute of Electronics – Parallel Computing System Lab (NCTU PCS) Hsinchu, R.O.C
Towards Algorithmic Multi-ported Memory :
Techniques and Design Trade-offs
•Introduction
•Background
•Non-Table-Based Approaches
•Table-Based Approaches
•Performance and Impact of Design Factors
•Conclusions
•References
Chun-Feng Chen NCTU_IEE - PCS Lab
Outline
2
Introduction
Chun-Feng Chen NCTU_IEE - PCS Lab
Introduction
• Algorithmic Multi-ported Memory (AMM)
• Multi-ported memories are important functional modules in modern digital systems
• E.g. shared cache in multi-core processors, routing tables of switches, etc.
• AMM composes simple SRAMs and logic to support multiple reads and writes
• Potential to attain better performance than circuit-based approaches (CMM)
• Most of the previous works on FPGA
• Laforest, Charles Eric, et al. "Efficient Multi-ported Memories for FPGAs" [ACM 2010]
• Charles Eric Laforest, et al. "Multi-ported Memories for FPGAs via XOR" [ACM 2012]
• Charles Eric Laforest, et al. "Composing Multi-Ported Memories on FPGAs" [ACM 2014]
• Jiun-Liang Lin, et al. "BRAM Efficient Multi-ported Memory on FPGAs" [VLSI-DAT 2015]
• Jiun-Liang Lin, et al. "Efficient Designs of Multi-ported Memory on FPGAs" [TVLSI 2016]
• Kun-Hua Huang, et al. "An Efficient Hierarchical Banking Structure for Algorithmic Multi-
ported Memory on FPGAs" [TVLSI 2017]
• Sundar Iyer, et al. "Algorithmic Memory Brings an Order of Magnitude Performance Increase
to Next Generation SoC Memories" [DesignCon 2012]
4
Chun-Feng Chen NCTU_IEE - PCS Lab
Motivation
– FPGA Limits AMM Exploration
• The limited resource on FPGA constrains the AMM exploration
• Limited number of BRAMs, F/F, slice LUTs, etc
• Unable to explore important design factors of AMM
• Number of ports, memory depth, banking structures
• AMM has more significant benefit for greater depth
• E.g. from 512K to 16M-depth, AMM 4R1W attains 1.25% to 36.47% shorter latency better than
circuit-based approaches
• BRAM size is fixed
• 1K depth of 32-bit data width
• Unable to explore impact of different bank sizes
• E.g. choose proper banking structure for 2R1W can enhance latency/area/power up to
9.53%/56.86%/33.39%
• BRAM port configuration is fixed
• dual-port 2RW mode
• Unable to explore impact of different bank port configurations (4R1W, 2R2W, etc)
• E.g. choose proper building memory cell for 8R4W can enhance the area/power up to
70.0%/6.37x
5
Chun-Feng Chen NCTU_IEE - PCS Lab
Our Contributions
• Implement all the AMM designs on 45nm technology
• Use SRAM as building memory cell
• Explore important design factors of AMM
• Different AMM designs, memory depth, port configurations, banking
structures, building memory cells, etc.
• Extensive experiments and comprehensive analysis
• Summarize observations into design guidelines
6
Background
• Non-table-based schemes
• Duplicate memory module
• E.g. NTRep-Rd [ACM 2010]
• Table-based schemes
• Adopt lookup tables to track the
stored up-to-date data address
• E.g. TBLVT [ACM 2010]
Chun-Feng Chen NCTU_IEE - PCS Lab
Algorithmic Multi-ported Memory
(AMM) Techniques Categorize
8
• Non-table-based approaches
• Use multiple banks to support multiple accesses
• Store parity data to support multiple reads and enable multiple
writes [VLSI-DAT’15, TVLSI’16, TVLSI’17]
• HB-NTX-RdWr can scale the number of ports with a systematic flow
[TVLSI’17]
• Table-based approaches
• Use multiple memory modules to support multiple accesses
• Use lookup tables to avoid module conflict and track the most up-to-
date values [VLSI-DAT’ 15, TVLSI’ 16, TVLSI’ 17]
Chun-Feng Chen NCTU_IEE - PCS Lab
AMM - Previous Proposed Designs
9
Non-Table-Based
Approaches
Multiple Reads:
(1) Non-Table-Based Replication (NTRep-Rd)
(2) Non-Table-Based XOR (NTX-Rd)
(3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd)
Multiple Write:
(1) Non-Table-Based XOR (NTX-Wr)
(2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr)
Multiple Reads and Writes:
(1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
Non-Table-Based
Approaches
Multiple Reads:
(1) Non-Table-Based Replication (NTRep-Rd)
(2) Non-Table-Based XOR (NTX-Rd)
(3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd)
Multiple Write:
(1) Non-Table-Based XOR (NTX-Wr)
(2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr)
Multiple Reads and Writes:
(1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
• A mR1W memory module of
NTRep-Rd technique
• Duplicate memory modules to
support multiple read ports
• Only one write port connects each
memory module
[7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th Annual
ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs). ACM, 2010.
Chun-Feng Chen NCTU_IEE - PCS Lab
Non-Table-Based Replication Multiple
Reads (NTRep-Rd) - [ACM’10, TRETs’14]
12
Non-Table-Based
Approaches
Multiple Reads:
(1) Non-Table-Based Replication (NTRep-Rd)
(2) Non-Table-Based XOR (NTX-Rd)
(3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd)
Multiple Write:
(1) Non-Table-Based XOR (NTX-Wr)
(2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr)
Multiple Reads and Writes:
(1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
• A 2R1W memory module of NTX-
Rd technique
• Write request
• W0 stores directly to BANK0 and
read D0 from BANK1
• Update the XOR-BANK
• Read request
• R1 reads directly
• R0 reads the other banks to
recover correct data
• NTX-Rd support two mode
• Case 1: 3R (no write request)
• Case 2: 2R1W (one write request)
[13] Sundar Iyer, Shang-Tse Chuang, and Co-Founder & CTO Memoir Systems. " Algorithmic Memory: An Order of Magnitude
Performance Increase for Next Generation SoCs." DesignCon. (http://www.designcon.com), 2012.
Chun-Feng Chen NCTU_IEE - PCS Lab
Non-Table-Based XOR Multiple Reads
(NTX-Rd) - [DesignCon.’12]
W0’ = (W0 D0)
R0 = (W0 D0) D0
W0’
14
Non-Table-Based
Approaches
Multiple Reads:
(1) Non-Table-Based Replication (NTRep-Rd)
(2) Non-Table-Based XOR (NTX-Rd)
(3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd)
Multiple Write:
(1) Non-Table-Based XOR (NTX-Wr)
(2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr)
Multiple Reads and Writes:
(1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
Hierarchical Banking Non-Table XOR-Based
Multiple Reads (HB-NTX-Rd) - [TVLSI’17]
[12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on
FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.10 (2017): 2776-2788.
Chun-Feng Chen NCTU_IEE - PCS Lab
• 4 read and 1 write memory
• Scale the B-NTX-Rd to more reads in a
hierarchical structure
• Use the 2R1W/3R as building modules
• Case 1: 5R (no write request)
• Three reads access BANK0
• Other two reads access the other banks
• Case 2: 4R1W (one write request)
• W0 and two reads access BANK0
• Other two reads access the other banks
• W0 stores directly, and reads BANK1 for
updating XOR-BANK
16
Non-Table-Based
Approaches
Multiple Reads:
(1) Non-Table-Based Replication (NTRep-Rd)
(2) Non-Table-Based XOR (NTX-Rd)
(3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd)
Multiple Write:
(1) Non-Table-Based XOR (NTX-Wr)
(2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr)
Multiple Reads and Writes:
(1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
• A 1R2W memory module of NTX-Wr technique
• Duplicate memory modules and store XOR-encoded values to support multiple
read and writes
[9] LaForest, Charles Eric, et al. "Multi-ported Memories for FPGAs via XOR.” Proceedings of the ACM/SIGDA International Symposium on
Field Programmable Gate Arrays (FPGAs). ACM, 2012.
Chun-Feng Chen NCTU_IEE - PCS Lab
Non-Table-Based XOR Multiple Writes
(NTX-Wr) - [ACM’12, TRETs’14]
W0’ = (W0 D0) D0
W1’ = (W1 D1) D1
W0’ W0’
W1’
W1’
18
Non-Table-Based
Approaches
Multiple Reads:
(1) Non-Table-Based Replication (NTRep-Rd)
(2) Non-Table-Based XOR (NTX-Rd)
(3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd)
Multiple Write:
(1) Non-Table-Based XOR (NTX-Wr)
(2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr)
Multiple Reads and Writes:
(1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
[12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on
FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.10 (2017): 2776-2788.
Chun-Feng Chen NCTU_IEE - PCS Lab
Hierarchical Banking Non-Table XOR-Based
Multiple Writes (HB-NTX-Wr) - [TVLSI’17]
(a) Non-Conflict-Write case (b) Conflict-Write case
W0’ = W0 Ref0
W1’ = W1 Ref1
Ref1new = W1 (D0 Ref1cur)
W1’’ = D1 Ref1new
W1’
W0’
W1’’
20
Non-Table-Based
Approaches
Multiple Reads:
(1) Non-Table-Based Replication (NTRep-Rd)
(2) Non-Table-Based XOR (NTX-Rd)
(3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd)
Multiple Write:
(1) Non-Table-Based XOR (NTX-Wr)
(2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr)
Multiple Reads and Writes:
(1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
• 2 read and 2 write memory
• Integrates HB-NTX-Rd and HB-NTX-Wr to enable multiple reads and writes
• Use HB-NTX-Rd 4R1W/5R as building memory modules
(b) Conflict-Write case(a) Non-Conflict-Write case
Chun-Feng Chen NCTU_IEE - PCS Lab
Hierarchical Banking Non-Table XOR-Based Multiple
Reads and Writes (HB-NTX-RdWr) - [TVLSI’17]
[12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on
FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.10 (2017): 2776-2788.
22
• The top-down flow increases read ports with HB-NTX-Rd, while the left-
right flow increases write ports with HB-NTX-Wr
Chun-Feng Chen NCTU_IEE - PCS Lab
HB-NTX-RdWr Systematic Flow
- [TVLSI’17]
23
Table-Based
Approaches
Multiple Reads and Writes:
(1) Table-Based Live Value Table (TBLVT)
(2) Table-Based Remap Table (TBRemap)
(3) Enhancing Table-Based with Reduce Lookup Tables
(TBLVT_HB-NTX-RdWr)
Table-Based
Approaches
Multiple Reads and Writes:
(1) Table-Based Live Value Table (TBLVT)
(2) Table-Based Remap Table (TBRemap)
(3) Enhancing Table-Based with Reduce Lookup Tables
(TBLVT_HB-NTX-RdWr)
[7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th Annual
ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs). ACM, 2010.
Chun-Feng Chen NCTU_IEE - PCS Lab
Table-Based Live Value Table (TBLVT)
- [ACM’10, TRETs’14]
• Write request
• Dedicate a write data to a certain
memory module
• Lookup table (LVT) traces the latest
location
• Read request will query the LVT first
and then access the data from correct
memory location
• Design of the LVT size:
• log2(#NumModules) x MemoryDepth
26
Table-Based
Approaches
Multiple Reads and Writes:
(1) Table-Based Live Value Table (TBLVT)
(2) Table-Based Remap Table (TBRemap)
(3) Enhancing Table-Based with Reduce Lookup Tables
(TBLVT_HB-NTX-RdWr)
[11] Lai, Bo-Cheng Charles, and Jiun-Liang Lin. "Efficient Designs of Multi-ported Memory on FPGAs." IEEE Transactions on Very Large
Scale Integration (VLSI) Systems 25.1 (2017): 139-150.
Chun-Feng Chen NCTU_IEE - PCS Lab
Table-Based Remap (TBRemap)
– [VLSI-DAT’15, TVLSI’16]
• Remap functions:
• Apply banking structure designs
• All the reads and writes need to check
remap table to determine which
memory bank to access
• Use a HWC to distribute the multiple
write into writes, and a remap table to
track the latest location
• Design of the Remap size:
• ([log2(#DataBanks + 1)] – 1) x
MemoryDepth
28
Table-Based
Approaches
Multiple Reads and Writes:
(1) Table-Based Live Value Table (TBLVT)
(2) Table-Based Remap Table (TBRemap)
(3) Enhancing Table-Based with Reduce Lookup Tables
(TBLVT_HB-NTX-RdWr)
Chun-Feng Chen NCTU_IEE - PCS Lab
Enhancing Table-Based Design with
Reduce Lookup Table
• NTRep-Rd mR1W modules replaced
by HB-NTX-RdWr mRnW modules
• Reduce lookup table size while uses
less modules, to alleviate the routing
complexity for latency critical path
• Example: A 2R4W 8K-depth memory
• Original TBLVT needs four NTRep-Rd
2R1W as building modules
• LVTSize = 2-bit x 8K-depth
• Enhance TBLVT needs two HB-NTX-
RdWr 2R2W as building modules
• LVTSize = 1-bit x 8K-depth
30
Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
Chun-Feng Chen NCTU_IEE - PCS Lab
A. Experiments Setup
– Read/Write Path Logic
• The block diagram of an AMM architecture, including write-path logic, read-path
logic, and building memory cells
• Write-path performs data manipulation design, e.g. replication, and lookup table… etc.
• Read-path performs retrieve the data from memory cells and decoding correct data
33
= Design Compiler synthesis RTL with TSMC 45nm
Chun-Feng Chen NCTU_IEE - PCS Lab
A. Experiments Setup
- Memory Cells (SRAM)
• The block diagram of an AMM architecture, including write-path logic, read-path
logic, and building memory cells
• Memory cells are composed by SRAMs, e.g. single-port or dual-port mode
34
= CACTI integrated memory model to estimate the
performance of different SRAM modules with TSMC 45nm
Chun-Feng Chen NCTU_IEE - PCS Lab
A. Experiments Setup - Algorithmic
Multi-ported Memory Architecture
• By combining the synthesis results of read-path and write-path logic, and
estimation from CACTI, we can evaluate the overall performance and cost
of an AMM design
35
= Overall performance of an AMM designs++
Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
Chun-Feng Chen NCTU_IEE - PCS Lab
B. Circuit-level vs. Algorithmic Schemes
– 2RnW (Latency)
37
l CMM has shorter latency for shallow memory depth (< 64K)
AMM has short latency for greater memory depth (> 128K)
l AMM is more scalable when increasing memory depth
For 2R8W, from 16K to 16M, the latency increases by:
HB-NTX-RdWr (1.52x), TBLVT_B-NTX-Rd (2.85x), CMM (23.46x)
l Non-table designs have shorter latency than table-based
For 2R8W, HB-NTX-Rd attains 4.23% to 95.09% shorter latencies
than TBLVT_B-NTX-Rd from 16K to 16M
Chun-Feng Chen NCTU_IEE - PCS Lab
B. Circuit-level vs. Algorithmic Schemes
– mR1W (Area)
38
l Non-table-based
AMM CMM
l 2R1W attains 6.38% to 36.2%
l 4R1W attains 15.27% to 67%
l 8R1W attains 59.33% to 3.01x smaller area when
compared with CMM
Chun-Feng Chen NCTU_IEE - PCS Lab
B. Circuit-level vs. Algorithmic Schemes
– 2RnW (Area)
39
l Non-table-based
CMM
l Table-based designs still attain smaller area over CMM
For 2R8W TBLVT_B-NTX-Rd, can attains 2.01% to
22.79% smaller area from 64K to 16M
l Table-based memory cell ,
table-based non-table-based
Chun-Feng Chen NCTU_IEE - PCS Lab
B. Circuit-level vs. Algorithmic Schemes
– 2RnW (Power)
40
l AMM has lower power (ever for non-table-based designs)
For AMM, access data
random bank
access data AMM
l For 2R8W HB-NTX-RdWr, can attains 45.59% to 2.1x
from 512K to 16M
l For 2R8W TBLVT_B-NTX-Rd, can attains 3.29% to
4.71x from 1K to 16M
Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
Chun-Feng Chen NCTU_IEE - PCS Lab
C. Overall Performance and Cost
- Non-Table-Based vs. Table-Based (Latency)
l The non-table-based schemes have shorter access
latencies with simple logic operations; table-based
schemes are impacted by routing path to lookup table
l The latency of AMM is mainly determined by the SRAM
modules for greater memory depth
For example: for 1R2W, 1K to 1M, memory cells account:
NTX-Wr: 28.82% to 71.55% of overall latency
TBLVT: 26.94% to 69.61% of overall latency
42
Chun-Feng Chen NCTU_IEE - PCS Lab
C. Overall Performance and Cost
- Non-Table-Based vs. Table-Based (Area)
l The area of AMM is mainly determined by the SRAM
modules
For example, 1R2W, 1K to 1M, memory cells account:
NTX-Wr: 92.52% to 99.99% of overall area
TBLVT: 93.11% to 99.99% of overall area
l Table-based ,
table-based non-table-based
43
Chun-Feng Chen NCTU_IEE - PCS Lab
C. Overall Performance and Cost
- Non-Table-Based vs. Table-Based (Power)
44
l The power of AMM is mainly determined by the SRAM
modules
For example, for 1R2W, 1K to 1M, memory cells account:
NTX-Wr: 89.27% to 99.97% of overall power
TBLVT: 88.9% to 99.97% of overall power
l Table-based , table-
based non-table-based
Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
Chun-Feng Chen NCTU_IEE - PCS Lab
D. Impact of Banking Structure
– Non-Table-Based (Latency)
l Banking structure is a tradeoff between memory cell and logic
l The best banking structure would be different according to
designs
l For example, for 2R1W, the 32-bank has the shortest latency
among all the other banking structures
46
Chun-Feng Chen NCTU_IEE - PCS Lab
D. Impact of Banking Structure
– Non-Table-Based (Area)
47
l Area efficiency: (Area)/(data size), data
memory cell area
l 8-bank has the smaller area among all the other designs
l Area of memory cell is the dominant factor of overall
area:
Logic only occupies 0.0782% (1bank) to 8.87%(256bank)
of overall area
2
1.502
1.253 1.13
1.17 1.246
1.346
1.566
1.80
Chun-Feng Chen NCTU_IEE - PCS Lab
D. Impact of Banking Structure
– Non-Table-Based (Power)
48
l Power efficiency: (Power)/(data size), data
power access
l 8-bank has the lower power among all the other designs
l Power of memory cell is the dominant factor of overall
power:
Logic only occupies : 0.201% (1bank) to 10.653%
(256bank) of overall power
1.68
1.29
1.17
1.13 1.27
1.46
1.74
1.866
2.01
Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
Chun-Feng Chen NCTU_IEE - PCS Lab
E. Scalability with Memory Depths and
Number of Ports - Latency
l The AMM performance is mainly determined by
the SRAM modules
l For 2R4W HB-NTX-RdWr, (64K to 128K)
& (256K to 512K) latency
l For 2R4W HB-NTX-RdWr, (128K to 256K)
latency
50
2K 2RW
4K 2RW
Chun-Feng Chen NCTU_IEE - PCS Lab
E. Scalability with Memory Depths and
Number of Ports - Area
51
l The AMM performance is mainly determined
by the SRAM modules
l HB-NTX-RdWr attains smaller area than NTX-
Wr_B-NTX-Rd
l For 2R8W, HB-NTX-RdWr attains 11.21% to
2.33x smaller area from 16K to 1M, than NTX-
Wr_B-NTX-Rd
Chun-Feng Chen NCTU_IEE - PCS Lab
E. Scalability with Memory Depths and
Number of Ports - Power
52
l The AMM performance is mainly determined by
the SRAM modules
l HB-NTX-RdWr attains lower power than NTX-
Wr_B-NTX-Rd
l For 2R8W, HB-NTX-RdWr attains 9.51% to
55.39% lower power from 16K to 1M, than
NTX-Wr_B-NTX-Rd
Performance and Impact
of Design Factors
A. Experiments Setup
B. Circuit-level vs. Algorithmic Scheme
C. Overall Performance and Cost (Non-Table-Based vs. Table-Based)
D. Impact of Banking Structure
E. Scalability with Memory Depths and Number of Ports
F. Proper Tradeoff Between Circuit-level and Algorithmic Memory
F.(1) Different Basic SRAM Modules
F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
Chun-Feng Chen NCTU_IEE - PCS Lab
F.(1) Different Basic SRAM Modules
- Latency
For a wide range of sizes, all these basic SRAM
modules pose very similar performance and cost
54
Chun-Feng Chen NCTU_IEE - PCS Lab
F.(1) Different Basic SRAM Modules
- Area
55
For a wide range of sizes, all these basic SRAM
modules pose very similar performance and cost
Chun-Feng Chen NCTU_IEE - PCS Lab
F.(1) Different Basic SRAM Modules
- Power
l For a widely range of sizes, power
consumption: 2RW > 1R1W/2R > 1R1W
l AMM 2RW
2RW + 1R1W (power)
56
Chun-Feng Chen NCTU_IEE - PCS Lab
F.(2) AMM Designs with Higher Port
Counts Circuit-level Modules - Latency
l CMM could outperform AMM in certain configurations
l Can we attain better performance by properly choosing
SRAM modules?
l Apply three different SRAM modules (2RW, 2R2W, 2R4W),
we use 4R2W as an example
l For 4R2W, AMM with 2RW SRAMs attains 6.08% to 2.032x
faster latencies from 1K to 16M than AMM with 2R2W
l This is because latency of HB-NTX-RdWr is mainly
determined by the SRAM module, and 2RW is faster (than
2R2W and 2R4W)
57
Chun-Feng Chen NCTU_IEE - PCS Lab
F.(2) AMM Designs with Higher Port
Counts Circuit-level Modules - Area
l But using more complex SRAM does provide
benefit on area and power
l AMM
(e.g. 4R2W: 54
2 )
l For 4R2W, AMM with 2R2W SRAM attains
30.28% to 71.43% smaller area from 1K to 16M
than AMM with 2RW
58
Chun-Feng Chen NCTU_IEE - PCS Lab
F.(2) AMM Designs with Higher Port
Counts Circuit-level Modules – Power
59
l But using more complex SRAM does provide
benefit on area and power
l AMM
(e.g. 4R2W:
54 2 )
l For 4R2W, AMM with 2R2W SRAM attains
2.52x to 3.42x lower power from 1K to 16M
than AMM with 2RW
Conclusions
Summary of Experiments on AMM
Studies
Chun-Feng Chen NCTU_IEE - PCS Lab
• AMM does attain superior performance (latency/ area/ power) than CMM, the
benefits become more significant for designs with more ports and greater depth
• The performance of AMM is mainly determined by the algorithmic logics when
memory depth is shallow. The building SRAM modules will become the main
performance factor for memory with great depth.
• Non-table-based AMMs have shorter latencies when compared with table-based
designs. Table-based AMMs pose smaller area and lower power consumption than
non-table-based AMMs
• Proper banking structure would enhance the performance while excessively
aggressive banking could induce significant overhead and performance hit
• Choosing proper SRAM with higher port counts as building modules could enhance
the performance (area/ power) of AMM designs
61
Conclusions
Chun-Feng Chen NCTU_IEE - PCS Lab
• Most of the previous works of AMM were conducted on FPGA-based
platforms, is implemented by composing multiple BRAMs and logic slices
LUT (lookup table)
• This thesis aims to comprehensive analysis and exploration the algorithmic
multi-ported memory on ASICs
• Different basic SRAM modules
• Scalability with memory depths and number of ports for AMM designs
• Applying banking structures for AMM designs
• Circuit-level schemes vs. Algorithmic schemes for different port configures
• Choosing proper SRAM modules with higher port counts can enhance the
performance of AMM designs
62
Thanks for listening
References
Chun-Feng Chen NCTU_IEE - PCS Lab
References
•[1] Abdel-Hafeez, Saleh M., and Anas S. Matalkah. "CMOS eight-transistor memory cell for low-dynamic-power high-speed embedded
SRAM." Journal of Circuits, Systems, and Computers 17.05 (2008): 845-863.
•[2] Bhagyalakshmi, I. V., Ravi Teja, and Madhan Mohan. "Design and VLSI Simulation of SRAM Memory Cells for Multi-ported SRAM’s." (2014).
•[3] Rivest, Ronald L., and Lance A. Glasser. A Fast-Multiport Memory Based on Single-Port Memory Cells. No. MIT/LCS/TM-455. MASSACHUSETTS
INST OF TECH CAMBRIDGE LAB FOR COMPUTER SCIENCE, 1991.
•[4] Park, Seon-yeong, et al. "CFLRU: a replacement algorithm for flash memory." Proceedings of the 2006 international conference on Compilers,
architecture and synthesis for embedded systems. ACM, 2006.
•[5] Synopsys Design Compiler User Guide Version X-2005.09. [Online] Available:
http://beethoven.ee.ncku.edu.tw/testlab/course/VLSIdesign_course/course_96/Tool/Design_Compiler%20_User_Guide.pdf
•[6] Synopsys Design Compiler Optimization Reference Manual Version D-2010.03. [Online] Available: http://cleroux.vvv.enseirb-
matmeca.fr/EN219/doc/dcrmo.pdf
•[7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th annual ACM/SIGDA
international symposium on Field programmable gate arrays (FPGA), pp. 41-50, ACM, 2010.
•[8] Charles Eric LaForest, Ming Gang Liu, Emma Rae Rapati, and J. Gregory Steffan. "Multi-ported Memories for FPGAs via XOR," In Proceedings of the
20th annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pp. 209–218, ACM, 2012.
•[9] Charles Eric Laforest, Zimo Li, Tristan O'rourke, Ming G. Liu, and J. Gregory Steffan. "Composing Multi-Ported Memories on FPGAs," in
Proceedings of the ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol.7, issue 3, article no. 16, 2014.
•[10] Lin, Jiun-Liang, and Bo-Cheng Charles Lai. "BRAM Efficient Multi-ported Memory on FPGA." VLSI Design, Automation and Test (VLSI-DAT), 2015
International Symposium on. IEEE, 2015.
•[11] Lai, Bo-Cheng Charles, and Jiun-Liang Lin. "Efficient Designs of Multiported Memory on FPGA." IEEE Transactions on Very Large Scale Integration
(VLSI) Systems (2016).
65
Chun-Feng Chen NCTU_IEE - PCS Lab
References
•[12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on FPGA." IEEE
Transactions on Very Large Scale Integration (VLSI) Systems (2017).
•[13] S. Iyer and D. Chuang. (Jan. 2012) “Algorithmic Memory Brings an Order of Magnitude Performance Increase to Next Generation SoC Memories
“DesignCon, accessed on Jun. 22, 2017. [Online] Available: http://www.yuba.stanford.edu/sundaes/Papers/DesignCon-AlgMem.pdf
•[14] Tse, David N. C., Pramod Viswanath, and Lizhong Zheng. "Diversity-multiplexing tradeoff in multiple-access channels." IEEE Transactions on
Information Theory 50.9 (2004): 1859-1874.
•[15] Ping, Li, et al. "Interleave division multiple-access." IEEE Transactions on Wireless Communications 5.4 (2006): 938-947.
•[16] Suhendra, Vivy, Chandrashekar Raghavan, and Tulika Mitra. "Integrated scratchpad memory optimization and task scheduling for MPSoC
architectures." Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems. ACM, 2006.
•[17] Iyer, Sundar, and Shang-Tse Chuang. "High speed memory systems and methods for designing hierarchical memory systems." U.S. Patent
Application No. 12/806,631.
•[18] Wilton, Steven JE, and Norman P. Jouppi. "CACTI: An enhanced cache access and cycle time model." IEEE Journal of Solid-State Circuits 31.5
(1996): 677-688.
•[19] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norman P. Jouppi. "CACTI 6.0: A tool to model large caches." HP Laboratories (2009):
22-31.
•[20] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norm Jouppi. "Optimizing NUCA organizations and wiring alternatives for large caches
with CACTI 6.0." Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007.
•[21] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norman P. Jouppi. "Architecting efficient interconnects for large caches with CACTI
6.0." IEEE micro 28.1 (2008).
•[22] Thoziyoor, Shyamkumar, et al. CACTI 5.1. Technical Report HPL-2008-20, HP Labs, 2008.
•[23] Synopsys Design Compiler Standard Cell Library, including TSMC, UMC and SMIC. [Online] Available:
https://www.synopsys.com/dw/ipdir.php?ds=dwc_standard_cell
66
Chun-Feng Chen NCTU_IEE - PCS Lab
References
•[24] TSMC Standard Cell Library (including 45nm, 90nm advanced technology) Description Name. [Online] Available: http://www.europractice-
ic.com/libraries_TSMC.php
•[25] Bo-Cheng Charles Lai, Jiun-Liang Lin, Kun-Hua Huang, and Kuo-Cheng Lu. "Method for accessing multi-port memory module, method for
increasing write ports of memory module and associated memory controller." U.S. Patent Application No. 15/098,330.
•[26] Bo-Cheng Charles Lai, Jiun-Liang Lin, and Kuo-Cheng Lu. "Method for accessing multi-port memory module and associated memory controller."
U.S. Patent Application No. 15/098,336.
•[27] Tseng, Jessica H., and Krste Asanović. "Banked multiported register files for high-frequency superscalar microprocessors." ACM SIGARCH
Computer Architecture News. Vol. 31. No. 2. ACM, 2003.
•[28] Kim, John. "Low-cost router microarchitecture for on-chip networks." Proceedings of the 42nd Annual IEEE/ACM International Symposium on
Microarchitecture. ACM, 2009.
•[29] Gupta, Pankaj, Steven Lin, and Nick McKeown. "Routing lookups in hardware at memory access speeds." INFOCOM'98. Seventeenth Annual Joint
Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE. Vol. 3. IEEE, 1998.
•[30] Hughes, John H. "Routing table lookup implemented using M-trie having nodes duplicated in multiple memory Banks." U.S. Patent No.
6,308,219. 23 Oct. 2001.
•[31] McAuley, Anthony J., Paul F. Tsuchiya, and Daniel V. Wilson. "Fast multilevel hierarchical routing table lookup using content addressable
memory." U.S. Patent No. 5,386,413. 31 Jan. 1995.
•[32] Teitenberg, Tim, and Bikram Singh Bakshi. "Efficient memory management for channel drivers in next generation I/O system." U.S. Patent No.
6,421,769. 16 Jul. 2002.
•[33] Treleaven, Philip C., David R. Brownbridge, and Richard P. Hopkins. "Data-driven and demand-driven computer architecture." ACM Computing
Surveys (CSUR) 14.1 (1982): 93-143.
•[34] Peng, Zebo, and Krzysztof Kuchcinski. "Automated transformation of algorithms into register-transfer level implementations." IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems 13.2 (1994): 150-166.
67
Chun-Feng Chen NCTU_IEE - PCS Lab
References
•[35] Keshav, Srinivasan, and Rosen Sharma. "Issues and trends in router design." IEEE Communications magazine 36.5 (1998): 144-151.
•[36] Tullsen, Dean M., et al. "Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor." ACM
SIGARCH Computer Architecture News. Vol. 24. No. 2. ACM, 1996.
•[37] Xilinx 7 Series FPGAs Configurable Logic Block User Guide. [Online] Available:
http://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf
•[38] Fetzer, E. S., Gibson, M., Klein, A., Calick, N., Zhu, C., Busta, E., & Mohammad, B. (2002). "A fully bypassed six-issue integer datapath and register
file on the Itanium-2 microprocessor." IEEE Journal of Solid-State Circuits Conference, vol. 1, Feb. 2002, pp. 420-478.
•[39] Bajwa, H., and X. Chen. "Low-Power High-Performance and Dynamically Configured Multi-port Cache Memory Architecture." Electrical
Engineering, 2007. ICEE'07. International Conference on. IEEE, April, 2007.
•[40] S. Ben-David, A. Borodin, R. Karp, G. Tardos, and A. Wigderson, “On the Power of Randomization in On-line Algorithms”, New York: Springer,
1994.
68

More Related Content

What's hot

Eco
EcoEco
Lfsr report
Lfsr report Lfsr report
1. FPGA architectures.pdf
1. FPGA architectures.pdf1. FPGA architectures.pdf
1. FPGA architectures.pdf
TesfuFiseha1
 
Combinational & Sequential ATPG.pdf
Combinational & Sequential ATPG.pdfCombinational & Sequential ATPG.pdf
Combinational & Sequential ATPG.pdf
MoinPasha12
 
TRANSITIONAL BUTTERWORTH-CHEBYSHEV FILTERS
TRANSITIONALBUTTERWORTH-CHEBYSHEV FILTERSTRANSITIONALBUTTERWORTH-CHEBYSHEV FILTERS
TRANSITIONAL BUTTERWORTH-CHEBYSHEV FILTERS
NITHIN KALLE PALLY
 
Shift rotate
Shift rotateShift rotate
Shift rotate
fika sweety
 
AMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptxAMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptx
Sairam Chebrolu
 
Raspberry Pi - Lecture 3 Embedded Communication Protocols
Raspberry Pi - Lecture 3 Embedded Communication ProtocolsRaspberry Pi - Lecture 3 Embedded Communication Protocols
Raspberry Pi - Lecture 3 Embedded Communication Protocols
Mohamed Abdallah
 
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tipturVLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
Pramod Kumar S
 
Gprs
GprsGprs
Arm processors' architecture
Arm processors'   architectureArm processors'   architecture
Arm processors' architecture
Dr.YNM
 
High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)
HARINATH REDDY
 
Transmision de datos generalidades
Transmision de datos generalidadesTransmision de datos generalidades
Transmision de datos generalidades
Henrry Eliseo Navarro Chinchilla
 
2019 2 testing and verification of vlsi design_verification
2019 2 testing and verification of vlsi design_verification2019 2 testing and verification of vlsi design_verification
2019 2 testing and verification of vlsi design_verification
Usha Mehta
 
Pass transistor logic
Pass transistor logicPass transistor logic
Pass transistor logic
Tripurna Chary
 
Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...
Lakshmi Narain College of Technology & Science Bhopal
 
Clock distribution
Clock distributionClock distribution
Clock distribution
Kaushal Panchal
 
Placement.pdf
Placement.pdfPlacement.pdf
Placement.pdf
Ahmed Abdelazeem
 
ASIC_Design.pdf
ASIC_Design.pdfASIC_Design.pdf
ASIC_Design.pdf
Ahmed Abdelazeem
 
Lect01 flow
Lect01 flowLect01 flow
Lect01 flow
prabhu_vlsi
 

What's hot (20)

Eco
EcoEco
Eco
 
Lfsr report
Lfsr report Lfsr report
Lfsr report
 
1. FPGA architectures.pdf
1. FPGA architectures.pdf1. FPGA architectures.pdf
1. FPGA architectures.pdf
 
Combinational & Sequential ATPG.pdf
Combinational & Sequential ATPG.pdfCombinational & Sequential ATPG.pdf
Combinational & Sequential ATPG.pdf
 
TRANSITIONAL BUTTERWORTH-CHEBYSHEV FILTERS
TRANSITIONALBUTTERWORTH-CHEBYSHEV FILTERSTRANSITIONALBUTTERWORTH-CHEBYSHEV FILTERS
TRANSITIONAL BUTTERWORTH-CHEBYSHEV FILTERS
 
Shift rotate
Shift rotateShift rotate
Shift rotate
 
AMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptxAMBA 5 COHERENT HUB INTERFACE.pptx
AMBA 5 COHERENT HUB INTERFACE.pptx
 
Raspberry Pi - Lecture 3 Embedded Communication Protocols
Raspberry Pi - Lecture 3 Embedded Communication ProtocolsRaspberry Pi - Lecture 3 Embedded Communication Protocols
Raspberry Pi - Lecture 3 Embedded Communication Protocols
 
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tipturVLSI lab manual Part A, VTU 7the sem KIT-tiptur
VLSI lab manual Part A, VTU 7the sem KIT-tiptur
 
Gprs
GprsGprs
Gprs
 
Arm processors' architecture
Arm processors'   architectureArm processors'   architecture
Arm processors' architecture
 
High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)
 
Transmision de datos generalidades
Transmision de datos generalidadesTransmision de datos generalidades
Transmision de datos generalidades
 
2019 2 testing and verification of vlsi design_verification
2019 2 testing and verification of vlsi design_verification2019 2 testing and verification of vlsi design_verification
2019 2 testing and verification of vlsi design_verification
 
Pass transistor logic
Pass transistor logicPass transistor logic
Pass transistor logic
 
Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...
 
Clock distribution
Clock distributionClock distribution
Clock distribution
 
Placement.pdf
Placement.pdfPlacement.pdf
Placement.pdf
 
ASIC_Design.pdf
ASIC_Design.pdfASIC_Design.pdf
ASIC_Design.pdf
 
Lect01 flow
Lect01 flowLect01 flow
Lect01 flow
 

Similar to Algorithmic Multi-ported Memory(MEM) - Comprehensive Techniques Guideline

Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
jhugg
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
Kathirvel Ayyaswamy
 
Introduction to Computer Architecture and Organization
Introduction to Computer Architecture and OrganizationIntroduction to Computer Architecture and Organization
Introduction to Computer Architecture and Organization
Dr. Balaji Ganesh Rajagopal
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
InfinIT - Innovationsnetværket for it
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with Cassandra
Emiliano
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
Amazon Web Services
 
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
Ishan Thakkar
 
Tms training
Tms trainingTms training
Tms training
Chi Lee
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
Amin Chowdhury
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
PlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Oscar Corcho
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
Istvan Szukacs
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
Istvan Szukacs
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
Haris456
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Web Services
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
C-Store-s553-stonebraker.ppt
C-Store-s553-stonebraker.pptC-Store-s553-stonebraker.ppt
C-Store-s553-stonebraker.ppt
JinwenZhong1
 
Sap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseSap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory database
Alexander Talac
 

Similar to Algorithmic Multi-ported Memory(MEM) - Comprehensive Techniques Guideline (20)

Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
22cs201 COMPUTER ORGANIZATION AND ARCHITECTURE
 
Introduction to Computer Architecture and Organization
Introduction to Computer Architecture and OrganizationIntroduction to Computer Architecture and Organization
Introduction to Computer Architecture and Organization
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 
Autonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with CassandraAutonomous control in Big Data platforms: and experience with Cassandra
Autonomous control in Big Data platforms: and experience with Cassandra
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in H...
 
Tms training
Tms trainingTms training
Tms training
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech Talks
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
C-Store-s553-stonebraker.ppt
C-Store-s553-stonebraker.pptC-Store-s553-stonebraker.ppt
C-Store-s553-stonebraker.ppt
 
Sap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseSap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory database
 

Recently uploaded

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

Algorithmic Multi-ported Memory(MEM) - Comprehensive Techniques Guideline

  • 1. Student: Chun-Feng Chen Advisor: Bo-Cheng Charles Lai Mar. 8th, 2018 NCTU Institute of Electronics – Parallel Computing System Lab (NCTU PCS) Hsinchu, R.O.C Towards Algorithmic Multi-ported Memory : Techniques and Design Trade-offs
  • 2. •Introduction •Background •Non-Table-Based Approaches •Table-Based Approaches •Performance and Impact of Design Factors •Conclusions •References Chun-Feng Chen NCTU_IEE - PCS Lab Outline 2
  • 4. Chun-Feng Chen NCTU_IEE - PCS Lab Introduction • Algorithmic Multi-ported Memory (AMM) • Multi-ported memories are important functional modules in modern digital systems • E.g. shared cache in multi-core processors, routing tables of switches, etc. • AMM composes simple SRAMs and logic to support multiple reads and writes • Potential to attain better performance than circuit-based approaches (CMM) • Most of the previous works on FPGA • Laforest, Charles Eric, et al. "Efficient Multi-ported Memories for FPGAs" [ACM 2010] • Charles Eric Laforest, et al. "Multi-ported Memories for FPGAs via XOR" [ACM 2012] • Charles Eric Laforest, et al. "Composing Multi-Ported Memories on FPGAs" [ACM 2014] • Jiun-Liang Lin, et al. "BRAM Efficient Multi-ported Memory on FPGAs" [VLSI-DAT 2015] • Jiun-Liang Lin, et al. "Efficient Designs of Multi-ported Memory on FPGAs" [TVLSI 2016] • Kun-Hua Huang, et al. "An Efficient Hierarchical Banking Structure for Algorithmic Multi- ported Memory on FPGAs" [TVLSI 2017] • Sundar Iyer, et al. "Algorithmic Memory Brings an Order of Magnitude Performance Increase to Next Generation SoC Memories" [DesignCon 2012] 4
  • 5. Chun-Feng Chen NCTU_IEE - PCS Lab Motivation – FPGA Limits AMM Exploration • The limited resource on FPGA constrains the AMM exploration • Limited number of BRAMs, F/F, slice LUTs, etc • Unable to explore important design factors of AMM • Number of ports, memory depth, banking structures • AMM has more significant benefit for greater depth • E.g. from 512K to 16M-depth, AMM 4R1W attains 1.25% to 36.47% shorter latency better than circuit-based approaches • BRAM size is fixed • 1K depth of 32-bit data width • Unable to explore impact of different bank sizes • E.g. choose proper banking structure for 2R1W can enhance latency/area/power up to 9.53%/56.86%/33.39% • BRAM port configuration is fixed • dual-port 2RW mode • Unable to explore impact of different bank port configurations (4R1W, 2R2W, etc) • E.g. choose proper building memory cell for 8R4W can enhance the area/power up to 70.0%/6.37x 5
  • 6. Chun-Feng Chen NCTU_IEE - PCS Lab Our Contributions • Implement all the AMM designs on 45nm technology • Use SRAM as building memory cell • Explore important design factors of AMM • Different AMM designs, memory depth, port configurations, banking structures, building memory cells, etc. • Extensive experiments and comprehensive analysis • Summarize observations into design guidelines 6
  • 8. • Non-table-based schemes • Duplicate memory module • E.g. NTRep-Rd [ACM 2010] • Table-based schemes • Adopt lookup tables to track the stored up-to-date data address • E.g. TBLVT [ACM 2010] Chun-Feng Chen NCTU_IEE - PCS Lab Algorithmic Multi-ported Memory (AMM) Techniques Categorize 8
  • 9. • Non-table-based approaches • Use multiple banks to support multiple accesses • Store parity data to support multiple reads and enable multiple writes [VLSI-DAT’15, TVLSI’16, TVLSI’17] • HB-NTX-RdWr can scale the number of ports with a systematic flow [TVLSI’17] • Table-based approaches • Use multiple memory modules to support multiple accesses • Use lookup tables to avoid module conflict and track the most up-to- date values [VLSI-DAT’ 15, TVLSI’ 16, TVLSI’ 17] Chun-Feng Chen NCTU_IEE - PCS Lab AMM - Previous Proposed Designs 9
  • 10. Non-Table-Based Approaches Multiple Reads: (1) Non-Table-Based Replication (NTRep-Rd) (2) Non-Table-Based XOR (NTX-Rd) (3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd) Multiple Write: (1) Non-Table-Based XOR (NTX-Wr) (2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr) Multiple Reads and Writes: (1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
  • 11. Non-Table-Based Approaches Multiple Reads: (1) Non-Table-Based Replication (NTRep-Rd) (2) Non-Table-Based XOR (NTX-Rd) (3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd) Multiple Write: (1) Non-Table-Based XOR (NTX-Wr) (2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr) Multiple Reads and Writes: (1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
  • 12. • A mR1W memory module of NTRep-Rd technique • Duplicate memory modules to support multiple read ports • Only one write port connects each memory module [7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs). ACM, 2010. Chun-Feng Chen NCTU_IEE - PCS Lab Non-Table-Based Replication Multiple Reads (NTRep-Rd) - [ACM’10, TRETs’14] 12
  • 13. Non-Table-Based Approaches Multiple Reads: (1) Non-Table-Based Replication (NTRep-Rd) (2) Non-Table-Based XOR (NTX-Rd) (3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd) Multiple Write: (1) Non-Table-Based XOR (NTX-Wr) (2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr) Multiple Reads and Writes: (1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
  • 14. • A 2R1W memory module of NTX- Rd technique • Write request • W0 stores directly to BANK0 and read D0 from BANK1 • Update the XOR-BANK • Read request • R1 reads directly • R0 reads the other banks to recover correct data • NTX-Rd support two mode • Case 1: 3R (no write request) • Case 2: 2R1W (one write request) [13] Sundar Iyer, Shang-Tse Chuang, and Co-Founder & CTO Memoir Systems. " Algorithmic Memory: An Order of Magnitude Performance Increase for Next Generation SoCs." DesignCon. (http://www.designcon.com), 2012. Chun-Feng Chen NCTU_IEE - PCS Lab Non-Table-Based XOR Multiple Reads (NTX-Rd) - [DesignCon.’12] W0’ = (W0 D0) R0 = (W0 D0) D0 W0’ 14
  • 15. Non-Table-Based Approaches Multiple Reads: (1) Non-Table-Based Replication (NTRep-Rd) (2) Non-Table-Based XOR (NTX-Rd) (3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd) Multiple Write: (1) Non-Table-Based XOR (NTX-Wr) (2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr) Multiple Reads and Writes: (1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
  • 16. Hierarchical Banking Non-Table XOR-Based Multiple Reads (HB-NTX-Rd) - [TVLSI’17] [12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.10 (2017): 2776-2788. Chun-Feng Chen NCTU_IEE - PCS Lab • 4 read and 1 write memory • Scale the B-NTX-Rd to more reads in a hierarchical structure • Use the 2R1W/3R as building modules • Case 1: 5R (no write request) • Three reads access BANK0 • Other two reads access the other banks • Case 2: 4R1W (one write request) • W0 and two reads access BANK0 • Other two reads access the other banks • W0 stores directly, and reads BANK1 for updating XOR-BANK 16
  • 17. Non-Table-Based Approaches Multiple Reads: (1) Non-Table-Based Replication (NTRep-Rd) (2) Non-Table-Based XOR (NTX-Rd) (3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd) Multiple Write: (1) Non-Table-Based XOR (NTX-Wr) (2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr) Multiple Reads and Writes: (1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
  • 18. • A 1R2W memory module of NTX-Wr technique • Duplicate memory modules and store XOR-encoded values to support multiple read and writes [9] LaForest, Charles Eric, et al. "Multi-ported Memories for FPGAs via XOR.” Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs). ACM, 2012. Chun-Feng Chen NCTU_IEE - PCS Lab Non-Table-Based XOR Multiple Writes (NTX-Wr) - [ACM’12, TRETs’14] W0’ = (W0 D0) D0 W1’ = (W1 D1) D1 W0’ W0’ W1’ W1’ 18
  • 19. Non-Table-Based Approaches Multiple Reads: (1) Non-Table-Based Replication (NTRep-Rd) (2) Non-Table-Based XOR (NTX-Rd) (3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd) Multiple Write: (1) Non-Table-Based XOR (NTX-Wr) (2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr) Multiple Reads and Writes: (1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
  • 20. [12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.10 (2017): 2776-2788. Chun-Feng Chen NCTU_IEE - PCS Lab Hierarchical Banking Non-Table XOR-Based Multiple Writes (HB-NTX-Wr) - [TVLSI’17] (a) Non-Conflict-Write case (b) Conflict-Write case W0’ = W0 Ref0 W1’ = W1 Ref1 Ref1new = W1 (D0 Ref1cur) W1’’ = D1 Ref1new W1’ W0’ W1’’ 20
  • 21. Non-Table-Based Approaches Multiple Reads: (1) Non-Table-Based Replication (NTRep-Rd) (2) Non-Table-Based XOR (NTX-Rd) (3) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Rd) Multiple Write: (1) Non-Table-Based XOR (NTX-Wr) (2) Hierarchical Banking Non-Table XOR-Based (HB-NTX-Wr) Multiple Reads and Writes: (1) Hierarchical Banking Non-Table XOR-Based (HB-NTX-RdWr)
  • 22. • 2 read and 2 write memory • Integrates HB-NTX-Rd and HB-NTX-Wr to enable multiple reads and writes • Use HB-NTX-Rd 4R1W/5R as building memory modules (b) Conflict-Write case(a) Non-Conflict-Write case Chun-Feng Chen NCTU_IEE - PCS Lab Hierarchical Banking Non-Table XOR-Based Multiple Reads and Writes (HB-NTX-RdWr) - [TVLSI’17] [12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.10 (2017): 2776-2788. 22
  • 23. • The top-down flow increases read ports with HB-NTX-Rd, while the left- right flow increases write ports with HB-NTX-Wr Chun-Feng Chen NCTU_IEE - PCS Lab HB-NTX-RdWr Systematic Flow - [TVLSI’17] 23
  • 24. Table-Based Approaches Multiple Reads and Writes: (1) Table-Based Live Value Table (TBLVT) (2) Table-Based Remap Table (TBRemap) (3) Enhancing Table-Based with Reduce Lookup Tables (TBLVT_HB-NTX-RdWr)
  • 25. Table-Based Approaches Multiple Reads and Writes: (1) Table-Based Live Value Table (TBLVT) (2) Table-Based Remap Table (TBRemap) (3) Enhancing Table-Based with Reduce Lookup Tables (TBLVT_HB-NTX-RdWr)
  • 26. [7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGAs). ACM, 2010. Chun-Feng Chen NCTU_IEE - PCS Lab Table-Based Live Value Table (TBLVT) - [ACM’10, TRETs’14] • Write request • Dedicate a write data to a certain memory module • Lookup table (LVT) traces the latest location • Read request will query the LVT first and then access the data from correct memory location • Design of the LVT size: • log2(#NumModules) x MemoryDepth 26
  • 27. Table-Based Approaches Multiple Reads and Writes: (1) Table-Based Live Value Table (TBLVT) (2) Table-Based Remap Table (TBRemap) (3) Enhancing Table-Based with Reduce Lookup Tables (TBLVT_HB-NTX-RdWr)
  • 28. [11] Lai, Bo-Cheng Charles, and Jiun-Liang Lin. "Efficient Designs of Multi-ported Memory on FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.1 (2017): 139-150. Chun-Feng Chen NCTU_IEE - PCS Lab Table-Based Remap (TBRemap) – [VLSI-DAT’15, TVLSI’16] • Remap functions: • Apply banking structure designs • All the reads and writes need to check remap table to determine which memory bank to access • Use a HWC to distribute the multiple write into writes, and a remap table to track the latest location • Design of the Remap size: • ([log2(#DataBanks + 1)] – 1) x MemoryDepth 28
  • 29. Table-Based Approaches Multiple Reads and Writes: (1) Table-Based Live Value Table (TBLVT) (2) Table-Based Remap Table (TBRemap) (3) Enhancing Table-Based with Reduce Lookup Tables (TBLVT_HB-NTX-RdWr)
  • 30. Chun-Feng Chen NCTU_IEE - PCS Lab Enhancing Table-Based Design with Reduce Lookup Table • NTRep-Rd mR1W modules replaced by HB-NTX-RdWr mRnW modules • Reduce lookup table size while uses less modules, to alleviate the routing complexity for latency critical path • Example: A 2R4W 8K-depth memory • Original TBLVT needs four NTRep-Rd 2R1W as building modules • LVTSize = 2-bit x 8K-depth • Enhance TBLVT needs two HB-NTX- RdWr 2R2W as building modules • LVTSize = 1-bit x 8K-depth 30
  • 31. Performance and Impact of Design Factors A. Experiments Setup B. Circuit-level vs. Algorithmic Scheme C. Overall Performance and Cost (Non-Table-Based vs. Table-Based) D. Impact of Banking Structure E. Scalability with Memory Depths and Number of Ports F. Proper Tradeoff Between Circuit-level and Algorithmic Memory F.(1) Different Basic SRAM Modules F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
  • 32. Performance and Impact of Design Factors A. Experiments Setup B. Circuit-level vs. Algorithmic Scheme C. Overall Performance and Cost (Non-Table-Based vs. Table-Based) D. Impact of Banking Structure E. Scalability with Memory Depths and Number of Ports F. Proper Tradeoff Between Circuit-level and Algorithmic Memory F.(1) Different Basic SRAM Modules F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
  • 33. Chun-Feng Chen NCTU_IEE - PCS Lab A. Experiments Setup – Read/Write Path Logic • The block diagram of an AMM architecture, including write-path logic, read-path logic, and building memory cells • Write-path performs data manipulation design, e.g. replication, and lookup table… etc. • Read-path performs retrieve the data from memory cells and decoding correct data 33 = Design Compiler synthesis RTL with TSMC 45nm
  • 34. Chun-Feng Chen NCTU_IEE - PCS Lab A. Experiments Setup - Memory Cells (SRAM) • The block diagram of an AMM architecture, including write-path logic, read-path logic, and building memory cells • Memory cells are composed by SRAMs, e.g. single-port or dual-port mode 34 = CACTI integrated memory model to estimate the performance of different SRAM modules with TSMC 45nm
  • 35. Chun-Feng Chen NCTU_IEE - PCS Lab A. Experiments Setup - Algorithmic Multi-ported Memory Architecture • By combining the synthesis results of read-path and write-path logic, and estimation from CACTI, we can evaluate the overall performance and cost of an AMM design 35 = Overall performance of an AMM designs++
  • 36. Performance and Impact of Design Factors A. Experiments Setup B. Circuit-level vs. Algorithmic Scheme C. Overall Performance and Cost (Non-Table-Based vs. Table-Based) D. Impact of Banking Structure E. Scalability with Memory Depths and Number of Ports F. Proper Tradeoff Between Circuit-level and Algorithmic Memory F.(1) Different Basic SRAM Modules F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
  • 37. Chun-Feng Chen NCTU_IEE - PCS Lab B. Circuit-level vs. Algorithmic Schemes – 2RnW (Latency) 37 l CMM has shorter latency for shallow memory depth (< 64K) AMM has short latency for greater memory depth (> 128K) l AMM is more scalable when increasing memory depth For 2R8W, from 16K to 16M, the latency increases by: HB-NTX-RdWr (1.52x), TBLVT_B-NTX-Rd (2.85x), CMM (23.46x) l Non-table designs have shorter latency than table-based For 2R8W, HB-NTX-Rd attains 4.23% to 95.09% shorter latencies than TBLVT_B-NTX-Rd from 16K to 16M
  • 38. Chun-Feng Chen NCTU_IEE - PCS Lab B. Circuit-level vs. Algorithmic Schemes – mR1W (Area) 38 l Non-table-based AMM CMM l 2R1W attains 6.38% to 36.2% l 4R1W attains 15.27% to 67% l 8R1W attains 59.33% to 3.01x smaller area when compared with CMM
  • 39. Chun-Feng Chen NCTU_IEE - PCS Lab B. Circuit-level vs. Algorithmic Schemes – 2RnW (Area) 39 l Non-table-based CMM l Table-based designs still attain smaller area over CMM For 2R8W TBLVT_B-NTX-Rd, can attains 2.01% to 22.79% smaller area from 64K to 16M l Table-based memory cell , table-based non-table-based
  • 40. Chun-Feng Chen NCTU_IEE - PCS Lab B. Circuit-level vs. Algorithmic Schemes – 2RnW (Power) 40 l AMM has lower power (ever for non-table-based designs) For AMM, access data random bank access data AMM l For 2R8W HB-NTX-RdWr, can attains 45.59% to 2.1x from 512K to 16M l For 2R8W TBLVT_B-NTX-Rd, can attains 3.29% to 4.71x from 1K to 16M
  • 41. Performance and Impact of Design Factors A. Experiments Setup B. Circuit-level vs. Algorithmic Scheme C. Overall Performance and Cost (Non-Table-Based vs. Table-Based) D. Impact of Banking Structure E. Scalability with Memory Depths and Number of Ports F. Proper Tradeoff Between Circuit-level and Algorithmic Memory F.(1) Different Basic SRAM Modules F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
  • 42. Chun-Feng Chen NCTU_IEE - PCS Lab C. Overall Performance and Cost - Non-Table-Based vs. Table-Based (Latency) l The non-table-based schemes have shorter access latencies with simple logic operations; table-based schemes are impacted by routing path to lookup table l The latency of AMM is mainly determined by the SRAM modules for greater memory depth For example: for 1R2W, 1K to 1M, memory cells account: NTX-Wr: 28.82% to 71.55% of overall latency TBLVT: 26.94% to 69.61% of overall latency 42
  • 43. Chun-Feng Chen NCTU_IEE - PCS Lab C. Overall Performance and Cost - Non-Table-Based vs. Table-Based (Area) l The area of AMM is mainly determined by the SRAM modules For example, 1R2W, 1K to 1M, memory cells account: NTX-Wr: 92.52% to 99.99% of overall area TBLVT: 93.11% to 99.99% of overall area l Table-based , table-based non-table-based 43
  • 44. Chun-Feng Chen NCTU_IEE - PCS Lab C. Overall Performance and Cost - Non-Table-Based vs. Table-Based (Power) 44 l The power of AMM is mainly determined by the SRAM modules For example, for 1R2W, 1K to 1M, memory cells account: NTX-Wr: 89.27% to 99.97% of overall power TBLVT: 88.9% to 99.97% of overall power l Table-based , table- based non-table-based
  • 45. Performance and Impact of Design Factors A. Experiments Setup B. Circuit-level vs. Algorithmic Scheme C. Overall Performance and Cost (Non-Table-Based vs. Table-Based) D. Impact of Banking Structure E. Scalability with Memory Depths and Number of Ports F. Proper Tradeoff Between Circuit-level and Algorithmic Memory F.(1) Different Basic SRAM Modules F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
  • 46. Chun-Feng Chen NCTU_IEE - PCS Lab D. Impact of Banking Structure – Non-Table-Based (Latency) l Banking structure is a tradeoff between memory cell and logic l The best banking structure would be different according to designs l For example, for 2R1W, the 32-bank has the shortest latency among all the other banking structures 46
  • 47. Chun-Feng Chen NCTU_IEE - PCS Lab D. Impact of Banking Structure – Non-Table-Based (Area) 47 l Area efficiency: (Area)/(data size), data memory cell area l 8-bank has the smaller area among all the other designs l Area of memory cell is the dominant factor of overall area: Logic only occupies 0.0782% (1bank) to 8.87%(256bank) of overall area 2 1.502 1.253 1.13 1.17 1.246 1.346 1.566 1.80
  • 48. Chun-Feng Chen NCTU_IEE - PCS Lab D. Impact of Banking Structure – Non-Table-Based (Power) 48 l Power efficiency: (Power)/(data size), data power access l 8-bank has the lower power among all the other designs l Power of memory cell is the dominant factor of overall power: Logic only occupies : 0.201% (1bank) to 10.653% (256bank) of overall power 1.68 1.29 1.17 1.13 1.27 1.46 1.74 1.866 2.01
  • 49. Performance and Impact of Design Factors A. Experiments Setup B. Circuit-level vs. Algorithmic Scheme C. Overall Performance and Cost (Non-Table-Based vs. Table-Based) D. Impact of Banking Structure E. Scalability with Memory Depths and Number of Ports F. Proper Tradeoff Between Circuit-level and Algorithmic Memory F.(1) Different Basic SRAM Modules F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
  • 50. Chun-Feng Chen NCTU_IEE - PCS Lab E. Scalability with Memory Depths and Number of Ports - Latency l The AMM performance is mainly determined by the SRAM modules l For 2R4W HB-NTX-RdWr, (64K to 128K) & (256K to 512K) latency l For 2R4W HB-NTX-RdWr, (128K to 256K) latency 50 2K 2RW 4K 2RW
  • 51. Chun-Feng Chen NCTU_IEE - PCS Lab E. Scalability with Memory Depths and Number of Ports - Area 51 l The AMM performance is mainly determined by the SRAM modules l HB-NTX-RdWr attains smaller area than NTX- Wr_B-NTX-Rd l For 2R8W, HB-NTX-RdWr attains 11.21% to 2.33x smaller area from 16K to 1M, than NTX- Wr_B-NTX-Rd
  • 52. Chun-Feng Chen NCTU_IEE - PCS Lab E. Scalability with Memory Depths and Number of Ports - Power 52 l The AMM performance is mainly determined by the SRAM modules l HB-NTX-RdWr attains lower power than NTX- Wr_B-NTX-Rd l For 2R8W, HB-NTX-RdWr attains 9.51% to 55.39% lower power from 16K to 1M, than NTX-Wr_B-NTX-Rd
  • 53. Performance and Impact of Design Factors A. Experiments Setup B. Circuit-level vs. Algorithmic Scheme C. Overall Performance and Cost (Non-Table-Based vs. Table-Based) D. Impact of Banking Structure E. Scalability with Memory Depths and Number of Ports F. Proper Tradeoff Between Circuit-level and Algorithmic Memory F.(1) Different Basic SRAM Modules F.(2) AMM Designs with Higher Port Counts Circuit-level Modules
  • 54. Chun-Feng Chen NCTU_IEE - PCS Lab F.(1) Different Basic SRAM Modules - Latency For a wide range of sizes, all these basic SRAM modules pose very similar performance and cost 54
  • 55. Chun-Feng Chen NCTU_IEE - PCS Lab F.(1) Different Basic SRAM Modules - Area 55 For a wide range of sizes, all these basic SRAM modules pose very similar performance and cost
  • 56. Chun-Feng Chen NCTU_IEE - PCS Lab F.(1) Different Basic SRAM Modules - Power l For a widely range of sizes, power consumption: 2RW > 1R1W/2R > 1R1W l AMM 2RW 2RW + 1R1W (power) 56
  • 57. Chun-Feng Chen NCTU_IEE - PCS Lab F.(2) AMM Designs with Higher Port Counts Circuit-level Modules - Latency l CMM could outperform AMM in certain configurations l Can we attain better performance by properly choosing SRAM modules? l Apply three different SRAM modules (2RW, 2R2W, 2R4W), we use 4R2W as an example l For 4R2W, AMM with 2RW SRAMs attains 6.08% to 2.032x faster latencies from 1K to 16M than AMM with 2R2W l This is because latency of HB-NTX-RdWr is mainly determined by the SRAM module, and 2RW is faster (than 2R2W and 2R4W) 57
  • 58. Chun-Feng Chen NCTU_IEE - PCS Lab F.(2) AMM Designs with Higher Port Counts Circuit-level Modules - Area l But using more complex SRAM does provide benefit on area and power l AMM (e.g. 4R2W: 54 2 ) l For 4R2W, AMM with 2R2W SRAM attains 30.28% to 71.43% smaller area from 1K to 16M than AMM with 2RW 58
  • 59. Chun-Feng Chen NCTU_IEE - PCS Lab F.(2) AMM Designs with Higher Port Counts Circuit-level Modules – Power 59 l But using more complex SRAM does provide benefit on area and power l AMM (e.g. 4R2W: 54 2 ) l For 4R2W, AMM with 2R2W SRAM attains 2.52x to 3.42x lower power from 1K to 16M than AMM with 2RW
  • 61. Summary of Experiments on AMM Studies Chun-Feng Chen NCTU_IEE - PCS Lab • AMM does attain superior performance (latency/ area/ power) than CMM, the benefits become more significant for designs with more ports and greater depth • The performance of AMM is mainly determined by the algorithmic logics when memory depth is shallow. The building SRAM modules will become the main performance factor for memory with great depth. • Non-table-based AMMs have shorter latencies when compared with table-based designs. Table-based AMMs pose smaller area and lower power consumption than non-table-based AMMs • Proper banking structure would enhance the performance while excessively aggressive banking could induce significant overhead and performance hit • Choosing proper SRAM with higher port counts as building modules could enhance the performance (area/ power) of AMM designs 61
  • 62. Conclusions Chun-Feng Chen NCTU_IEE - PCS Lab • Most of the previous works of AMM were conducted on FPGA-based platforms, is implemented by composing multiple BRAMs and logic slices LUT (lookup table) • This thesis aims to comprehensive analysis and exploration the algorithmic multi-ported memory on ASICs • Different basic SRAM modules • Scalability with memory depths and number of ports for AMM designs • Applying banking structures for AMM designs • Circuit-level schemes vs. Algorithmic schemes for different port configures • Choosing proper SRAM modules with higher port counts can enhance the performance of AMM designs 62
  • 65. Chun-Feng Chen NCTU_IEE - PCS Lab References •[1] Abdel-Hafeez, Saleh M., and Anas S. Matalkah. "CMOS eight-transistor memory cell for low-dynamic-power high-speed embedded SRAM." Journal of Circuits, Systems, and Computers 17.05 (2008): 845-863. •[2] Bhagyalakshmi, I. V., Ravi Teja, and Madhan Mohan. "Design and VLSI Simulation of SRAM Memory Cells for Multi-ported SRAM’s." (2014). •[3] Rivest, Ronald L., and Lance A. Glasser. A Fast-Multiport Memory Based on Single-Port Memory Cells. No. MIT/LCS/TM-455. MASSACHUSETTS INST OF TECH CAMBRIDGE LAB FOR COMPUTER SCIENCE, 1991. •[4] Park, Seon-yeong, et al. "CFLRU: a replacement algorithm for flash memory." Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems. ACM, 2006. •[5] Synopsys Design Compiler User Guide Version X-2005.09. [Online] Available: http://beethoven.ee.ncku.edu.tw/testlab/course/VLSIdesign_course/course_96/Tool/Design_Compiler%20_User_Guide.pdf •[6] Synopsys Design Compiler Optimization Reference Manual Version D-2010.03. [Online] Available: http://cleroux.vvv.enseirb- matmeca.fr/EN219/doc/dcrmo.pdf •[7] LaForest, Charles Eric, and J. Gregory Steffan. "Efficient Multi-ported Memories for FPGAs." Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays (FPGA), pp. 41-50, ACM, 2010. •[8] Charles Eric LaForest, Ming Gang Liu, Emma Rae Rapati, and J. Gregory Steffan. "Multi-ported Memories for FPGAs via XOR," In Proceedings of the 20th annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pp. 209–218, ACM, 2012. •[9] Charles Eric Laforest, Zimo Li, Tristan O'rourke, Ming G. Liu, and J. Gregory Steffan. "Composing Multi-Ported Memories on FPGAs," in Proceedings of the ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol.7, issue 3, article no. 16, 2014. •[10] Lin, Jiun-Liang, and Bo-Cheng Charles Lai. "BRAM Efficient Multi-ported Memory on FPGA." VLSI Design, Automation and Test (VLSI-DAT), 2015 International Symposium on. IEEE, 2015. •[11] Lai, Bo-Cheng Charles, and Jiun-Liang Lin. "Efficient Designs of Multiported Memory on FPGA." IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2016). 65
  • 66. Chun-Feng Chen NCTU_IEE - PCS Lab References •[12] Lai, Bo-Cheng Charles, and Kun-Hua Huang. "An Efficient Hierarchical Banking Structure for Algorithmic Multiported Memory on FPGA." IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2017). •[13] S. Iyer and D. Chuang. (Jan. 2012) “Algorithmic Memory Brings an Order of Magnitude Performance Increase to Next Generation SoC Memories “DesignCon, accessed on Jun. 22, 2017. [Online] Available: http://www.yuba.stanford.edu/sundaes/Papers/DesignCon-AlgMem.pdf •[14] Tse, David N. C., Pramod Viswanath, and Lizhong Zheng. "Diversity-multiplexing tradeoff in multiple-access channels." IEEE Transactions on Information Theory 50.9 (2004): 1859-1874. •[15] Ping, Li, et al. "Interleave division multiple-access." IEEE Transactions on Wireless Communications 5.4 (2006): 938-947. •[16] Suhendra, Vivy, Chandrashekar Raghavan, and Tulika Mitra. "Integrated scratchpad memory optimization and task scheduling for MPSoC architectures." Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems. ACM, 2006. •[17] Iyer, Sundar, and Shang-Tse Chuang. "High speed memory systems and methods for designing hierarchical memory systems." U.S. Patent Application No. 12/806,631. •[18] Wilton, Steven JE, and Norman P. Jouppi. "CACTI: An enhanced cache access and cycle time model." IEEE Journal of Solid-State Circuits 31.5 (1996): 677-688. •[19] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norman P. Jouppi. "CACTI 6.0: A tool to model large caches." HP Laboratories (2009): 22-31. •[20] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norm Jouppi. "Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0." Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2007. •[21] Muralimanohar, Naveen, Rajeev Balasubramonian, and Norman P. Jouppi. "Architecting efficient interconnects for large caches with CACTI 6.0." IEEE micro 28.1 (2008). •[22] Thoziyoor, Shyamkumar, et al. CACTI 5.1. Technical Report HPL-2008-20, HP Labs, 2008. •[23] Synopsys Design Compiler Standard Cell Library, including TSMC, UMC and SMIC. [Online] Available: https://www.synopsys.com/dw/ipdir.php?ds=dwc_standard_cell 66
  • 67. Chun-Feng Chen NCTU_IEE - PCS Lab References •[24] TSMC Standard Cell Library (including 45nm, 90nm advanced technology) Description Name. [Online] Available: http://www.europractice- ic.com/libraries_TSMC.php •[25] Bo-Cheng Charles Lai, Jiun-Liang Lin, Kun-Hua Huang, and Kuo-Cheng Lu. "Method for accessing multi-port memory module, method for increasing write ports of memory module and associated memory controller." U.S. Patent Application No. 15/098,330. •[26] Bo-Cheng Charles Lai, Jiun-Liang Lin, and Kuo-Cheng Lu. "Method for accessing multi-port memory module and associated memory controller." U.S. Patent Application No. 15/098,336. •[27] Tseng, Jessica H., and Krste Asanović. "Banked multiported register files for high-frequency superscalar microprocessors." ACM SIGARCH Computer Architecture News. Vol. 31. No. 2. ACM, 2003. •[28] Kim, John. "Low-cost router microarchitecture for on-chip networks." Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2009. •[29] Gupta, Pankaj, Steven Lin, and Nick McKeown. "Routing lookups in hardware at memory access speeds." INFOCOM'98. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE. Vol. 3. IEEE, 1998. •[30] Hughes, John H. "Routing table lookup implemented using M-trie having nodes duplicated in multiple memory Banks." U.S. Patent No. 6,308,219. 23 Oct. 2001. •[31] McAuley, Anthony J., Paul F. Tsuchiya, and Daniel V. Wilson. "Fast multilevel hierarchical routing table lookup using content addressable memory." U.S. Patent No. 5,386,413. 31 Jan. 1995. •[32] Teitenberg, Tim, and Bikram Singh Bakshi. "Efficient memory management for channel drivers in next generation I/O system." U.S. Patent No. 6,421,769. 16 Jul. 2002. •[33] Treleaven, Philip C., David R. Brownbridge, and Richard P. Hopkins. "Data-driven and demand-driven computer architecture." ACM Computing Surveys (CSUR) 14.1 (1982): 93-143. •[34] Peng, Zebo, and Krzysztof Kuchcinski. "Automated transformation of algorithms into register-transfer level implementations." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 13.2 (1994): 150-166. 67
  • 68. Chun-Feng Chen NCTU_IEE - PCS Lab References •[35] Keshav, Srinivasan, and Rosen Sharma. "Issues and trends in router design." IEEE Communications magazine 36.5 (1998): 144-151. •[36] Tullsen, Dean M., et al. "Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor." ACM SIGARCH Computer Architecture News. Vol. 24. No. 2. ACM, 1996. •[37] Xilinx 7 Series FPGAs Configurable Logic Block User Guide. [Online] Available: http://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf •[38] Fetzer, E. S., Gibson, M., Klein, A., Calick, N., Zhu, C., Busta, E., & Mohammad, B. (2002). "A fully bypassed six-issue integer datapath and register file on the Itanium-2 microprocessor." IEEE Journal of Solid-State Circuits Conference, vol. 1, Feb. 2002, pp. 420-478. •[39] Bajwa, H., and X. Chen. "Low-Power High-Performance and Dynamically Configured Multi-port Cache Memory Architecture." Electrical Engineering, 2007. ICEE'07. International Conference on. IEEE, April, 2007. •[40] S. Ben-David, A. Borodin, R. Karp, G. Tardos, and A. Wigderson, “On the Power of Randomization in On-line Algorithms”, New York: Springer, 1994. 68