Introduction and Background
Multiplier Architectures
Results
Conclusion
Implementation and Comparison of Softcore
Multiplier Architectures for FPGAs
Shahid Abbas
Projektarbeit (Master of Science)
Fachgebiet Digitaltechnik
Universt¨at Kassel
August 22, 2014
1 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Outline
1 Introduction and Background
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
2 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Outline
1 Introduction and Background
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
2 Multiplier Architectures
Target Specific Implementation
LUT-Based Multipliers
2 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Outline
1 Introduction and Background
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
2 Multiplier Architectures
Target Specific Implementation
LUT-Based Multipliers
3 Results
Simulation
Synthesis Results
2 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Outline
1 Introduction and Background
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
2 Multiplier Architectures
Target Specific Implementation
LUT-Based Multipliers
3 Results
Simulation
Synthesis Results
4 Conclusion
2 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
3 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
3 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
3 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
3 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
Fixed allocation
3 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
Fixed allocation
Place and routing issues
3 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
Fixed allocation
Place and routing issues
Use of FPGA logic blocks for multiplier of any word size
3 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
Fixed allocation
Place and routing issues
Use of FPGA logic blocks for multiplier of any word size
Softcore multiplier that work in conjunction with DSP multipliers
3 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Fundamentals of Binary Multiplication
1 Partial Products Calculation
A=A3
B=B3
20
·B0·A
x
+
Step 1
Step 2
A0…
… B0
21
·B1·A
22
·B2·A
23
·B3·A
+
+
=
Figure: Binary 4×4-bit Multiplication
4 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Fundamentals of Binary Multiplication
1 Partial Products Calculation
2 Addition of Partial Products by proper shifting
A=A3
B=B3
20
·B0·A
x
+
Step 1
Step 2
A0…
… B0
21
·B1·A
22
·B2·A
23
·B3·A
+
+
=
Figure: Binary 4×4-bit Multiplication
4 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Xilinx Virtex-6 Slice [2]
Configurable Logic Blocks (CLB) contains two slices
0
1
0
1
0
1
0
1
c_in
c_out
LUTLUTLUTLUT
Figure: Xilinx Virtex-6 Slice
5 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Xilinx Virtex-6 Slice [2]
Configurable Logic Blocks (CLB) contains two slices
Each slice contains four Look-Up Tables (LUT), eight Flip-Flops, multiplexers and a
carry-propagation logic.
0
1
0
1
0
1
0
1
c_in
c_out
LUTLUTLUTLUT
Figure: Xilinx Virtex-6 Slice
5 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Xilinx Virtex-6 Slice [2]
Configurable Logic Blocks (CLB) contains two slices
Each slice contains four Look-Up Tables (LUT), eight Flip-Flops, multiplexers and a
carry-propagation logic.
Single or two outputs per LUT
0
1
0
1
0
1
0
1
c_in
c_out
LUTLUTLUTLUT
Figure: Xilinx Virtex-6 Slice
5 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
FloPoCo Library and Bit Heaps
Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4].
before first compression
1 0.530 ns
1 1.061 ns
before 3-bit height additions
before final addition
Figure: Bit-Heap Structure for 16×16-Bit Multiplier
6 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
FloPoCo Library and Bit Heaps
Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4].
Bit heap is a data-structure holds unevaluated sum of any number of bits weighted by
power of two [5].
before first compression
1 0.530 ns
1 1.061 ns
before 3-bit height additions
before final addition
Figure: Bit-Heap Structure for 16×16-Bit Multiplier
6 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
FloPoCo Library and Bit Heaps
Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4].
Bit heap is a data-structure holds unevaluated sum of any number of bits weighted by
power of two [5].
Equally weighted bits aligned in column as order is irrelevant for sum
before first compression
1 0.530 ns
1 1.061 ns
before 3-bit height additions
before final addition
Figure: Bit-Heap Structure for 16×16-Bit Multiplier
6 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation [6]
Best Fit design in Logic Blocks = Better Performance
a b
c_out c_in
sum
0
1
Figure: Full Adder Implementation with Multiplexer and XOR-Gates
7 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation [6]
0
1
0
1
0
1
0
1
c_in
c_out
S0S1S2S3
LUTLUTLUTLUT
a0b0a1b1a2b2a3b3a4b4a5b5a6b6a7b7
Figure: Slice configuration of 4-LUTs for Partial Product and Addition
8 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation (Automated)
vector < vector < pair < int, int >>>
0
00
0
0
c_in=0
c_out (to bit-heap)
Partial-Product Calculation
Re-arrangement
3-LUT Slice
4-LUT Slice
n 8-bit
m 8-bit
Before Multiplication
A
B
20
·B0·A
21
·B1·A
22
·B2·A
23
·B3·A
24
·B4·A
25
·B5·A
26
·B6·A
27
·B7·A
Figure: 8×8-bit Multiplier Implementation in FloPoCo
9 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation (For 8×8-bit Multiplier)
Automated Implementation
10 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation (For 8×8-bit Multiplier)
Automated Implementation
vector < vector < pair < int, int >>>
10 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation (For 8×8-bit Multiplier)
Automated Implementation
vector < vector < pair < int, int >>>
Manual interconnection of Slices
10 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation (For 8×8-bit Multiplier)
Automated Implementation
vector < vector < pair < int, int >>>
Manual interconnection of Slices
Addition using Bit Heaps
10 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation (For 8×8-bit Multiplier)
Automated Implementation
vector < vector < pair < int, int >>>
Manual interconnection of Slices
Addition using Bit Heaps
Addition using Arithmetic Expressions
10 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation (Manual)
Re-arrangement
Tobitheap
Tobitheap
Tobitheap
Tobitheap
Tobitheap
AND-gate
Figure: 8×8-bit Multiplier with Manual Interconnection of Slices
11 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
Multiplication of two numbers can be obtained by the bit shifted additions of
small multiplier result
A = 2n
A1 + A0 (1)
B = 2n
B1 + B0 (2)
A × B = 22n
A1B1 + 2n
(A1B0 + A0B1) + A0B0 (3)
12 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
Multiplication of two numbers can be obtained by the bit shifted additions of
small multiplier result
A = 2n
A1 + A0 (1)
B = 2n
B1 + B0 (2)
A × B = 22n
A1B1 + 2n
(A1B0 + A0B1) + A0B0 (3)
A basic n × m-bit multiplier can be instantiated multiple times
12 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
Multiplication of two numbers can be obtained by the bit shifted additions of
small multiplier result
A = 2n
A1 + A0 (1)
B = 2n
B1 + B0 (2)
A × B = 22n
A1B1 + 2n
(A1B0 + A0B1) + A0B0 (3)
A basic n × m-bit multiplier can be instantiated multiple times
Add results of each instance through proper shifting
12 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits)
3
3
6
A B
Y
Figure: 3×3-LUT Multiplier
3
5
A B
Y
2
Figure: 3×2-LUT Multiplier
13 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits)
3×2-LUT based Multipliers (Needs 3-LUTs for 5 output Bits)
3
3
6
A B
Y
Figure: 3×3-LUT Multiplier
3
5
A B
Y
2
Figure: 3×2-LUT Multiplier
13 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits)
3×2-LUT based Multipliers (Needs 3-LUTs for 5 output Bits)
1×4-LUT based Multipliers
3
3
6
A B
Y
Figure: 3×3-LUT Multiplier
3
5
A B
Y
2
Figure: 3×2-LUT Multiplier
13 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers (3×3-LUT based 8×7 Multiplier)
3x3 3x3
3x33x3
0 1 2 3 4 5
0
1
2
3
4
5
2x3
2x3
3x1 3x16 2x1
6 7
A
B
i ii
iii iv
v
vi
vii viii ix
Figure: 8×7-bit LUT-Multiplier Implementation in FloPoCo
14 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers (3×3-LUT based 8×7 Multiplier)
3x3 3x3
3x33x3
0 1 2 3 4 5
0
1
2
3
4
5
2x3
2x3
3x1 3x16 2x1
6 7
A
B
i ii
iii iv
v
vi
vii viii ix
Figure: 8×7-bit LUT-Multiplier Implementation in FloPoCo
AB = A0..2B0..2 + 23
(A3..5B0..2 + A0..2B3..5) + 26
(A6..7B0..2 + A3..5B3..5 + A0..2B6)
+ 29
(A6..7B3..5 + A3..5B6) + 212
A6..7B6
(4)
14 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
48-Designs for each architecture
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
48-Designs for each architecture
Self-Checking testbenches were generated using FloPoCo function emulate(TestCase
*tc)
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
48-Designs for each architecture
Self-Checking testbenches were generated using FloPoCo function emulate(TestCase
*tc)
TestBench 10000 option was used to generated 10000 random test
cases during core-generation.
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
48-Designs for each architecture
Self-Checking testbenches were generated using FloPoCo function emulate(TestCase
*tc)
TestBench 10000 option was used to generated 10000 random test
cases during core-generation.
Simulation on ModelSim
15 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
100
200
300
400
500
600
700
800
900
1000
Speed Vs Complexity (N X M)
Complexity (N X M)
Frequency(MHz)
f
max
= 906.62 MHz in Target Specific Multiplier
Target Specfic Multiplier
3x3 LUT Multiplier
1x4 LUT Multiplier
3x2 LUT Multiplier
Figure: Comparison of Architectures on the basis of Speed (for N×M-bit)
16 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
0 10 20 30 40 50 60 70
0
100
200
300
400
500
600
700
Speed Vs Complexity (N X N)
Complexity (N)
Frequency(MHz)
Target Specfic Multiplier
3x3 LUT Multiplier
1x4 LUT Multiplier
3x2 LUT Multiplier
Figure: Comparison of Architectures on the basis of Speed (for N×N-bit)
17 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
200
400
600
800
1000
1200
1400
1600
1800
Slice Usage Vs Complexity (N X M)
Complexity (N X M)
NumberofSlices
Target Specfic Multiplier
3x3 LUT Multiplier
1x4 LUT Multiplier
3x2 LUT Multiplier
Figure: Comparison of Architectures on the basis of Slice usage (for N×M-bit)
18 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
0 10 20 30 40 50 60 70
0
200
400
600
800
1000
1200
1400
1600
1800
Slice Usage Vs Complexity (N X N)
Complexity (N)
NumberofSlices
Target Specfic Multiplier
3x3 LUT Multiplier
1x4 LUT Multiplier
3x2 LUT Multiplier
Figure: Comparison of Architectures on the basis of Speed (for N×N-bit)
19 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
Average Performance
Table: Average values of parameters for different architectures
Architecture No. of Flip-Flops No. of LUTs No. of Slices Frequency (MHz)
Target Specific 1144 1615 419 346.36
3×3-LUT 1422 1893 491 301.03
3×2-LUT 1730 1962 513 264.95
1×4-LUT 2019 2340 610 259.98
20 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
Automatic Vs Manual Interconnection of Slices (8×8-bit)
Table: Automatic Vs Manual routing between Slices
No. of FFs No. of LUTs No. of Slices Frequency (MHz)
Automatic 56 74 21 686.81
Manual(Bit Heap) 22 74 20 256.61
Manual (Without Bit Heap) 40 59 16 414.08
21 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
Automatic Vs Manual Interconnection of Slices (8×8-bit)
before first compression
before 3-bit height additions
before final addition
Figure: Bit Heap Structure for Automatic
Interconnection of Slices
before first compression
1 0.530 ns
before 3-bit height additions
before final addition
Figure: Bit Heap Structure for Manual
Interconnection of Slices
22 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Conclusion
Fast multipliers with minimum resources can be implemented
by choosing appropriate architecture.
23 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Conclusion
Fast multipliers with minimum resources can be implemented
by choosing appropriate architecture.
Target Specific Implementation showed best results due to
average fast speed and less consumption of resources.
23 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Conclusion
Fast multipliers with minimum resources can be implemented
by choosing appropriate architecture.
Target Specific Implementation showed best results due to
average fast speed and less consumption of resources.
Automated generation of this approach can modified with
introduction of AND-gate for corner elements.
23 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Conclusion
Fast multipliers with minimum resources can be implemented
by choosing appropriate architecture.
Target Specific Implementation showed best results due to
average fast speed and less consumption of resources.
Automated generation of this approach can modified with
introduction of AND-gate for corner elements.
Slice usage can be improved by their manual interconnection,
with compromise over speed.
23 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
References
[1] Ian Kuon and J. Rose.
Measuring the Gap Between FPGAs and ASICs.
Computer-Aided Design of Integrated Circuits and Systems, 26:203–215, February 2007.
[2] Xilinx.
Virtex-6 FPGA, Configurable Logic Block User Guide, UG364 (v1.2).
http://www.xilinx.com/support/documentation/user_guides/ug364.pdf, 2012.
[3] F. de Dinechin and B. Pasca.
Designing Custom Arithmetic Data Paths with FloPoCo.
Design and Test of Computers, 28:18–27, 2011.
[4] Florent de Dinechin.
Tutorial held at HiPEAC’2013 “Building Custom Arithmetic Operators with the FloPoCo Generator”.
http://perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/flopoco-tutorial.pdf,
2013.
[5] Brunie N., de Dinechin F., Istoan M., Sergent G., Illyes K., and Popa B.
Arithmetic core generation using bit heaps.
In Proc. IEEE FPL ’2013, pages 1–8, Porto, Portugal, 2–4, 2013.
[6] H. ParandehAfshar and P. Ienne.
Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs.
In Proc. IEEE FPL ’2011, pages 225–231, Chania, Greece, 5–7, 2011.
[7] F. de Dinechin and B. Pasca.
Large multipliers with fewer DSP blocks.
In Proc. IEEE FPL ’2009, pages 225–231, Chania, Greece, Aug 31-Sept 2 2011.
24 / 25
Introduction and Background
Multiplier Architectures
Results
Conclusion
Thanks for your attention !
25 / 25

Implementation and Comparison of Softcore Multiplier Architectures for FPGAs

  • 1.
    Introduction and Background MultiplierArchitectures Results Conclusion Implementation and Comparison of Softcore Multiplier Architectures for FPGAs Shahid Abbas Projektarbeit (Master of Science) Fachgebiet Digitaltechnik Universt¨at Kassel August 22, 2014 1 / 25
  • 2.
    Introduction and Background MultiplierArchitectures Results Conclusion Outline 1 Introduction and Background Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps 2 / 25
  • 3.
    Introduction and Background MultiplierArchitectures Results Conclusion Outline 1 Introduction and Background Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps 2 Multiplier Architectures Target Specific Implementation LUT-Based Multipliers 2 / 25
  • 4.
    Introduction and Background MultiplierArchitectures Results Conclusion Outline 1 Introduction and Background Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps 2 Multiplier Architectures Target Specific Implementation LUT-Based Multipliers 3 Results Simulation Synthesis Results 2 / 25
  • 5.
    Introduction and Background MultiplierArchitectures Results Conclusion Outline 1 Introduction and Background Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps 2 Multiplier Architectures Target Specific Implementation LUT-Based Multipliers 3 Results Simulation Synthesis Results 4 Conclusion 2 / 25
  • 6.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Motivations Fast Multiplication for Signal Processing 3 / 25
  • 7.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Motivations Fast Multiplication for Signal Processing Limited number of DSP Blocks in FPGA [1] 3 / 25
  • 8.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Motivations Fast Multiplication for Signal Processing Limited number of DSP Blocks in FPGA [1] Fixed word size 3 / 25
  • 9.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Motivations Fast Multiplication for Signal Processing Limited number of DSP Blocks in FPGA [1] Fixed word size Use big multiplier for small word size 3 / 25
  • 10.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Motivations Fast Multiplication for Signal Processing Limited number of DSP Blocks in FPGA [1] Fixed word size Use big multiplier for small word size Fixed allocation 3 / 25
  • 11.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Motivations Fast Multiplication for Signal Processing Limited number of DSP Blocks in FPGA [1] Fixed word size Use big multiplier for small word size Fixed allocation Place and routing issues 3 / 25
  • 12.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Motivations Fast Multiplication for Signal Processing Limited number of DSP Blocks in FPGA [1] Fixed word size Use big multiplier for small word size Fixed allocation Place and routing issues Use of FPGA logic blocks for multiplier of any word size 3 / 25
  • 13.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Motivations Fast Multiplication for Signal Processing Limited number of DSP Blocks in FPGA [1] Fixed word size Use big multiplier for small word size Fixed allocation Place and routing issues Use of FPGA logic blocks for multiplier of any word size Softcore multiplier that work in conjunction with DSP multipliers 3 / 25
  • 14.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Fundamentals of Binary Multiplication 1 Partial Products Calculation A=A3 B=B3 20 ·B0·A x + Step 1 Step 2 A0… … B0 21 ·B1·A 22 ·B2·A 23 ·B3·A + + = Figure: Binary 4×4-bit Multiplication 4 / 25
  • 15.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Fundamentals of Binary Multiplication 1 Partial Products Calculation 2 Addition of Partial Products by proper shifting A=A3 B=B3 20 ·B0·A x + Step 1 Step 2 A0… … B0 21 ·B1·A 22 ·B2·A 23 ·B3·A + + = Figure: Binary 4×4-bit Multiplication 4 / 25
  • 16.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Xilinx Virtex-6 Slice [2] Configurable Logic Blocks (CLB) contains two slices 0 1 0 1 0 1 0 1 c_in c_out LUTLUTLUTLUT Figure: Xilinx Virtex-6 Slice 5 / 25
  • 17.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Xilinx Virtex-6 Slice [2] Configurable Logic Blocks (CLB) contains two slices Each slice contains four Look-Up Tables (LUT), eight Flip-Flops, multiplexers and a carry-propagation logic. 0 1 0 1 0 1 0 1 c_in c_out LUTLUTLUTLUT Figure: Xilinx Virtex-6 Slice 5 / 25
  • 18.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps Xilinx Virtex-6 Slice [2] Configurable Logic Blocks (CLB) contains two slices Each slice contains four Look-Up Tables (LUT), eight Flip-Flops, multiplexers and a carry-propagation logic. Single or two outputs per LUT 0 1 0 1 0 1 0 1 c_in c_out LUTLUTLUTLUT Figure: Xilinx Virtex-6 Slice 5 / 25
  • 19.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps FloPoCo Library and Bit Heaps Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4]. before first compression 1 0.530 ns 1 1.061 ns before 3-bit height additions before final addition Figure: Bit-Heap Structure for 16×16-Bit Multiplier 6 / 25
  • 20.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps FloPoCo Library and Bit Heaps Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4]. Bit heap is a data-structure holds unevaluated sum of any number of bits weighted by power of two [5]. before first compression 1 0.530 ns 1 1.061 ns before 3-bit height additions before final addition Figure: Bit-Heap Structure for 16×16-Bit Multiplier 6 / 25
  • 21.
    Introduction and Background MultiplierArchitectures Results Conclusion Motivation Fundamentals of Binary Multiplication Xilinx Virtex-6 Slice FloPoCo Library and Bit Heaps FloPoCo Library and Bit Heaps Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4]. Bit heap is a data-structure holds unevaluated sum of any number of bits weighted by power of two [5]. Equally weighted bits aligned in column as order is irrelevant for sum before first compression 1 0.530 ns 1 1.061 ns before 3-bit height additions before final addition Figure: Bit-Heap Structure for 16×16-Bit Multiplier 6 / 25
  • 22.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation [6] Best Fit design in Logic Blocks = Better Performance a b c_out c_in sum 0 1 Figure: Full Adder Implementation with Multiplexer and XOR-Gates 7 / 25
  • 23.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation [6] 0 1 0 1 0 1 0 1 c_in c_out S0S1S2S3 LUTLUTLUTLUT a0b0a1b1a2b2a3b3a4b4a5b5a6b6a7b7 Figure: Slice configuration of 4-LUTs for Partial Product and Addition 8 / 25
  • 24.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation (Automated) vector < vector < pair < int, int >>> 0 00 0 0 c_in=0 c_out (to bit-heap) Partial-Product Calculation Re-arrangement 3-LUT Slice 4-LUT Slice n 8-bit m 8-bit Before Multiplication A B 20 ·B0·A 21 ·B1·A 22 ·B2·A 23 ·B3·A 24 ·B4·A 25 ·B5·A 26 ·B6·A 27 ·B7·A Figure: 8×8-bit Multiplier Implementation in FloPoCo 9 / 25
  • 25.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation (For 8×8-bit Multiplier) Automated Implementation 10 / 25
  • 26.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation (For 8×8-bit Multiplier) Automated Implementation vector < vector < pair < int, int >>> 10 / 25
  • 27.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation (For 8×8-bit Multiplier) Automated Implementation vector < vector < pair < int, int >>> Manual interconnection of Slices 10 / 25
  • 28.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation (For 8×8-bit Multiplier) Automated Implementation vector < vector < pair < int, int >>> Manual interconnection of Slices Addition using Bit Heaps 10 / 25
  • 29.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation (For 8×8-bit Multiplier) Automated Implementation vector < vector < pair < int, int >>> Manual interconnection of Slices Addition using Bit Heaps Addition using Arithmetic Expressions 10 / 25
  • 30.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers Target Specific Implementation (Manual) Re-arrangement Tobitheap Tobitheap Tobitheap Tobitheap Tobitheap AND-gate Figure: 8×8-bit Multiplier with Manual Interconnection of Slices 11 / 25
  • 31.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers LUT-Based Multipliers [7] [5] Multiplication of two numbers can be obtained by the bit shifted additions of small multiplier result A = 2n A1 + A0 (1) B = 2n B1 + B0 (2) A × B = 22n A1B1 + 2n (A1B0 + A0B1) + A0B0 (3) 12 / 25
  • 32.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers LUT-Based Multipliers [7] [5] Multiplication of two numbers can be obtained by the bit shifted additions of small multiplier result A = 2n A1 + A0 (1) B = 2n B1 + B0 (2) A × B = 22n A1B1 + 2n (A1B0 + A0B1) + A0B0 (3) A basic n × m-bit multiplier can be instantiated multiple times 12 / 25
  • 33.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers LUT-Based Multipliers [7] [5] Multiplication of two numbers can be obtained by the bit shifted additions of small multiplier result A = 2n A1 + A0 (1) B = 2n B1 + B0 (2) A × B = 22n A1B1 + 2n (A1B0 + A0B1) + A0B0 (3) A basic n × m-bit multiplier can be instantiated multiple times Add results of each instance through proper shifting 12 / 25
  • 34.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers LUT-Based Multipliers [7] [5] 3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits) 3 3 6 A B Y Figure: 3×3-LUT Multiplier 3 5 A B Y 2 Figure: 3×2-LUT Multiplier 13 / 25
  • 35.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers LUT-Based Multipliers [7] [5] 3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits) 3×2-LUT based Multipliers (Needs 3-LUTs for 5 output Bits) 3 3 6 A B Y Figure: 3×3-LUT Multiplier 3 5 A B Y 2 Figure: 3×2-LUT Multiplier 13 / 25
  • 36.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers LUT-Based Multipliers [7] [5] 3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits) 3×2-LUT based Multipliers (Needs 3-LUTs for 5 output Bits) 1×4-LUT based Multipliers 3 3 6 A B Y Figure: 3×3-LUT Multiplier 3 5 A B Y 2 Figure: 3×2-LUT Multiplier 13 / 25
  • 37.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers LUT-Based Multipliers (3×3-LUT based 8×7 Multiplier) 3x3 3x3 3x33x3 0 1 2 3 4 5 0 1 2 3 4 5 2x3 2x3 3x1 3x16 2x1 6 7 A B i ii iii iv v vi vii viii ix Figure: 8×7-bit LUT-Multiplier Implementation in FloPoCo 14 / 25
  • 38.
    Introduction and Background MultiplierArchitectures Results Conclusion Target Specific Implementation LUT-Based Multipliers LUT-Based Multipliers (3×3-LUT based 8×7 Multiplier) 3x3 3x3 3x33x3 0 1 2 3 4 5 0 1 2 3 4 5 2x3 2x3 3x1 3x16 2x1 6 7 A B i ii iii iv v vi vii viii ix Figure: 8×7-bit LUT-Multiplier Implementation in FloPoCo AB = A0..2B0..2 + 23 (A3..5B0..2 + A0..2B3..5) + 26 (A6..7B0..2 + A3..5B3..5 + A0..2B6) + 29 (A6..7B3..5 + A3..5B6) + 212 A6..7B6 (4) 14 / 25
  • 39.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 15 / 25
  • 40.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 15 / 25
  • 41.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 15 / 25
  • 42.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 4 Width of large word is odd and other is even 15 / 25
  • 43.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 4 Width of large word is odd and other is even 5 Word sizes are odd and unequal 15 / 25
  • 44.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 4 Width of large word is odd and other is even 5 Word sizes are odd and unequal 6 Word sizes are odd and equal 15 / 25
  • 45.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 4 Width of large word is odd and other is even 5 Word sizes are odd and unequal 6 Word sizes are odd and equal Eight Designs for every of above specifications 15 / 25
  • 46.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 4 Width of large word is odd and other is even 5 Word sizes are odd and unequal 6 Word sizes are odd and equal Eight Designs for every of above specifications 48-Designs for each architecture 15 / 25
  • 47.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 4 Width of large word is odd and other is even 5 Word sizes are odd and unequal 6 Word sizes are odd and equal Eight Designs for every of above specifications 48-Designs for each architecture Self-Checking testbenches were generated using FloPoCo function emulate(TestCase *tc) 15 / 25
  • 48.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 4 Width of large word is odd and other is even 5 Word sizes are odd and unequal 6 Word sizes are odd and equal Eight Designs for every of above specifications 48-Designs for each architecture Self-Checking testbenches were generated using FloPoCo function emulate(TestCase *tc) TestBench 10000 option was used to generated 10000 random test cases during core-generation. 15 / 25
  • 49.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Simulation 1 Word sizes are even and equal 2 Word sizes are even and unequal. 3 Width of large word is even and other is odd 4 Width of large word is odd and other is even 5 Word sizes are odd and unequal 6 Word sizes are odd and equal Eight Designs for every of above specifications 48-Designs for each architecture Self-Checking testbenches were generated using FloPoCo function emulate(TestCase *tc) TestBench 10000 option was used to generated 10000 random test cases during core-generation. Simulation on ModelSim 15 / 25
  • 50.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Synthesis Results 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 100 200 300 400 500 600 700 800 900 1000 Speed Vs Complexity (N X M) Complexity (N X M) Frequency(MHz) f max = 906.62 MHz in Target Specific Multiplier Target Specfic Multiplier 3x3 LUT Multiplier 1x4 LUT Multiplier 3x2 LUT Multiplier Figure: Comparison of Architectures on the basis of Speed (for N×M-bit) 16 / 25
  • 51.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Synthesis Results 0 10 20 30 40 50 60 70 0 100 200 300 400 500 600 700 Speed Vs Complexity (N X N) Complexity (N) Frequency(MHz) Target Specfic Multiplier 3x3 LUT Multiplier 1x4 LUT Multiplier 3x2 LUT Multiplier Figure: Comparison of Architectures on the basis of Speed (for N×N-bit) 17 / 25
  • 52.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Synthesis Results 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 200 400 600 800 1000 1200 1400 1600 1800 Slice Usage Vs Complexity (N X M) Complexity (N X M) NumberofSlices Target Specfic Multiplier 3x3 LUT Multiplier 1x4 LUT Multiplier 3x2 LUT Multiplier Figure: Comparison of Architectures on the basis of Slice usage (for N×M-bit) 18 / 25
  • 53.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Synthesis Results 0 10 20 30 40 50 60 70 0 200 400 600 800 1000 1200 1400 1600 1800 Slice Usage Vs Complexity (N X N) Complexity (N) NumberofSlices Target Specfic Multiplier 3x3 LUT Multiplier 1x4 LUT Multiplier 3x2 LUT Multiplier Figure: Comparison of Architectures on the basis of Speed (for N×N-bit) 19 / 25
  • 54.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Synthesis Results Average Performance Table: Average values of parameters for different architectures Architecture No. of Flip-Flops No. of LUTs No. of Slices Frequency (MHz) Target Specific 1144 1615 419 346.36 3×3-LUT 1422 1893 491 301.03 3×2-LUT 1730 1962 513 264.95 1×4-LUT 2019 2340 610 259.98 20 / 25
  • 55.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Synthesis Results Automatic Vs Manual Interconnection of Slices (8×8-bit) Table: Automatic Vs Manual routing between Slices No. of FFs No. of LUTs No. of Slices Frequency (MHz) Automatic 56 74 21 686.81 Manual(Bit Heap) 22 74 20 256.61 Manual (Without Bit Heap) 40 59 16 414.08 21 / 25
  • 56.
    Introduction and Background MultiplierArchitectures Results Conclusion Simulation Synthesis Results Synthesis Results Automatic Vs Manual Interconnection of Slices (8×8-bit) before first compression before 3-bit height additions before final addition Figure: Bit Heap Structure for Automatic Interconnection of Slices before first compression 1 0.530 ns before 3-bit height additions before final addition Figure: Bit Heap Structure for Manual Interconnection of Slices 22 / 25
  • 57.
    Introduction and Background MultiplierArchitectures Results Conclusion Conclusion Fast multipliers with minimum resources can be implemented by choosing appropriate architecture. 23 / 25
  • 58.
    Introduction and Background MultiplierArchitectures Results Conclusion Conclusion Fast multipliers with minimum resources can be implemented by choosing appropriate architecture. Target Specific Implementation showed best results due to average fast speed and less consumption of resources. 23 / 25
  • 59.
    Introduction and Background MultiplierArchitectures Results Conclusion Conclusion Fast multipliers with minimum resources can be implemented by choosing appropriate architecture. Target Specific Implementation showed best results due to average fast speed and less consumption of resources. Automated generation of this approach can modified with introduction of AND-gate for corner elements. 23 / 25
  • 60.
    Introduction and Background MultiplierArchitectures Results Conclusion Conclusion Fast multipliers with minimum resources can be implemented by choosing appropriate architecture. Target Specific Implementation showed best results due to average fast speed and less consumption of resources. Automated generation of this approach can modified with introduction of AND-gate for corner elements. Slice usage can be improved by their manual interconnection, with compromise over speed. 23 / 25
  • 61.
    Introduction and Background MultiplierArchitectures Results Conclusion References [1] Ian Kuon and J. Rose. Measuring the Gap Between FPGAs and ASICs. Computer-Aided Design of Integrated Circuits and Systems, 26:203–215, February 2007. [2] Xilinx. Virtex-6 FPGA, Configurable Logic Block User Guide, UG364 (v1.2). http://www.xilinx.com/support/documentation/user_guides/ug364.pdf, 2012. [3] F. de Dinechin and B. Pasca. Designing Custom Arithmetic Data Paths with FloPoCo. Design and Test of Computers, 28:18–27, 2011. [4] Florent de Dinechin. Tutorial held at HiPEAC’2013 “Building Custom Arithmetic Operators with the FloPoCo Generator”. http://perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/flopoco-tutorial.pdf, 2013. [5] Brunie N., de Dinechin F., Istoan M., Sergent G., Illyes K., and Popa B. Arithmetic core generation using bit heaps. In Proc. IEEE FPL ’2013, pages 1–8, Porto, Portugal, 2–4, 2013. [6] H. ParandehAfshar and P. Ienne. Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs. In Proc. IEEE FPL ’2011, pages 225–231, Chania, Greece, 5–7, 2011. [7] F. de Dinechin and B. Pasca. Large multipliers with fewer DSP blocks. In Proc. IEEE FPL ’2009, pages 225–231, Chania, Greece, Aug 31-Sept 2 2011. 24 / 25
  • 62.
    Introduction and Background MultiplierArchitectures Results Conclusion Thanks for your attention ! 25 / 25