Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Implementation and Comparison of Softcore Multiplier Architectures for FPGAs
1. Introduction and Background
Multiplier Architectures
Results
Conclusion
Implementation and Comparison of Softcore
Multiplier Architectures for FPGAs
Shahid Abbas
Projektarbeit (Master of Science)
Fachgebiet Digitaltechnik
Universt¨at Kassel
August 22, 2014
1 / 25
2. Introduction and Background
Multiplier Architectures
Results
Conclusion
Outline
1 Introduction and Background
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
2 / 25
3. Introduction and Background
Multiplier Architectures
Results
Conclusion
Outline
1 Introduction and Background
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
2 Multiplier Architectures
Target Specific Implementation
LUT-Based Multipliers
2 / 25
4. Introduction and Background
Multiplier Architectures
Results
Conclusion
Outline
1 Introduction and Background
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
2 Multiplier Architectures
Target Specific Implementation
LUT-Based Multipliers
3 Results
Simulation
Synthesis Results
2 / 25
5. Introduction and Background
Multiplier Architectures
Results
Conclusion
Outline
1 Introduction and Background
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
2 Multiplier Architectures
Target Specific Implementation
LUT-Based Multipliers
3 Results
Simulation
Synthesis Results
4 Conclusion
2 / 25
6. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
3 / 25
7. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
3 / 25
8. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
3 / 25
9. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
3 / 25
10. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
Fixed allocation
3 / 25
11. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
Fixed allocation
Place and routing issues
3 / 25
12. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
Fixed allocation
Place and routing issues
Use of FPGA logic blocks for multiplier of any word size
3 / 25
13. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Motivations
Fast Multiplication for Signal Processing
Limited number of DSP Blocks in FPGA [1]
Fixed word size
Use big multiplier for small word size
Fixed allocation
Place and routing issues
Use of FPGA logic blocks for multiplier of any word size
Softcore multiplier that work in conjunction with DSP multipliers
3 / 25
14. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Fundamentals of Binary Multiplication
1 Partial Products Calculation
A=A3
B=B3
20
·B0·A
x
+
Step 1
Step 2
A0…
… B0
21
·B1·A
22
·B2·A
23
·B3·A
+
+
=
Figure: Binary 4×4-bit Multiplication
4 / 25
15. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Fundamentals of Binary Multiplication
1 Partial Products Calculation
2 Addition of Partial Products by proper shifting
A=A3
B=B3
20
·B0·A
x
+
Step 1
Step 2
A0…
… B0
21
·B1·A
22
·B2·A
23
·B3·A
+
+
=
Figure: Binary 4×4-bit Multiplication
4 / 25
16. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Xilinx Virtex-6 Slice [2]
Configurable Logic Blocks (CLB) contains two slices
0
1
0
1
0
1
0
1
c_in
c_out
LUTLUTLUTLUT
Figure: Xilinx Virtex-6 Slice
5 / 25
17. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Xilinx Virtex-6 Slice [2]
Configurable Logic Blocks (CLB) contains two slices
Each slice contains four Look-Up Tables (LUT), eight Flip-Flops, multiplexers and a
carry-propagation logic.
0
1
0
1
0
1
0
1
c_in
c_out
LUTLUTLUTLUT
Figure: Xilinx Virtex-6 Slice
5 / 25
18. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
Xilinx Virtex-6 Slice [2]
Configurable Logic Blocks (CLB) contains two slices
Each slice contains four Look-Up Tables (LUT), eight Flip-Flops, multiplexers and a
carry-propagation logic.
Single or two outputs per LUT
0
1
0
1
0
1
0
1
c_in
c_out
LUTLUTLUTLUT
Figure: Xilinx Virtex-6 Slice
5 / 25
19. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
FloPoCo Library and Bit Heaps
Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4].
before first compression
1 0.530 ns
1 1.061 ns
before 3-bit height additions
before final addition
Figure: Bit-Heap Structure for 16×16-Bit Multiplier
6 / 25
20. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
FloPoCo Library and Bit Heaps
Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4].
Bit heap is a data-structure holds unevaluated sum of any number of bits weighted by
power of two [5].
before first compression
1 0.530 ns
1 1.061 ns
before 3-bit height additions
before final addition
Figure: Bit-Heap Structure for 16×16-Bit Multiplier
6 / 25
21. Introduction and Background
Multiplier Architectures
Results
Conclusion
Motivation
Fundamentals of Binary Multiplication
Xilinx Virtex-6 Slice
FloPoCo Library and Bit Heaps
FloPoCo Library and Bit Heaps
Floating-Point Cores (FloPoCo), C++ framework for synthesizable VHDL code [3] [4].
Bit heap is a data-structure holds unevaluated sum of any number of bits weighted by
power of two [5].
Equally weighted bits aligned in column as order is irrelevant for sum
before first compression
1 0.530 ns
1 1.061 ns
before 3-bit height additions
before final addition
Figure: Bit-Heap Structure for 16×16-Bit Multiplier
6 / 25
22. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
Target Specific Implementation [6]
Best Fit design in Logic Blocks = Better Performance
a b
c_out c_in
sum
0
1
Figure: Full Adder Implementation with Multiplexer and XOR-Gates
7 / 25
31. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
Multiplication of two numbers can be obtained by the bit shifted additions of
small multiplier result
A = 2n
A1 + A0 (1)
B = 2n
B1 + B0 (2)
A × B = 22n
A1B1 + 2n
(A1B0 + A0B1) + A0B0 (3)
12 / 25
32. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
Multiplication of two numbers can be obtained by the bit shifted additions of
small multiplier result
A = 2n
A1 + A0 (1)
B = 2n
B1 + B0 (2)
A × B = 22n
A1B1 + 2n
(A1B0 + A0B1) + A0B0 (3)
A basic n × m-bit multiplier can be instantiated multiple times
12 / 25
33. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
Multiplication of two numbers can be obtained by the bit shifted additions of
small multiplier result
A = 2n
A1 + A0 (1)
B = 2n
B1 + B0 (2)
A × B = 22n
A1B1 + 2n
(A1B0 + A0B1) + A0B0 (3)
A basic n × m-bit multiplier can be instantiated multiple times
Add results of each instance through proper shifting
12 / 25
34. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits)
3
3
6
A B
Y
Figure: 3×3-LUT Multiplier
3
5
A B
Y
2
Figure: 3×2-LUT Multiplier
13 / 25
35. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits)
3×2-LUT based Multipliers (Needs 3-LUTs for 5 output Bits)
3
3
6
A B
Y
Figure: 3×3-LUT Multiplier
3
5
A B
Y
2
Figure: 3×2-LUT Multiplier
13 / 25
36. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers [7] [5]
3×3-LUT based Multipliers (Needs 6-LUTs for 6 output Bits)
3×2-LUT based Multipliers (Needs 3-LUTs for 5 output Bits)
1×4-LUT based Multipliers
3
3
6
A B
Y
Figure: 3×3-LUT Multiplier
3
5
A B
Y
2
Figure: 3×2-LUT Multiplier
13 / 25
37. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers (3×3-LUT based 8×7 Multiplier)
3x3 3x3
3x33x3
0 1 2 3 4 5
0
1
2
3
4
5
2x3
2x3
3x1 3x16 2x1
6 7
A
B
i ii
iii iv
v
vi
vii viii ix
Figure: 8×7-bit LUT-Multiplier Implementation in FloPoCo
14 / 25
38. Introduction and Background
Multiplier Architectures
Results
Conclusion
Target Specific Implementation
LUT-Based Multipliers
LUT-Based Multipliers (3×3-LUT based 8×7 Multiplier)
3x3 3x3
3x33x3
0 1 2 3 4 5
0
1
2
3
4
5
2x3
2x3
3x1 3x16 2x1
6 7
A
B
i ii
iii iv
v
vi
vii viii ix
Figure: 8×7-bit LUT-Multiplier Implementation in FloPoCo
AB = A0..2B0..2 + 23
(A3..5B0..2 + A0..2B3..5) + 26
(A6..7B0..2 + A3..5B3..5 + A0..2B6)
+ 29
(A6..7B3..5 + A3..5B6) + 212
A6..7B6
(4)
14 / 25
40. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
15 / 25
41. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
15 / 25
42. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
15 / 25
43. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
15 / 25
44. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
15 / 25
45. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
15 / 25
46. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
48-Designs for each architecture
15 / 25
47. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
48-Designs for each architecture
Self-Checking testbenches were generated using FloPoCo function emulate(TestCase
*tc)
15 / 25
48. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
48-Designs for each architecture
Self-Checking testbenches were generated using FloPoCo function emulate(TestCase
*tc)
TestBench 10000 option was used to generated 10000 random test
cases during core-generation.
15 / 25
49. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Simulation
1 Word sizes are even and equal
2 Word sizes are even and unequal.
3 Width of large word is even and other is odd
4 Width of large word is odd and other is even
5 Word sizes are odd and unequal
6 Word sizes are odd and equal
Eight Designs for every of above specifications
48-Designs for each architecture
Self-Checking testbenches were generated using FloPoCo function emulate(TestCase
*tc)
TestBench 10000 option was used to generated 10000 random test
cases during core-generation.
Simulation on ModelSim
15 / 25
50. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
100
200
300
400
500
600
700
800
900
1000
Speed Vs Complexity (N X M)
Complexity (N X M)
Frequency(MHz)
f
max
= 906.62 MHz in Target Specific Multiplier
Target Specfic Multiplier
3x3 LUT Multiplier
1x4 LUT Multiplier
3x2 LUT Multiplier
Figure: Comparison of Architectures on the basis of Speed (for N×M-bit)
16 / 25
51. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
0 10 20 30 40 50 60 70
0
100
200
300
400
500
600
700
Speed Vs Complexity (N X N)
Complexity (N)
Frequency(MHz)
Target Specfic Multiplier
3x3 LUT Multiplier
1x4 LUT Multiplier
3x2 LUT Multiplier
Figure: Comparison of Architectures on the basis of Speed (for N×N-bit)
17 / 25
52. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
200
400
600
800
1000
1200
1400
1600
1800
Slice Usage Vs Complexity (N X M)
Complexity (N X M)
NumberofSlices
Target Specfic Multiplier
3x3 LUT Multiplier
1x4 LUT Multiplier
3x2 LUT Multiplier
Figure: Comparison of Architectures on the basis of Slice usage (for N×M-bit)
18 / 25
53. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
0 10 20 30 40 50 60 70
0
200
400
600
800
1000
1200
1400
1600
1800
Slice Usage Vs Complexity (N X N)
Complexity (N)
NumberofSlices
Target Specfic Multiplier
3x3 LUT Multiplier
1x4 LUT Multiplier
3x2 LUT Multiplier
Figure: Comparison of Architectures on the basis of Speed (for N×N-bit)
19 / 25
54. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
Average Performance
Table: Average values of parameters for different architectures
Architecture No. of Flip-Flops No. of LUTs No. of Slices Frequency (MHz)
Target Specific 1144 1615 419 346.36
3×3-LUT 1422 1893 491 301.03
3×2-LUT 1730 1962 513 264.95
1×4-LUT 2019 2340 610 259.98
20 / 25
55. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
Automatic Vs Manual Interconnection of Slices (8×8-bit)
Table: Automatic Vs Manual routing between Slices
No. of FFs No. of LUTs No. of Slices Frequency (MHz)
Automatic 56 74 21 686.81
Manual(Bit Heap) 22 74 20 256.61
Manual (Without Bit Heap) 40 59 16 414.08
21 / 25
56. Introduction and Background
Multiplier Architectures
Results
Conclusion
Simulation
Synthesis Results
Synthesis Results
Automatic Vs Manual Interconnection of Slices (8×8-bit)
before first compression
before 3-bit height additions
before final addition
Figure: Bit Heap Structure for Automatic
Interconnection of Slices
before first compression
1 0.530 ns
before 3-bit height additions
before final addition
Figure: Bit Heap Structure for Manual
Interconnection of Slices
22 / 25
57. Introduction and Background
Multiplier Architectures
Results
Conclusion
Conclusion
Fast multipliers with minimum resources can be implemented
by choosing appropriate architecture.
23 / 25
58. Introduction and Background
Multiplier Architectures
Results
Conclusion
Conclusion
Fast multipliers with minimum resources can be implemented
by choosing appropriate architecture.
Target Specific Implementation showed best results due to
average fast speed and less consumption of resources.
23 / 25
59. Introduction and Background
Multiplier Architectures
Results
Conclusion
Conclusion
Fast multipliers with minimum resources can be implemented
by choosing appropriate architecture.
Target Specific Implementation showed best results due to
average fast speed and less consumption of resources.
Automated generation of this approach can modified with
introduction of AND-gate for corner elements.
23 / 25
60. Introduction and Background
Multiplier Architectures
Results
Conclusion
Conclusion
Fast multipliers with minimum resources can be implemented
by choosing appropriate architecture.
Target Specific Implementation showed best results due to
average fast speed and less consumption of resources.
Automated generation of this approach can modified with
introduction of AND-gate for corner elements.
Slice usage can be improved by their manual interconnection,
with compromise over speed.
23 / 25
61. Introduction and Background
Multiplier Architectures
Results
Conclusion
References
[1] Ian Kuon and J. Rose.
Measuring the Gap Between FPGAs and ASICs.
Computer-Aided Design of Integrated Circuits and Systems, 26:203–215, February 2007.
[2] Xilinx.
Virtex-6 FPGA, Configurable Logic Block User Guide, UG364 (v1.2).
http://www.xilinx.com/support/documentation/user_guides/ug364.pdf, 2012.
[3] F. de Dinechin and B. Pasca.
Designing Custom Arithmetic Data Paths with FloPoCo.
Design and Test of Computers, 28:18–27, 2011.
[4] Florent de Dinechin.
Tutorial held at HiPEAC’2013 “Building Custom Arithmetic Operators with the FloPoCo Generator”.
http://perso.citi-lab.fr/fdedinec/recherche/2013-HiPEAC-Tutorial-FloPoCo/flopoco-tutorial.pdf,
2013.
[5] Brunie N., de Dinechin F., Istoan M., Sergent G., Illyes K., and Popa B.
Arithmetic core generation using bit heaps.
In Proc. IEEE FPL ’2013, pages 1–8, Porto, Portugal, 2–4, 2013.
[6] H. ParandehAfshar and P. Ienne.
Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs.
In Proc. IEEE FPL ’2011, pages 225–231, Chania, Greece, 5–7, 2011.
[7] F. de Dinechin and B. Pasca.
Large multipliers with fewer DSP blocks.
In Proc. IEEE FPL ’2009, pages 225–231, Chania, Greece, Aug 31-Sept 2 2011.
24 / 25