F31 book-arith-pres-pt3

1,774 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,774
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
129
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

F31 book-arith-pres-pt3

  1. 1. Part III Multiplication Parts Chapters 1. 2. 3. 4. Numbers and Arithmetic Representing Signed Numbers Redundant Number Systems Residue Number Systems 5. 6. 7. 8. Basic Addition and Counting Carry-Look ahead Adders Variations in Fast Adders Multioperand Addition III. Multiplication 9. 10. 11. 12. Basic Multiplication Schemes High-Radix Multipliers Tree and Array Multipliers Variations in Multipliers IV. Division 13. 14. 15. 16. Basic Division Schemes High-Radix Dividers Variations in Dividers Division by Convergence 17. 18. 19. 20. Floating-Point Reperesentations Floating-Point Operations Errors and Error Control Precise and Certifiable Arithmetic 21. 22. 23. 24. Square-Rooting Methods The CORDIC Algorithms Variations in Function Evaluation Arithmetic by Table Lookup 25. 26. 27. 28. 28. High-Throughput Arithmetic Low-Power Arithmetic Fault-Tolerant Arithmetic Past, Present, and Future Reconfigurable Arithmetic Elementary Operations I. Number Representation II. Addition / Subtraction V. Real Arithmetic VI. Function Evaluation VII. Implementation Topics Appendix: Past, Present, and Future Apr. 2012 Computer Arithmetic, Multiplicationlide 1 S
  2. 2. About This Presentation This presentation is intended to support the use of the textbook Computer Arithmetic: Algorithms and Hardware Designs (Oxford U. Press, 2nd ed., 2010, ISBN 978-0-19-532848-6). It is updated regularly by the author as part of his teaching of the graduate course ECE 252B, Computer Arithmetic, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Unauthorized uses are strictly prohibited. © Behrooz Parhami Edition Released Revised Revised Revised Revised First Jan. 2000 Sep. 2001 Sep. 2003 Oct. 2005 May 2007 Apr. 2008 Apr. 2009 Apr. 2011 Apr. 2012 Second Apr. 2010 Apr. 2012 Computer Arithmetic, Multiplicationlide 2 S
  3. 3. III Multiplication Review multiplication schemes and various speedup methods • Multiplication is heavily used (in arith & array indexing) • Division = reciprocation + multiplication • Multiplication speedup: high-radix, tree, recursive • Bit-serial, modular, and array multipliers Topics in This Part Chapter 9 Basic Multiplication Schemes Chapter 10 High-Radix Multipliers Chapter 11 Tree and Array Multipliers Chapter 12 Variations in Multipliers Apr. 2012 Computer Arithmetic, Multiplicationlide 3 S
  4. 4. “Well, well, for a rabbit, you’re not very good at multiplying, are you?” Apr. 2012 Computer Arithmetic, Multiplicationlide 4 S
  5. 5. 9 Basic Multiplication Schemes Chapter Goals Study shift/add or bit-at-a-time multipliers and set the stage for faster methods and variations to be covered in Chapters 10-12 Chapter Highlights Multiplication = multioperand addition Hardware, firmware, software algorithms Multiplying 2’s-complement numbers The special case of one constant operand Apr. 2012 Computer Arithmetic, Multiplicationlide 5 S
  6. 6. Basic Multiplication Schemes: Topics Topics in This Chapter 9.1 Shift/Add Multiplication Algorithms 9.2 Programmed Multiplication 9.3 Basic Hardware Multipliers 9.4 Multiplication of Signed Numbers 9.5 Multiplication by Constants 9.6 Preview of Fast Multipliers Apr. 2012 Computer Arithmetic, Multiplicationlide 6 S
  7. 7. 9.1 Shift/Add Multiplication Algorithms Notation for our discussion of multiplication algorithms: a x p Multiplicand Multiplier Product (a × x) p2k–1p2k–2 ak–1ak–2 . . . a1a0 xk–1xk–2 . . . x1x0 . . . p3p2p1p0 Initially, we assume unsigned operands Multiplicand Multiplier x 0 a 20 x 1 a 21 x 2 a 22 x 3 a 23 Partial products bit-matrix p × a x Product Fig. 9.1 Multiplication of two 4-bit unsigned binary numbers in dot notation. Apr. 2012 Computer Arithmetic, Multiplicationlide 7 S
  8. 8. Multiplication Recurrence Partial products bit-matrix p × Multiplicand Multiplier x 0 a 20 x 1 a 21 x 2 a 22 x 3 a 23 Fig. 9.1 a x Product Preferred Multiplication with right shifts: top-to-bottom accumulation p(j+1) = (p(j) + xj a 2k) 2–1 |–––add–––| |––shift right––| with p(0) = 0 and p(k) = p = ax + p(0)2–k Multiplication with left shifts: bottom-to-top accumulation p(j+1) = 2 p(j) + xk–j–1a with |shift| |––––add––––| p(0) = 0 and p(k) = p = ax + p(0)2k Apr. 2012 Computer Arithmetic, Multiplicationlide 8 S
  9. 9. Examples of Basic Multiplication Right-shift algorithm ======================== a 1010 1 0 1 0 x 1 0 1 1 ======================== p(0) 0 0 0 0 +x0a 1 0 1 0 ––––––––––––––––––––––––– 2p(1) 0 1 0 1 0 (1) p 0 1 0 1 0 +x1a 1 0 1 0 ––––––––––––––––––––––––– 2p(2) 0 1 1 1 1 0 (2) p 0 1 1 1 1 0 +x2a 0 0 0 0 ––––––––––––––––––––––––– 2p(3) 0 0 1 1 1 1 0 (3) p 0 0 1 1 1 1 0 +x3a 1 0 1 0 ––––––––––––––––––––––––– 2p(4) 0 1 1 0 1 1 1 0 (4) p 0 1 1 0 1 1 1 0 ======================== Left-shift algorithm ======================= a 1 0 1 0 x 1 0 1 1 ======================= p(0) 0 0 0 0 (0) 2p 0 0 0 0 0 +x3a 1 0 1 0 –––––––––––––––––––––––– p(1) 0 1 0 1 0 2p(1) 0 1 0 1 0 0 +x2a 0 0 0 0 –––––––––––––––––––––––– p(2) 0 1 0 1 0 0 (j+1) (2) = (p(j) + x a 2k) 2–1 p 2p 0 1j 0 1 0 0 0 +x1a 1 |–––add–––| 0 1 0 –––––––––––––––––––––––– |––shift right––| (3) p 0 1 1 0 0 1 0 (3) 2p 0 1 1 0 0 1 0 0 +x0a 1 0 1 0 –––––––––––––––––––––––– p(4) 0 1 1 0 1 1 1 0 ======================= Apr. 2012 Computer Arithmetic, Multiplicationlide 9 S Fig. 9.2 Examples of sequential multiplication with right and left shifts. Check: 10 × 11 = 110 = 64 + 32 + 8+4+2
  10. 10. Examples of Basic Multiplication (Continued) Right-shift algorithm ======================== a 1 0 1 0 x 1 0 1 1 ======================== p(0) 0 0 0 0 +x0a 1 0 1 0 ––––––––––––––––––––––––– 2p(1) 0 1 0 1 0 (1) p 0 1 0 1 0 +x1a 1 0 1 0 ––––––––––––––––––––––––– 2p(2) 0 1 1 1 1 0 (2) p 0 1 1 1 1 0 (j) p(j+1) =0 2 p0 0 xk–j–1a + +x2a 0 |shift| ––––––––––––––––––––––––– 2p(3) 0 0|––––add––––| 1 1 1 1 0 (3) p 0 0 1 1 1 1 0 +x3a 1 0 1 0 ––––––––––––––––––––––––– 2p(4) 0 1 1 0 1 1 1 0 (4) p 0 1 1 0 1 1 1 0 ======================== Left-shift algorithm ======================= a 1 0 1 0 x 1 0 1 1 ======================= p(0) 0 0 0 0 (0) 2p 0 0 0 0 0 +x3a 1 0 1 0 –––––––––––––––––––––––– p(1) 0 1 0 1 0 2p(1) 0 1 0 1 0 0 +x2a 0 0 0 0 –––––––––––––––––––––––– p(2) 0 1 0 1 0 0 (2) 2p 0 1 0 1 0 0 0 +x1a 1 0 1 0 –––––––––––––––––––––––– p(3) 0 1 1 0 0 1 0 (3) 2p 0 1 1 0 0 1 0 0 +x0a 1 0 1 0 –––––––––––––––––––––––– p(4) 0 1 1 0 1 1 1 0 ======================= Fig. 9.2 Examples of sequential multiplication with right and left shifts. Check: 10 × 11 = 110 = 64 + 32 + 8+4+2 Apr. 2012 Computer Arithmetic, Multiplicationlide 10 S
  11. 11. 9.2 Programmed Multiplication {Using right shifts, multiply unsigned m_cand and m_ier, storing the resultant 2k-bit product in p_high and p_low. Registers: R0 holds 0 Rc for counter R0 0 Rc Counter Ra for m_cand Rx for m_ier Rp for p_high Rq for p_low} Ra Multiplicand Rx Multiplier {Load operands into registers Ra and Rx} Rp Product, high Rq Product, low mult: load Ra with m_cand load Rx with m_ier {Initialize partial product and counter} copy R0 into Rp copy R0 into Rq load k into Rc {Begin multiplication loop} m_loop: shift Rx right 1 {LSB moves to carry flag} branch no_add if carry = 0 add Ra to Rp {carry flag is set to cout} no_add: rotate Rp right 1 {carry to MSB, LSB to carry} rotate Rq right 1 {carry to MSB, LSB to carry} decr Rc {decrement counter by 1} branch m_loop if Rc ≠ 0 {Store the product} Fig. 9.3 Programmed store Rp into p_high multiplication (right-shift store Rq into p_low algorithm). m_done: ... Apr. 2012 Computer Arithmetic, Multiplicationlide 11 S
  12. 12. Time Complexity of Programmed Multiplication Assume k-bit words k iterations of the main loop 6-7 instructions per iteration, depending on the multiplier bit Thus, 6k + 3 to 7k + 3 machine instructions, ignoring operand loads and result store k = 32 implies 200+ instructions on average This is too slow for many modern applications! Microprogrammed multiply would be somewhat better Apr. 2012 Computer Arithmetic, Multiplicationlide 12 S
  13. 13. 9.3 Basic Hardware Multipliers Shift Multiplier x Doublewidth partial product p (j) Shift Multiplicand a 0 0 Mux xj a cout k 1 k Adder xj p(j+1) = (p(j) + xj a 2k) 2–1 |–––add–––| |––shift right––| k Fig. 9.4 Hardware realization of the sequential multiplication algorithm with additions and right shifts. Apr. 2012 Computer Arithmetic, Multiplicationlide 13 S
  14. 14. Example of Hardware Multiplication Shift 1 Multiplier x 1 1 1 0 0 0 Doublewidth 1 1 1 1 1 1 1 0 0 0 0 0 0 0partial product p 0 (j) (11)ten (110)ten Shift (10)ten 1 Multiplicand a 0 1 0 0 0 Mux xj a cout k 1 k Adder xj p(j+1) = (p(j) + xj a 2k) 2–1 |–––add–––| |––shift right––| k Fig. 9.4a Hardware realization of the sequential multiplication algorithm with additions and right shifts. Apr. 2012 Computer Arithmetic, Multiplicationlide 14 S
  15. 15. Performing Add and Shift in One Clock Cycle Adder’s carry-out Adder’s sum k k–1 Unused part of the multiplier x Partial product p(j) k To adder k–1 To mux control Fig. 9.5 Combining the loading and shifting of the double-width register holding the partial product and the partially used multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 15 S
  16. 16. Sequential Multiplication with Left Shifts Shift Multiplier x Doublewidth partial product p (j) Shift Multiplicand a 2k xk-j-1 0 0 Mux xk-j-1a cout 1 k 2k-bit adder 2k Fig. 9.4b Hardware realization of the sequential multiplication algorithm with left shifts and additions. Apr. 2012 Computer Arithmetic, Multiplicationlide 16 S
  17. 17. 9.4 Multiplication of Signed Numbers Fig. 9.6 Sequential multiplication of 2’s-complement numbers with right shifts (positive multiplier). Negative multiplicand, positive multiplier: No change, other than looking out for proper sign extension ============================ a 1 0 1 1 0 x 0 1 0 1 1 ============================ p(0) 0 0 0 0 0 Check: +x0a 1 0 1 1 0 – 10 × 11 ––––––––––––––––––––––––––––– = –110 (1) 2p 1 1 0 1 1 0 (1) = –512 + p 1 1 0 1 1 0 +x1a 1 0 1 1 0 256 + ––––––––––––––––––––––––––––– 128 + 2p(2) 1 1 0 0 0 1 0 16 + 2 p(2) 1 1 0 0 0 1 0 +x2a 0 0 0 0 0 ––––––––––––––––––––––––––––– 2p(3) 1 1 1 0 0 0 1 0 (3) p 1 1 1 0 0 0 1 0 +x3a 1 0 1 1 0 ––––––––––––––––––––––––––––– 2p(4) 1 1 0 0 1 0 0 1 0 (4) p 1 1 0 0 1 0 0 1 0 +x4a 0 0 0 0 0 ––––––––––––––––––––––––––––– 2p(5) 1 1 1 0 0 1 0 0 1 0 (5) p 1 1 1 0 0 1 0 0 1 0 ============================ Apr. 2012 Computer Arithmetic, Multiplicationlide 17 S
  18. 18. The Case of a Negative Multiplier Fig. 9.7 Sequential multiplication of 2’s-complement numbers with right shifts (negative multiplier). Negative multiplicand, negative multiplier: In last step (the sign bit), subtract rather than add ============================ a 1 0 1 1 0 x 1 0 1 0 1 ============================ p(0) 0 0 0 0 0 Check: +x0a 1 0 1 1 0 – 10 × –11 ––––––––––––––––––––––––––––– = 110 2p(1) 1 1 0 1 1 0 = 64 + 32 + p(1) 1 1 0 1 1 0 +x1a 0 0 0 0 0 8+4+2 ––––––––––––––––––––––––––––– 2p(2) 1 1 1 0 1 1 0 (2) p 1 1 1 0 1 1 0 +x2a 1 0 1 1 0 ––––––––––––––––––––––––––––– 2p(3) 1 1 0 0 1 1 1 0 (3) p 1 1 0 0 1 1 1 0 +x3a 0 0 0 0 0 ––––––––––––––––––––––––––––– 2p(4) 1 1 1 0 0 1 1 1 0 (4) p 1 1 1 0 0 1 1 1 0 +(−x4a) 0 1 0 1 0 ––––––––––––––––––––––––––––– 2p(5) 0 0 0 1 1 0 1 1 1 0 (5) p 0 0 0 1 1 0 1 1 1 0 ============================ Apr. 2012 Computer Arithmetic, Multiplicationlide 18 S
  19. 19. Signed 2’s-Complement Hardware Multiplier Multiplier Partial product k+1 Multiplicand 1 Enable 0 Mux k+1 cout Select Adder k+1 Fig. 9.8 cin 0, except in last cycle The 2’s-complement sequential hardware multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 19 S
  20. 20. Booth’s Recoding Table 9.1 Radix-2 Booth’s recoding ––––––––––––––––––––––––––––––––––––– xi xi–1 yi Explanation ––––––––––––––––––––––––––––––––––––– 0 0 0 No string of 1s in sight 0 1 1 End of string of 1s in x − 1 0 1 Beginning of string of 1s in x 1 1 0 Continuation of string of 1s in x ––––––––––––––––––––––––––––––––––––– Example 1 0 0 1 1 1 0 1 (1) −1 0 1 0 0 −1 1 0 1 0 1 0 − 1 1 −1 1 1 1 1 0 0 0 −1 0 Operand x Recoded version y Justification 2j + 2j–1 + . . . + 2i+1 + 2i = 2j+1 – 2i Apr. 2012 Computer Arithmetic, Multiplicationlide 20 S
  21. 21. Example Multiplication with Booth’s Recoding Fig. 9.9 Sequential multiplication of 2’s-complement numbers with right shifts by means of Booth’s recoding. –––––––––– xi xi–1 yi –––––––––– 0 0 0 0 1 1 − 1 0 1 1 1 0 –––––––––– ============================ a 1 0 1 1 0 x 1 0 1 0 1 Multiplier − y 1 1 −1 1 −1 Booth-recoded ============================ p(0) 0 0 0 0 0 Check: +y0a 0 1 0 1 0 – 10 × –11 ––––––––––––––––––––––––––––– = 110 2p(1) 0 0 1 0 1 0 p(1) 0 0 1 0 1 0 = 64 + 32 + +y1a 1 0 1 1 0 8+4+2 ––––––––––––––––––––––––––––– 2p(2) 1 1 1 0 1 1 0 p(2) 1 1 1 0 1 1 0 +y2a 0 1 0 1 0 ––––––––––––––––––––––––––––– 2p(3) 0 0 0 1 1 1 1 0 p(3) 0 0 0 1 1 1 1 0 +y3a 1 0 1 1 0 ––––––––––––––––––––––––––––– 2p(4) 1 1 1 0 0 1 1 1 0 p(4) 1 1 1 0 0 1 1 1 0 y4a 0 1 0 1 0 ––––––––––––––––––––––––––––– 2p(5) 0 0 0 1 1 0 1 1 1 0 p(5) 0 0 0 1 1 0 1 1 1 0 ============================ Apr. 2012 Computer Arithmetic, Multiplicationlide 21 S
  22. 22. 9.5 Multiplication by Constants Explicit, e.g. y := 12 ∗ x + 1 Implicit, e.g. A[i, j] := A[i, j] + B[i, j] Address of A[i, j] = base + n ∗ i + j Software aspects: 0 1 2 . . . n–1 0 1 2 . . Row i . m–1 Column j Optimizing compilers replace multiplications by shifts/adds/subs Produce efficient code using as few registers as possible Find the best code by a time/space-efficient algorithm Hardware aspects: Synthesize special-purpose units such as filters y[t] = a0x[t] + a1x[t – 1] + a2x[t – 2] + b1y[t – 1] + b2y[t – 2] Apr. 2012 Computer Arithmetic, Multiplicationlide 22 S
  23. 23. Multiplication Using Binary Expansion Example: Multiply R1 by the constant 113 = (1 1 1 0 0 0 1)two R2 R3 R6 R7 R112 R113 ← ← ← ← ← ← R1 shift-left 1 R2 + R1 R3 shift-left 1 R6 + R1 R7 shift-left 4 R112 + R1 Shift, add Shift Ri: Register that contains i times (R1) This notation is for clarity; only one register other than R1 is needed Shorter sequence using shift-and-add instructions R3 R7 R113 ← ← ← R1 shift-left 1 + R1 R3 shift-left 1 + R1 R7 shift-left 4 + R1 Apr. 2012 Computer Arithmetic, Multiplicationlide 23 S
  24. 24. Multiplication via Recoding Example: Multiply R1 by 113 = (1 1 1 0 0 0 1)two = (1 0 0−1 0 0 0 1)two R8 R7 R112 R113 ← ← ← ← R1 shift-left 3 R8 – R1 R7 shift-left 4 R112 + R1 Shift Shift, subtract Shift, add Shorter sequence using shift-and-add/subtract instructions R7 R113 ← ← R1 shift-left 3 – R1 R7 shift-left 4 + R1 6 shift or add (3 shift-and-add) instructions needed without recoding The canonic signed-digit representation of a number contains no consecutive nonzero digits: average number of shift-adds is O(k/3) Apr. 2012 Computer Arithmetic, Multiplicationlide 24 S
  25. 25. Multiplication via Factorization Example: Multiply R1 by 119 = 7 × 17 = (8 – 1) × (16 + 1) R8 R7 R112 R119 ← ← ← ← R1 shift-left 3 R8 – R1 R7 shift-left 4 R112 + R7 Shorter sequence using shift-and-add/subtract instructions R7 R119 ← ← R1 shift-left 3 – R1 R7 shift-left 4 + R7 Requires a scratch register for holding the 7 multiple 119 = (1 1 1 0 1 1 1)two = (1 0 0 0−1 0 0−1)two More instructions may be needed without factorization Apr. 2012 Computer Arithmetic, Multiplicationlide 25 S
  26. 26. Multiplication by Multiple Constants Example: Multiplying a number by 45, 49, and 65 R9 R45 ← ← R1 shift-left 3 + R1 R9 shift-left 2 + R9 R7 R49 ← ← R1 shift-left 3 – R1 R7 shift-left 3 – R7 R65 ← Separate solutions: 5 shift-add/subtract operations R1 shift-left 6 + R1 A combined solution for all three constants R65 R49 R45 ← ← ← R1 shift-left 6 + R1 R65 – R1 left-shift 4 R49 – R1 left-shift 2 A programmable block can perform any of the three multiplications Apr. 2012 Computer Arithmetic, Multiplicationlide 26 S
  27. 27. 9.6 Preview of Fast Multipliers Viewing multiplication as a multioperand addition problem, there are but two ways to speed it up a. Reducing the number of operands to be added: Handling more than one multiplier bit at a time (high-radix multipliers, Chapter 10) b. Adding the operands faster: Parallel/pipelined multioperand addition (tree and array multipliers, Chapter 11) In Chapter 12, we cover all remaining multiplication topics: Bit-serial multipliers Modular multipliers Multiply-add units Squaring as a special case Apr. 2012 Computer Arithmetic, Multiplicationlide 27 S
  28. 28. 10 High-Radix Multipliers Chapter Goals Study techniques that allow us to handle more than one multiplier bit in each cycle (two bits in radix 4, three in radix 8, . . .) Chapter Highlights High radix gives rise to “difficult” multiples Recoding (change of digit-set) as remedy Carry-save addition reduces cycle time Implementation and optimization methods Apr. 2012 Computer Arithmetic, Multiplicationlide 28 S
  29. 29. High-Radix Multipliers: Topics Topics in This Chapter 10.1 Radix-4 Multiplication 10.2 Modified Booth’s Recoding 10.3 Using Carry-Save Adders 10.4 Radix-8 and Radix-16 Multipliers 10.5 Multibeat Multipliers 10.6 VLSI Complexity Issues Apr. 2012 Computer Arithmetic, Multiplicationlide 29 S
  30. 30. 10.1 Radix-4 Multiplication Fig. 9.1 (modified) × a x Multiplicand Multiplier x0 r 0 x 0 a 20 x 1 a 21 x1 r 1 x 2 a 22 x2 a r 2 3 x 3 a 23 x3 a r p Partial products bit-matrix Product Preferred Multiplication with right shifts in radix r : top-to-bottom accumulation p(j+1) = (p(j) + xj a r k) r –1 |–––add–––| |––shift right––| with p(0) = 0 and p(k) = p = ax + p(0)r –k Multiplication with left shifts in radix r : bottom-to-top accumulation p(j+1) = r p(j) + xk–j–1a with |shift| |––––add––––| p(0) = 0 and p(k) = p = ax + p(0)r k Apr. 2012 Computer Arithmetic, Multiplicationlide 30 S
  31. 31. Radix-4 Multiplication in Dot Notation a x x 0 a 20 x 1 a 21 x 2 a 22 x 3 a 23 Partial products bit-matrix p × Multiplicand Multiplier Product Fig. 10.1 Radix-4, or two-bit-at-a-time, multiplication in dot notation Fig. 9.1 Number of cycles is halved, but now the “difficult” multiple 3a must be dealt with × a x (x x ) a 4 0 1 0 two (x 3x 2) twoa 4 1 Multiplicand Multiplier p Product Apr. 2012 Computer Arithmetic, Multiplicationlide 31 S
  32. 32. A Possible Design for a Radix-4 Multiplier Precomputed via shift-and-add (3a = 2a + a) Multiplier 3a 0 a 2a 00 01 10 11 k/2 + 1 cycles, rather than k One extra cycle over k/2 not too bad, but we would like to avoid it if possible Solving this problem for radix 4 may also help when dealing with even higher radices 2-bit shifts x i+1 Mux To the adder Fig. 10.2 The multiple generation part of a radix-4 multiplier with precomputation of 3a. Apr. 2012 Computer Arithmetic, Multiplicationlide 32 S xi
  33. 33. Example Radix-4 Multiplication Using 3a ================================ a 0 1 1 0 3a 0 1 0 0 1 0 x 1 1 1 0 ================================ p(0) 0 0 0 0 +(x1x0)twoa 0 0 1 1 0 0 ––––––––––––––––––––––––––––––––– 4p(1) 0 0 1 1 0 0 p(1) 0 0 1 1 0 0 +(x3x2)twoa 0 1 0 0 1 0 ––––––––––––––––––––––––––––––––– 4p(2) 0 1 0 1 0 1 0 0 p(2) 0 1 0 1 0 1 0 0 ================================ × a x (x x ) 1 0 (x 3x 2) p Fig. 10.3 Example of radix-4 multiplication using the 3a multiple. Apr. 2012 Computer Arithmetic, Multiplicationlide 33 S
  34. 34. A Second Design for a Radix-4 Multiplier Fig. 10.4 The multiple generation part of a radix-4 multiplier based on replacing 3a with 4a (carry into next higher radix-4 multiplier digit) and –a. Multiplier 2-bit shifts 0 a 2a –a 00 01 10 11 Mux To the adder x i+1 xi c Carry +c FF mod 4 c xi+1 ⊕ xi c xi ⊕ c xi+1 ---0 0 0 0 1 1 1 1 xi --0 0 1 1 0 0 1 1 c --0 1 0 1 0 1 0 1 Set if x c) xi+1(xi ∨ i+1 = x i = 1 or if x i+1 = c = 1 Mux control ---------------0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 0 Set carry -----------0 0 0 0 0 1 1 1 Apr. 2012 Computer Arithmetic, Multiplicationlide 34 S
  35. 35. 10.2 Modified Booth’s Recoding Table 10.1 Radix-4 Booth’s recoding yielding (zk/2 . . . z1z0)four ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– xi+1 xi xi–1 yi+1 yi zi/2 Explanation ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 0 0 0 0 0 0 No string of 1s in sight 0 0 1 0 1 1 End of string of 1s 0 1 0 0 1 1 Isolated 1 0 1 1 1 0 2 End of string of 1s − − 1 0 0 1 0 2 Beginning of string of 1s − − 1 0 1 1 1 1 End a string, begin new one − − 1 1 0 0 1 1 Beginning of string of 1s 1 1 1 0 0 0 Continuation of string of 1s ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Recoded Context Radix-4 digit radix-2 digits Example 1 0 0 1 1 1 0 1 (1) −1 0 1 0 0 −1 1 0 (1) − 2 2 − 1 2 1 0 1 0 − 1 1 −1 1 1 − − 1 1 1 1 0 0 0 −1 0 0 − 2 Operand x Recoded version y Radix-4 version z Apr. 2012 Computer Arithmetic, Multiplicationlide 35 S
  36. 36. Example Multiplication via Modified Booth’s Recoding ================================ a 0 1 1 0 x 1 0 1 0 − z 1 −2 Radix-4 ================================ p(0) 0 0 0 0 0 0 +z0a 1 1 0 1 0 0 ––––––––––––––––––––––––––––––––– 4p(1) 1 1 0 1 0 0 p(1) 1 1 1 1 0 1 0 0 +z1a 1 1 1 0 1 0 ––––––––––––––––––––––––––––––––– 4p(2) 1 1 0 1 1 1 0 0 p(2) 1 1 0 1 1 1 0 0 ================================ × a x (x x ) 1 0 (x 3x 2) p Fig. 10.5 Example of radix-4 multiplication with modified Booth’s recoding of the 2’scomplement multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 36 S
  37. 37. Multiple Generation with Radix-4 Booth’s Recoding M ultip lier 2-bit shift x i+1 M ultip licand Init. 0 xi x i–1 k Sign of a Recoding Logic neg two Could have named this signal one/two non0 0 Enable 0 Select 0 a M ux 2a 1 ---- Encoding ---Digit neg two non0 –2 1 1 1 –1 1 0 1 0 0 0 0 1 0 0 1 2 0 1 1 0, a, or 2a k+1 Add/subtract control z i/2 a To adder inp ut Fig. 10.6 The multiple generation part of a radix-4 multiplier based on Booth’s recoding. Apr. 2012 Computer Arithmetic, Multiplicationlide 37 S
  38. 38. 10.3 Using Carry-Save Adders Multiplier 2a 0 Mux a 0 x i+1 xi Mux Old Cumulative Partial Product CSA Adder New Cumulative Partial Product Fig. 10.7 Radix-4 multiplication with a carry-save adder used to combine the cumulative partial product, xia, and 2xi+1a into two numbers. Apr. 2012 Computer Arithmetic, Multiplicationlide 38 S
  39. 39. Keeping the Partial Product in Carry-Save Form Sum Mux Multiplier Carry Partial Product k k Sum Right shift Carry Multiplicand 0 Mux k-Bit CSA Carry k Upper half of PP (a) Multiplier k block diagram Sum k-Bit Adder Lower half of PP (b) Operation in a typical cycle Fig. 10.8 Radix-2 multiplication with the upper half of the cumulative partial product kept in stored-carry form. Apr. 2012 Computer Arithmetic, Multiplicationlide 39 S
  40. 40. Carry-Save Multiplier with Radix-4 Booth’s Recoding a Multiplier x i+1 x i x i-1 Booth recoder and selector Old cumulative partial product z i/2 a CSA New cumulative partial product 2-bit Adder Adder FF Extra “dot” To the lower half of partial product Fig. 10.9 Radix-4 multiplication with a CSA used to combine the stored-carry cumulative partial product and zi/2a into two numbers. Apr. 2012 Computer Arithmetic, Multiplicationlide 40 S
  41. 41. Radix-4 Booth’s Recoding for Parallel Multiplication x i+2 x i+1 x x i i–1 x i–2 Recoding Logic neg two non0 a 2a 1 Enable 0 Mux Select Fig. 10.10 Booth recoding and multiple selection logic for high-radix or parallel multiplication. 0, a, or 2a k+1 Selective Complement k+2 Extra "Dot" for Column i 0, a, –a, 2a, or –2a z i/2 a Apr. 2012 Computer Arithmetic, Multiplicationlide 41 S
  42. 42. Yet Another Design for Radix-4 Multiplication Multiplier 2a 0 Mux Old Cumulative Partial Product (4; 2)-counter CSA New Cumulative Partial Product CSA Adder x i+1 a 0 xi Mux Fig. 10.11 Radix-4 multiplication, with the cumulative partial product, xia, and 2xi+1a combined into two numbers by two CSAs. FF 2-Bit Adder T o the Lower Half of Partial Product Apr. 2012 Computer Arithmetic, Multiplicationlide 42 S
  43. 43. 10.4 Radix-8 and Radix-16 Multipliers M ultip lier 0 8a Mux 0 x i+3 4a Mux 4-bit right shift 0 4-Bit Shift Mux 0 x i+1 a Mux CSA Fig. 10.12 Radix-16 multiplication with the upper half of the cumulative partial product in carry-save form. x i+2 2a xi CSA CSA CSA 4 Sum 3 Carry Partial Product (Up p er Half) FF 4-Bit Adder 4 T o the Lower Half of Partial Product Apr. 2012 Computer Arithmetic, Multiplicationlide 43 S
  44. 44. Other High-Radix Multipliers Remove this mux & CSA and replace the 4-bit shift (adder) with a 3-bit shift (adder) to get a radix-8 multiplier (cycle time will remain the same, though) 0 8a Mux 0 x i+3 4a Mux 0 4-Bit Shift Mux x i+2 2a 0 x i+1 a Mux CSA A radix-16 multiplier design becomes a radix-256 multiplier if radix-4 Booth’s recoding is applied first (the muxes are replaced by Booth recoding and multiple selection logic) M ultip lier xi CSA Fig. 10.12 CSA CSA 4 Sum 3 Carry Partial Product (Up p er Half) FF 4-Bit Adder 4 T o the Lower Half of Partial Product Apr. 2012 Computer Arithmetic, Multiplicationlide 44 S
  45. 45. A Spectrum of Multiplier Design Choices Next multip le Several multip les ... Small CSA tree Adder Partial product Sp eed up ... Full CSA tree Partial product Adder Basic binary All multip les High-radix or p artial tree Adder Full Economize tree Fig. 10.13 High-radix multipliers as intermediate between sequential radix-2 and full-tree multipliers. Apr. 2012 Computer Arithmetic, Multiplicationlide 45 S
  46. 46. 10.5 Multibeat Multipliers Inputs Present state Next-state logic State flip-flops Next-state excitation PH1 Next-state logic State latches CLK (a) Sequential machine with FFs Fig. 10.15 Inputs State latches Next-state logic PH2 Inputs (b) Sequential machine with latches and 2-phase clock Two-phase clocking for sequential logic. Begin changing FF contents Change becomes visible at FF output Once cycle Observation: Half of the clock cycle goes to waste Apr. 2012 Computer Arithmetic, Multiplicationlide 46 S
  47. 47. Twin-Beat and Three-Beat Multipliers Twin M ultip lier Registers 3a a 4 a 3a 4 Pipelined Radix-8 Booth Recoder & Selector This radix-64 multiplier runs at the clock rate of a radix-8 design (2X speed) Pipelined Radix-8 Booth Recoder & Selector CSA CSA Sum Sum Carry Node 3 Beat-1 Input Node 1 Carry Beat-2 Input 5 FF Adder 6 6-Bit Adder 6 T o the Lower Half of Partial Product Fig. 10.14 Twin-beat multiplier with radix-8 Booth’s recoding. Node 2 Beat-3 Input Fig. 10.16 Conceptual view of a three-beat multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 47 S
  48. 48. 10.6 VLSI Complexity Issues A radix-2b multiplier requires: bk two-input AND gates to form the partial products bit-matrix O(bk) area for the CSA tree At least Θ(k) area for the final carry-propagate adder Total area: Latency: A = O(bk) T = O((k/b) log b + log k) Any VLSI circuit computing the product of two k-bit integers must satisfy the following constraints: AT grows at least as fast as k3/2 AT2 is at least proportional to k2 The preceding radix-2b implementations are suboptimal, because: AT = AT2 = O(k2 log b + bk log k) O((k3/b) log2b) Apr. 2012 Computer Arithmetic, Multiplicationlide 48 S
  49. 49. Comparing High- and Low-Radix Multipliers AT = O(k2 log b + bk log k) AT 2 = O((k3/b) log2b) Low-Cost b = O(1) High Speed b = O(k) AT- or AT 2Optimal AT O(k 2) O(k 2 log k) O(k 3/2) AT 2 O(k 3) O(k 2 log2 k) O(k 2) Intermediate designs do not yield better AT or AT 2 values; The multipliers remain asymptotically suboptimal for any b By the AT measure (indicator of cost-effectiveness), slower radix-2 multipliers are better than high-radix or tree multipliers Thus, when an application requires many independent multiplications, it is more cost-effective to use a large number of slower multipliers High-radix multiplier latency can be reduced from O((k/b) log b + log k) to O(k/b + log k) through more effective pipelining (Chapter 11) Apr. 2012 Computer Arithmetic, Multiplicationlide 49 S
  50. 50. 11 Tree and Array Multipliers Chapter Goals Study the design of multipliers for highest possible performance (speed, throughput) Chapter Highlights Tree multiplier = reduction tree + redundant-to-binary converter Avoiding full sign extension in multiplying signed numbers Array multiplier = one-sided reduction tree + ripple-carry adder Apr. 2012 Computer Arithmetic, Multiplicationlide 50 S
  51. 51. Tree and Array Multipliers: Topics Topics in This Chapter 11.1. Full-Tree Multipliers 11.2. Alternative Reduction Trees 11.3. Tree Multipliers for Signed Numbers 11.4. Partial-Tree and Truncated Multipliers 11.5. Array Multipliers 11.6. Pipelined Tree and Array Multipliers Apr. 2012 Computer Arithmetic, Multiplicationlide 51 S
  52. 52. 11.1 Full-Tree Multipliers a x (x x ) a 4 0 1 0 two (x 3x 2) twoa 4 1 p × Next multip le Multiplicand Multiplier Product Several multip les ... ... Small CSA tree Adder Partial product Sp eed up Full CSA tree Partial product Adder Basic binary All multiples High-radix or p artial tree Adder Full Economize tree Fig. 10.13 High-radix multipliers as intermediate between sequential radix-2 and full-tree multipliers. Fig. 11.1 M ultiplier ... a M ultipleForming Circuits a a . . . a Partial-Products Reduction Tree (M ulti-Operand Addition Tree) Redundant result Redundant-to-Binary Converter Higher-order product bits Some lower-order product bits are generated directly General structure of a full-tree multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 52 S
  53. 53. Full-Tree versus Partial-Tree Multiplier All partial products Several partial products ... ... Large tree of carry-save adders Logdepth Adder Logdepth Product Small tree of carry-save adders Adder Product Schematic diagrams for full-tree and partial-tree multipliers. Apr. 2012 Computer Arithmetic, Multiplicationlide 53 S
  54. 54. Variations in Full-Tree Multiplier Design Designs are distinguished by variations in three elements: 1. Multiple-forming circuits M ultiplier ... a M ultipleForming Circuits a a . . . a Partial-Products Reduction Tree 2. Partial products reduction tree (M ulti-Operand Addition Tree) Fig. 11.1 Redundant result 3. Redundant-to-binary converter Redundant-to-Binary Converter Higher-order product bits Some lower-order product bits are generated directly Apr. 2012 Computer Arithmetic, Multiplicationlide 54 S
  55. 55. Example of Variations in CSA Tree Design × a x Wallace Tree Multiplicand Multiplier (5 FAs + 3 HAs + 4-Bit Adder) x 0 a 20 x 1 a 21 x 2 a 22 x 3 a 23 1 Partial products bit-matrix 2 3 4 3 2 1 FA FA FA HA p Product -------------------1 3 2 3 2 1 1 FA HA FA HA ---------------------2 2 2 2 1 1 1 4-Bit Adder ---------------------1 1 1 1 1 1 1 1 Fig. 11.2 Corrections shown in red Dadda Tree (4 FAs + 2 HAs + 6-Bit Adder) 1 2 3 4 3 2 1 FA FA HA HA -------------------1 3 3 3 3 2 1 2 2 FA FA FA HA HA HA FA ---------------------2 2 2 2 2 2 1 1 6-Bit Adder ---------------------1 1 1 1 1 1 1 1 Two different binary 4 × 4 tree multipliers. Apr. 2012 Computer Arithmetic, Multiplicationlide 55 S
  56. 56. Details of a CSA Tree [0, 6] [2, 8] [1, 7] [3, 9] [5, 11] [4, 10] [6, 12] [1, 6] Fig. 11.3 Possible CSA tree for a 7 × 7 tree multiplier. 7-bit CSA [2, 8] 7-bit CSA [1,8] [3, 11] [5, 11] 7-bit CSA [6, 12] [3, 12] [2, 8] CSA trees are quite irregular, causing some difficulties in VLSI realization Thus, our motivation to examine alternate methods for partial products reduction 7-bit CSA [3,9] [2,12] [3,12] T he index pair [i, j] means that bit positions from i up to j are involved. 10-bit CSA [3,12] [4,13] [4,12] 10-bit CPA Ignore [4, 13] Apr. 2012 Computer Arithmetic, Multiplicationlide 56 S 3 2 1 0
  57. 57. 11.2 Alternative Reduction Trees FA Inputs Level-1 carries FA 11 + ψ1 = 2ψ1 + 3 Therefore, ψ1 = 8 carries are needed FA FA FA FA FA FA FA Level-2 carries Level 2 FA FA FA Level-3 carries Fig. 11.4 A slice of a balanced-delay tree for 11 inputs. Level 1 FA Level-4 carry FA FA Level 3 FA FA Outputs Apr. 2012 Computer Arithmetic, Multiplicationlide 57 S Level 4 FA Level 5
  58. 58. Binary Tree of 4-to-2 Reduction Modules 4-to-2 4-to-2 4-to-2 4-to-2 compressor 4-to-2 FA 4-to-2 0 4-to-2 1 0 FA 4-to-2 c (a) Binary tree of (4; 2)-counters 1 s (b) Realization with FAs c s (c) A faster realization Fig. 11.5 Tree multiplier with a more regular structure based on 4-to-2 reduction modules. Due to its recursive structure, a binary tree is more regular than a 3-to-2 reduction tree when laid out in VLSI Apr. 2012 Computer Arithmetic, Multiplicationlide 58 S
  59. 59. Example Multiplier with 4-to-2 Reduction Tree Similarly, using Booth’s recoding may not yield any advantage, because it introduces irregularity Multiple generation circuits ... M u l t i p l i c a n d Redundant-to-binary converter Fig. 11.6 Layout of a partial-products reduction tree composed of 4-to-2 reduction modules. Each solid arrow represents two numbers. Apr. 2012 Computer Arithmetic, Multiplicationlide 59 S M ul t ipl e s elect i on si gnals Even if 4-to-2 reduction is implemented using two CSA levels, design regularity potentially makes up for the larger number of logic levels
  60. 60. 11.3 Tree Multipliers for Signed Numbers ---------- Extended positions ---------- Sign Magnitude positions --------- xk–1 yk–1 zk–1 xk–1 yk–1 zk–1 xk–2 yk–2 zk–2 xk–1 yk–1 zk–1 xk–1 yk–1 zk–1 xk–1 yk–1 zk–1 From Fig. 8.19a xk–1 yk–1 zk–1 xk–3 yk–3 zk–3 xk–4 yk–4 zk–4 ... ... ... Sign extension in multioperand addition. Signs Sign extensions α α α α α α α α x x x x x x x β β β β β β β x x x x x x x x γ γ γ γ γ γ x x x x x x x x x Five redundant copies removed FA FA FA FA FA The difference in multiplication is the shifting sign positions αβγ FA Fig. 11.7 Sharing of full adders to reduce the CSA width in a signed tree multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 60 S
  61. 61. Using the Negative-Weight Property of the Sign Bit Sign extension is a way of converting negatively weighted bits (negabits) to positively weighted bits (posibits) to facilitate reduction, but there are other methods of accomplishing the same without introducing a lot of extra bits a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 ---------------------------a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 a4 x4 a3 x4 a2 x4 a1 x4 a0 x4 -------------------------------------------------------p9 p8 p7 p6 p5 p4 p3 p2 p1 p0 a. Unsigned × a4 a3 a2 a1 a0 x x x x x 4 3 2 1 0 ----------------------------a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 -a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 -a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 -a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 a4 x4 -a3 x4 -a2 x4 -a1 x4 -a0 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0 b. 2's-complement × a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 ---------------------------_ _ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 _ a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 _ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 _ _ _ _ a x a x a x a x a x 4 4 3 4 2 4 1 4 0 4 a4 a4 1 x4 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0 Fig. 11.8 Baugh-Wooley 2’s-complement multiplication. × d. Modified B-W Baugh and Wooley have contributed two such methods c. Baugh-Wooley × __ __ a4 a3 a2 a1 a0 x x x x x 4 3 2 1 0 --------------------------__ a x a x a x a x a x 4 0 3 0 2 0 1 0 0 0 a x a x a x a x 3 x1 a2 x1 a1 x1 0 1 a 2 2 1 2 0 2 a x __3 a0 x3 1 a x 0 4 a x a x a4 x1 __ 4 2 3 2 a x __3 a3 x3 a2 x3 __ __ 4 a x a x a x a x 4 4 3 4 2 4 1 4 1 1 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0 Apr. 2012 Computer Arithmetic, Multiplicationlide 61 S
  62. 62. a4 x4 -a3 x4 -a2 x4 -a1 x4 -a0 x4 -------------------------------------------------------p9 p8 p7 p6 p5 p4 p3 p2 p1 p0 The Baugh-Wooley Method and Its Modified Form Fig. 11.8 –a4x0 = a4(1 – x0) – = a4x0′ – a4 –a4 a4x0′ a4 In next column a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 × ---------------------------_ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 _ a4 _ a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 _ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 _ _ _ _ a4 x4 a3 x4 a2 x4 a1 x4 a0 x4 a4 a4 1 x4 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0 c. Baugh-Wooley d. Modified B-W –a4x0 = (1 – a4x0) – 1 = (a4x0)′ – 1 –1 (a4x0)′ 1 In next column × __ __ a4 a3 a2 a1 a0 x x x x x 4 3 2 1 0 --------------------------__ a x a x a x a x a x 4 0 3 0 2 0 1 0 0 0 a x a x a x a x 0 1 a3 x1 a2 x1 a1 x1 2 2 1 2 0 2 a x a x __3 0 3 1 a x 0 4 a x a x a4 x1 __ 4 2 3 2 a x a x a x __3 __3 __3 4 3 2 a x a x a x a x 4 4 3 4 2 4 1 4 1 1 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0 Apr. 2012 Computer Arithmetic, Multiplicationlide 62 S
  63. 63. Alternate Views of the Baugh-Wooley Methods + 0 0 –a4x3 –a4x2 –a4x1 –a4x0 + 0 0 –a3x4 –a2x4 –a1x4 –a0x4 -------------------------------------------– 0 0 a4x3 a4x2 a4x1 a4x0 – 0 0 a3x4 a2x4 a1x4 a0x4 -------------------------------------------+ 1 1 a4x3 a4x2 a4x1 a4x0 + 1 1 a3x4 a2x4 a1x4 a0x4 1 1 -------------------------------------------+ a4 a4 a4x3 a4x2 a4x1 a4x0 + x4 x4 a4 a3x4 a2x4 a1x4 a0x4 a4 1 x4 x4 -------------------------------------------- a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 ---------------------------a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 a4 x4 a3 x4 a2 x4 a1 x4 a0 x4 -------------------------------------------------------p9 p8 p7 p6 p5 p4 p3 p2 p1 p0 a. Unsigned × a4 a3 a2 a1 a0 x x x x x 4 3 2 1 0 ----------------------------a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 -a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 -a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 -a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 a4 x4 -a3 x4 -a2 x4 -a1 x4 -a0 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0 b. 2's-complement × a4 a3 a2 a1 a0 x4 x3 x2 x1 x0 ---------------------------_ _ a4 x0 a3 x0 a2 x0 a1 x0 a0 x0 _ a4 x1 a3 x1 a2 x1 a1 x1 a0 x1 _ a4 x2 a3 x2 a2 x2 a1 x2 a0 x2 a4 x3 a3 x3 a2 x3 a1 x3 a0 x3 _ _ _ _ a x a x a x a x a x 4 4 3 4 2 4 1 4 0 4 a4 a4 1 x4 x4 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0 c. Baugh-Wooley × d. Modified B-W × __ __ a4 a3 a2 a1 a0 x x x x x 4 3 2 1 0 --------------------------__ a x a x a x a x a x 4 0 3 0 2 0 1 0 0 0 a x a x a x a x 3 x1 a2 x1 a1 x1 0 1 a 2 2 1 2 0 2 a x __3 a0 x3 1 a x 0 4 a x a x a4 x1 __ 4 2 3 2 a x __3 a3 x3 a2 x3 __ __ 4 a x a x a x a x 4 4 3 4 2 4 1 4 1 1 -------------------------------------------------------p p p p p p p p p p 9 8 7 6 5 4 3 2 1 0 Apr. 2012 Computer Arithmetic, Multiplicationlide 63 S
  64. 64. 11.4 Partial-Tree and Truncated Multipliers h inputs High-radix versus partialtree multipliers: The difference is quantitative, not qualitative For small h, say ≤ 8 bits, we view the multiplier of Fig. 11.9 as high-radix When h is a significant fraction of k, say k/2 or k/4, then we tend to view it as a partial-tree multiplier Upper part of the cumulative partial product (stored-carry) ... CSA Tree Sum Carry Better design through pipelining to be covered in Section 11.6 FF Adder h-Bit Adder Lower part of the cumulative partial product Fig. 11.9 General structure of a partial-tree multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 64 S
  65. 65. Why Truncated Multipliers? Nearly half of the hardware in array/tree multipliers is there to get the last bit right (1 dot = one FPGA cell) ulp .         k-by-k fractional × .         multiplication --------------------------------.       | Max error = 8/2 + 7/4 .      |  + 6/8 + 5/16 + 4/32 .     |   + 3/64 + 2/128 .    |    + 1/256 = 7.004 ulp .   |     .  |      Mean error = . |       1.751 ulp . |        --------------------------------.        |        Fig. 11.10 The idea of a truncated multiplier with 8-bit fractional operands. Apr. 2012 Computer Arithmetic, Multiplicationlide 65 S
  66. 66. Truncated Multipliers with Error Compensation We can introduce additional “dots” on the left-hand side to compensate for the removal of dots from the right-hand side Constant compensation Variable compensation . . . . . . . . . . . . . . . . o o o o o o o o o o o o o o o 1 o o o o o o o| o| o| o| o| o| o| | o o o o o o o o o o o o o o o o o| o o| o o| o o| o o| o o| x-1o| y-1 | Constant and variable error compensation for truncated multipliers. Max error = +4 ulp Max error ≅ −3 ulp Max error = +? ulp Max error ≅ −? ulp Mean error = ? ulp Mean error = ? ulp Apr. 2012 Computer Arithmetic, Multiplicationlide 66 S
  67. 67. 11.5 Array Multipliers x2 a x1 a x0 a CSA x3 a CSA x4 a CSA CSA a4 x0 0 a3 x0 0 a2 x0 0 a1 x0 0 a0 x0 0 p0 a3 x1 a4 x1 a2 x1 a0 x1 p1 a3 x2 a4 x2 a2 x2 a1 x2 a0 x 2 p2 a3 x3 a4 x3 a2 x3 a1 x3 a0 x3 p3 a3 x4 Rip p le-Carry Adder a1 x1 a1 x4 a2 x4 a4 x4 a0 x4 0 ax Fig. 11.11 A basic array multiplier uses a one-sided CSA tree and a ripple-carry adder. p9 p8 p7 p6 Fig. 11.12 Details of a 5 × 5 array multiplier using FA blocks. Apr. 2012 Computer Arithmetic, Multiplicationlide 67 S p4 p5
  68. 68. Signed (2’s-complement) Array Multiplier Fig. 11.13 Modifications in a 5 × 5 array multiplier to deal with 2’s-complement inputs using the Baugh-Wooley method or to shorten the critical path. a4 x4 _ a4 _ x4 _ a4 x0 0 a3 x0 0 a2 x0 0 a1 x0 0 a0 x0 p0 a3 x1 _ a4 x1 a2 x1 a1 x1 a0 x1 p1 a3 x2 _ a4 x2 a2 x2 a1 x2 a0 x 2 p2 a3 x3 _ a4 x3 a2 x3 1 a0 x3 _ a2 x4 _ a3 x4 a1 x3 _ a1 x4 _ a0 x4 p3 a4 x4 p9 p8 p7 p6 Apr. 2012 Computer Arithmetic, Multiplicationlide 68 S p5 p4
  69. 69. Array Multiplier Built of Modified Full-Adder Cells Fig. 11.14 Design of a 5 × 5 array multiplier with two additive inputs and full-adder blocks that include AND gates. a4 a3 a2 a1 a0 x0 p0 p1 p2 p3 FA p4 p9 p8 p7 p6 Apr. 2012 Computer Arithmetic, Multiplicationlide 69 S p5 x1 x2 x3 x4
  70. 70. Array Multiplier without a Final Carry-Propagate Adder Fig. 11.15 Conceptual view of a modified array multiplier that does not need a final carry-propagate adder. .. . M ux i Fig. 11.16 Carry-save addition, performed in level i, extends the conditionally computed bits of the final product. i Conditional bits Dots in row i Dots in row i + 1 i + 1 Conditional bits of the final product Bi+1 Level i M ux i i i+1 Bi+1 i+1 M ux k Bi Bi i .. . k M ux k [k, 2k–1] k–1 i+1 i i–1 All remaining bits of the final product produced only 2 gate levels after pk–1 Apr. 2012 Computer Arithmetic, Multiplicationlide 70 S 1 0
  71. 71. 11.6 Pipelined Tree and Array Multipliers h inputs ... h inputs Upper part of the cumulative partial product (stored-carry) ... Pipelined CSA Tree Latches (h + 2)-input CSA tree Latches Latches CSA Tree CSA Sum Latch CSA Carry FF Adder h-Bit Adder Lower part of the cumulative partial product Sum Carry FF Adder Fig. 11.9 General structure of a partial-tree multiplier. h-Bit Adder Lower part of the cumulative partial product Fig. 11.17 Efficiently pipelined partial-tree multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 71 S
  72. 72. Pipelined Array Multipliers a4 a3 a2 a1 a0 x0 x1 x2 x3 x4 p4 p3 p2 p1 p0 With latches after every FA level, the maximum throughput is achieved Latches may be inserted after every h FA levels for an intermediate design Example: 3-stage pipeline Fig. 11.18 Pipelined 5 × 5 array multiplier using latched FA blocks. The small shaded boxes are latches. Latched FA with AND gate FA FA FA Latch FA p9 p8 p7 p6 p5 Apr. 2012 Computer Arithmetic, Multiplicationlide 72 S
  73. 73. 12 Variations in Multipliers Chapter Goals Learn additional methods for synthesizing fast multipliers as well as other types of multipliers (bit-serial, modular, etc.) Chapter Highlights Building a multiplier from smaller units Performing multiply-add as one operation Bit-serial and (semi)systolic multipliers Using a multiplier for squaring is wasteful Apr. 2012 Computer Arithmetic, Multiplicationlide 73 S
  74. 74. Variations in Multipliers: Topics Topics in This Chapter 12.1 Divide-and-Conquer Designs 12.2 Additive Multiply Modules 12.3 Bit-Serial Multipliers 12.4 Modular Multipliers 12.5 The Special Case of Squaring 12.6 Combined Multiply-Add Units Apr. 2012 Computer Arithmetic, Multiplicationlide 74 S
  75. 75. 12.1 Divide-and-Conquer Designs Building wide multiplier from narrower ones aH xH × aL xL Rearranged partial products in 2b-by-2b multiplication 2b bits b bits a L xL a L xH a H xL a H xH p a H xL a H xH a L xL a L xH 3b bits Fig. 12.1 Divide-and-conquer (recursive) strategy for synthesizing a 2b × 2b multiplier from b × b multipliers. Apr. 2012 Computer Arithmetic, Multiplicationlide 75 S
  76. 76. General Structure of a Recursive Multiplier 2b × 2b 3b × 3b 4b × 4b use (3; 2)-counters use (5; 2)-counters use (7; 2)-counters 4b × 4b 3b × 3b 2b × 2b b ×b Fig. 12.2 Using b × b multipliers to synthesize 2b × 2b, 3b × 3b, and 4b × 4b multipliers. Apr. 2012 Computer Arithmetic, Multiplicationlide 76 S
  77. 77. Using b × c, rather than b × b Building Blocks 4b × 4b 3b × 3b 2b × 2b b ×b 2b × 2c 2b × 4c gb × hc use b × c multipliers and (3; 2)-counters use b × c multipliers and (5?; 2)-counters use b × c multipliers and (?; 2)-counters Apr. 2012 Computer Arithmetic, Multiplicationlide 77 S
  78. 78. Wide Multiplier Built of Narrow Multipliers and Adders Fig. 12.3 Using 4 × 4 multipliers and 4-bit adders to synthesize an 8 × 8 multiplier. aH xH [4, 7] aL [4, 7] Multiply [12,15] xH [0, 3] aH [4, 7] Multiply [8,11] [8,11] x [4, 7] L aL [0, 3] [0, 3] Multiply [4, 7] [8,11] 8 xL [0, 3] Multiply [4, 7] [4, 7] [0, 3] Add [4, 7] 12 8 Add [8,11] 12 Add [4, 7] Add [8,11] 000 Add [12,15] p [12,15] p[8,11] p [4, 7] Apr. 2012 Computer Arithmetic, Multiplicationlide 78 S p [0, 3]
  79. 79. Karatsuba Multiplication 2b × 2b multiplication requires four b × b multiplications: (2baH + aL) × (2bxH + xL) = 22baHxH + 2b (aHxL + aLxH) + aLxL Karatsuba noted that one of the four multiplications can be removed at the expense of introducing a few additions: (2baH + aL) × (2bxH + xL) = b bits Mult 1 Mult 3 aH aL xH 22baHxH + 2b [(aH + aL) × (xH + xL) – aHxH – aLxL] + aLxL xL Mult 2 Benefit is quite significant for extremely wide operands (4/3)5 = 4.2 (4/3)10 = 17.8 (4/3)20 = 315.3 (4/3)50 = 1,765,781 Apr. 2012 Computer Arithmetic, Multiplicationlide 79 S
  80. 80. 12.2 Additive Multiply Modules y z y ax z x a 4-bit adder p = ax + y + z cin p (a) Block diagram (b) Dot notation v v Fig. 12.4 Additive multiply module with 2 × 4 multiplier (ax) plus 4-bit and 2-bit additive inputs (y and z). b × c AMM b-bit and c-bit multiplicative inputs b-bit and c-bit additive inputs (b + c)-bit output (2b – 1) × (2c – 1) + (2b – 1) + (2c – 1) = 2b+c – 1 Apr. 2012 Computer Arithmetic, Multiplicationlide 80 S
  81. 81. Multiplier Built of AMMs * Legend: 2 bits 4 bits * x[0, 1] a [0, 3] * * a [4, 7] [2, 5] x[0, 1] [2, 3] [4, 5] * a [4, 7] x [2, 3] Understanding an 8 × 8 multiplier built of 4 × 2 AMMs using dot notation [4,7] x [4, 5] a [0, 3] [8, 11] * x [2, 3] a [0, 3] [6, 9] [0, 1] [4,5] [6, 7] [6, 9] x [6, 7] a [0, 3] [8, 11] [8, 9] x [4, 5] a [4, 7] [8, 9] [10,13] a [4, 7] [6, 7] [10,11] x [6, 7] p[12,15] p [10,11] p [8, 9] p[0, 1] p[2, 3] p[4,5] p [6, 7] Fig. 12.5 An 8 × 8 multiplier built of 4 × 2 AMMs. Inputs marked with an asterisk carry 0s. Apr. 2012 Computer Arithmetic, Multiplicationlide 81 S
  82. 82. a [4, 7] Multiplier Built of AMMs: Alternate Design * a [0,3] * * x[0, 1] * x [2, 3] This design is more regular than that in Fig. 12.5 and is easily expandable to larger configurations; its latency, however, is greater * x [4, 5] * x [6, 7] Legend: 2 bits 4 bits p[12,15] p [10,11] p[0, 1] p[2, 3] p[4,5] p [6, 7] p [8, 9] Fig. 12.6 Alternate 8 × 8 multiplier design based on 4 × 2 AMMs. Inputs marked with an asterisk carry 0s. Apr. 2012 Computer Arithmetic, Multiplicationlide 82 S
  83. 83. 12.3 Bit-Serial Multipliers Bit-serial adder (LSB first) …x x x 2 1 0 FF FA …y y y 2 1 0 Bit-serial multiplier (Must follow the k-bit inputs with k 0s; alternatively, view the product as being only k bits wide) …a a a 2 1 0 …x x x 2 1 0 ? …s s s 2 1 0 …p p p 2 1 0 What goes inside the box to make a bit-serial multiplier? Can the circuit be designed to support a high clock rate? Apr. 2012 Computer Arithmetic, Multiplicationlide 83 S
  84. 84. Semisystolic Serial-Parallel Multiplier a3 M ultiplicand (parallel in) a1 a2 a0 x0 x1 x2 x3 M ultiplier (serial in) LSB-first Sum FA Carry FA FA FA Product (serial out) Fig. 12.7 Semi-systolic circuit for 4 × 4 multiplication in 8 clock cycles. This is called “semisystolic” because it has a large signal fan-out of k (k-way broadcasting) and a long wire spanning all k positions Apr. 2012 Computer Arithmetic, Multiplicationlide 84 S
  85. 85. Systolic Retiming as a Design Tool A semisystolic circuit can be converted to a systolic circuit via retiming, which involves advancing and retarding signals by means of delay removal and delay insertion in such a way that the relative timings of various parts are unaffected Cut e f CL g h –d +d e+d f+d CR CL g–d h–d CR –d +d Original delays Adjusted delays Fig. 12.8 Example of retiming by delaying the inputs to C L and advancing the outputs from CL by d units Apr. 2012 Computer Arithmetic, Multiplicationlide 85 S
  86. 86. Alternate Explanation of Systolic Retiming t t+a t d1 d2 t+a+d1+d2 t+d1 t+a+d1 d1 t d2 t+a+d1+d2 Transferring delay from the outputs of a subsystem to its inputs does not change the behavior of the overall system Apr. 2012 Computer Arithmetic, Multiplicationlide 86 S
  87. 87. a3 A First Attempt at Retiming M ultip licand (parallel in) a1 a2 x0 x1 x2 x3 a0 M ultiplier (serial in) LSB-first Sum FA FA FA FA Product (serial out) Carry Fig. 12.7 a3 M ultiplicand (parallel in) a1 a2 a0 x0 x1 x2 x3 M ultiplier (serial in) LSB-first FA Sum FA FA FA Carry Cut 3 Cut 2 Cut 1 Product (serial out) Fig. 12.9 A retimed version of our semi-systolic multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 87 S
  88. 88. Deriving a Fully Systolic Multiplier M ultip licand (parallel in) a1 a2 a3 a0 x0 x1 x2 x3 M ultiplier (serial in) LSB-first Sum FA FA FA FA Carry Product (serial out) Fig. 12.7 M ultiplicand (parallel in) a1 a2 a3 a0 x0 x1 x2 x3 M ultiplier (serial in) LSB-first Sum FA Carry FA FA FA Product (serial out) Fig. 12.10 Systolic circuit for 4 × 4 multiplication in 15 cycles. Apr. 2012 Computer Arithmetic, Multiplicationlide 88 S
  89. 89. A Direct Design for a Bit-Serial Multiplier p (i–1) ai ai t out t in a(i - 1) xi xi x(i - 1) ai x(i - 1) cout ai xi 2 (5; 3)-counter 1 c in Already accumulated into three numbers ai xi 0 xi a(i - 1) 1 sin 0 a x s out Mux Already output Fig. 12.11 Building block for a latency-free bit-serial multiplier. (a) Structure of the bit-matrix ... ... ... t out t in LSB ... cout c in 0 ... sin s out pi ai xi Fig. 12.12 The cellular structure of the bit-serial multiplier based on the cell in Fig. 12.11. p ai xi p(i - 1) ai x(i - 1) xi a(i - 1) 2p (i ) Shift right )to obtain p(i (b) Reduction after each input bit Fig. 12.13 Bit-serial multiplier design in dot notation. Apr. 2012 Computer Arithmetic, Multiplicationlide 89 S
  90. 90. 12.4 Modular Multipliers FA FA ... FA FA FA Fig. 12.14 Modulo-(2b – 1) carry-save adder. 4 4 4 4 Mod-15 CSA Divide by 16 Fig. 12.15 Design of a 4 × 4 modulo-15 multiplier. Mod-15 CSA Mod-15 CPA Apr. 2012 Computer Arithmetic, Multiplicationlide 90 S 4
  91. 91. Other Examples of Modular Multiplication Address Fig. 12.16 One way to design of a 4 × 4 modulo-13 multiplier. n inp uts . . . CSA Tree Table ... Data Fig. 12.17 A method for modular multioperand addition. 3-inp ut M odulo-m Adder sum mod m Apr. 2012 Computer Arithmetic, Multiplicationlide 91 S
  92. 92. 12.5 The Special Case of Squaring Multiply x by x x3 x3 x2 x2 x1 x1 x0 x0 x1 x0 x0 x1 x4 x3 x3 x4 x3 x0 x2 x1 x1 x2 x0 x3 x0 x0 x4 x2 x3 x3 x2 x4 x4 x1 x3 x2 x2 x3 x1 x4 x4 x0 x3 x1 x2 x2 x1 x3 x0 x4 x2 x0 x1 x1 x0 x2 x4 x4 p9 x4 x4 p8 p7 p6 p5 p4 p3 p2 p1 p0 Simplify x4 x3 x4 p9 x4 x2 p8 p7 x4 x1 x3 x2 x3 x4 x0 x3 x1 x3 x0 x2 x1 x2 x2 x0 p6 p5 p4 p3 Fig. 12.18 x1 x0 x1 _ x0 x1x0 –x1x0 p2 0 x0 Design of a 5-bit squarer. Apr. 2012 Computer Arithmetic, Multiplicationlide 92 S
  93. 93. Divide-and-Conquer Squarers Building wide squarers from narrower ones aH xH xH × aH x H xH aL xL xL xL a L xH a H xL xH p Rearranged partial products in 2b-by-2b multiplication 2b bits b bits a L xL xL xH x H aH xH a H xL a L xH xL xL x L aL 3b bits Divide-and-conquer (recursive) strategy for synthesizing a 2b × 2b squarer from b × b squarers and multiplier. Apr. 2012 Computer Arithmetic, Multiplicationlide 93 S
  94. 94. 12.6 Combined Multiply-Add Units Additive input (a) CSA tree output Carry-save additive input (b) CSA tree output Additive input Dot matrix for the 4 × 4 multiplication (c) Carry-save additive input (d) Dot matrix for the 4 × 4 multiplication Multiply-add versus multiply-accumulate Multiply-accumulate units often have wider additive inputs Fig. 12.19 Dot-notation representations of various methods for performing a multiply-add operation in hardware. Apr. 2012 Computer Arithmetic, Multiplicationlide 94 S

×