Lec Jan15 2009

761 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
761
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
47
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lec Jan15 2009

  1. 1. CSL718 : Pipelined Processors PipelineTimings 15th Jan, 2009 Anshul Kumar, CSE IITD
  2. 2. Pipelined Processors Pipelined Processors Parallel architectures Function-parallel Data-parallel Instr level (ILP) Thread level Process level Intel’s terminology: Pipelined VLIWs Superscalar • intra ILP processors processors • inter ILP slide 2 Anshul Kumar, CSE IITD
  3. 3. Ideal Pipelining Tinst S stages slide 3 Anshul Kumar, CSE IITD
  4. 4. Determining Clock Period P Reg Reg Comb Clock Δt Δt ≥ P Δt = Pmax P = propagation delay Pmax = max propagation delay slide 4 Anshul Kumar, CSE IITD
  5. 5. Ideal Pipelining Tinst S stages Pmax = Tinst / S Δt = Tinst / S Effective CPI = 1 Effective time per inst Teff = CPI * Δt = 1 * Tinst / S slide 5 Anshul Kumar, CSE IITD
  6. 6. Pipelining with hazards Tinst S stages Frequency of interruptions - b Δt = Tinst / S CPI = 1 + (S - 1) * b Teff = (1 + (S - 1) * b) * Tinst / S slide 6 Anshul Kumar, CSE IITD
  7. 7. Teff vs. S (Tinst = 10) 12 10 8 b = .2 Teff 6 b = .1 b = .05 4 2 0 1 2 3 4 5 6 7 8 9 10 S
  8. 8. A more realistic view P Reg Reg Comb Clock Register output delay Register setup time Clock skew slide 8 Anshul Kumar, CSE IITD
  9. 9. Clocking Overhead • Fixed overhead c – Setup time – Output delay • Variable overhead (stretching factor) k – Clock skew Δt = Pmax + k * Pmax + c = (1 + k) * Tinst / S + c Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c] slide 9 Anshul Kumar, CSE IITD
  10. 10. Teff vs. S (Tinst = 10, c = 1, k = .1) 14 12 10 8 b = .2 Teff b = .1 6 b = .05 4 2 0 1 3 5 7 9 11 13 15 S
  11. 11. Pipelining with Clocking Overhead Teff = [1 + (S - 1) * b] * [(1 + k) * Tinst / S + c] Sopt = √ [(1 - b) * (1 + k) * Tinst / (b * c)] slide 11 Anshul Kumar, CSE IITD
  12. 12. Partitioning instruction into cycles with non-uniform stage times non-uniform One action - one pipeline stage => large quantization overhead Multiple actions per stage? Multiple stages per action? slide 12 Anshul Kumar, CSE IITD
  13. 13. Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns Decode 6+6 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns PC - MAR 4 ns slide 13 Anshul Kumar, CSE IITD
  14. 14. Optimal Pipelining Tinst = 4+6+10+3+12+9+3+6+10+3+22+2 = 90 ns b = 0.2 c = 4 ns k = 5% Sopt = √ [(1 - b) * (1 + k) * Tinst / (b * c)] = 9.7 ⇒ 9 Pmax = 10 ns slide 14 Anshul Kumar, CSE IITD
  15. 15. Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Pmax = 10 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns S = 10 Δt = 14.5 ns Decode 6+6 ns S * Δt = 145 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns PC - MAR 4 ns slide 15 Anshul Kumar, CSE IITD
  16. 16. Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns S=9 Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns Pmax = 13 ns Δt = 17.65 ns Decode 6+6 ns S * Δt = 159 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns PC - MAR 4 ns slide 16 Anshul Kumar, CSE IITD
  17. 17. Example Put Away 2 ns Execute 7+7+8 ns Data - ALU 3 ns Pmax = 20 ns Cache Data 10 ns Cache Dir 6 ns Addr - MAR 3 ns Gen Addr 9ns S=5 Δt = 25 ns Decode 6+6 ns S * Δt = 125 ns Data - IR 3 ns Cache Data 10 ns Cache Dir 6 ns PC - MAR 4 ns slide 17 Anshul Kumar, CSE IITD
  18. 18. Comparison Δt S * Δt S Pmax Teff 9 13 17.65 159 45.89 10 10 14.50 145 40.60 5 20 25.00 125 45.00 slide 18 Anshul Kumar, CSE IITD
  19. 19. Cycle Quantization Delays are not integral multiple of clock period Total overhead = clocking overhead + quantization overhead Δt ≥ Tinst / S + c (ignoring k) ∴ S * Δt ≥ Tinst + S * c Quantization overhead = S * (Δt - c) -Tinst This reduces as clock period becomes small slide 19 Anshul Kumar, CSE IITD
  20. 20. Other Timing Approaches • Self Timed Circuits – No centralized free running clock – An operation begins as soon as its inputs are available, that is, all its predecessors have completed – Higher speed, lower power consumption • Wave Pipelining – Omit inter-stage registers – Reduced clocking overhead slide 20 Anshul Kumar, CSE IITD
  21. 21. Conventional vs Wave Pipelining Conventional vs Wave Pipelining Conventional Pipeline Wave Pipeline • Registers separate • No registers between adjoining stages adjoining stages • Clock period > max prop • Clock period less than delay max prop delay • Inter-stage data stored in • Waves of data propagate registers through combinational network (effectively, data is stored in the combinational circuit delay!) slide 21 Anshul Kumar, CSE IITD
  22. 22. No pipelining Reg X X’ Reg Y Clock X X’ Y slide 22 Anshul Kumar, CSE IITD
  23. 23. Conventional pipelining Reg X X’ Y Y’ Z Z’ Reg W Clock X X’ Y Y’ Z Z’ W
  24. 24. Wave pipelining Reg X Z’ Reg W Clock X Z’ slide 24 Anshul Kumar, CSE IITD W
  25. 25. Timing Reg Reg Comb ckt X Y Clock T≥p+s T clock period X Y p s propagation delay set-up time slide 25 Anshul Kumar, CSE IITD
  26. 26. Timing with clock skew Reg Reg Comb ckt X Y Clock T Clock skew = ±δ X Y p s δ δ T ≥ p + s + 2δ slide 26 Anshul Kumar, CSE IITD
  27. 27. Variation in propagation delay • Different delays in different paths • Delay variation due to process / temperature/ power variations • Data-dependent delay variations slide 27 Anshul Kumar, CSE IITD
  28. 28. Timing for wave pipelining Reg Reg Comb ckt X Y Clock T ±δ X Δp pmin Y pmax T ≥ Δ p + s + 4δ slide 28 Anshul Kumar, CSE IITD
  29. 29. Timing for wave pipelining (expanded view) T X Δp Y nT (n-1) T pmin pmax pmin ≥ (n-1) T + 2δ nT ≥ pmax + s + 2δ ⇒T ≥ Δ p + s + 4δ slide 29 Anshul Kumar, CSE IITD
  30. 30. Comparison Conventional Pipeline Wave Pipeline T ≥ pmax/n + s + 2δ T ≥ Δ p + s + 4δ (plus cycle quantization overhead) nT ≥ pmax + ns + 2nδ nT ≥ pmax + s + 2δ slide 30 Anshul Kumar, CSE IITD
  31. 31. Problems with wave pipelining • Need to balance delays • Narrow range of clock frequencies • Control difficult • Not very suitable for non-linear pipelines slide 31 Anshul Kumar, CSE IITD
  32. 32. References 1. M.J. Flynn, quot;Computer Architecture : Pipelined and Parallel Processor Designquot;, Narosa Publishing House/ Jones and Bartlett, 1996. 2. Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, and Wentai Liu, “Wave-Pipelining: A Tutorial and Research Survey”, IEEE Trans. on VLSI Systems, vol. 6, no. 3, September 1998, pp. 464 – 474. slide 32 Anshul Kumar, CSE IITD

×