Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Understanding the Tomasulo Algorithm
Yichao Cheng
Jul 23, 2013
Background
 IBM System/360 Model 91
 FPU’s add/mul/div takes 2/3/13 cycles
 Can performance be improved through utilizi...
Major Contributions
Proposed three innovative mechanisms:
 Common data busing(CDB)
 Register tagging scheme
 Reservatio...
Doubt
 When people talk about Tomasolu algorithm, they
talk about register renaming
 However this word can’t be found in...
Architecture Overview
FLOS
Adder
Mul
div
FLB
SDB
FLR Decoder
Storage
Instruction
Unit
FPU
From a FPU’s perspective
All instructions are ‘register-to-register’
 Register-to-register arithmetic
 Storage-to-regist...
 Be equivalent to destination and source
 For example, AD R1, R2
 R1 is both a sink and a source
‘sink’ and ‘source’
so...
1.Reg-to-reg arithmetic AD R1, R2
FLOS
Adder
Mul
div
FLB
SDB
FLR Decoder
Storage
2.Storage-to-reg arithmetic AD R1, FLB
FLOS
Mul
divSDB
Decoder
Storage
Adder
FLR
FLB
3.Load LD R1, FLB1
FLOS
Adder
Mul
div
FLB
SDB
FLR Decoder
Storage
0
4.Store STD R1, SDB1
FLOS
Mul
div
FLB
Decoder
Storage
FLR
AdderSDB
0
Timing Sequence: 1. reg-to reg arithmetic
DecodeIU
EU Execute
Write back
to FLR
2 operands
To ALU
Decode
2. storage-to-reg arithmetic
DecodeIU
EU Execute
Write back
to FLR
FLR
To ALU
Decode
FLB
To ALU
Addr
Gen
Mem
Read
3.Load
DecodeIU
EU Execute
Writeback
to FLR
FLR
To ALU
Decode
FLB
To ALU
Addr
Gen
Mem
Read
4.Store
DecodeIU
EU Execute
FLR
To ALU
Decode
Write
To SDB
Addr
Gen
Mem
Write
A Day in the Life of ‘LD R1, addr’
FLOS
Adder
Mul
div
FLB
SDB
FLR Decoder
Storage
Instruction
Unit
FLBStorage FLOS
Adder
Mul
divSDB
Decoder
FLB1
addr
FLR
Decode &
Address
generation
A Day in the Life of ‘LD R1, addr’
Inst...
FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder
Mul
divSDB
Decoder
addr
FLB1
LD R1, FLB1
FLR
Instruction
Unit
FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder
Mul
divSDB
Decoder
addr
FLB1
LD R1, FLB1
FLR
FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Mul
divSDB
Decoder
addr
FLB1
LD R1, FLB1
OP
FLR
Adder
FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Mul
divSDB
addr
FLB1
LD R1, FLB1
OP
DecoderFLR
Adder
FLBStorage
A Day in the Life of ‘LD R1, addr’
FLOS
Adder
Mul
divSDB
FLR
addr
FLB1
R1
LD R1, FLB1
Decoder
An Example of Dependence
LD F0, FLB1
MD F0, FLB2
What if send them to different execution units at the
same time?
Adder
Mu...
An Example of Dependence
LD F0, FLB1
MD F0, FLB2
The result(F0) cannot reflect the impact of LD, because
MD uses the old v...
An Example of Dependence
LD F0, FLB1
MD F0, FLB2
Adder
Mul
div
It is also called true dependence,
a.k.a. RAW
A Simple Solution
 ‘busy’ bit scheme
R0
R1
R2
R3
B
I’am already the sink
of some instruction
I need your
contentLD R1 B
M...
Performance Degrades...
 When the code keep using one register
 E.g. MD F0, E
AD F2, F0
AD F4, A
AD F2, F4
overlap fails...
Cause of the Problem
 If one instruction gets stuck(due to dependence), the
following can’t be decoded(even it is qualifi...
Dispatch and Issue Decoupling
MD F0, E
AD F2, F0
AD F4, A
AD F2, F4
Adder
Can issue?Decode
Is that reg busy?
Dispatch and Issue Decoupling
MD F0, E
AD F2, F0
AD F4, A
AD F2, F4
Adder
Dispatch
anyway
Decode
Are my operands
ready?
MD...
An Example of True Dependence
LD F0, FLB1 F0 as sink
AD F2, F0 F0 as source
Adder
Mul
div
FLB
FLR
FLB1
F0
Assume CDB has n...
LD F0, FLB1 dispatches to A1
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
An Example of True Dependence
F0 is ...
LD F0, FLB1 dispatches to A1
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
An Example of True Dependence
Its co...
LD F0, FLB1
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
I need the value of F0,
but he seems to be busy
An Ex...
LD F0, FLB1
AD F2, F0 dispatches to A2
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
Since A1 is the
producer, just let
h...
LD F0, FLB1
AD F2, F0 dispatches to A2
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
Since A1 is the
producer, just ask
h...
LD F0, FLB1 executing
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
An Example of True Dependence
AD F2, A1
Ope...
LD F0, FLB1 broadcasts it’s result to the air
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
I’m A1. Who needs
m...
LD F0, FLB1 broadcasts it’s result to the air
AD F2, F0
Adder
Mul
div
FLB
FLR
FLB1
F0
LD F0, FLB1
B A1
I depend on
A1!
An ...
The Role of CDB
 Common Data Bus is in charge of value forwarding
 In reg-to-reg model, a value is passed through a
regi...
The Role of CDB
 Common Data Bus is in charge of value forwarding
 In reg-to-reg model, a value is passed through a
regi...
The Role of CDB
Add
For Mul
Resv. S
For
Resv. S
FLB
SDB
FLR
 Load/Store doesn’t need to go through ALU
 The dependence m...
The Role of CDB
CDB
All units which
may take register
as an operand
All units which can
alter a register
ConsumerProducer
...
The Role of CDB
CDB
All units which
may take register
as an operand
All units which can
alter a register
ConsumerProducer
...
The Implementation of CDB
 A consumer recognizes his producer by tagging
 Producers throw <tag, value> on the bus by
tur...
The Implementation of CDB
 A consumer recognizes his producer by tagging
 Producers throw <tag, value> on the bus by
tur...
The Implementation of CDB
 A consumer recognizes his producer by tagging
 Producers throw <tag, value> on the bus by
tur...
The Implementation of CDB
 A consumer recognizes his producer by tagging
 Producers throw <tag, value> on the bus by
tur...
The Principle behind the Scene
 Tag is a pointer pointing to the producer of the value
required by the current instructio...
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
An Example for False Dependence
FLB2
FLB1
WAW
WAR
LD F0, FLB1 dispatches
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
An Example for False Dependence
FLB2
FLB1
...
LD F0, FLB1
AD F2, F0 dispatches to A1
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F2, F0
An Example for False Depen...
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F2, F0
An Example for False Dependence
FLB2
FLB1
B...
LD F0, FLB1
AD F2, F0
LD F0, FLB2 dispatches
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F2, F0
An Example for False Dependence
...
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0 dispatches to A2
Adder
Mul
div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for F...
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for False Dependence
F...
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3, F0
Adder
Mul
div
FLB
FLR
F0
AD F3, F0
AD F2, F0
An Example for False Dependence
F...
Timing Sequence with Busy Bit
D
T EX WB
AG
D
FLB
D
T T EX WBD
D
T EX WB
AG
D
FLB
D
LD F0, FLB1
AD F2, F0
LD F0, FLB2
AD F3...
Timing Sequence with Reservation Station
D
T EX WB
AG
D
FLB
D
T T EX WBD
D
T EX WB
AG
D
FLB
D
T T EX WBD
LD F0, FLB1
AD F2...
The Side Effect of Register Machine
 What are the differences between a circuit and a
register machine?
The Side Effect of Register Machine
 What are the differences between a circuit and a
register machine?
Register Machine
...
Conclusion
 Tomasulo algorithm has nothing to do with register
renaming
 It resolves the WAR & WAW by elimating the side...
Upcoming SlideShare
Loading in …5
×

Understanding Tomasulo Algorithm

1,985 views

Published on

How Tomasulo Algorithm works. And why it works.

Published in: Technology, Business
  • Be the first to comment

Understanding Tomasulo Algorithm

  1. 1. Understanding the Tomasulo Algorithm Yichao Cheng Jul 23, 2013
  2. 2. Background  IBM System/360 Model 91  FPU’s add/mul/div takes 2/3/13 cycles  Can performance be improved through utilizing multiple execution units? Adder Mul div
  3. 3. Major Contributions Proposed three innovative mechanisms:  Common data busing(CDB)  Register tagging scheme  Reservation station which permits:  Out-of-order execution of independent instructions  while preserving the essential precedences in the instruction stream
  4. 4. Doubt  When people talk about Tomasolu algorithm, they talk about register renaming  However this word can’t be found in the original paper How could anyone invent a thing without noticing it?
  5. 5. Architecture Overview FLOS Adder Mul div FLB SDB FLR Decoder Storage Instruction Unit FPU
  6. 6. From a FPU’s perspective All instructions are ‘register-to-register’  Register-to-register arithmetic  Storage-to-register arithmetic  Load  Store Instruction Unit(outside FPU) is in charge of the address generation and memory access.
  7. 7.  Be equivalent to destination and source  For example, AD R1, R2  R1 is both a sink and a source ‘sink’ and ‘source’ source sink value
  8. 8. 1.Reg-to-reg arithmetic AD R1, R2 FLOS Adder Mul div FLB SDB FLR Decoder Storage
  9. 9. 2.Storage-to-reg arithmetic AD R1, FLB FLOS Mul divSDB Decoder Storage Adder FLR FLB
  10. 10. 3.Load LD R1, FLB1 FLOS Adder Mul div FLB SDB FLR Decoder Storage 0
  11. 11. 4.Store STD R1, SDB1 FLOS Mul div FLB Decoder Storage FLR AdderSDB 0
  12. 12. Timing Sequence: 1. reg-to reg arithmetic DecodeIU EU Execute Write back to FLR 2 operands To ALU Decode
  13. 13. 2. storage-to-reg arithmetic DecodeIU EU Execute Write back to FLR FLR To ALU Decode FLB To ALU Addr Gen Mem Read
  14. 14. 3.Load DecodeIU EU Execute Writeback to FLR FLR To ALU Decode FLB To ALU Addr Gen Mem Read
  15. 15. 4.Store DecodeIU EU Execute FLR To ALU Decode Write To SDB Addr Gen Mem Write
  16. 16. A Day in the Life of ‘LD R1, addr’ FLOS Adder Mul div FLB SDB FLR Decoder Storage Instruction Unit
  17. 17. FLBStorage FLOS Adder Mul divSDB Decoder FLB1 addr FLR Decode & Address generation A Day in the Life of ‘LD R1, addr’ Instruction Unit
  18. 18. FLBStorage A Day in the Life of ‘LD R1, addr’ FLOS Adder Mul divSDB Decoder addr FLB1 LD R1, FLB1 FLR Instruction Unit
  19. 19. FLBStorage A Day in the Life of ‘LD R1, addr’ FLOS Adder Mul divSDB Decoder addr FLB1 LD R1, FLB1 FLR
  20. 20. FLBStorage A Day in the Life of ‘LD R1, addr’ FLOS Mul divSDB Decoder addr FLB1 LD R1, FLB1 OP FLR Adder
  21. 21. FLBStorage A Day in the Life of ‘LD R1, addr’ FLOS Mul divSDB addr FLB1 LD R1, FLB1 OP DecoderFLR Adder
  22. 22. FLBStorage A Day in the Life of ‘LD R1, addr’ FLOS Adder Mul divSDB FLR addr FLB1 R1 LD R1, FLB1 Decoder
  23. 23. An Example of Dependence LD F0, FLB1 MD F0, FLB2 What if send them to different execution units at the same time? Adder Mul div to exploit parallelisim
  24. 24. An Example of Dependence LD F0, FLB1 MD F0, FLB2 The result(F0) cannot reflect the impact of LD, because MD uses the old value of F0 Adder Mul div
  25. 25. An Example of Dependence LD F0, FLB1 MD F0, FLB2 Adder Mul div It is also called true dependence, a.k.a. RAW
  26. 26. A Simple Solution  ‘busy’ bit scheme R0 R1 R2 R3 B I’am already the sink of some instruction I need your contentLD R1 B MD R1 A
  27. 27. Performance Degrades...  When the code keep using one register  E.g. MD F0, E AD F2, F0 AD F4, A AD F2, F4 overlap fails because the first AD depends on MD, though the others don’t The second AD is qualified to issue
  28. 28. Cause of the Problem  If one instruction gets stuck(due to dependence), the following can’t be decoded(even it is qualified to issue) Solution :  Decouple the dependence mantainance from decoding  Look ahead more instructions for concurrency
  29. 29. Dispatch and Issue Decoupling MD F0, E AD F2, F0 AD F4, A AD F2, F4 Adder Can issue?Decode Is that reg busy?
  30. 30. Dispatch and Issue Decoupling MD F0, E AD F2, F0 AD F4, A AD F2, F4 Adder Dispatch anyway Decode Are my operands ready? MD F0, E Can issue?
  31. 31. An Example of True Dependence LD F0, FLB1 F0 as sink AD F2, F0 F0 as source Adder Mul div FLB FLR FLB1 F0 Assume CDB has not been introduced yet
  32. 32. LD F0, FLB1 dispatches to A1 AD F2, F0 Adder Mul div FLB FLR FLB1 F0 LD F0, FLB1 B A1 An Example of True Dependence F0 is reserved for some instruction
  33. 33. LD F0, FLB1 dispatches to A1 AD F2, F0 Adder Mul div FLB FLR FLB1 F0 LD F0, FLB1 B A1 An Example of True Dependence Its content is calculated by A1
  34. 34. LD F0, FLB1 AD F2, F0 Adder Mul div FLB FLR FLB1 F0 LD F0, FLB1 B A1 I need the value of F0, but he seems to be busy An Example of True Dependence
  35. 35. LD F0, FLB1 AD F2, F0 dispatches to A2 Adder Mul div FLB FLR FLB1 F0 LD F0, FLB1 B A1 Since A1 is the producer, just let him tell me An Example of True Dependence AD F2, F0
  36. 36. LD F0, FLB1 AD F2, F0 dispatches to A2 Adder Mul div FLB FLR FLB1 F0 LD F0, FLB1 B A1 Since A1 is the producer, just ask him for it An Example of True Dependence AD F2, A1
  37. 37. LD F0, FLB1 executing AD F2, F0 Adder Mul div FLB FLR FLB1 F0 LD F0, FLB1 B A1 An Example of True Dependence AD F2, A1 Operands are ready. Execute!
  38. 38. LD F0, FLB1 broadcasts it’s result to the air AD F2, F0 Adder Mul div FLB FLR FLB1 F0 LD F0, FLB1 B A1 I’m A1. Who needs my result? Over.. An Example of True Dependence AD F2, A1
  39. 39. LD F0, FLB1 broadcasts it’s result to the air AD F2, F0 Adder Mul div FLB FLR FLB1 F0 LD F0, FLB1 B A1 I depend on A1! An Example of True Dependence AD F2, A1 Me too!
  40. 40. The Role of CDB  Common Data Bus is in charge of value forwarding  In reg-to-reg model, a value is passed through a register(write & read) F0 Write as sink (Producer)
  41. 41. The Role of CDB  Common Data Bus is in charge of value forwarding  In reg-to-reg model, a value is passed through a register(write & read) F0 Read as source (Consumer)
  42. 42. The Role of CDB Add For Mul Resv. S For Resv. S FLB SDB FLR  Load/Store doesn’t need to go through ALU  The dependence management is decoupled from execution as expected
  43. 43. The Role of CDB CDB All units which may take register as an operand All units which can alter a register ConsumerProducer Add For Mul Resv. S For Resv. S FLB SDB FLR P:3 P:2 P:6
  44. 44. The Role of CDB CDB All units which may take register as an operand All units which can alter a register ConsumerProducer Add For Mul Resv. S For Resv. S FLB SDB FLRC:4 C:3 C:2*2 C:3*2
  45. 45. The Implementation of CDB  A consumer recognizes his producer by tagging  Producers throw <tag, value> on the bus by turns(make a request first)  If tag matches , consumer ingates the value C C C C C C P P P P P P tag tag tag X Y Y Requset (2 cycles)
  46. 46. The Implementation of CDB  A consumer recognizes his producer by tagging  Producers throw <tag, value> on the bus by turns(make a request first)  If tag matches , consumer ingates the value P P P P P P Y value C C C C C C tag tag tag X Y Y
  47. 47. The Implementation of CDB  A consumer recognizes his producer by tagging  Producers throw <tag, value> on the bus by turns(make a request first)  If tag matches , consumer ingates the value PP P P P P C C C C C C tag tag tag X Y Y request
  48. 48. The Implementation of CDB  A consumer recognizes his producer by tagging  Producers throw <tag, value> on the bus by turns(make a request first)  If tag matches , consumer ingates the value PP P P P P C C C C C C tag tag tag X Y Y X value
  49. 49. The Principle behind the Scene  Tag is a pointer pointing to the producer of the value required by the current instruction  The pointers construct the dependency information which are hidden by the reg-reg model(discuss later)  With the information, the order of execution can be resolved  CDB enables ‘producer-consumer’ style data flow
  50. 50. LD F0, FLB1 AD F2, F0 LD F0, FLB2 AD F3, F0 Adder Mul div FLB FLR F0 An Example for False Dependence FLB2 FLB1 WAW WAR
  51. 51. LD F0, FLB1 dispatches AD F2, F0 LD F0, FLB2 AD F3, F0 Adder Mul div FLB FLR F0 An Example for False Dependence FLB2 FLB1 B FLB1
  52. 52. LD F0, FLB1 AD F2, F0 dispatches to A1 LD F0, FLB2 AD F3, F0 Adder Mul div FLB FLR F0 AD F2, F0 An Example for False Dependence FLB2 FLB1 B FLB1
  53. 53. LD F0, FLB1 AD F2, F0 LD F0, FLB2 AD F3, F0 Adder Mul div FLB FLR F0 AD F2, F0 An Example for False Dependence FLB2 FLB1 B FLB1
  54. 54. LD F0, FLB1 AD F2, F0 LD F0, FLB2 dispatches AD F3, F0 Adder Mul div FLB FLR F0 AD F2, F0 An Example for False Dependence FLB2 FLB1 B FLB2
  55. 55. LD F0, FLB1 AD F2, F0 LD F0, FLB2 AD F3, F0 dispatches to A2 Adder Mul div FLB FLR F0 AD F3, F0 AD F2, F0 An Example for False Dependence FLB2 FLB1 B FLB2
  56. 56. LD F0, FLB1 AD F2, F0 LD F0, FLB2 AD F3, F0 Adder Mul div FLB FLR F0 AD F3, F0 AD F2, F0 An Example for False Dependence FLB2 FLB1 B FLB2 Keep tracing the source of the value instead of the register holding it
  57. 57. LD F0, FLB1 AD F2, F0 LD F0, FLB2 AD F3, F0 Adder Mul div FLB FLR F0 AD F3, F0 AD F2, F0 An Example for False Dependence FLB2 FLB1 B FLB2 There’s no need to rename a register(Naming is just a way of referring values)
  58. 58. Timing Sequence with Busy Bit D T EX WB AG D FLB D T T EX WBD D T EX WB AG D FLB D LD F0, FLB1 AD F2, F0 LD F0, FLB2 AD F3, F0 T T EX WBD
  59. 59. Timing Sequence with Reservation Station D T EX WB AG D FLB D T T EX WBD D T EX WB AG D FLB D T T EX WBD LD F0, FLB1 AD F2, F0 LD F0, FLB2 AD F3, F0
  60. 60. The Side Effect of Register Machine  What are the differences between a circuit and a register machine?
  61. 61. The Side Effect of Register Machine  What are the differences between a circuit and a register machine? Register Machine  General purpose  Control-driven  Implict dependence via registers Circuit  Special purpose  Data-driven  Exposed dependence ...But registers are rare
  62. 62. Conclusion  Tomasulo algorithm has nothing to do with register renaming  It resolves the WAR & WAW by elimating the side effect of using register to pass value  By using Tomasulo algorithm, the execution of a program is driven by data flow thus exploiting maximum concurrency

×