During a data-dependent stall (eg L1 cache miss) enter run-ahead mode continuing execution in program order
Helps to warm caches until dependency resolved and normal execution can be resumed
Throws away lots of instructions that could have been executed between stall and resolution of the data-dependency
Evolution to SST
During a data-dependent stall (eg L1 cache miss) enter execute-ahead mode doing speculative execution
(..contd)
Evolution to SST
Speculative Execution Depends on =>
Checkpointing
Transactional Memory
Exploits => hardware threading
Ahead thread executing instructions speculatively
Behind thread executing instructions with resolved data dependencies
Advantages =>
[+] Single threaded software code is being executed simultaneously from 2 different locations using hardware threads
[+] Achieves MLP and ILP
[-] Program locality works toward ensuring cache misses are kept to a minimum or the prefetcher may be able to produce the result with a very low cycle latency
Hazards
Common to OO and SST
Data
RAW, WAR, WAW
Control
Branching, Exceptions
Memory Consistency Protocols
Scheme must not break effect of Total Store Ordering (The Von-Neuman/Turing ordering of a code). In other words the results of the dynamic machine scheduling of code must not differ with the static program schedule)
OO & SST Differences
Traditional OO
Stalls instructions with any data dependency, that is , there is no progression to the retirement unit.
Uses register renaming to continue OO ‘ execute ’
SST
RAW => Defers instructions and any resolved operands in a deferred queue (DQ)
WAR, WAW => Speculatively retired
Data hazards
RAW a=5; a=10; b=a+1;
b should be 11 not 6
WAR a=5 b=a+1 a=6
b should be 5 not 6
WAW a=5; b=50;
b should be 50 not 5
Executing instructions out of order is problematical as potentially N versions of operands held in finite set of registers
When does the register have the correct value for the right instruction?
SST handling of Data Hazards
Ahead thread
Avoids RAW by using NT bit and deferring the instruction
Behind thread
Avoids WAR by saving resolved operands alongside relevant instruction in the DQ
Avoids WAW the NT bits determines if it can update the ARF (architectural register file) if not the WAW bit is set preventing this and the SRF register update may only be used to do data forwarding
Discovering and propagating data dependencies
Reg [dest] = Reg [operand_1] || Reg [operand_n]
SST handling of Control Hazards
Speculation fails if any of the following occur
Branch Mis-Prediction
Transactional Memory Failure
Memory order violation detected by ‘ S ’ bit in cache
Exception
Failed speculation causes
speculative checkpoint to be discarded and,
architectural checkpoint restored
SST Memory Consistency Protocol
Load Order protocol
Speculative loads set the cache line “ S ” speculatively read bit (transactional memory support)
If cache logic evicts or invalidates a line with the ‘ S ’ bit set then ahead thread speculation has failed for this episode
Checkpoints
For N=2
At start of an SST episode 2 checkpoints are created
Architectural Checkpoint
Initially active
Once active ahead-thread progresses with speculative execution
Speculative Checkpoint (inactive)
Behind thread wakes then makes it active ; clears W bit vector
NT bit vector copied to SNT bit vector to detect WAW hazards
When deferred queue empty for speculative episode a “ merge ” operation is performed
Merge is Ahead-thread results + Behind-thread results => Architectural Checkpoint
NT = SNT && W ; SNT and W bit vectors cleared ; Architectural Checkpoint is discarded ; Speculative Checkpoint is made active aka it becomes the new Architectural Checkpoint
When deferred queue empty for all speculative episodes a “ join ” operation is performed
Join similar to Merge except nothing remains in the Deferred Queue and the speculative episode is ended returning the Ahead-thread to Normal mode
SST new circuit structures
To Handle N Checkpoints (assume N=2)
2 Defer Queues
Hold instructions & resolved operands used by behind thread
speculative checkpoint when it updates SRF2 the behind-thread wakes and uses SRF1
Status bits NT, SNT, W, WAW
Not There, Speculatively Not There, Written, WAW
Behind thread uses W bit like Ahead thread uses NT bit
SNT bit is used to capture register state of Ahead thread when Behind thread initiates
NT =/= SNT => WAW when checked during SST episode
Any Register with WAW set value gets dropped at end of SST episode
S bit in Cache line
Cache Slot is waiting for a ‘ S ’ peculative Load
SST logic Wakeup Behind Thread DQ Full? DQ Empty for current & spec ckpt? L1 Miss Set ‘ S ’ bit in Cache Start Behind thread in wait mode to handle Defers Start Executing Main thread Speculatively ahead Behind Thread Runs Thru DQ for Active Checkpoint Done Ahead Thread • Normal Mode Behind Thread • Pause L1 Resolved Ahead Thread • Scout Mode Behind Thread • Pause High Level SW initiates a Memory Transaction Restore Checkpoint Tx Fail ‘ S ’ bit Detect Mem Order Violation Br Mispredict Exception WAIT Begin SST Episode Arch Checkpoint Active • Architectural Inactive • Speculative Instr has Data Dependencies? Execute Instr and Retire OO Enqueue DQ with Instr & All Resolved Opr Instr has no Data Dependencies? WAIT more data expected Speculation Successful Program Execution resumes were speculation finished
SST scheduling Program Order LDX addr1, %r1 ADD %r1, 0x04, %r2 STX %r2, addr2 SETHI 0x01, %r2 STX %r2, addr3 etc.. ; Ahead-Thread 1 LDX addr1, %r1 ; Load Miss on addr1, Defer and set R1 [ NT ]) To Defer Q ; Checkpoint Start Ahead-Thread, Behind-Thread Waits for data read 2 ADD %r1, 0x04, %r2 ; Source Operand has NT bit set Defer and set R2 [NT] To Defer Q 3 STX %r2, addr2 ; Source Operand has NT bit set Defer) To Defer Q 4 SETHI 0x01, %r2 ; Ahead Thread Executes Independently) 5 STX %r2, addr3 ; Ahead Thread Executes Independently & continues speculative execution of more program instructions ; Load Miss resolves start Behind-Thread 6 ADD %r1, 0x04, %r2 [NT=0,SNT=1] ; NT was reset at 4, set waw bit 7 STX %r2, addr3 SST Order LDX addr1, %r1 ADD %r1, 0x04, %r2 STX %r2, addr2 SETHI 0x01, %r2 STX %r2, addr3 etc.. Deferring data-dependent instructions prevents RAW – here %r2 was read at 3 but written before at 2 Saving operands in DQ prevents WAR as any valid data in register at that time is captured and saved for Behind-Thread to use later regardless of future writes by Ahead-Thread Registers with WAW bit not committed to Architectural state – here %r2 was written at 4 & 6 ;Deferred Queue LDX addr1, %r1 [ NT ] ADD %r1 [ NT ], 0x04, %r2 [ NT ] STX %r2 [ NT ] , addr2 WAW WAR RAW
0 comments
Post a comment