Dynamic Dataflow on CGRA https://github.com/mnnuahg/StamicCGRA
Dynamic dataflow on cgra

34 views

Published on

This slide explains my experimental implementation of dynamic dataflow execution on CGRA

Published in: Engineering
Dynamic dataflow on cgra

  1. 1. Dynamic Dataflow on CGRA https://github.com/mnnuahg/StamicCGRA
  2. 2. Dataflow Computing • Programs are represented by data flow graphs (DFGs) • Data tokens flow through the DFG • The example shows the DFG of function f(x) = (2x+1)/(2x-1) • DFG is a way to express parallelism • Independent operations can be done in parallel *2 +1 / x -1 y
  3. 3. Dataflow Computing on CGRAs • Coarse Grained Reconfigurable Array (CGRA) is a type of hardware architecture for parallel computing • A CGRA contains multiple processing elements (PEs) • Each PE connect to neighbor PEs • Different PEs can be configured to perform different operations • CGRAs are naturally suitable for dataflow computing -1*2 +1 / *2 +1 / -1 x y
  4. 4. Predication • No control flow in data flow computing • Control flow should be transformed into predication • forward one of the two inputs to output depending on the predicate + - Select < yx Select T F
  5. 5. Loops • just output a token with value 1 when it receives any token • forward any of its two inputs to output • forward the input token to different output arc depending on the predicate • This DFG represent a function L(x) = the smallest power of 2 that is greater than or equal to x 1 ⊗ Switch ⊗ < ⊗ SwitchSwitch T F TF 1 Discard *2 x
  6. 6. Re-Entrance Problem • What if another token y enters the loops before the token x leaves the loops? • The token y may enter the left loop first • The token x may looped back in the right loop first • Then x and y go together to the < node => computation error! ⊗ < ⊗ SwitchSwitch T F TF 1 Discard *2 y x
  7. 7. Solution 1: Forbid Re-Entrance • Just forbid the loop body to be re-entered by multiple tokens • More clearly, a new token can enter the loop only when the old token leaves the loop • Add a predicate input for • Once a new token enters , it does not allow other new tokens to enter until receiving a false predicate • Tokens looped back to are still allowed to enter • Problem: Loop body can’t be executed in parallel ⊗ < ⊗ SwitchSwitch T F TF 1 Discard *2⊗ ⊗ ⊗ y x
  8. 8. Solution 2: Tagged Token • A solution proposed by MIT in 1990s • Attach a tag for each token • Operation can be performed using only tokens with the same tag, and produce tokens with the same tag • Problem: not suitable for CGRAs • Dataflow architectures in 1990s are kind of centralized • All tokens go to a centralized matching unit, and the unit decides which operations are ready to fire • CGRAs don’t have centralized units Each PE should have it own matching unit to find tokens with the same tag => expensive! ⊗ < ⊗ SwitchSwitch T F TF 1 Discard *2 y x
  9. 9. Solution 3: Tag Matching Using Special CGRA Instructions • Just ensure the tokens enter the loop body have matched tags • Then the rest operations in the loop body don’t need to match the tags ! • Tag Matcher re-orders the tokens so that tokens with the same tag are outputted together • Tag matcher be built from a group of PEs using special CGRA instructions ⊗ < ⊗ SwitchSwitch T F TF 1 Discard *2 Tag Matcher y x
  10. 10. Tag Matcher Implementation • A token can be outputted only when all other inputs already received tokens with the same tag • Need some signaling mechanism • We offer two tag matcher implementations with different signaling mechanism • One send signal tokens • Another send signal via shared bus • No slides yet so please read the code
  11. 11. DFG Hierarchy • The loop can be used as a building block for larger DFG • This DFG will store L(x) to address x x Store ⊗ < ⊗ SwitchSwitch T F TF 1 Discard *2 Tag Matcher L(x)
  12. 12. Handle Out-of-Order Operations • The loop is an out-of-order operation • Even if token x enter the loop before token y, L(x) may be outputted after L(y) • We need to assign new tags for tokens enter the loop • So we can recognize them at output! • We also need a Tag Matcher before • send token to inform that some tags can be re-used Store ⊗ < ⊗ SwitchSwitch T F TF 1 Discard *2 Tag Matcher Tag Matcher Change Tag Store Change Tag Store x L(x)
  13. 13. Combine Results with Different Tags • The larger DFG may combine L(x) and L(y) (and maybe L(z), L(w), …) to form its output • And L(x) and L(y) can be computed in parallel thus with different tags • Two schemes 1. Parallel call a function containing the loop of L(x) • Not supported by CGRA since function call requires a sub-DFG to output (return) its result to different positions 2. Reduction • Need to restore to original tags before combining
  14. 14. DFG of Reduction Operation • This DFG represent the function S(x) = 𝑖=𝑥 10 𝐿(𝑖) • The output L(i)’s may be out-of-order • Once an L(i) is outputted • A signal token is sent to inform ChangeTag that the tag of L(i) can be re-used • L(i) goes to RestoreTag to restore its original tag • Signals should be sent to RestoreTag • So that it knows the mapping between old/new tags • Now shown here • The DFG is also an out-of-order operation ⊗ <10 Switch +1 Change Tag L(x) Restore Tag 0 ⊗ + Switch Tag MatcherDiscard Input Output T F T F
  15. 15. Deadlock Problem • The output of L(i)’s are out of order • Deadlock may occur if • Tokens of tag1 occupies the FIFO of one input of TagMatcher • Tokens of tag2 occupies the FIFO of another input • The tokens can’t be consumed because of tag mismatch • Tokens with matching tag can’t reach TagMatcher because of occupied FIFO Tag Matcher L(x) L(y) L(y) L(x) y x x y
  16. 16. Solution of Deadlock Problem • If TagMatcher is going to be jammed, stall the loop • By blocking the loop predicate • TagMatcher has internal buffer to store the tokens to be matched • Just prevent more tokens than the size of the buffer to enter TagMatcher • is used to control the number of tokens enter TagMatcher ⊗ <10 Switch +1 Change Tag L(x) Restore Tag 0 ⊗ + Switch Discard Input Output T F T F ⋈ Tag Matcher ⋈

