Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ocelot

1,052 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ocelot

  1. 1. <ul>PTXOptimizer @ ocelot by Sean-Chen 2011/04/15 </ul>
  2. 2. <ul><li>PTXOptimizer </li><ul><li>SubkernelFormationPass
  3. 3. RemoveBarrierPass
  4. 4. LinearScanRegisterAllocationPass
  5. 5. MIMDThreadSchedulingPass </li></ul></ul>
  6. 6. Module Kernel DFG0 CFG0 SubkernelFormationPass 2 Step1. Assign top kernel for all CFG0 CFG1 CFG2 a b + a a c entry exit
  7. 7. Module Kernel 0 Module Kernel 0 Kernel 1 Kernel 0 Kernel 1 kernel_0 kernel_1 Schedule assign Sub Kernels assign SubkernelFormationPass 3 CFG0 CFG1 CFG2 CFG2
  8. 8. SubkernelFormationPass 1 <ul><li>Algorithm
  9. 9. 1) start at a kernel entry point that dominates all remaining blocks
  10. 10. 2) create a strongly connected subgraph with N instructions and no barriers
  11. 11. a) This is a new kernel
  12. 12. 3) For all edges leaving the graph
  13. 13. a) save all live registers
  14. 14. b) save the target block's id
  15. 15. c) create a new scheduler block includes an indirect branch to each
  16. 16. of the targets
  17. 17. d) redirect each edge to the kernel exit point
  18. 18. e) create a new kernel rooted in the new scheduler block, goto 1 </li></ul>Ref: SubkernelFormationPass.cpp
  19. 19. SubkernelFormationPass 4 <ul><li>sample methods to do </li><ul><li>Create new Kernel </li><ul><li>Kernel = new kernel(); </li></ul><li>Assign New CFG 2 Kernel </li><ul><li>New_kernel->cfg() = new CFG();
  20. 20. Org_Kernel->cfg()->update(); </li></ul><li>Update PTX graph </li><ul><li>PTX->cfg()->update() </li></ul><li>Update module </li><ul><li>module->update() </li></ul><li>Re-schedule() </li><ul><li>SSA graph
  21. 21. Dominator tree
  22. 22. Control tree </li></ul></ul></ul>
  23. 23. SubkernelFormationPass 5 <ul><li>Why to do it?
  24. 24. Reduce the kernel loading and Paralleling
  25. 25. Ps: that is a trade off in “fork” “join” with kernel communication </li></ul>
  26. 26. RemoveBarrierPass 1 Ref: ocelot-pact.pdf
  27. 27. RemoveBarrierPass 2 <ul><li>How to do it? </li><ul><li>Replace Barrier instruction to function call. </li></ul><li>Definition </li><ul><li>The call instruction stores the address of the next instruction, so execution can resume at that point after executing a ret instruction. A call is assumed to be divergent unless the .uni suffix is present, indicating that the call is guaranteed to be non-divergent, meaning that all threads in a warp have identical values for the guard predicate and call target. </li></ul></ul><ul>ref PTX_isa 2.1 </ul>
  28. 28. RemoveBarrierPass 3 <ul><li>Example </li><ul><li>Assign a=a+1 in CTA with different thread.
  29. 29. a = a+1 ; sync(); //@ sync mem reg .... b=b+1; </li><ul><li>b+1 = thread1
  30. 30. a+1 = thread2 </li></ul><li>thread1 wait for thread2 finish... </li><ul><li>bar.sync() </li></ul></ul></ul>
  31. 32. RemoveBarrierPass 3 <ul><li>sample methods to do @ load /store 2 new memory address </li><ul><li>Find branch location and replace it with </li><ul><li>Brn = Kernel->cfg()->terminator()->Branch();
  32. 33. Instruiction *IT= new Instruction(IR::FunctionCall)
  33. 34. Kernel->cfg()->insert(IT);
  34. 35. kernel->cfg()->remove(Brn); </li></ul><li>Assign Function call type </li><ul><li>IT->d() = IR::addressType // dest register
  35. 36. IT->a() = IR::addressType // source register
  36. 37. IT->type() = IR::FunctionCall </li></ul><li>Link update </li><ul><li>IT->Preprocessor()->update()
  37. 38. IT->Successor()->update() </li></ul><li>Call back to original pointer </li><ul><li>new end pointer = org end pointer </li></ul></ul></ul>
  38. 39. RemoveBarrierPass 4 <ul><li>Why to do it?
  39. 40. Reduce the thread waiting time in each barrier synchronous check. </li></ul>
  40. 41. <ul><ul><li>LinearScanRegisterAllocationPass 1 </li></ul></ul>@ %r1 Ref: ocelot-pact.pdf
  41. 42. <ul><ul><li>LinearScanRegisterAllocationPass 2 </li></ul></ul><ul><li>sample methods to do </li><ul><li>Base On SSA graph </li><ul><li>Find PHINodes </li><ul><li>kernel->dfg()->hasPHINode()? </li></ul><li>Replace all alive in PHINode </li><ul><li>Foreach (kernel->dfg->PHINode()->aliveIn())... </li></ul><li>Update graph </li><ul><li>kernel->cfg()->update()
  42. 43. Preprocessor
  43. 44. Successor </li></ul></ul></ul></ul>
  44. 45. <ul><ul><li>LinearScanRegisterAllocationPass 3 </li></ul></ul><ul><li>Why to do it?
  45. 46. Replace register to local share memory. </li><ul><li>More parallelism to thread access.
  46. 47. More data sharing </li></ul></ul>
  47. 48. MIMDThreadSchedulingPass 1 <ul><li>definition </li><ul><li>Predicated Execution </li><ul><li>reg .pred p, q, r </li></ul><li>Example </li><ul><li>if (i < n)
  48. 49. j = j + 1;
  49. 50. setp.lt.s32 p, i, n; // compare i to n
  50. 51. @!p bra L1; // if false, branch over
  51. 52. add.s32 j, j, 1;
  52. 53. L1: ... </li></ul></ul></ul>j=j+1 j=j+1
  53. 55. MIMDThreadSchedulingPass 2 <ul><li>sample methods to do </li><ul><li>Find Branch instruction and dominator </li><ul><li>Dom = kernel->dominator_tree();
  54. 56. Post = kernel->post_dominator_tree();
  55. 57. kernel->terminator()->hasBranch()? </li></ul><li>Replace Branch to Predicted </li><ul><li>Instruction IT = new Instruction(IR::Instruction::Pred);
  56. 58. kernel->Instruction->Insert(IT);
  57. 59. Kernel->Instruction->erase(Bn); </li></ul><li>Update graph </li><ul><li>kernel->cfg()->update();
  58. 60. kernel->PTX()->update(); </li></ul></ul></ul>
  59. 61. MIMDThreadSchedulingPass 3 <ul><li>Why to do it?
  60. 62. More parallelism to thread access </li></ul>
  61. 63. Reference <ul><li>gpuocelot </li><ul><li>http://code.google.com/p/gpuocelot/wiki/References </li></ul></ul>

×