<ul>PTXOptimizer @ ocelot by Sean-Chen  2011/04/15 </ul>
<ul><li>PTXOptimizer </li><ul><li>SubkernelFormationPass
RemoveBarrierPass
LinearScanRegisterAllocationPass
MIMDThreadSchedulingPass </li></ul></ul>
Module Kernel  DFG0 CFG0 SubkernelFormationPass 2 Step1. Assign top kernel for all  CFG0 CFG1 CFG2 a b + a a c entry exit
Module Kernel 0  Module Kernel 0  Kernel 1 Kernel 0 Kernel 1 kernel_0 kernel_1 Schedule assign Sub Kernels assign Subkerne...
SubkernelFormationPass 1 <ul><li>Algorithm
1) start at a kernel entry point that dominates all remaining blocks
2) create a strongly connected subgraph with N instructions and no barriers
a) This is a new kernel
3) For all edges leaving the graph
a) save all live registers
b) save the target block's id
c) create a new scheduler block includes an indirect branch to each
of the targets
d) redirect each edge to the kernel exit point
e) create a new kernel rooted in the new scheduler block, goto 1 </li></ul>Ref: SubkernelFormationPass.cpp
SubkernelFormationPass 4 <ul><li>sample methods to do  </li><ul><li>Create new Kernel  </li><ul><li>Kernel = new kernel();...
Org_Kernel->cfg()->update(); </li></ul><li>Update PTX graph </li><ul><li>PTX->cfg()->update() </li></ul><li>Update module ...
Dominator tree
Upcoming SlideShare
Loading in …5
×

ocelot

890
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
890
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ocelot

  1. 1. <ul>PTXOptimizer @ ocelot by Sean-Chen 2011/04/15 </ul>
  2. 2. <ul><li>PTXOptimizer </li><ul><li>SubkernelFormationPass
  3. 3. RemoveBarrierPass
  4. 4. LinearScanRegisterAllocationPass
  5. 5. MIMDThreadSchedulingPass </li></ul></ul>
  6. 6. Module Kernel DFG0 CFG0 SubkernelFormationPass 2 Step1. Assign top kernel for all CFG0 CFG1 CFG2 a b + a a c entry exit
  7. 7. Module Kernel 0 Module Kernel 0 Kernel 1 Kernel 0 Kernel 1 kernel_0 kernel_1 Schedule assign Sub Kernels assign SubkernelFormationPass 3 CFG0 CFG1 CFG2 CFG2
  8. 8. SubkernelFormationPass 1 <ul><li>Algorithm
  9. 9. 1) start at a kernel entry point that dominates all remaining blocks
  10. 10. 2) create a strongly connected subgraph with N instructions and no barriers
  11. 11. a) This is a new kernel
  12. 12. 3) For all edges leaving the graph
  13. 13. a) save all live registers
  14. 14. b) save the target block's id
  15. 15. c) create a new scheduler block includes an indirect branch to each
  16. 16. of the targets
  17. 17. d) redirect each edge to the kernel exit point
  18. 18. e) create a new kernel rooted in the new scheduler block, goto 1 </li></ul>Ref: SubkernelFormationPass.cpp
  19. 19. SubkernelFormationPass 4 <ul><li>sample methods to do </li><ul><li>Create new Kernel </li><ul><li>Kernel = new kernel(); </li></ul><li>Assign New CFG 2 Kernel </li><ul><li>New_kernel->cfg() = new CFG();
  20. 20. Org_Kernel->cfg()->update(); </li></ul><li>Update PTX graph </li><ul><li>PTX->cfg()->update() </li></ul><li>Update module </li><ul><li>module->update() </li></ul><li>Re-schedule() </li><ul><li>SSA graph
  21. 21. Dominator tree
  22. 22. Control tree </li></ul></ul></ul>
  23. 23. SubkernelFormationPass 5 <ul><li>Why to do it?
  24. 24. Reduce the kernel loading and Paralleling
  25. 25. Ps: that is a trade off in “fork” “join” with kernel communication </li></ul>
  26. 26. RemoveBarrierPass 1 Ref: ocelot-pact.pdf
  27. 27. RemoveBarrierPass 2 <ul><li>How to do it? </li><ul><li>Replace Barrier instruction to function call. </li></ul><li>Definition </li><ul><li>The call instruction stores the address of the next instruction, so execution can resume at that point after executing a ret instruction. A call is assumed to be divergent unless the .uni suffix is present, indicating that the call is guaranteed to be non-divergent, meaning that all threads in a warp have identical values for the guard predicate and call target. </li></ul></ul><ul>ref PTX_isa 2.1 </ul>
  28. 28. RemoveBarrierPass 3 <ul><li>Example </li><ul><li>Assign a=a+1 in CTA with different thread.
  29. 29. a = a+1 ; sync(); //@ sync mem reg .... b=b+1; </li><ul><li>b+1 = thread1
  30. 30. a+1 = thread2 </li></ul><li>thread1 wait for thread2 finish... </li><ul><li>bar.sync() </li></ul></ul></ul>
  31. 32. RemoveBarrierPass 3 <ul><li>sample methods to do @ load /store 2 new memory address </li><ul><li>Find branch location and replace it with </li><ul><li>Brn = Kernel->cfg()->terminator()->Branch();
  32. 33. Instruiction *IT= new Instruction(IR::FunctionCall)
  33. 34. Kernel->cfg()->insert(IT);
  34. 35. kernel->cfg()->remove(Brn); </li></ul><li>Assign Function call type </li><ul><li>IT->d() = IR::addressType // dest register
  35. 36. IT->a() = IR::addressType // source register
  36. 37. IT->type() = IR::FunctionCall </li></ul><li>Link update </li><ul><li>IT->Preprocessor()->update()
  37. 38. IT->Successor()->update() </li></ul><li>Call back to original pointer </li><ul><li>new end pointer = org end pointer </li></ul></ul></ul>
  38. 39. RemoveBarrierPass 4 <ul><li>Why to do it?
  39. 40. Reduce the thread waiting time in each barrier synchronous check. </li></ul>
  40. 41. <ul><ul><li>LinearScanRegisterAllocationPass 1 </li></ul></ul>@ %r1 Ref: ocelot-pact.pdf
  41. 42. <ul><ul><li>LinearScanRegisterAllocationPass 2 </li></ul></ul><ul><li>sample methods to do </li><ul><li>Base On SSA graph </li><ul><li>Find PHINodes </li><ul><li>kernel->dfg()->hasPHINode()? </li></ul><li>Replace all alive in PHINode </li><ul><li>Foreach (kernel->dfg->PHINode()->aliveIn())... </li></ul><li>Update graph </li><ul><li>kernel->cfg()->update()
  42. 43. Preprocessor
  43. 44. Successor </li></ul></ul></ul></ul>
  44. 45. <ul><ul><li>LinearScanRegisterAllocationPass 3 </li></ul></ul><ul><li>Why to do it?
  45. 46. Replace register to local share memory. </li><ul><li>More parallelism to thread access.
  46. 47. More data sharing </li></ul></ul>
  47. 48. MIMDThreadSchedulingPass 1 <ul><li>definition </li><ul><li>Predicated Execution </li><ul><li>reg .pred p, q, r </li></ul><li>Example </li><ul><li>if (i < n)
  48. 49. j = j + 1;
  49. 50. setp.lt.s32 p, i, n; // compare i to n
  50. 51. @!p bra L1; // if false, branch over
  51. 52. add.s32 j, j, 1;
  52. 53. L1: ... </li></ul></ul></ul>j=j+1 j=j+1
  53. 55. MIMDThreadSchedulingPass 2 <ul><li>sample methods to do </li><ul><li>Find Branch instruction and dominator </li><ul><li>Dom = kernel->dominator_tree();
  54. 56. Post = kernel->post_dominator_tree();
  55. 57. kernel->terminator()->hasBranch()? </li></ul><li>Replace Branch to Predicted </li><ul><li>Instruction IT = new Instruction(IR::Instruction::Pred);
  56. 58. kernel->Instruction->Insert(IT);
  57. 59. Kernel->Instruction->erase(Bn); </li></ul><li>Update graph </li><ul><li>kernel->cfg()->update();
  58. 60. kernel->PTX()->update(); </li></ul></ul></ul>
  59. 61. MIMDThreadSchedulingPass 3 <ul><li>Why to do it?
  60. 62. More parallelism to thread access </li></ul>
  61. 63. Reference <ul><li>gpuocelot </li><ul><li>http://code.google.com/p/gpuocelot/wiki/References </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×