Your SlideShare is downloading. ×
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
ocelot
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ocelot

848

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
848
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1.
      PTXOptimizer @ ocelot by Sean-Chen 2011/04/15
  • 2.
    • PTXOptimizer
      • SubkernelFormationPass
      • 3. RemoveBarrierPass
      • 4. LinearScanRegisterAllocationPass
      • 5. MIMDThreadSchedulingPass
  • 6. Module Kernel DFG0 CFG0 SubkernelFormationPass 2 Step1. Assign top kernel for all CFG0 CFG1 CFG2 a b + a a c entry exit
  • 7. Module Kernel 0 Module Kernel 0 Kernel 1 Kernel 0 Kernel 1 kernel_0 kernel_1 Schedule assign Sub Kernels assign SubkernelFormationPass 3 CFG0 CFG1 CFG2 CFG2
  • 8. SubkernelFormationPass 1
    • Algorithm
    • 9. 1) start at a kernel entry point that dominates all remaining blocks
    • 10. 2) create a strongly connected subgraph with N instructions and no barriers
    • 11. a) This is a new kernel
    • 12. 3) For all edges leaving the graph
    • 13. a) save all live registers
    • 14. b) save the target block's id
    • 15. c) create a new scheduler block includes an indirect branch to each
    • 16. of the targets
    • 17. d) redirect each edge to the kernel exit point
    • 18. e) create a new kernel rooted in the new scheduler block, goto 1
    Ref: SubkernelFormationPass.cpp
  • 19. SubkernelFormationPass 4
    • sample methods to do
      • Create new Kernel
        • Kernel = new kernel();
      • Assign New CFG 2 Kernel
        • New_kernel->cfg() = new CFG();
        • 20. Org_Kernel->cfg()->update();
      • Update PTX graph
        • PTX->cfg()->update()
      • Update module
        • module->update()
      • Re-schedule()
  • 23. SubkernelFormationPass 5
    • Why to do it?
    • 24. Reduce the kernel loading and Paralleling
    • 25. Ps: that is a trade off in “fork” “join” with kernel communication
  • 26. RemoveBarrierPass 1 Ref: ocelot-pact.pdf
  • 27. RemoveBarrierPass 2
    • How to do it?
      • Replace Barrier instruction to function call.
    • Definition
      • The call instruction stores the address of the next instruction, so execution can resume at that point after executing a ret instruction. A call is assumed to be divergent unless the .uni suffix is present, indicating that the call is guaranteed to be non-divergent, meaning that all threads in a warp have identical values for the guard predicate and call target.
      ref PTX_isa 2.1
  • 28. RemoveBarrierPass 3
    • Example
      • Assign a=a+1 in CTA with different thread.
      • 29. a = a+1 ; sync(); //@ sync mem reg .... b=b+1;
        • b+1 = thread1
        • 30. a+1 = thread2
      • thread1 wait for thread2 finish...
        • bar.sync()
  • 31.  
  • 32. RemoveBarrierPass 3
    • sample methods to do @ load /store 2 new memory address
      • Find branch location and replace it with
        • Brn = Kernel->cfg()->terminator()->Branch();
        • 33. Instruiction *IT= new Instruction(IR::FunctionCall)
        • 34. Kernel->cfg()->insert(IT);
        • 35. kernel->cfg()->remove(Brn);
      • Assign Function call type
        • IT->d() = IR::addressType // dest register
        • 36. IT->a() = IR::addressType // source register
        • 37. IT->type() = IR::FunctionCall
      • Link update
        • IT->Preprocessor()->update()
        • 38. IT->Successor()->update()
      • Call back to original pointer
        • new end pointer = org end pointer
  • 39. RemoveBarrierPass 4
    • Why to do it?
    • 40. Reduce the thread waiting time in each barrier synchronous check.
  • 41.
      • LinearScanRegisterAllocationPass 1
    @ %r1 Ref: ocelot-pact.pdf
  • 42.
      • LinearScanRegisterAllocationPass 2
    • sample methods to do
      • Base On SSA graph
        • Find PHINodes
          • kernel->dfg()->hasPHINode()?
        • Replace all alive in PHINode
          • Foreach (kernel->dfg->PHINode()->aliveIn())...
        • Update graph
          • kernel->cfg()->update()
          • 43. Preprocessor
          • 44. Successor
  • 45.
      • LinearScanRegisterAllocationPass 3
    • Why to do it?
    • 46. Replace register to local share memory.
      • More parallelism to thread access.
      • 47. More data sharing
  • 48. MIMDThreadSchedulingPass 1
    • definition
      • Predicated Execution
        • reg .pred p, q, r
      • Example
        • if (i < n)
        • 49. j = j + 1;
        • 50. setp.lt.s32 p, i, n; // compare i to n
        • 51. @!p bra L1; // if false, branch over
        • 52. add.s32 j, j, 1;
        • 53. L1: ...
    j=j+1 j=j+1
  • 54.  
  • 55. MIMDThreadSchedulingPass 2
    • sample methods to do
      • Find Branch instruction and dominator
        • Dom = kernel->dominator_tree();
        • 56. Post = kernel->post_dominator_tree();
        • 57. kernel->terminator()->hasBranch()?
      • Replace Branch to Predicted
        • Instruction IT = new Instruction(IR::Instruction::Pred);
        • 58. kernel->Instruction->Insert(IT);
        • 59. Kernel->Instruction->erase(Bn);
      • Update graph
        • kernel->cfg()->update();
        • 60. kernel->PTX()->update();
  • 61. MIMDThreadSchedulingPass 3
    • Why to do it?
    • 62. More parallelism to thread access
  • 63. Reference
    • gpuocelot
      • http://code.google.com/p/gpuocelot/wiki/References

×