Compiler optimizationsbased on call-graph flatteningCarlo Alberto Ferrarisprofessor Silvano RivoiraMaster of Science in Te...
Increasing complexitiesEveryday objects are becoming  multi-purpose  networked  interoperable  customizable  reusable  upg...
Increasing complexitiesEveryday objects are becoming  more and more complex
Increasing complexitiesSoftware that runs smart objects is becoming  more and more complex
Diminishing resourcesSystems have to be resource-efficient
Diminishing resourcesSystems have to be resource-efficientResources come in many different flavours
Diminishing resourcesSystems have to be resource-efficientResources come in many different flavoursPowerEspecially valuabl...
Diminishing resourcesSystems have to be resource-efficientResources come in many different flavoursPower, densityCritical ...
Diminishing resourcesSystems have to be resource-efficientResources come in many different flavoursPower, density, computa...
Diminishing resourcesSystems have to be resource-efficientResources come in many different flavoursPower, density, computa...
Diminishing resourcesSystems have to be resource-efficientResources come in many non-orthogonal flavoursPower, density, co...
Do more with less
AbstractionsWe need to modularize and hide the complexityOperating systems, frameworks, libraries, managed languages, virt...
AbstractionsWe need to modularize and hide the complexityOperating systems, frameworks, libraries, managed languages, virt...
AbstractionsWe need to modularize and hide the complexity                         Palm webOS                         User ...
AbstractionsWe need to modularize and hide the complexity                         Javascript PC emulator                  ...
OptimizationsWe need to modularize and hide the complexity without sacrificing performance
OptimizationsWe need to modularize and hide the complexity without sacrificing performanceCompiler optimizations trade off...
Vestigial abstractionsThe natural subdivision of code in functions is  maintained in the compiler and all the way down  to...
Vestigial abstractionsProcessors don’t care about functions; respecting  the conventions is just additional workPush the c...
Vestigial abstractionsMany optimizations are simply not feasible when functions are present   int replace(int* ptr, int va...
Vestigial abstractionsMany optimizations are simply not feasible when functions are present        interpreter_setup();   ...
Vestigial abstractionsMany optimization efforts are directed at working around the overhead caused by functionsInlining cl...
Call-graph flattening
Call-graph flatteningWhat if we dismiss functions during early compilation…
Call-graph flatteningWhat if we dismiss functions during early compilation and track the control flow explicitely instead?
Call-graph flatteningWhat if we dismiss functions during early compilation and track the control flow explicitely instead?
Call-graph flatteningWhat if we dismiss functions during early compilation and track the control flow explicitely instead?
Call-graph flatteningWe get most benefits of inlining without code duplication, including the ability to perform contextua...
Call-graph flatteningWe get most benefits of inlining without code duplication, including the ability to perform contextua...
Call-graph flatteningThe load on the compiler increases greatly both  directly due to CGF itself and also indirectly due  ...
Call-graph flatteningDuring CGF we need to statically keep track of all live values across all callsites in all functionsA...
Call-graph flatteningBasically the compiler has to statically emulate  ahead-of-time all the possible stack usages of  the...
Call-graph flatteningThe indirect cause of increased compiler load  comes from standard optimizations that are run  after ...
Call-graph flatteningThe indirect cause of increased compiler load  comes from standard optimizations that are run  after ...
Call-graph flatteningMany possible application scenarios beside inlining
Call-graph flatteningMany possible application scenarios beside inliningCode motionMove instructions between function boun...
Call-graph flatteningMany possible application scenarios beside inliningCode motion, macro compressionFind similar code se...
Call-graph flatteningMany possible application scenarios beside inliningCode motion, macro compression, nonlinear CFCGF su...
Call-graph flatteningMany possible application scenarios beside inliningCode motion, macro compression, nonlinear CF,  sta...
Call-graph flatteningMany possible application scenarios beside inliningCode motion, macro compression, nonlinear CF,  sta...
ImplementationTo test if CGF is applicable also to complex  architectures and to validate some of the ideas  presented in ...
ImplementationOperates on LLVM-IR; host and target architecture agnostic; roughly 800 lines of C++ code in 4 classesThe pi...
ImplementationEnumerate suitable functionsEnumerate suitable callsites (and their live values)Create dispatch function, po...
Examplesint a(int n) {    return n+1;}int b(int n) {    int i;    for (i=0; i<10000; i++)        n = a(n);    return n;}
int a(int n) {    return n+1;}int b(int n) {    int i;    for (i=0; i<10000; i++)        n = a(n);    return n;}
int a(int n) {    return n+1;}int b(int n) {    int i;    for (i=0; i<10000; i++)        n = a(n);    return n;}
Examplesint a(int n) {    return n+1;}int b(int n) {    n = a(n);    n = a(n);    n = a(n);    n = a(n);    return n;}
int a(int n) {    return n+1;}int b(int n) {    n = a(n);    n = a(n);    n = a(n);    n = a(n);    return n;}
.type    .Ldispatch,@function.Ldispatch:    movl     $.Ltmp4, %eax   # store the return dispather of a in rax    jmpq     ...
FuzzingTo stress test the pilot implementation and to  perform benchmarks a tunable fuzzer has been  writtenint f_1_2(int ...
BenchmarksDue to the shortcomings in the currently available optimizations in LLVM, the only meaningful benchmarks that ca...
BenchmarksUsing our tunable fuzzer different programs were  generated and key statistics of the compiled  code were gathered
BenchmarksUsing our tunable fuzzer different programs were  generated and key statistics of the compiled  code were gathered
BenchmarksIn short, when optimizations work the resulting   code size is better than the one found in   literature
BenchmarksIn short, when optimizations work the resulting   code size is better than the one found in   literatureWhen the...
Benchmarks
Next stepsReduce live value verbosityAlternative indirection schemesTune available optimizations for CGF constructsBetter ...
Conclusions“Do more with less”; optimizations are requiredCGF removes unneeded overhead due to low-level  abstractions and...
Compiler optimizationsbased on call-graph flatteningCarlo Alberto Ferrarisprofessor Silvano Rivoira
.type wrapper,@functionsubq $24, %rsp       # allocate space on the stackmovl %edi, 16(%rsp) # store the argument n on the...
Compiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flattening
Upcoming SlideShare
Loading in …5
×

Compiler optimizations based on call-graph flattening

1,551 views
1,496 views

Published on

Presentation for my thesis dissertation on compiler optimizations based on call-graph flattening.

Thesis: http://cafxx.strayorange.com/app/cv/addendum/thesis/ferraris_compiler_optimizations_call_graph_flattening.pdf
Code repository: https://github.com/CAFxX/cgf

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,551
On SlideShare
0
From Embeds
0
Number of Embeds
264
Actions
Shares
0
Downloads
20
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Compiler optimizations based on call-graph flattening

  1. 1. Compiler optimizationsbased on call-graph flatteningCarlo Alberto Ferrarisprofessor Silvano RivoiraMaster of Science in Telecommunication EngineeringThird School of Engineering: Information TechnologyPolitecnico di TorinoJuly 6th, 2011
  2. 2. Increasing complexitiesEveryday objects are becoming multi-purpose networked interoperable customizable reusable upgradeable
  3. 3. Increasing complexitiesEveryday objects are becoming more and more complex
  4. 4. Increasing complexitiesSoftware that runs smart objects is becoming more and more complex
  5. 5. Diminishing resourcesSystems have to be resource-efficient
  6. 6. Diminishing resourcesSystems have to be resource-efficientResources come in many different flavours
  7. 7. Diminishing resourcesSystems have to be resource-efficientResources come in many different flavoursPowerEspecially valuable in battery-powered scenarios such as mobile, sensor, 3rd world applications
  8. 8. Diminishing resourcesSystems have to be resource-efficientResources come in many different flavoursPower, densityCritical factor in data-center and product design
  9. 9. Diminishing resourcesSystems have to be resource-efficientResources come in many different flavoursPower, density, computationalCPU, RAM, storage, etc. are often growing slower than the potential applications
  10. 10. Diminishing resourcesSystems have to be resource-efficientResources come in many different flavoursPower, density, computational, developmentDevelopment time and costs should be as low as possible for low TTM and profitability
  11. 11. Diminishing resourcesSystems have to be resource-efficientResources come in many non-orthogonal flavoursPower, density, computational, development
  12. 12. Do more with less
  13. 13. AbstractionsWe need to modularize and hide the complexityOperating systems, frameworks, libraries, managed languages, virtual machines, …
  14. 14. AbstractionsWe need to modularize and hide the complexityOperating systems, frameworks, libraries, managed languages, virtual machines, …All of this comes with a cost: generic solutions are generally less efficient than ad-hoc ones
  15. 15. AbstractionsWe need to modularize and hide the complexity Palm webOS User interface running on HTML+CSS+Javascript
  16. 16. AbstractionsWe need to modularize and hide the complexity Javascript PC emulator Running Linux inside a browser
  17. 17. OptimizationsWe need to modularize and hide the complexity without sacrificing performance
  18. 18. OptimizationsWe need to modularize and hide the complexity without sacrificing performanceCompiler optimizations trade off compilation time with development, execution time
  19. 19. Vestigial abstractionsThe natural subdivision of code in functions is maintained in the compiler and all the way down to the processorEach function is self-contained with strict conventions regulating how it relates to other functions
  20. 20. Vestigial abstractionsProcessors don’t care about functions; respecting the conventions is just additional workPush the contents of the registers and return address on the stack, jump to the callee; execute the callee, jump to the return address; restore the registers from the stack
  21. 21. Vestigial abstractionsMany optimizations are simply not feasible when functions are present int replace(int* ptr, int value) { void *malloc(size_t size) { int tmp = *ptr; void *ret; *ptr = value; // [various checks] return tmp; ret = imalloc(size); } if (ret == NULL) errno = ENOMEM; int A(int* ptr, int value) { return ret; return replace(ptr, value); } } // ... int B(int* ptr, int value) { type *ptr = malloc(size); replace(ptr, value); if (ptr == NULL) return value; return NOT_ENOUGH_MEMORY; } // ...
  22. 22. Vestigial abstractionsMany optimizations are simply not feasible when functions are present interpreter_setup(); while (opcode = get_next_instruction()) interpreter_step(opcode); interpreter_shutdown(); function interpreter_step(opcode) { switch (opcode) { case opcode_instruction_A: execute_instruction_A(); break; case opcode_instruction_B: execute_instruction_B(); break; // ... default: abort("illegal opcode!"); } }
  23. 23. Vestigial abstractionsMany optimization efforts are directed at working around the overhead caused by functionsInlining clones the body of the callee in the caller; optimal solution w.r.t. calling overhead but causes code size increase and cache pollution; useful only on small, hot functions
  24. 24. Call-graph flattening
  25. 25. Call-graph flatteningWhat if we dismiss functions during early compilation…
  26. 26. Call-graph flatteningWhat if we dismiss functions during early compilation and track the control flow explicitely instead?
  27. 27. Call-graph flatteningWhat if we dismiss functions during early compilation and track the control flow explicitely instead?
  28. 28. Call-graph flatteningWhat if we dismiss functions during early compilation and track the control flow explicitely instead?
  29. 29. Call-graph flatteningWe get most benefits of inlining without code duplication, including the ability to perform contextual code optimizations, without the code size issues
  30. 30. Call-graph flatteningWe get most benefits of inlining without code duplication, including the ability to perform contextual code optimizations, without the code size issuesWhere’s the catch?
  31. 31. Call-graph flatteningThe load on the compiler increases greatly both directly due to CGF itself and also indirectly due to subsequent optimizationsWorse case complexity (number of edges) is quadratic w.r.t. the number of callsites being transformed (heuristics may help)
  32. 32. Call-graph flatteningDuring CGF we need to statically keep track of all live values across all callsites in all functionsA value is alive if it will be needed in subsequent instructions A = 5, B = 9, C = 0; // live: A, B C = sqrt(B); // live: A, C return A + C;
  33. 33. Call-graph flatteningBasically the compiler has to statically emulate ahead-of-time all the possible stack usages of the programThis has already been done on microcontrollers and resulted in a 23% decrease of stack usage (and 5% performance increase)
  34. 34. Call-graph flatteningThe indirect cause of increased compiler load comes from standard optimizations that are run after CGFCGF does not create new branches (each call and return instruction is turned into a jump) but other optimizations can
  35. 35. Call-graph flatteningThe indirect cause of increased compiler load comes from standard optimizations that are run after CGFMost optimizations are designed to operate on small functions with limited amounts of branches
  36. 36. Call-graph flatteningMany possible application scenarios beside inlining
  37. 37. Call-graph flatteningMany possible application scenarios beside inliningCode motionMove instructions between function boundaries; avoid unneeded computations, alleviate register pressure, improve cache locality
  38. 38. Call-graph flatteningMany possible application scenarios beside inliningCode motion, macro compressionFind similar code sequences in different parts of the code and merge them; reduce code size and cache pollution
  39. 39. Call-graph flatteningMany possible application scenarios beside inliningCode motion, macro compression, nonlinear CFCGF supports natively nonlinear control flows; almost-zero-cost EH and coroutines
  40. 40. Call-graph flatteningMany possible application scenarios beside inliningCode motion, macro compression, nonlinear CF, stackless executionNo runtime stack needed in fully-flattened programs
  41. 41. Call-graph flatteningMany possible application scenarios beside inliningCode motion, macro compression, nonlinear CF, stackless execution, stack protectionEffective stack poisoning attacks are much harder or even impossible
  42. 42. ImplementationTo test if CGF is applicable also to complex architectures and to validate some of the ideas presented in the thesis, a pilot implementation was written against the open-source LLVM compiler framework
  43. 43. ImplementationOperates on LLVM-IR; host and target architecture agnostic; roughly 800 lines of C++ code in 4 classesThe pilot implementation can not flatten recursive, indirect or variadic callsites; they can be used anyway
  44. 44. ImplementationEnumerate suitable functionsEnumerate suitable callsites (and their live values)Create dispatch function, populate with codeTransform callsitesPropagate live valuesRemove original functions or create wrappers
  45. 45. Examplesint a(int n) { return n+1;}int b(int n) { int i; for (i=0; i<10000; i++) n = a(n); return n;}
  46. 46. int a(int n) { return n+1;}int b(int n) { int i; for (i=0; i<10000; i++) n = a(n); return n;}
  47. 47. int a(int n) { return n+1;}int b(int n) { int i; for (i=0; i<10000; i++) n = a(n); return n;}
  48. 48. Examplesint a(int n) { return n+1;}int b(int n) { n = a(n); n = a(n); n = a(n); n = a(n); return n;}
  49. 49. int a(int n) { return n+1;}int b(int n) { n = a(n); n = a(n); n = a(n); n = a(n); return n;}
  50. 50. .type .Ldispatch,@function.Ldispatch: movl $.Ltmp4, %eax # store the return dispather of a in rax jmpq *%rdi # jump to the requested outer disp..Ltmp2: # outer dispatcher of b movl $.LBB2_4, %eax # store the address of %10.Ltmp0: # outer dispatcher of a movl (%rsi), %ecx # load the argument n in ecx jmp .LBB2_4.Ltmp8: # block %17 movl $.Ltmp6, %eax jmp .LBB2_4.Ltmp6: # block %18 movl $.Ltmp7, %eax.LBB2_4: # block %10 movq %rax, %rsi incl %ecx # n = n + 1 movl $.Ltmp8, %eax jmpq *%rsi # indirectbr.Ltmp4: # return dispatcher of a movl %ecx, (%rdx) # store in pointer rdx the return value ret # in ecx and return to the wrapper.Ltmp7: # return dispatcher of b movl %ecx, (%rdx) ret
  51. 51. FuzzingTo stress test the pilot implementation and to perform benchmarks a tunable fuzzer has been writtenint f_1_2(int a) { a += 1; switch (a%3) { case 0: a += f_0_2(a); break; case 1: a += f_0_4(a); break; case 2: a += f_0_6(a); break; } return a;}
  52. 52. BenchmarksDue to the shortcomings in the currently available optimizations in LLVM, the only meaningful benchmarks that can be done are those concerning code size and stack usageIn literature, average code size increases of 13% were reported due to CGF
  53. 53. BenchmarksUsing our tunable fuzzer different programs were generated and key statistics of the compiled code were gathered
  54. 54. BenchmarksUsing our tunable fuzzer different programs were generated and key statistics of the compiled code were gathered
  55. 55. BenchmarksIn short, when optimizations work the resulting code size is better than the one found in literature
  56. 56. BenchmarksIn short, when optimizations work the resulting code size is better than the one found in literatureWhen they don’t, the register spiller and allocator perform so badly that most instructions simply shuffle data around on the stack
  57. 57. Benchmarks
  58. 58. Next stepsReduce live value verbosityAlternative indirection schemesTune available optimizations for CGF constructsBetter register spiller and allocatorAd-hoc optimizations (code threader, adaptive fl.)Support recursion, indirect calls; better wrappers
  59. 59. Conclusions“Do more with less”; optimizations are requiredCGF removes unneeded overhead due to low-level abstractions and empowers powerful global optimizationsBenchmark results of the pilot implementation are better than those in literature when available LLVM optimizations can cope
  60. 60. Compiler optimizationsbased on call-graph flatteningCarlo Alberto Ferrarisprofessor Silvano Rivoira
  61. 61. .type wrapper,@functionsubq $24, %rsp # allocate space on the stackmovl %edi, 16(%rsp) # store the argument n on the stackmovl $.Ltmp0, %edi # address of the outer dispatcherleaq 16(%rsp), %rsi # address of the incoming argument(s)leaq 12(%rsp), %rdx # address of the return value(s)callq .Ldispatch # call to the dispatch functionmovl 12(%rsp), %eax # load the ret value from the stackaddq $24, %rsp # deallocate space on the stackret # return

×