This document discusses various compiler optimizations including constant folding, hoisting loop invariant code, scalarization, loop unswitching, peeling and sentinels, strength reduction, loop induction variable elimination, and auto-vectorization. It provides code examples and the generated assembly for each optimization. It explains that many optimizations are performed by compilers automatically at high optimization levels, while some more advanced optimizations like loop peeling and sentinels require manual intervention.
Analog to digital converter is one of the most important feature of micro controller. here i am explaining about basic of ADC, working and how exactly controller do it. Here i also explaining registers of ADC and attached a sample code.
Introduction to Simplified instruction computer or SIC/XETemesgen Molla
AN introduction concept to system programming. SIC is simulation machine architecture that represents RISC and CISC (real machine architecture) for teaching and learning purpose.
In digital electronics, a decoder can take the form of a multiple-input, multiple-output logic circuit that converts coded inputs into coded outputs, where the input and output codes are different e.g. n-to-2n , binary-coded decimal decoders. Decoding is necessary in applications such as data multiplexing, 7 segment display and memory address decoding.
An encoder is a device, circuit, transducer, software program, algorithm or person that converts information from one format or code to another. The purpose of encoder is standardization, speed, secrecy, security, or saving space by shrinking size. Encoders are combinational logic circuits and they are exactly opposite of decoders. They accept one or more inputs and generate a multibit output code.
Analog to digital converter is one of the most important feature of micro controller. here i am explaining about basic of ADC, working and how exactly controller do it. Here i also explaining registers of ADC and attached a sample code.
Introduction to Simplified instruction computer or SIC/XETemesgen Molla
AN introduction concept to system programming. SIC is simulation machine architecture that represents RISC and CISC (real machine architecture) for teaching and learning purpose.
In digital electronics, a decoder can take the form of a multiple-input, multiple-output logic circuit that converts coded inputs into coded outputs, where the input and output codes are different e.g. n-to-2n , binary-coded decimal decoders. Decoding is necessary in applications such as data multiplexing, 7 segment display and memory address decoding.
An encoder is a device, circuit, transducer, software program, algorithm or person that converts information from one format or code to another. The purpose of encoder is standardization, speed, secrecy, security, or saving space by shrinking size. Encoders are combinational logic circuits and they are exactly opposite of decoders. They accept one or more inputs and generate a multibit output code.
Introduction to Convolutional Codes
Convolutional Encoder Structure
Convolutional Encoder Representation(Vector, Polynomial, State Diagram and Trellis Representations )
Maximum Likelihood Decoder
Viterbi Algorithm
MATLAB Simulation
Hard and Soft Decisions
Bit Error Rate Tradeoff
Consumed Time Tradeoff
Synthesis Process, synthesis Model, Why Perform Logic synthesis, Resource Sharing,Example of Resource sharing,Pipe-lining,Power Analysis of FPGA Based System
This presentation aims to create awareness for IoT device makers on the various aspects they might encounter with their products. Security challenges which need to be addressed are listed to try to guide developers along the right path.
THIS DOCUMENT CONTAINS THE DIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUIT. IN THIS DOCUMENT THERE IS A BRIEF EXPLANATION ABOUT THE CIRCUIT HOW TO DESIGN AND IMPLEMENTATION OF CIRCUIT AND THE THEORETICAL CALCULATIONS,TRUTH TABLES ARE ALSO DONE IN THIS DOCUMENT AND THIS IS USEFUL FOR THE ELECTRONICS STUDENTS
Introduction to Convolutional Codes
Convolutional Encoder Structure
Convolutional Encoder Representation(Vector, Polynomial, State Diagram and Trellis Representations )
Maximum Likelihood Decoder
Viterbi Algorithm
MATLAB Simulation
Hard and Soft Decisions
Bit Error Rate Tradeoff
Consumed Time Tradeoff
Synthesis Process, synthesis Model, Why Perform Logic synthesis, Resource Sharing,Example of Resource sharing,Pipe-lining,Power Analysis of FPGA Based System
This presentation aims to create awareness for IoT device makers on the various aspects they might encounter with their products. Security challenges which need to be addressed are listed to try to guide developers along the right path.
THIS DOCUMENT CONTAINS THE DIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUIT. IN THIS DOCUMENT THERE IS A BRIEF EXPLANATION ABOUT THE CIRCUIT HOW TO DESIGN AND IMPLEMENTATION OF CIRCUIT AND THE THEORETICAL CALCULATIONS,TRUTH TABLES ARE ALSO DONE IN THIS DOCUMENT AND THIS IS USEFUL FOR THE ELECTRONICS STUDENTS
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
The slides give an idea about how to look pragmatically at software optimization and order optimization approaches according to this pragmatic point of view
I am Frank Allen. I am a Computer Architecture Assignment Expert at architectureassignmenthelp.com. I hold a Master's in Computer Architecture from, Ontario Tech University, Canada. I have been helping students with their assignments for the past 10 years. I solve assignments related to Computer Architecture.
Visit architectureassignmenthelp.com or email info@architectureassignmenthelp.com. You can also call on +1 678 648 4277 for any assistance with Computer Architecture Assignments.
eBPF has 64-bit general purpose registers, therefore 32-bit architectures normally need to use register pair to model them and need to generate extra instructions to manipulate the high 32-bit in the pair. Some of these overheads incurred could be eliminated if JIT compiler knows only the low 32-bit of a register is interested. This could be known through data flow (DF) analysis techniques. Either the classic iterative DF analysis or "path-sensitive" version based on verifier's code path walker.
In this talk, implementations for both versions of DF analyzer will be presented. We will see how a def-use chain based classic eBPF DF analyser looks first, and will see the possibility to integrate it with previous proposed eBPF control flow graph framework to make a stand-alone eBPF global DF analyser which could potentially serve as a library. Then, another "path-sensitive" DF analyser based on the existing verifier code path walker will be presented. We will discuss how function calls, path prune, path switch affect the implementation. Finally, we will summarize pros and cons for each, and will see how could each of them be adapted to 64-bit and 32-bit architecture back-ends.
Also, eBPF has 32-bit sub-register and ALU32 instructions associated, enable them (-mattr=+alu32) in LLVM code-gen could let the generated eBPF sequences carry more 32-bit information which could potentially easy flow analyser. This will be briefly discussed in the talk as well.
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia
By Andy Wingo.
With the new compiler and virtual machine in Guile 2.2, Guile hackers need to update their mental performance models. This talk will give a bit of a state of the union of Guile performance, with an updated overview of the cost of various kinds of abstractions. Sometimes abstraction is free!
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
Arm tools and roadmap for SVE compiler supportLinaro
By Richard Sandiford, Florian Hahn (Arm), ARM
This presentation will give an overview of what Arm is doing to develop the HPC ecosystem, with a particular focus on SVE. It will include a brief synopsis of both the commercial and open-source tools and libraries that Arm is developing and a description of the various community initiatives that Arm is involved in. The bulk of the talk will describe the roadmap for SVE compiler support in both GCC and LLVM. It will cover the work that has already been done to support both hand-optimised and automatically-vectorised code, and the plans for future improvements.
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
HackLU 2018 Make ARM Shellcode Great AgainSaumil Shah
Compared to x86, ARM shellcode has made little progress. The x86 hardware is largely homogenous. ARM, however, has several versions and variants across devices today. There are several constraints and subtleties involved in writing production quality ARM shellcode which works on modern ARM hardware, not just on QEMU emulators.
In this talk, we shall explore issues such as overcoming cache coherency, reliable polymorphic shellcode, ARM egghunting and last but not the least, polyglot ARM shellcode. A bonus side effect of this talk will be creating headaches for those who like to defend agaisnt attacks using age old signature based techniques.
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itSergey Platonov
The talk will look at limitations of compilers when creating fast code and how to make more effective use of both the underlying micro-architecture of modern CPU's and how algorithmic optimizations may have surprising effects on the generated code. We shall discuss several specific CPU architecture features and their pros and cons in relation to creating fast C++ code. We then expand with several algorithmic techniques, not usually well-documented, for making faster, compiler friendly, C++.
Note that we shall not discuss caching and related issues here as they are well documented elsewhere.
An evaluation of LLVM compiler for SVE with fairly complicated loopsLinaro
By Hiroshi Nakashima, Kyoto University / RIKEN AICS
As a part of the evaluation of Post-K’s compilers, we have been investigating compiled codes of vectorizable kernel loops in a particle-in-cell simulation program. This talk will reveal how the latest version of LLVM compiler (v1.4) works on the loops together with the qualitative and quantitative comparison with the code generated by Intel’s compiler for KNL.
Hiroshi Nakashima Bio
Currently working as a professor of Kyoto University’s supercomputer center (ACCMS) for R&D on HPC programming and supercomputer system architecture, as well as a visiting senior researcher of RIKEN AICS for the evaluation of Post-K computer and its compilers.
Email
h.nakashima@media.kyoto-u.ac.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Similar to Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations (20)
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
3. 3
OUTLINE
Constant folding
Hoisting loop invariant code
Scalarization
Loop unswitching
Loop peeling and sentinels
Strength reduction
Loop-induction variable elimination
Auto-vectorization
Function body inlining
Built-in detection
Case study
4. 4
CONSTANT FOLDING
evaluates expressions from constants in compile time
It is one of the simplest and most widely used compiler
transformations. Expressions could be quite complicated, but
absence of any kind of side effects is always to be considered.
double b = exp(sqrt(M_PI/2));
fldd d17, .L5
// other code
.L5:
.word 2911864813
.word 1074529267
May be performed in global as well as local scope.
Applicable to any code pattern.
5. 5 . 1
HOISTING LOOP INVARIANT CODE
The goal of hoisting — also called loop-invariant code motion
— is to avoid recomputing loop-invariant code each time
through the body of a loop.
void scale(double* x, double* y, int len)
{
for (int i = 0; i < len; i++)
y[i] = x[i] * exp(sqrt(M_PI/2));
}
... is the same as...
void scale(double* x, double *y, int len)
{
double factor = exp(sqrt(M_PI/2));
for (int i = 0; i < len; i++)
y[i] = x[i] * factor;
}
9. 6
Usually it is performed on a branches inside a loop body to
allow further optimization. Machine-independent, local.
SCALARIZATION
Scalarization replaces branchy code with a branchless analogy
int branchy(int i)
{
if (i > 4 && i < 42) return 1;
return 0;
}
int branchless(int i)
{
return (((unsigned)i) - 5 < 36);
}
$ gcc -march=armv8-a+simd -O3 1.c -S -o 1.s
branchy:
sub w0, w0, #5
cmp w0, 36
cset w0, ls
ret
branchless:
sub w0, w0, #5
cmp w0, 36
cset w0, ls
ret
10. 7 . 1
LOOP UNSWITCHING
Moves loop-invariant conditionals or switches which are
independent of the loop index out of the loop. Increases ILP,
enables further optimizations.
for (int i = 0; i < len; i++)
{
if (a > 32)
arr[i] = a;
else
arr[i] = 42;
}
int i = 0;
if (a > 32)
for (; i < len; i++)
arr[i] = a;
else
for (; i < len; i++)
arr[i] = 42;
$ gcc -march=armv8-a+simd -O3 -S -o 1.s
-fno-tree-vectorize 1.c
auto-vectorization is disable d just to make example readable.
12. 7 . 3
Can be done by a compiler under high optimization levels.
LOOP PEELING
Loop peeling takes out of the loop and executes separately a
small number of iterations from the beginning or/and the end
of a loop, which eliminates if-statements.
for(int i=0; i<len; i++)
{
if (i == 0)
b[i] = a[i+1];
else if(i == len-1 )
b[i] = a[i-1];
else
b[i] = a[i+1]+a[i-1];
}
b[0] = a[0];
for(int i=0; i<len; i++)
[i] = a[i+1]+a[i-1];
b[len-1] = a[len-1];
For the example below both te sted compilers: gcc 4.9 and clang 3.5 won't perform this optimization.
Make sure that a compiler w as able to apply optimization to such a code pattern or do it manually .
13. 7 . 4
ADVANCED UNSWITCHING: SENTINELS
Sentinels are special dummy values placed in a data structure
which are added to simplify the logic of handling boundary
conditions (in particular, handling of loop-exit tests or
elimination out-of-bound checks).
This optimization cannot be performed by the compiler
because require changes in use of data structures
14. 7 . 5
ADVANCED UNSWITCHING: SENTINELS
Let's assume that array a contains two extra elements: one —
at the left a[-1], and one — at the right a[N]. With such
assumptions the code can be rewritten as follows.
for(int i=0; i<len; i++)
{
if (i == 0)
b[i] = a[i+1];
else if(i == len-1 )
b[i] = a[i-1];
else
b[i] = a[i+1]+a[i-1];
}
// setup boundaries
a[-1] = 0;
a[len] = 0;
for (int i=0; i<len; i++)
[i] = a[i+1] + a[i-1];
15. 8 . 1
STRENGTH REDUCTION
replaces complex expressions with a simpler analogy
double usePow(double x)
{
return pow(x, 2.);
}
float usePowf(float x)
{
return powf(x, 2.f);
}
Machine-independent transformation in most cases.
It may be machine-dependent in case if it relies on a speci c
feature set, implemented in the HW (e.g. built-in detection)
16. 8 . 2
STRENGTH REDUCTION
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
usePow:
fmdrr d16, r0, r1
fmuld d16, d16, d16
fmrrd r0, r1, d16
bx lr
usePowf:
fmsr s15, r0
fmuls s15, s15, s15
fmrs r0, s15
bx lr
Usually it is performed in a local scope
for a dependency chain.
17. 8 . 3
STRENGTH REDUCTION (ADVANCED CASE)
float useManyPowf(float a, float b, float c,
float d, float e, float f, float x)
{
return
a * powf(x, 5.f) +
b * powf(x, 4.f) +
c * powf(x, 3.f) +
d * powf(x, 2.f) +
e * x + f;
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
19. 8 . 5
STRENGTH REDUCTION (MANUAL)
Let's further reduce latency by applying Horner rule.
float applyHornerf(float a, float b, float c,
float d, float e, float f, float x)
{
return ((((a * x + b) * x + c) * x + d) * x + e) * x + f;
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
Compiler is not capable of this optimization because it
doesn't produce exact result with the original formula
21. 9 . 1
LOOP-INDUCTION VARIABLE ELIMINATION
In most cases compiler is able to replace
address arithmetics with pointer arithmetics.
void function(int* arr, int len)
{
for (int i = 0; i < len; i++)
arr[i] = 1;
}
void function(int* arr, int len)
{
for (int* p = arr; p < arr + len; p++)
*p = 1;
}
$ gcc -march=armv7-a -mthumb -O1 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
22. 9 . 2
LOOP-INDUCTION VARIABLE ELIMINATION
cmp r1, #0
ble .L1
mov r3,r0
add r0,r0,r1,lsl #2
movs r2, #1
.L3:
str r2, [r3], #4
cmp r3, r0
bne .L3
.L1:
bx lr
add r1,r0,r1,lsl #2
cmp r0, r1
bcs .L5
movs r3, #1
.L8:
str r3, [r0], #4
cmp r1, r0
bhi .L8
.L5:
bx lr
Most hand-written pointer optimizations do not make sense
with usage optimization levels higher than O0.
23. 10 . 1
AUTO-VECTORIZATION
Auto-vectorization is machine code generation
that takes an advantage of vector instructions.
Most of all modern architectures have vector extensions as
a co-processor or as dedicated pipes
MMX, SSE, SSE2, SSE4, AVX, AVX-512
AltiVec, VSX
ASIMD (NEON), MSA
Enabled by inlining, unrolling, fusion, software pipelining,
inter-procedural optimization, and other machine
independent transformations.
24. 10 . 2
AUTO-VECTORIZATION
void vectorizeMe(float *a, float *b, int len)
{
int i;
for (i = 0; i < len; i++)
a[i] += b[i];
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s
> -mfpu=neon-vfpv4 -mfloat-abi=softfp
> -fopt-info-missed 1.c
But, NEON does not support full IEEE 754,
so gcc cannot vectorize the loop, what it told us
1.c:64:3: note: not vectorized:
relevant stmt not supported:_13 =_9+_12;
25. 10 . 3
Here is a generated assembly
and... Here we are!
AUTO-VECTORIZATION
.L3:
fldmias r1!, {s14}
flds s15, [r0]
fadds s15, s15, s14
fstmias r0!, {s15}
cmp r0, r2
bne .L3
But armv8-a does support, let's check it!
$ gcc -march=armv8-a+simd -O3 -S -o 1.s
> -fopt-info-all 1.c
1.c:64:3: note: loop vectorized
26. 10 . 4
Here is a generated assembly and full optimizer's report
AUTO-VECTORIZATION
.L6:
ldr q1, [x3],16
add w6, w6, 1
ldr q0, [x7],16
cmp w6, w4
fadd v0.4s, v0.4s, v1.4s
str q0, [x8],16
bcc .L6
1.c:66:3: note: loop vectorized
1.c:66:3: note: loop versioned for vectorization because of possible aliasing
1.c:66:3: note: loop peeled for vectorization to enhance alignment
1.c:66:3: note: loop with 3 iterations completelyunrolled
1.c:61:6: note: loop with 3 iterations completelyunrolled
Compiler versions the loop to allow optimized paths
in case of aligned and non-aliasing pointers
27. 10 . 5
Let's follow the advice to put some keywords
Optimizer's report shrinks.
KEYWORDS
void vectorizeMe(float* __restrict a_, float* __restrict b_, int len)
{
float *a=__builtin_assume_aligned(a_, 16);
float *b=__builtin_assume_aligned(b_, 16);
for (int i = 0; i < len; i++)
a[i] += b[i];
}
1.c:66:3: note: loop vectorized
1.c:66:3: note: loop with 3 iterations completely unrolled
__restrict and __builtin_assume_aligned
keywords only eliminate some loop versioning, but are not
very useful from the performance perspective
28. 11 . 1
FUNCTION BODY INLINING
Replaces functional call to function body itself.
int square(int x) { return x*x; }
for (int i = 0; i < len; i++)
arr[i] = square(i);
Becomes
for (int i = 0; i < len; i++)
arr[i] = i*i;
Enables all further optimizations.
Machine-independent, interprocedural.
34. 13 . 1
CASE STUDY: FLOATING POINT
double power( double d, unsigned n)
{
double x = 1.0;
for (unsigned j = 1; j<=n; j++, x *= d) ;
return x;
}
int main ()
{
double a = 1./0x80000000U, sum = 0.0;
for (unsigned i=1; i<= 0x80000000U; i++)
sum += power( i*a, i % 8);
printf ("sum = %gn", sum);
}
ags-dp.c
35. 13 . 2
CASE STUDY: FLOATING POINT
1. Compile it without optimization
2. Compile it with O1: ~3.26 speedup
$ gcc -std=c99 -Wall -O0 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m24.550s
$ gcc -std=c99 -Wall -O1 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m7.529s
36. 13 . 3
CASE STUDY: FLOATING POINT
3. Compile it with O2: ~1.24 speedup
4. Compile it with O3: ~1.00 speedup
$ gcc -std=c99 -Wall -O2 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m6.069s
$ gcc -std=c99 -Wall -O3 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m6.067s
Total speedup is ~4.05
37. 14 . 1
CASE STUDY: INTEGER
int power( int d, unsigned n)
{
int x = 1;
for (unsigned j = 1; j<=n; j++, x*=d) ;
return x;
}
int main ()
{
int64_t sum = 0;
for (unsigned i=1; i<0x80000000U; i++)
sum += power( i, i % 8);
printf ("sum = %ldn", sum);
}
ags.c
38. 14 . 2
CASE STUDY: INTEGER
1. Compile it without optimization
2. Compile it with O1: ~2.64 speedup
$ gcc -std=c99 -Wall -O0 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m18.750s
$ gcc -std=c99 -Wall -O1 flag.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.092s
39. 14 . 3
CASE STUDY: INTEGER
3. Compile it with O2:~0.97 speedup
4. Compile it with O3: ~1.00 speedup
$ gcc -std=c99 -Wall -O2 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.300s
$ gcc -std=c99 -Wall -O3 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.082s
WHAT THERE IS NO IMPROVEMENT?
41. 15 . 1
HELPING A COMPILER
Compiler usually applies optimization to the inner loops. In
this example number of iterations in the inner loop depends on
a value of an induction variable of the outer loop.
Let's help the compiler ( )ags-dp-tuned.c
42. 15 . 2
HELPING A COMPILER
/*power function is not changed*/
int main ()
{
double a = 1./0x80000000U, s = -1.;
for (double i=0; i<=0x80000000U-8; i += 8)
{
s+=power((i+0)*a,0);s+=power((i+1)*a,1);
s+=power((i+2)*a,2);s+=power((i+3)*a,3);
s+=power((i+4)*a,4);s+=power((i+5)*a,5);
s+=power((i+6)*a,6);s+=power((i+7)*a,7);
}
printf ("sum = %gn", s);
}
$ gcc -std=c99 -Wall -O3 flags-dp-tuned.c -o flags-dp-tuned
$ time ./flags-dp-tuned
sum = 7.29569e+08
real 0m2.448s
Speedup is ~2.48x.
43. 15 . 3
Let's try it for integers ( )
HELPING A COMPILER
ags-tuned.c
/*power function is not changed*/
int main ()
{
int64_t sum = -1;
for (unsigned i=0; i<=0x80000000U-8; i += 8)
{
sum+=power(i+0, 0); sum+=power(i+1,1);
sum+=power(i+2, 2); sum+=power(i+3,3);
sum+=power(i+4, 4); sum+=power(i+5,5);
sum+=power(i+6, 6); sum+=power(i+7,7);
}
printf ("sum = %ldn", sum);
}
$ gcc -std=c99 -Wall -O3 flags-tuned.c -o flags-tuned
$ time ./flags-tuned
sum = 288231861405089791
real 0m1.286s
Speedup is ~5.5x.
44. 16 . 1
DETAILED ANALYSIS
This is how the compiler sees the outer loop
after inner loop inlining
int64_t sum = -1;
for (unsigned i=0; i<=0x80000000U-8; i += 8)
{
int d = (int)i;
sum+=1;
sum+=(1+d);
sum+=(2+d)*(2+d);
sum+=(3+d)*(3+d)*(3+d);
sum+=(4+d)*(4+d)*(4+d)*(4+d);
sum+=(5+d)*(5+d)*(5+d)*(5+d)*(5+d);
sum+=(6+d)*(6+d)*(6+d)*(6+d)*(6+d)*(6+d);
sum+=(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d);
}
48. 16 . 5
(I+7)
leal 5(%r8), %esi
movl $7, %ecx
movl $1, %edx
.L2:
imull %esi, %edx
subl $1, %ecx
jne .L2
movslq %edx, %rdx
addq %rdx, %rax
Full listing is available in gitHub
49. 17 . 1
SUMMARY
Copy propagation and other basic transformations enables
more complicated ones
Hoisting loop-invariant code and other ways of strength
reduction techniques minimizes number of operations
Compiler is very conservative to hoist oating point math
Scalarization and loop unswitching makes loop bodies plane
Loop peeling compiler capabilities are steel pure, use
sentinels technique to eliminate extra checks and control
ow
Most hand-written pointer optimizations do not make sense
with usage of optimization levels higher than O0
50. 17 . 2
SUMMARY
__restrict and __builtin_assume_aligned
keywords only eliminate some loop versioning, but are not
very useful from the performance perspective nowadays
Functional body inlining is essential for loop further loop
optimization
Compiler will detect memcpy and memset, even if you
implement them manually and call library function instead.
Library functions are usually more ef cient
Use knowledge about semantics of your code to help a
compiler optimize it