SlideShare a Scribd company logo
1
PRAGMATIC
OPTIMIZATION
IN MODERN PROGRAMMING
MASTERING COMPILER OPTIMIZATIONS
Created by for / 2015-2016Marina (geek) Kolpakova UNN
2
COURSE TOPICS
Ordering optimization approaches
Demystifying a compiler
Mastering compiler optimizations
Modern computer architectures concepts
3
OUTLINE
Constant folding
Hoisting loop invariant code
Scalarization
Loop unswitching
Loop peeling and sentinels
Strength reduction
Loop-induction variable elimination
Auto-vectorization
Function body inlining
Built-in detection
Case study
4
CONSTANT FOLDING
evaluates expressions from constants in compile time
It is one of the simplest and most widely used compiler
transformations. Expressions could be quite complicated, but
absence of any kind of side effects is always to be considered.
double b = exp(sqrt(M_PI/2));
fldd d17, .L5
// other code
.L5:
.word 2911864813
.word 1074529267
May be performed in global as well as local scope.
Applicable to any code pattern.
5 . 1
HOISTING LOOP INVARIANT CODE
The goal of hoisting — also called loop-invariant code motion
— is to avoid recomputing loop-invariant code each time
through the body of a loop.
void scale(double* x, double* y, int len)
{
for (int i = 0; i < len; i++)
y[i] = x[i] * exp(sqrt(M_PI/2));
}
... is the same as...
void scale(double* x, double *y, int len)
{
double factor = exp(sqrt(M_PI/2));
for (int i = 0; i < len; i++)
y[i] = x[i] * factor;
}
5 . 2
HOISTING LOOP INVARIANT CODE
$ gcc -march=armv7-a -mthumb -O1 -S -o 1.s
-mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
fldd d17, .L5
.L3:
fldmiad r3!, {d16}
fmuld d16, d16, d17
fstmiad r1!, {d16}
cmp r3, r0
bne .L3
.L1:
bx lr
.L5:
.word 2911864813
.word 1074529267
fldd d17, .L11
.L9:
fldmiad r3!, {d16}
fmuld d16, d16, d17
fstmiad r1!, {d16}
cmp r3, r0
bne .L9
.L7:
bx lr
.L11:
.word 2911864813
.word 1074529267
5 . 3
HOISTING NON-CONSTANTS
Let's make expression variable
void scale(double* x, double* y, int len)
{
for (int i = 0; i < len; i++)
y[i] = x[i] * exp(sqrt((double)len)/2.));
}
void scale(double* x, double *y, int len)
{
double factor = exp(sqrt((double)len)/2.));
for (int i = 0; i < len; i++)
y[i] = x[i] * factor;
}
$ gcc -march=armv7-a -mthumb -O1 -S -o 1.s
-mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
5 . 4
HOISTING NON-CONSTANTS
fmsr s15, r2 @ int
fsitod d9, s15
fsqrtd d9, d9
.L4:
fldmiad r6!, {d8}
fcpyd d16, d9
fcmpd d9, d9
fmstat
beq .L3
fmsr s15, r7 @ int
fsitod d16, s15
fmrrd r0, r1, d16
bl sqrt(PLT)
fmdrr d16, r0, r1
.L3:
fmuld d8, d8, d16
fstmiad r5!, {d8}
adds r4, r4, #1
cmp r4, r7
bne .L4
fmsr s15, r2 @ int
fsitod d17, s15
fsqrtd d17, d17
fcmpd d17, d17
fmstat
beq .L9
fsitod d16, s15
fmrrd r0, r1, d16
bl sqrt(PLT)
fmdrr d17, r0, r1
.L9:
mov r3, r5
mov r1, r4
add r0, r5, r6, lsl #3
.L11:
fldmiad r3!, {d16}
fmuld d16, d16, d17
fstmiad r1!, {d16}
cmp r3, r0
bne .L11
Compiler isn't sure that oating point ags aren't affected!
6
Usually it is performed on a branches inside a loop body to
allow further optimization. Machine-independent, local.
SCALARIZATION
Scalarization replaces branchy code with a branchless analogy
int branchy(int i)
{
if (i > 4 && i < 42) return 1;
return 0;
}
int branchless(int i)
{
return (((unsigned)i) - 5 < 36);
}
$ gcc -march=armv8-a+simd -O3 1.c -S -o 1.s
branchy:
sub w0, w0, #5
cmp w0, 36
cset w0, ls
ret
branchless:
sub w0, w0, #5
cmp w0, 36
cset w0, ls
ret
7 . 1
LOOP UNSWITCHING
Moves loop-invariant conditionals or switches which are
independent of the loop index out of the loop. Increases ILP,
enables further optimizations.
for (int i = 0; i < len; i++)
{
if (a > 32)
arr[i] = a;
else
arr[i] = 42;
}
int i = 0;
if (a > 32)
for (; i < len; i++)
arr[i] = a;
else
for (; i < len; i++)
arr[i] = 42;
$ gcc -march=armv8-a+simd -O3 -S -o 1.s 
-fno-tree-vectorize 1.c
auto-vectorization is disable d just to make example readable.
7 . 2
LOOP UNSWITCHING
cmp w1, wzr
ble .L1
cmp w2, 32
bgt .L6
mov x2, 0
mov w3, 42
.L4:
str w3,[x0,x2,lsl 2]
add x2, x2, 1
cmp w1, w2
bgt .L4
.L1:
ret
L6:
mov x3, 0
.L3:
str w2,[x0,x3,lsl 2]
add x3, x3, 1
cmp w1, w3
bgt .L3
ret
7 . 3
Can be done by a compiler under high optimization levels.
LOOP PEELING
Loop peeling takes out of the loop and executes separately a
small number of iterations from the beginning or/and the end
of a loop, which eliminates if-statements.
for(int i=0; i<len; i++)
{
if (i == 0)
b[i] = a[i+1];
else if(i == len-1 )
b[i] = a[i-1];
else
b[i] = a[i+1]+a[i-1];
}
b[0] = a[0];
for(int i=0; i<len; i++)
[i] = a[i+1]+a[i-1];
b[len-1] = a[len-1];
For the example below both te sted compilers: gcc 4.9 and clang 3.5 won't perform this optimization.
Make sure that a compiler w as able to apply optimization to such a code pattern or do it manually .
7 . 4
ADVANCED UNSWITCHING: SENTINELS
Sentinels are special dummy values placed in a data structure
which are added to simplify the logic of handling boundary
conditions (in particular, handling of loop-exit tests or
elimination out-of-bound checks).
This optimization cannot be performed by the compiler
because require changes in use of data structures
7 . 5
ADVANCED UNSWITCHING: SENTINELS
Let's assume that array a contains two extra elements: one —
at the left a[-1], and one — at the right a[N]. With such
assumptions the code can be rewritten as follows.
for(int i=0; i<len; i++)
{
if (i == 0)
b[i] = a[i+1];
else if(i == len-1 )
b[i] = a[i-1];
else
b[i] = a[i+1]+a[i-1];
}
// setup boundaries
a[-1] = 0;
a[len] = 0;
for (int i=0; i<len; i++)
[i] = a[i+1] + a[i-1];
8 . 1
STRENGTH REDUCTION
replaces complex expressions with a simpler analogy
double usePow(double x)
{
return pow(x, 2.);
}
float usePowf(float x)
{
return powf(x, 2.f);
}
Machine-independent transformation in most cases.
It may be machine-dependent in case if it relies on a speci c
feature set, implemented in the HW (e.g. built-in detection)
8 . 2
STRENGTH REDUCTION
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
usePow:
fmdrr d16, r0, r1
fmuld d16, d16, d16
fmrrd r0, r1, d16
bx lr
usePowf:
fmsr s15, r0
fmuls s15, s15, s15
fmrs r0, s15
bx lr
Usually it is performed in a local scope
for a dependency chain.
8 . 3
STRENGTH REDUCTION (ADVANCED CASE)
float useManyPowf(float a, float b, float c,
float d, float e, float f, float x)
{
return
a * powf(x, 5.f) +
b * powf(x, 4.f) +
c * powf(x, 3.f) +
d * powf(x, 2.f) +
e * x + f;
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
8 . 4
Compiler was able just partly reduce the complexity and
generate there it is possible
STRENGTH REDUCTION (ADVANCED)
useManyPowf:
push {r3, lr}
flds s17, [sp, #56]
fmsr s24, r1
movs r1, #0
fmsr s22, r0
movt r1, 16544
fmrs r0, s17
fmsr s21, r2
fmsr s20, r3
flds s19, [sp, #48]
flds s18, [sp, #52]
bl powf(PLT)
mov r1, #1082130432
fmsr s23, r0
fmrs r0, s17
bl powf(PLT)
movs r1, #0
movt r1, 16448
fmsr s16, r0
fmrs r0, s17
bl powf(PLT)
fmuls s16, s16, s24
vfma.f32 s16, s23, s22
fmsr s15, r0
vfma.f32 s16, s15, s21
fmuls s15, s17, s17
vfma.f32 s16, s20, s15
vfma.f32 s16, s19, s17
fadds s15, s16, s18
fldmfdd sp!, {d8-d12}
fmrs r0, s15
pop {r3, pc}
vfma
8 . 5
STRENGTH REDUCTION (MANUAL)
Let's further reduce latency by applying Horner rule.
float applyHornerf(float a, float b, float c,
float d, float e, float f, float x)
{
return ((((a * x + b) * x + c) * x + d) * x + e) * x + f;
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
Compiler is not capable of this optimization because it
doesn't produce exact result with the original formula
8 . 6
STRENGTH REDUCTION (MANUAL)
applyHornerf:
flds s15, [sp, #8]
fmsr s11, r0
fmsr s12, r1
flds s14, [sp]
vfma.f32 s12, s11, s15
fmsr s11, r2
flds s13, [sp, #4]
vfma.f32 s11, s12, s15
fcpys s12, s11
fmsr s11, r3
vfma.f32 s11, s12, s15
vfma.f32 s14, s11, s15
vfma.f32 s13, s14, s15
fmrs r0, s13
bx lr
9 . 1
LOOP-INDUCTION VARIABLE ELIMINATION
In most cases compiler is able to replace
address arithmetics with pointer arithmetics.
void function(int* arr, int len)
{
for (int i = 0; i < len; i++)
arr[i] = 1;
}
void function(int* arr, int len)
{
for (int* p = arr; p < arr + len; p++)
*p = 1;
}
$ gcc -march=armv7-a -mthumb -O1 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
9 . 2
LOOP-INDUCTION VARIABLE ELIMINATION
cmp r1, #0
ble .L1
mov r3,r0
add r0,r0,r1,lsl #2
movs r2, #1
.L3:
str r2, [r3], #4
cmp r3, r0
bne .L3
.L1:
bx lr
add r1,r0,r1,lsl #2
cmp r0, r1
bcs .L5
movs r3, #1
.L8:
str r3, [r0], #4
cmp r1, r0
bhi .L8
.L5:
bx lr
Most hand-written pointer optimizations do not make sense
with usage optimization levels higher than O0.
10 . 1
AUTO-VECTORIZATION
Auto-vectorization is machine code generation
that takes an advantage of vector instructions.
Most of all modern architectures have vector extensions as
a co-processor or as dedicated pipes
MMX, SSE, SSE2, SSE4, AVX, AVX-512
AltiVec, VSX
ASIMD (NEON), MSA
Enabled by inlining, unrolling, fusion, software pipelining,
inter-procedural optimization, and other machine
independent transformations.
10 . 2
AUTO-VECTORIZATION
void vectorizeMe(float *a, float *b, int len)
{
int i;
for (i = 0; i < len; i++)
a[i] += b[i];
}
$ gcc -march=armv7-a -mthumb -O3 -S -o 1.s 
> -mfpu=neon-vfpv4 -mfloat-abi=softfp 
> -fopt-info-missed 1.c
But, NEON does not support full IEEE 754,
so gcc cannot vectorize the loop, what it told us
1.c:64:3: note: not vectorized:
relevant stmt not supported:_13 =_9+_12;
10 . 3
Here is a generated assembly
and... Here we are!
AUTO-VECTORIZATION
.L3:
fldmias r1!, {s14}
flds s15, [r0]
fadds s15, s15, s14
fstmias r0!, {s15}
cmp r0, r2
bne .L3
But armv8-a does support, let's check it!
$ gcc -march=armv8-a+simd -O3 -S -o 1.s 
> -fopt-info-all 1.c
1.c:64:3: note: loop vectorized
10 . 4
Here is a generated assembly and full optimizer's report
AUTO-VECTORIZATION
.L6:
ldr q1, [x3],16
add w6, w6, 1
ldr q0, [x7],16
cmp w6, w4
fadd v0.4s, v0.4s, v1.4s
str q0, [x8],16
bcc .L6
1.c:66:3: note: loop vectorized
1.c:66:3: note: loop versioned for vectorization because of possible aliasing
1.c:66:3: note: loop peeled for vectorization to enhance alignment
1.c:66:3: note: loop with 3 iterations completelyunrolled
1.c:61:6: note: loop with 3 iterations completelyunrolled
Compiler versions the loop to allow optimized paths
in case of aligned and non-aliasing pointers
10 . 5
Let's follow the advice to put some keywords
Optimizer's report shrinks.
KEYWORDS
void vectorizeMe(float* __restrict a_, float* __restrict b_, int len)
{
float *a=__builtin_assume_aligned(a_, 16);
float *b=__builtin_assume_aligned(b_, 16);
for (int i = 0; i < len; i++)
a[i] += b[i];
}
1.c:66:3: note: loop vectorized
1.c:66:3: note: loop with 3 iterations completely unrolled
__restrict and __builtin_assume_aligned
keywords only eliminate some loop versioning, but are not
very useful from the performance perspective
11 . 1
FUNCTION BODY INLINING
Replaces functional call to function body itself.
int square(int x) { return x*x; }
for (int i = 0; i < len; i++)
arr[i] = square(i);
Becomes
for (int i = 0; i < len; i++)
arr[i] = i*i;
Enables all further optimizations.
Machine-independent, interprocedural.
11 . 2
FUNCTION BODY INLINING
gcc -march=armv8-a+nosimd -O3 -S -o 1.s 1.c
.L2:
add x2, x4, :lo12:.LANCHOR0
mov x1, 34464
mul w3, w0, w0
movk x1, 0x1, lsl 16
str w3, [x2,x0,lsl 2]
add x0, x0, 1
cmp x0, x1
bne .L2
Let's compile the code with vector extension enabled
-march=armv8-a+simd
11 . 3
FUNCTION BODY INLINING
add x0, x0, :lo12:.LANCHOR0
movi v2.4s, 0x4
ldr q0, [x1]
add x1, x0, 397312
add x1, x1, 2688
.L2:
mul v1.4s, v0.4s, v0.4s
add v0.4s, v0.4s, v2.4s
str q1, [x0],16
cmp x0, x1
bne .L2
Auto-vectorization is enabled because of possibility
to inline function call inside a loop.
12 . 1
BUILT-IN DETECTION
Compiler automatically uses the library functions memset
and memcpy to initialize and copy memory blocks
static char a[100000];
static char b[100000];
static int at(int idx, char val)
{
if (idx>=0 && idx<100000)
a[idx] = val;
}
int main()
{
for (int i=0; i<100000; ++i) a[i]=42;
for (int i=0; i<100000; ++i) at(i,-1);
for (int i=0; i<100000; ++i) b[i] = a[i];
}
12 . 2
Compiler knows what you mean
FILLING/COPYING MEMORY BLOCKS
main:
.LFB1:
.cfi_startproc
subq $8, %rsp
.cfi_def_cfa_offset 16
movl $100000, %edx
movl $42, %esi
movl $a, %edi
call memset
movl $100000, %edx
movl $255, %esi
movl $a, %edi
call memset
movl $100000, %edx
movl $a, %esi
movl $b, %edi
call memcpy
addq $8, %rsp
.cfi_def_cfa_offset 8
ret
12 . 3
The same picture, if the code is compiled for ARM target
FILLING/COPYING MEMORY BLOCKS
main:
ldr r3, .L3
mov r1, #42
stmfd sp!, {r4, lr}
add r3, pc, r3
movw r4, #34464
movt r4, 1
mov r0, r3
mov r2, r4
bl memset(PLT)
mov r2, r4
mov r1, #255
bl memset(PLT)
mov r2, r4
mov r3, r0
ldr r0, .L3+4
mov r1, r3
add r0, pc, r0
add r0, r0, #1792
bl memcpy(PLT)
ldmfd sp!, {r4, pc}
13 . 1
CASE STUDY: FLOATING POINT
double power( double d, unsigned n)
{
double x = 1.0;
for (unsigned j = 1; j<=n; j++, x *= d) ;
return x;
}
int main ()
{
double a = 1./0x80000000U, sum = 0.0;
for (unsigned i=1; i<= 0x80000000U; i++)
sum += power( i*a, i % 8);
printf ("sum = %gn", sum);
}
ags-dp.c
13 . 2
CASE STUDY: FLOATING POINT
1. Compile it without optimization
2. Compile it with O1: ~3.26 speedup
$ gcc -std=c99 -Wall -O0 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m24.550s
$ gcc -std=c99 -Wall -O1 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m7.529s
13 . 3
CASE STUDY: FLOATING POINT
3. Compile it with O2: ~1.24 speedup
4. Compile it with O3: ~1.00 speedup
$ gcc -std=c99 -Wall -O2 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m6.069s
$ gcc -std=c99 -Wall -O3 flags-dp.c -o flags-dp
$ time ./flags-dp
sum = 7.29569e+08
real 0m6.067s
Total speedup is ~4.05
14 . 1
CASE STUDY: INTEGER
int power( int d, unsigned n)
{
int x = 1;
for (unsigned j = 1; j<=n; j++, x*=d) ;
return x;
}
int main ()
{
int64_t sum = 0;
for (unsigned i=1; i<0x80000000U; i++)
sum += power( i, i % 8);
printf ("sum = %ldn", sum);
}
ags.c
14 . 2
CASE STUDY: INTEGER
1. Compile it without optimization
2. Compile it with O1: ~2.64 speedup
$ gcc -std=c99 -Wall -O0 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m18.750s
$ gcc -std=c99 -Wall -O1 flag.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.092s
14 . 3
CASE STUDY: INTEGER
3. Compile it with O2:~0.97 speedup
4. Compile it with O3: ~1.00 speedup
$ gcc -std=c99 -Wall -O2 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.300s
$ gcc -std=c99 -Wall -O3 flags.c -o flags
$ time ./flags
sum = 288231861405089791
real 0m7.082s
WHAT THERE IS NO IMPROVEMENT?
14 . 4
EXAMPLE: INTEGER
-O1 -O2
COMPILER OVERDONE
WITH BRANCH
TWIDDLING!
movl $1, %r8d
movl $0, %edx
.L9:
movl %r8d, %edi
movl %r8d, %esi
andl $7, %esi
je .L10
movl $1, %ecx
movl $1, %eax
.L8: @
addl $1, %eax @
imull %edi, %ecx @
cmpl %eax, %esi @
jae .L8
jmp .L7
.L10:
movl $1, %ecx
.L7:
movslq %ecx, %rcx
addq %rcx, %rdx
addl $1, %r8d
jns .L9
@ printing is here
movl $1, %esi
movl $1, %r8d
xorl %edx, %edx
.p2align 4,,10
.p2align 3
.L10:
movl %r8d, %edi
andl $7, %edi
je .L11
movl $1, %ecx
movl $1, %eax
.p2align 4,,10
.p2align 3
.L9:
addl $1, %eax
imull %esi, %ecx
cmpl %eax, %edi
jae .L9
movslq %ecx, %rcx
.L8:
addl $1, %r8d
addq %rcx, %rdx
testl %r8d, %r8d
movl %r8d, %esi
jns .L10
subq $8, %rsp
@ printing is here
ret
.L11:
movl $1, %ecx
jmp .L8
15 . 1
HELPING A COMPILER
Compiler usually applies optimization to the inner loops. In
this example number of iterations in the inner loop depends on
a value of an induction variable of the outer loop.
Let's help the compiler ( )ags-dp-tuned.c
15 . 2
HELPING A COMPILER
/*power function is not changed*/
int main ()
{
double a = 1./0x80000000U, s = -1.;
for (double i=0; i<=0x80000000U-8; i += 8)
{
s+=power((i+0)*a,0);s+=power((i+1)*a,1);
s+=power((i+2)*a,2);s+=power((i+3)*a,3);
s+=power((i+4)*a,4);s+=power((i+5)*a,5);
s+=power((i+6)*a,6);s+=power((i+7)*a,7);
}
printf ("sum = %gn", s);
}
$ gcc -std=c99 -Wall -O3 flags-dp-tuned.c -o flags-dp-tuned
$ time ./flags-dp-tuned
sum = 7.29569e+08
real 0m2.448s
Speedup is ~2.48x.
15 . 3
Let's try it for integers ( )
HELPING A COMPILER
ags-tuned.c
/*power function is not changed*/
int main ()
{
int64_t sum = -1;
for (unsigned i=0; i<=0x80000000U-8; i += 8)
{
sum+=power(i+0, 0); sum+=power(i+1,1);
sum+=power(i+2, 2); sum+=power(i+3,3);
sum+=power(i+4, 4); sum+=power(i+5,5);
sum+=power(i+6, 6); sum+=power(i+7,7);
}
printf ("sum = %ldn", sum);
}
$ gcc -std=c99 -Wall -O3 flags-tuned.c -o flags-tuned
$ time ./flags-tuned
sum = 288231861405089791
real 0m1.286s
Speedup is ~5.5x.
16 . 1
DETAILED ANALYSIS
This is how the compiler sees the outer loop
after inner loop inlining
int64_t sum = -1;
for (unsigned i=0; i<=0x80000000U-8; i += 8)
{
int d = (int)i;
sum+=1;
sum+=(1+d);
sum+=(2+d)*(2+d);
sum+=(3+d)*(3+d)*(3+d);
sum+=(4+d)*(4+d)*(4+d)*(4+d);
sum+=(5+d)*(5+d)*(5+d)*(5+d)*(5+d);
sum+=(6+d)*(6+d)*(6+d)*(6+d)*(6+d)*(6+d);
sum+=(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d);
}
16 . 2
5 LOOP COUNTERS INSTEAD OF 1
movl $6, %r11d
movl $5, %r10d
movl $4, %ebx
movl $3, %r9d
movl $2, %r8d
movq $-1, %rax // <-accumulator
.L3:
// loop body
addl $8, %r8d
addl $8, %r9d
addl $8, %ebx
addl $8, %r10d
addl $8, %r11d
cmpl $-2147483646, %r8d
jne .L3
16 . 3
(I+1), (I+2), (I+3)
leal -1(%r8), %edx // 2+i-1
movslq %edx, %rdx // to long
leaq 1(%rax,%rdx), %rdi // sum += 1+i
movl %r8d, %edx // 2+i
imull %r8d, %edx // (2+i)**2
movslq %edx, %rdx // to long
leaq (%rdx,%rdi), %rsi // sum += (2+i)**2
movl %r9d, %ecx
imull %r9d, %ecx
imull %r9d, %ecx
movslq %ecx, %rcx
leaq (%rcx,%rsi), %rdx
16 . 4
(I+4), (I+5), (I+6)
movl %ebx, %eax
imull %ebx, %eax
imull %eax, %eax
cltq
leaq (%rax,%rdx), %rcx
movl %r10d, %eax
imull %r10d, %eax
imull %eax, %eax
imull %r10d, %eax
cltq
addq %rcx, %rax
movl %r11d, %edx
imull %r11d, %edx
imull %r11d, %edx
imull %edx, %edx
movslq %edx, %rdx
addq %rdx, %rax
16 . 5
(I+7)
leal 5(%r8), %esi
movl $7, %ecx
movl $1, %edx
.L2:
imull %esi, %edx
subl $1, %ecx
jne .L2
movslq %edx, %rdx
addq %rdx, %rax
Full listing is available in gitHub
17 . 1
SUMMARY
Copy propagation and other basic transformations enables
more complicated ones
Hoisting loop-invariant code and other ways of strength
reduction techniques minimizes number of operations
Compiler is very conservative to hoist oating point math
Scalarization and loop unswitching makes loop bodies plane
Loop peeling compiler capabilities are steel pure, use
sentinels technique to eliminate extra checks and control
ow
Most hand-written pointer optimizations do not make sense
with usage of optimization levels higher than O0
17 . 2
SUMMARY
__restrict and __builtin_assume_aligned
keywords only eliminate some loop versioning, but are not
very useful from the performance perspective nowadays
Functional body inlining is essential for loop further loop
optimization
Compiler will detect memcpy and memset, even if you
implement them manually and call library function instead.
Library functions are usually more ef cient
Use knowledge about semantics of your code to help a
compiler optimize it
18
THE END
/ 2015-2016MARINA KOLPAKOVA

More Related Content

What's hot

Types of Interrupts with details Mi ppt
Types of Interrupts with details Mi pptTypes of Interrupts with details Mi ppt
Types of Interrupts with details Mi ppt
sanjaytron
 
8085 micro processor- notes
8085 micro  processor- notes8085 micro  processor- notes
8085 micro processor- notes
Dr.YNM
 
Convolutional Error Control Coding
Convolutional Error Control CodingConvolutional Error Control Coding
Convolutional Error Control Coding
Mohammed Abuibaid
 
Digital Clock
Digital ClockDigital Clock
Digital Clock
JiaahRajpout123
 
Transport Layer (L4) of MIPI Unipro - An Introduction
Transport Layer (L4) of MIPI Unipro - An IntroductionTransport Layer (L4) of MIPI Unipro - An Introduction
Transport Layer (L4) of MIPI Unipro - An Introduction
Arrow Devices
 
ANALYSIS & DESIGN OF COMBINATIONAL LOGIC
ANALYSIS & DESIGN OF COMBINATIONAL LOGICANALYSIS & DESIGN OF COMBINATIONAL LOGIC
ANALYSIS & DESIGN OF COMBINATIONAL LOGIC
Supanna Shirguppe
 
Ns2
Ns2Ns2
14827 shift registers
14827 shift registers14827 shift registers
14827 shift registers
Sandeep Kumar
 
Unit v. HDL Synthesis Process
Unit v. HDL Synthesis ProcessUnit v. HDL Synthesis Process
RISC and ARM contollers Design-Philosophy.pptx
RISC and ARM contollers Design-Philosophy.pptxRISC and ARM contollers Design-Philosophy.pptx
RISC and ARM contollers Design-Philosophy.pptx
contactamitsuryavans
 
ARM Architecture Instruction Set
ARM Architecture Instruction SetARM Architecture Instruction Set
ARM Architecture Instruction Set
Dwight Sabio
 
IoT Product Life Cycle and Security
IoT Product Life Cycle and SecurityIoT Product Life Cycle and Security
IoT Product Life Cycle and Security
omeili
 
Hamming codes
Hamming codesHamming codes
Hamming codes
GIGI JOSEPH
 
DIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUIT
DIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUITDIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUIT
DIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUIT
sanjay kumar pediredla
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit IIManoj Patil
 
Counters
CountersCounters
Counters
Ravi Maurya
 
RTL-Design for beginners
RTL-Design  for beginnersRTL-Design  for beginners
RTL-Design for beginners
Dr.YNM
 
8051 serialp port
8051 serialp port8051 serialp port
8051 serialp port
Teju Kotti
 

What's hot (20)

Types of Interrupts with details Mi ppt
Types of Interrupts with details Mi pptTypes of Interrupts with details Mi ppt
Types of Interrupts with details Mi ppt
 
8085 micro processor- notes
8085 micro  processor- notes8085 micro  processor- notes
8085 micro processor- notes
 
Convolutional Error Control Coding
Convolutional Error Control CodingConvolutional Error Control Coding
Convolutional Error Control Coding
 
Digital Clock
Digital ClockDigital Clock
Digital Clock
 
Transport Layer (L4) of MIPI Unipro - An Introduction
Transport Layer (L4) of MIPI Unipro - An IntroductionTransport Layer (L4) of MIPI Unipro - An Introduction
Transport Layer (L4) of MIPI Unipro - An Introduction
 
ANALYSIS & DESIGN OF COMBINATIONAL LOGIC
ANALYSIS & DESIGN OF COMBINATIONAL LOGICANALYSIS & DESIGN OF COMBINATIONAL LOGIC
ANALYSIS & DESIGN OF COMBINATIONAL LOGIC
 
Ns2
Ns2Ns2
Ns2
 
Assembler
AssemblerAssembler
Assembler
 
14827 shift registers
14827 shift registers14827 shift registers
14827 shift registers
 
Unit v. HDL Synthesis Process
Unit v. HDL Synthesis ProcessUnit v. HDL Synthesis Process
Unit v. HDL Synthesis Process
 
RISC and ARM contollers Design-Philosophy.pptx
RISC and ARM contollers Design-Philosophy.pptxRISC and ARM contollers Design-Philosophy.pptx
RISC and ARM contollers Design-Philosophy.pptx
 
ARM Architecture Instruction Set
ARM Architecture Instruction SetARM Architecture Instruction Set
ARM Architecture Instruction Set
 
IoT Product Life Cycle and Security
IoT Product Life Cycle and SecurityIoT Product Life Cycle and Security
IoT Product Life Cycle and Security
 
Hamming codes
Hamming codesHamming codes
Hamming codes
 
Doulos coverage-tips-tricks
Doulos coverage-tips-tricksDoulos coverage-tips-tricks
Doulos coverage-tips-tricks
 
DIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUIT
DIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUITDIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUIT
DIGITAL ELECTRONICS DESIGN OF 3 BIT MAJORITY CIRCUIT
 
System Programming Unit II
System Programming Unit IISystem Programming Unit II
System Programming Unit II
 
Counters
CountersCounters
Counters
 
RTL-Design for beginners
RTL-Design  for beginnersRTL-Design  for beginners
RTL-Design for beginners
 
8051 serialp port
8051 serialp port8051 serialp port
8051 serialp port
 

Viewers also liked

Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
Marina Kolpakova
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
Marina Kolpakova
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
Marina Kolpakova
 
Code GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniquesCode GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniques
Marina Kolpakova
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Marina Kolpakova
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
Marina Kolpakova
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
Marina Kolpakova
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
Marina Kolpakova
 

Viewers also liked (9)

Code GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory SubsystemCode GPU with CUDA - Memory Subsystem
Code GPU with CUDA - Memory Subsystem
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
Code GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniquesCode GPU with CUDA - Applying optimization techniques
Code GPU with CUDA - Applying optimization techniques
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 

Similar to Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Computer Architecture Assignment Help
Computer Architecture Assignment HelpComputer Architecture Assignment Help
Computer Architecture Assignment Help
Architecture Assignment Help
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
Netronome
 
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Igalia
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
Functional Thursday
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)bolovv
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
Linaro
 
HackLU 2018 Make ARM Shellcode Great Again
HackLU 2018 Make ARM Shellcode Great AgainHackLU 2018 Make ARM Shellcode Great Again
HackLU 2018 Make ARM Shellcode Great Again
Saumil Shah
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
Ankit Pandey
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
AMD Developer Central
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
Samsung Open Source Group
 
5 - Advanced SVE.pdf
5 - Advanced SVE.pdf5 - Advanced SVE.pdf
5 - Advanced SVE.pdf
JunZhao68
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
Albert DeFusco
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
ManhHoangVan
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Sergey Platonov
 
2.1 ### uVision Project, (C) Keil Software .docx
2.1   ### uVision Project, (C) Keil Software    .docx2.1   ### uVision Project, (C) Keil Software    .docx
2.1 ### uVision Project, (C) Keil Software .docx
tarifarmarie
 
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWLec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Hsien-Hsin Sean Lee, Ph.D.
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
Linaro
 

Similar to Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations (20)

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Computer Architecture Assignment Help
Computer Architecture Assignment HelpComputer Architecture Assignment Help
Computer Architecture Assignment Help
 
OptimizingARM
OptimizingARMOptimizingARM
OptimizingARM
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
 
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
HackLU 2018 Make ARM Shellcode Great Again
HackLU 2018 Make ARM Shellcode Great AgainHackLU 2018 Make ARM Shellcode Great Again
HackLU 2018 Make ARM Shellcode Great Again
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
 
5 - Advanced SVE.pdf
5 - Advanced SVE.pdf5 - Advanced SVE.pdf
5 - Advanced SVE.pdf
 
openMP loop parallelization
openMP loop parallelizationopenMP loop parallelization
openMP loop parallelization
 
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06Lect-06
 
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against itEvgeniy Muralev, Mark Vince, Working with the compiler, not against it
Evgeniy Muralev, Mark Vince, Working with the compiler, not against it
 
2.1 ### uVision Project, (C) Keil Software .docx
2.1   ### uVision Project, (C) Keil Software    .docx2.1   ### uVision Project, (C) Keil Software    .docx
2.1 ### uVision Project, (C) Keil Software .docx
 
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIWLec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
 
An evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loopsAn evaluation of LLVM compiler for SVE with fairly complicated loops
An evaluation of LLVM compiler for SVE with fairly complicated loops
 

Recently uploaded

Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
Nguyen Thanh Tu Collection
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
Fundacja Rozwoju Społeczeństwa Przedsiębiorczego
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
BhavyaRajput3
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
Steve Thomason
 

Recently uploaded (20)

Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdfESC Beyond Borders _From EU to You_ InfoPack general.pdf
ESC Beyond Borders _From EU to You_ InfoPack general.pdf
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCECLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
CLASS 11 CBSE B.St Project AIDS TO TRADE - INSURANCE
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 

Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations

  • 1. 1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING MASTERING COMPILER OPTIMIZATIONS Created by for / 2015-2016Marina (geek) Kolpakova UNN
  • 2. 2 COURSE TOPICS Ordering optimization approaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
  • 3. 3 OUTLINE Constant folding Hoisting loop invariant code Scalarization Loop unswitching Loop peeling and sentinels Strength reduction Loop-induction variable elimination Auto-vectorization Function body inlining Built-in detection Case study
  • 4. 4 CONSTANT FOLDING evaluates expressions from constants in compile time It is one of the simplest and most widely used compiler transformations. Expressions could be quite complicated, but absence of any kind of side effects is always to be considered. double b = exp(sqrt(M_PI/2)); fldd d17, .L5 // other code .L5: .word 2911864813 .word 1074529267 May be performed in global as well as local scope. Applicable to any code pattern.
  • 5. 5 . 1 HOISTING LOOP INVARIANT CODE The goal of hoisting — also called loop-invariant code motion — is to avoid recomputing loop-invariant code each time through the body of a loop. void scale(double* x, double* y, int len) { for (int i = 0; i < len; i++) y[i] = x[i] * exp(sqrt(M_PI/2)); } ... is the same as... void scale(double* x, double *y, int len) { double factor = exp(sqrt(M_PI/2)); for (int i = 0; i < len; i++) y[i] = x[i] * factor; }
  • 6. 5 . 2 HOISTING LOOP INVARIANT CODE $ gcc -march=armv7-a -mthumb -O1 -S -o 1.s -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c fldd d17, .L5 .L3: fldmiad r3!, {d16} fmuld d16, d16, d17 fstmiad r1!, {d16} cmp r3, r0 bne .L3 .L1: bx lr .L5: .word 2911864813 .word 1074529267 fldd d17, .L11 .L9: fldmiad r3!, {d16} fmuld d16, d16, d17 fstmiad r1!, {d16} cmp r3, r0 bne .L9 .L7: bx lr .L11: .word 2911864813 .word 1074529267
  • 7. 5 . 3 HOISTING NON-CONSTANTS Let's make expression variable void scale(double* x, double* y, int len) { for (int i = 0; i < len; i++) y[i] = x[i] * exp(sqrt((double)len)/2.)); } void scale(double* x, double *y, int len) { double factor = exp(sqrt((double)len)/2.)); for (int i = 0; i < len; i++) y[i] = x[i] * factor; } $ gcc -march=armv7-a -mthumb -O1 -S -o 1.s -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
  • 8. 5 . 4 HOISTING NON-CONSTANTS fmsr s15, r2 @ int fsitod d9, s15 fsqrtd d9, d9 .L4: fldmiad r6!, {d8} fcpyd d16, d9 fcmpd d9, d9 fmstat beq .L3 fmsr s15, r7 @ int fsitod d16, s15 fmrrd r0, r1, d16 bl sqrt(PLT) fmdrr d16, r0, r1 .L3: fmuld d8, d8, d16 fstmiad r5!, {d8} adds r4, r4, #1 cmp r4, r7 bne .L4 fmsr s15, r2 @ int fsitod d17, s15 fsqrtd d17, d17 fcmpd d17, d17 fmstat beq .L9 fsitod d16, s15 fmrrd r0, r1, d16 bl sqrt(PLT) fmdrr d17, r0, r1 .L9: mov r3, r5 mov r1, r4 add r0, r5, r6, lsl #3 .L11: fldmiad r3!, {d16} fmuld d16, d16, d17 fstmiad r1!, {d16} cmp r3, r0 bne .L11 Compiler isn't sure that oating point ags aren't affected!
  • 9. 6 Usually it is performed on a branches inside a loop body to allow further optimization. Machine-independent, local. SCALARIZATION Scalarization replaces branchy code with a branchless analogy int branchy(int i) { if (i > 4 && i < 42) return 1; return 0; } int branchless(int i) { return (((unsigned)i) - 5 < 36); } $ gcc -march=armv8-a+simd -O3 1.c -S -o 1.s branchy: sub w0, w0, #5 cmp w0, 36 cset w0, ls ret branchless: sub w0, w0, #5 cmp w0, 36 cset w0, ls ret
  • 10. 7 . 1 LOOP UNSWITCHING Moves loop-invariant conditionals or switches which are independent of the loop index out of the loop. Increases ILP, enables further optimizations. for (int i = 0; i < len; i++) { if (a > 32) arr[i] = a; else arr[i] = 42; } int i = 0; if (a > 32) for (; i < len; i++) arr[i] = a; else for (; i < len; i++) arr[i] = 42; $ gcc -march=armv8-a+simd -O3 -S -o 1.s -fno-tree-vectorize 1.c auto-vectorization is disable d just to make example readable.
  • 11. 7 . 2 LOOP UNSWITCHING cmp w1, wzr ble .L1 cmp w2, 32 bgt .L6 mov x2, 0 mov w3, 42 .L4: str w3,[x0,x2,lsl 2] add x2, x2, 1 cmp w1, w2 bgt .L4 .L1: ret L6: mov x3, 0 .L3: str w2,[x0,x3,lsl 2] add x3, x3, 1 cmp w1, w3 bgt .L3 ret
  • 12. 7 . 3 Can be done by a compiler under high optimization levels. LOOP PEELING Loop peeling takes out of the loop and executes separately a small number of iterations from the beginning or/and the end of a loop, which eliminates if-statements. for(int i=0; i<len; i++) { if (i == 0) b[i] = a[i+1]; else if(i == len-1 ) b[i] = a[i-1]; else b[i] = a[i+1]+a[i-1]; } b[0] = a[0]; for(int i=0; i<len; i++) [i] = a[i+1]+a[i-1]; b[len-1] = a[len-1]; For the example below both te sted compilers: gcc 4.9 and clang 3.5 won't perform this optimization. Make sure that a compiler w as able to apply optimization to such a code pattern or do it manually .
  • 13. 7 . 4 ADVANCED UNSWITCHING: SENTINELS Sentinels are special dummy values placed in a data structure which are added to simplify the logic of handling boundary conditions (in particular, handling of loop-exit tests or elimination out-of-bound checks). This optimization cannot be performed by the compiler because require changes in use of data structures
  • 14. 7 . 5 ADVANCED UNSWITCHING: SENTINELS Let's assume that array a contains two extra elements: one — at the left a[-1], and one — at the right a[N]. With such assumptions the code can be rewritten as follows. for(int i=0; i<len; i++) { if (i == 0) b[i] = a[i+1]; else if(i == len-1 ) b[i] = a[i-1]; else b[i] = a[i+1]+a[i-1]; } // setup boundaries a[-1] = 0; a[len] = 0; for (int i=0; i<len; i++) [i] = a[i+1] + a[i-1];
  • 15. 8 . 1 STRENGTH REDUCTION replaces complex expressions with a simpler analogy double usePow(double x) { return pow(x, 2.); } float usePowf(float x) { return powf(x, 2.f); } Machine-independent transformation in most cases. It may be machine-dependent in case if it relies on a speci c feature set, implemented in the HW (e.g. built-in detection)
  • 16. 8 . 2 STRENGTH REDUCTION $ gcc -march=armv7-a -mthumb -O3 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c usePow: fmdrr d16, r0, r1 fmuld d16, d16, d16 fmrrd r0, r1, d16 bx lr usePowf: fmsr s15, r0 fmuls s15, s15, s15 fmrs r0, s15 bx lr Usually it is performed in a local scope for a dependency chain.
  • 17. 8 . 3 STRENGTH REDUCTION (ADVANCED CASE) float useManyPowf(float a, float b, float c, float d, float e, float f, float x) { return a * powf(x, 5.f) + b * powf(x, 4.f) + c * powf(x, 3.f) + d * powf(x, 2.f) + e * x + f; } $ gcc -march=armv7-a -mthumb -O3 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
  • 18. 8 . 4 Compiler was able just partly reduce the complexity and generate there it is possible STRENGTH REDUCTION (ADVANCED) useManyPowf: push {r3, lr} flds s17, [sp, #56] fmsr s24, r1 movs r1, #0 fmsr s22, r0 movt r1, 16544 fmrs r0, s17 fmsr s21, r2 fmsr s20, r3 flds s19, [sp, #48] flds s18, [sp, #52] bl powf(PLT) mov r1, #1082130432 fmsr s23, r0 fmrs r0, s17 bl powf(PLT) movs r1, #0 movt r1, 16448 fmsr s16, r0 fmrs r0, s17 bl powf(PLT) fmuls s16, s16, s24 vfma.f32 s16, s23, s22 fmsr s15, r0 vfma.f32 s16, s15, s21 fmuls s15, s17, s17 vfma.f32 s16, s20, s15 vfma.f32 s16, s19, s17 fadds s15, s16, s18 fldmfdd sp!, {d8-d12} fmrs r0, s15 pop {r3, pc} vfma
  • 19. 8 . 5 STRENGTH REDUCTION (MANUAL) Let's further reduce latency by applying Horner rule. float applyHornerf(float a, float b, float c, float d, float e, float f, float x) { return ((((a * x + b) * x + c) * x + d) * x + e) * x + f; } $ gcc -march=armv7-a -mthumb -O3 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c Compiler is not capable of this optimization because it doesn't produce exact result with the original formula
  • 20. 8 . 6 STRENGTH REDUCTION (MANUAL) applyHornerf: flds s15, [sp, #8] fmsr s11, r0 fmsr s12, r1 flds s14, [sp] vfma.f32 s12, s11, s15 fmsr s11, r2 flds s13, [sp, #4] vfma.f32 s11, s12, s15 fcpys s12, s11 fmsr s11, r3 vfma.f32 s11, s12, s15 vfma.f32 s14, s11, s15 vfma.f32 s13, s14, s15 fmrs r0, s13 bx lr
  • 21. 9 . 1 LOOP-INDUCTION VARIABLE ELIMINATION In most cases compiler is able to replace address arithmetics with pointer arithmetics. void function(int* arr, int len) { for (int i = 0; i < len; i++) arr[i] = 1; } void function(int* arr, int len) { for (int* p = arr; p < arr + len; p++) *p = 1; } $ gcc -march=armv7-a -mthumb -O1 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp 1.c
  • 22. 9 . 2 LOOP-INDUCTION VARIABLE ELIMINATION cmp r1, #0 ble .L1 mov r3,r0 add r0,r0,r1,lsl #2 movs r2, #1 .L3: str r2, [r3], #4 cmp r3, r0 bne .L3 .L1: bx lr add r1,r0,r1,lsl #2 cmp r0, r1 bcs .L5 movs r3, #1 .L8: str r3, [r0], #4 cmp r1, r0 bhi .L8 .L5: bx lr Most hand-written pointer optimizations do not make sense with usage optimization levels higher than O0.
  • 23. 10 . 1 AUTO-VECTORIZATION Auto-vectorization is machine code generation that takes an advantage of vector instructions. Most of all modern architectures have vector extensions as a co-processor or as dedicated pipes MMX, SSE, SSE2, SSE4, AVX, AVX-512 AltiVec, VSX ASIMD (NEON), MSA Enabled by inlining, unrolling, fusion, software pipelining, inter-procedural optimization, and other machine independent transformations.
  • 24. 10 . 2 AUTO-VECTORIZATION void vectorizeMe(float *a, float *b, int len) { int i; for (i = 0; i < len; i++) a[i] += b[i]; } $ gcc -march=armv7-a -mthumb -O3 -S -o 1.s > -mfpu=neon-vfpv4 -mfloat-abi=softfp > -fopt-info-missed 1.c But, NEON does not support full IEEE 754, so gcc cannot vectorize the loop, what it told us 1.c:64:3: note: not vectorized: relevant stmt not supported:_13 =_9+_12;
  • 25. 10 . 3 Here is a generated assembly and... Here we are! AUTO-VECTORIZATION .L3: fldmias r1!, {s14} flds s15, [r0] fadds s15, s15, s14 fstmias r0!, {s15} cmp r0, r2 bne .L3 But armv8-a does support, let's check it! $ gcc -march=armv8-a+simd -O3 -S -o 1.s > -fopt-info-all 1.c 1.c:64:3: note: loop vectorized
  • 26. 10 . 4 Here is a generated assembly and full optimizer's report AUTO-VECTORIZATION .L6: ldr q1, [x3],16 add w6, w6, 1 ldr q0, [x7],16 cmp w6, w4 fadd v0.4s, v0.4s, v1.4s str q0, [x8],16 bcc .L6 1.c:66:3: note: loop vectorized 1.c:66:3: note: loop versioned for vectorization because of possible aliasing 1.c:66:3: note: loop peeled for vectorization to enhance alignment 1.c:66:3: note: loop with 3 iterations completelyunrolled 1.c:61:6: note: loop with 3 iterations completelyunrolled Compiler versions the loop to allow optimized paths in case of aligned and non-aliasing pointers
  • 27. 10 . 5 Let's follow the advice to put some keywords Optimizer's report shrinks. KEYWORDS void vectorizeMe(float* __restrict a_, float* __restrict b_, int len) { float *a=__builtin_assume_aligned(a_, 16); float *b=__builtin_assume_aligned(b_, 16); for (int i = 0; i < len; i++) a[i] += b[i]; } 1.c:66:3: note: loop vectorized 1.c:66:3: note: loop with 3 iterations completely unrolled __restrict and __builtin_assume_aligned keywords only eliminate some loop versioning, but are not very useful from the performance perspective
  • 28. 11 . 1 FUNCTION BODY INLINING Replaces functional call to function body itself. int square(int x) { return x*x; } for (int i = 0; i < len; i++) arr[i] = square(i); Becomes for (int i = 0; i < len; i++) arr[i] = i*i; Enables all further optimizations. Machine-independent, interprocedural.
  • 29. 11 . 2 FUNCTION BODY INLINING gcc -march=armv8-a+nosimd -O3 -S -o 1.s 1.c .L2: add x2, x4, :lo12:.LANCHOR0 mov x1, 34464 mul w3, w0, w0 movk x1, 0x1, lsl 16 str w3, [x2,x0,lsl 2] add x0, x0, 1 cmp x0, x1 bne .L2 Let's compile the code with vector extension enabled -march=armv8-a+simd
  • 30. 11 . 3 FUNCTION BODY INLINING add x0, x0, :lo12:.LANCHOR0 movi v2.4s, 0x4 ldr q0, [x1] add x1, x0, 397312 add x1, x1, 2688 .L2: mul v1.4s, v0.4s, v0.4s add v0.4s, v0.4s, v2.4s str q1, [x0],16 cmp x0, x1 bne .L2 Auto-vectorization is enabled because of possibility to inline function call inside a loop.
  • 31. 12 . 1 BUILT-IN DETECTION Compiler automatically uses the library functions memset and memcpy to initialize and copy memory blocks static char a[100000]; static char b[100000]; static int at(int idx, char val) { if (idx>=0 && idx<100000) a[idx] = val; } int main() { for (int i=0; i<100000; ++i) a[i]=42; for (int i=0; i<100000; ++i) at(i,-1); for (int i=0; i<100000; ++i) b[i] = a[i]; }
  • 32. 12 . 2 Compiler knows what you mean FILLING/COPYING MEMORY BLOCKS main: .LFB1: .cfi_startproc subq $8, %rsp .cfi_def_cfa_offset 16 movl $100000, %edx movl $42, %esi movl $a, %edi call memset movl $100000, %edx movl $255, %esi movl $a, %edi call memset movl $100000, %edx movl $a, %esi movl $b, %edi call memcpy addq $8, %rsp .cfi_def_cfa_offset 8 ret
  • 33. 12 . 3 The same picture, if the code is compiled for ARM target FILLING/COPYING MEMORY BLOCKS main: ldr r3, .L3 mov r1, #42 stmfd sp!, {r4, lr} add r3, pc, r3 movw r4, #34464 movt r4, 1 mov r0, r3 mov r2, r4 bl memset(PLT) mov r2, r4 mov r1, #255 bl memset(PLT) mov r2, r4 mov r3, r0 ldr r0, .L3+4 mov r1, r3 add r0, pc, r0 add r0, r0, #1792 bl memcpy(PLT) ldmfd sp!, {r4, pc}
  • 34. 13 . 1 CASE STUDY: FLOATING POINT double power( double d, unsigned n) { double x = 1.0; for (unsigned j = 1; j<=n; j++, x *= d) ; return x; } int main () { double a = 1./0x80000000U, sum = 0.0; for (unsigned i=1; i<= 0x80000000U; i++) sum += power( i*a, i % 8); printf ("sum = %gn", sum); } ags-dp.c
  • 35. 13 . 2 CASE STUDY: FLOATING POINT 1. Compile it without optimization 2. Compile it with O1: ~3.26 speedup $ gcc -std=c99 -Wall -O0 flags-dp.c -o flags-dp $ time ./flags-dp sum = 7.29569e+08 real 0m24.550s $ gcc -std=c99 -Wall -O1 flags-dp.c -o flags-dp $ time ./flags-dp sum = 7.29569e+08 real 0m7.529s
  • 36. 13 . 3 CASE STUDY: FLOATING POINT 3. Compile it with O2: ~1.24 speedup 4. Compile it with O3: ~1.00 speedup $ gcc -std=c99 -Wall -O2 flags-dp.c -o flags-dp $ time ./flags-dp sum = 7.29569e+08 real 0m6.069s $ gcc -std=c99 -Wall -O3 flags-dp.c -o flags-dp $ time ./flags-dp sum = 7.29569e+08 real 0m6.067s Total speedup is ~4.05
  • 37. 14 . 1 CASE STUDY: INTEGER int power( int d, unsigned n) { int x = 1; for (unsigned j = 1; j<=n; j++, x*=d) ; return x; } int main () { int64_t sum = 0; for (unsigned i=1; i<0x80000000U; i++) sum += power( i, i % 8); printf ("sum = %ldn", sum); } ags.c
  • 38. 14 . 2 CASE STUDY: INTEGER 1. Compile it without optimization 2. Compile it with O1: ~2.64 speedup $ gcc -std=c99 -Wall -O0 flags.c -o flags $ time ./flags sum = 288231861405089791 real 0m18.750s $ gcc -std=c99 -Wall -O1 flag.c -o flags $ time ./flags sum = 288231861405089791 real 0m7.092s
  • 39. 14 . 3 CASE STUDY: INTEGER 3. Compile it with O2:~0.97 speedup 4. Compile it with O3: ~1.00 speedup $ gcc -std=c99 -Wall -O2 flags.c -o flags $ time ./flags sum = 288231861405089791 real 0m7.300s $ gcc -std=c99 -Wall -O3 flags.c -o flags $ time ./flags sum = 288231861405089791 real 0m7.082s WHAT THERE IS NO IMPROVEMENT?
  • 40. 14 . 4 EXAMPLE: INTEGER -O1 -O2 COMPILER OVERDONE WITH BRANCH TWIDDLING! movl $1, %r8d movl $0, %edx .L9: movl %r8d, %edi movl %r8d, %esi andl $7, %esi je .L10 movl $1, %ecx movl $1, %eax .L8: @ addl $1, %eax @ imull %edi, %ecx @ cmpl %eax, %esi @ jae .L8 jmp .L7 .L10: movl $1, %ecx .L7: movslq %ecx, %rcx addq %rcx, %rdx addl $1, %r8d jns .L9 @ printing is here movl $1, %esi movl $1, %r8d xorl %edx, %edx .p2align 4,,10 .p2align 3 .L10: movl %r8d, %edi andl $7, %edi je .L11 movl $1, %ecx movl $1, %eax .p2align 4,,10 .p2align 3 .L9: addl $1, %eax imull %esi, %ecx cmpl %eax, %edi jae .L9 movslq %ecx, %rcx .L8: addl $1, %r8d addq %rcx, %rdx testl %r8d, %r8d movl %r8d, %esi jns .L10 subq $8, %rsp @ printing is here ret .L11: movl $1, %ecx jmp .L8
  • 41. 15 . 1 HELPING A COMPILER Compiler usually applies optimization to the inner loops. In this example number of iterations in the inner loop depends on a value of an induction variable of the outer loop. Let's help the compiler ( )ags-dp-tuned.c
  • 42. 15 . 2 HELPING A COMPILER /*power function is not changed*/ int main () { double a = 1./0x80000000U, s = -1.; for (double i=0; i<=0x80000000U-8; i += 8) { s+=power((i+0)*a,0);s+=power((i+1)*a,1); s+=power((i+2)*a,2);s+=power((i+3)*a,3); s+=power((i+4)*a,4);s+=power((i+5)*a,5); s+=power((i+6)*a,6);s+=power((i+7)*a,7); } printf ("sum = %gn", s); } $ gcc -std=c99 -Wall -O3 flags-dp-tuned.c -o flags-dp-tuned $ time ./flags-dp-tuned sum = 7.29569e+08 real 0m2.448s Speedup is ~2.48x.
  • 43. 15 . 3 Let's try it for integers ( ) HELPING A COMPILER ags-tuned.c /*power function is not changed*/ int main () { int64_t sum = -1; for (unsigned i=0; i<=0x80000000U-8; i += 8) { sum+=power(i+0, 0); sum+=power(i+1,1); sum+=power(i+2, 2); sum+=power(i+3,3); sum+=power(i+4, 4); sum+=power(i+5,5); sum+=power(i+6, 6); sum+=power(i+7,7); } printf ("sum = %ldn", sum); } $ gcc -std=c99 -Wall -O3 flags-tuned.c -o flags-tuned $ time ./flags-tuned sum = 288231861405089791 real 0m1.286s Speedup is ~5.5x.
  • 44. 16 . 1 DETAILED ANALYSIS This is how the compiler sees the outer loop after inner loop inlining int64_t sum = -1; for (unsigned i=0; i<=0x80000000U-8; i += 8) { int d = (int)i; sum+=1; sum+=(1+d); sum+=(2+d)*(2+d); sum+=(3+d)*(3+d)*(3+d); sum+=(4+d)*(4+d)*(4+d)*(4+d); sum+=(5+d)*(5+d)*(5+d)*(5+d)*(5+d); sum+=(6+d)*(6+d)*(6+d)*(6+d)*(6+d)*(6+d); sum+=(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d)*(7+d); }
  • 45. 16 . 2 5 LOOP COUNTERS INSTEAD OF 1 movl $6, %r11d movl $5, %r10d movl $4, %ebx movl $3, %r9d movl $2, %r8d movq $-1, %rax // <-accumulator .L3: // loop body addl $8, %r8d addl $8, %r9d addl $8, %ebx addl $8, %r10d addl $8, %r11d cmpl $-2147483646, %r8d jne .L3
  • 46. 16 . 3 (I+1), (I+2), (I+3) leal -1(%r8), %edx // 2+i-1 movslq %edx, %rdx // to long leaq 1(%rax,%rdx), %rdi // sum += 1+i movl %r8d, %edx // 2+i imull %r8d, %edx // (2+i)**2 movslq %edx, %rdx // to long leaq (%rdx,%rdi), %rsi // sum += (2+i)**2 movl %r9d, %ecx imull %r9d, %ecx imull %r9d, %ecx movslq %ecx, %rcx leaq (%rcx,%rsi), %rdx
  • 47. 16 . 4 (I+4), (I+5), (I+6) movl %ebx, %eax imull %ebx, %eax imull %eax, %eax cltq leaq (%rax,%rdx), %rcx movl %r10d, %eax imull %r10d, %eax imull %eax, %eax imull %r10d, %eax cltq addq %rcx, %rax movl %r11d, %edx imull %r11d, %edx imull %r11d, %edx imull %edx, %edx movslq %edx, %rdx addq %rdx, %rax
  • 48. 16 . 5 (I+7) leal 5(%r8), %esi movl $7, %ecx movl $1, %edx .L2: imull %esi, %edx subl $1, %ecx jne .L2 movslq %edx, %rdx addq %rdx, %rax Full listing is available in gitHub
  • 49. 17 . 1 SUMMARY Copy propagation and other basic transformations enables more complicated ones Hoisting loop-invariant code and other ways of strength reduction techniques minimizes number of operations Compiler is very conservative to hoist oating point math Scalarization and loop unswitching makes loop bodies plane Loop peeling compiler capabilities are steel pure, use sentinels technique to eliminate extra checks and control ow Most hand-written pointer optimizations do not make sense with usage of optimization levels higher than O0
  • 50. 17 . 2 SUMMARY __restrict and __builtin_assume_aligned keywords only eliminate some loop versioning, but are not very useful from the performance perspective nowadays Functional body inlining is essential for loop further loop optimization Compiler will detect memcpy and memset, even if you implement them manually and call library function instead. Library functions are usually more ef cient Use knowledge about semantics of your code to help a compiler optimize it