TMPA-2017: Simple Type Based Alias Analysis for a VLIW Processor

MCST
Simple Type-Based Alias Analysis for a VLIW
Processor
Markin A. L. Alex.L.Markin@mcst.ru
Ermolitsky A. V. Alexander.V.Ermolitsky@mcst.ru
4 march 2017

Elbrus
Elbrus — general purpose VLIW (Very Long Instruction Word)
microprocessor.
Features:
23 instructions per tick
In-Order instruction execution
Array Access Unit (AAU) — asynchronous array loading from
memory to the Array Prefetch Buﬀer (APB)
Hardware support of loop pipelining
Disambiguation Access Memory (DAM) — hardware support
of pointer disambiguation
All these features vitaly need good compiler optimization.
2 / 20

Pointer analysis
void foo(int * a, float * b) {
for(int i = 1; i < N; i++) {
a[0] += a[i];
b[0] *= b[i];
} }
The purpose of pointer analysis is to detect whether a and b may
refer to the the same memory area.
It is diﬃcult because:
Lack of information about program (in per-module build
mode)
Pointer analysis needs a lot of resources (in whole-program
mode)
Pointer analysis algorithms are complicated
3 / 20

Strict-aliasing
The C language allows to disambiguate pointers by types:
7 An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:
a type compatible with the effective type of the object,
a qualified version of a type compatible with the effective
type of the object,
a type that is the signed or unsigned type corresponding to
the effective type of the object,
a type that is the signed or unsigned type corresponding to
a qualified version of the effective type of the object,
an aggregate or union type that includes one of the a
mentioned types among its members (including, recursively, a
member of a subaggregate or contained union), or
a character type.
4 / 20

Algorithm
The strict-aliasing implementation for lcc (Elbrus C Compiler)
works with the architecture-independent IR (EIR).
General description:
1. Gather all interesting READ and WRITE operations
2. Generate compatibility vector for each type of operations
3. Assign results of analysis to corresponding operations
Type-based alias analysis is implemented in all major compilers.
5 / 20

Implementation characteristics
Pointer analysis — answers whether two pointers can refer to
the same memory area
Intraprocedural — does not require whole program
information
Flow-insensitive — does not use information about the
program control-ﬂow
Context-insensitive — does not use information from the
functions call points
No memory modeling
Result representation is vector
6 / 20

Runtime results
400.perlbench
401.bzip2
403.gcc
429.mcf
445.gobmk
462.libquantum
464.h264ref
471.omnetpp
473.astar
483.xalancbmk
0.90
0.95
1.00
1.05
1.10
1.15
17.49
lcc module
lcc whole
gcc module
gcc lto
Figure: Integer SPEC CPU2006 execution speedup (> 1 is better)
7 / 20

Runtime results
416.gamess
433.milc
434.zeusmp435.gromacs436.cactusADM437.leslie3d
444.namd
447.dealII
450.soplex
453.povray
454.calculix459.GemsFDTD465.tonto
482.sphinx3
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
lcc module
lcc whole
gcc module
gcc lto
Figure: Floating point SPEC CPU2006 execution speedup (> 1 is better)
8 / 20

Runtime results
GMean speedup gained with the help of strict-aliasing:
lcc -O3
-ffast
lcc -O3
-ffast
-fwhole
gcc -O3 gcc -O3
-flto
SPEC CPU2006
INT
28.6% 1.9% 1% 0%
SPEC CPU2006
FP
13.3% 4.3% 1.5 1.1%
Testing environment:
lcc — Elbrus 4C (Elbrus v3 ISA)
gcc — Intel Xeon E5-2650 (x86 64 ISA)
9 / 20

Implementation Aspects
Problem: strict aliasing violations are common. So separate
analysis for strict-aliasing errors detecting was implemented
Problem: unions are hard to analyse at compile time, so they
are ignored
10 / 20

462.libquantum
This test got 17.49 times execution speedup after enabling
strict-aliasing analysis for per-module build mode!
Three hottest functions have the same pattern:
void foo(str_1 * str) {
for(int i = 0; i < N; i++)
{
str->arr[i].field; // LOAD of arr and LOAD of
field
...
str->arr[i].field = val; // STORE to field
}
}
Dependence between STORE of field and LOAD of arr prohibits to
eliminate invariant LOAD.
11 / 20

462.libquantum
In the lcc architecture-independent representation (EIR) we have
the following operations:
loop:
...
o1. READ str : str_1 *
o2. RD_FIELD o1.arr : str_2 *
o3. ADD_P o2, i : str_2 *
o4. RD_FIELD o3.field : int32
...
o4. WR_FIELD o3.field <- val : int32
12 / 20

462.libquantum
The strict-aliasing analysis builds table of type compatibility for
three types:
str_1 * str_2 * int32
str_1 * 1 0 0
str_2 * 0 1 0
int32 0 0 1
In this example all three types are incompatibile and the operations
working with them can not refer to the same memory area.
13 / 20

462.libquantum
Speedup was gained by the Elbrus-speciﬁc optimizations. The
architecture-dependent IR of the loop is the following:
loop:
...
o1. LOAD str->arr 0 -> r1 // Alias vector: 010
o2. ADD_P r1 i -> r2
o3. LOAD r2 offset(field) -> r3 // Alias vector: 001
...
o4. STORE r2 offset(field) val // Alias vector: 001
Results of strict-aliasing makes possible to disambiguate operations
o1. LOAD and o4. STORE and to eliminate invariant o1. LOAD
from the loop.
14 / 20

462.libquantum
The only LOAD in the loop makes possible to evaluate some
optimizations:
o1. LOAD str->arr 0 -> r1 // Alias vector: 010
loop:
...
o2. MOVA arr_buff
...
o3. ADD_P r1 i -> r2
o4. STORE r2 offset(field) val // Alias vector: 001
Before strict-aliasing:
weak pipelining
DAM applied
no APB
After strict-aliasing:
improved pipelining
No DAM
APB
15 / 20

Other tests
Almost all other tests (except 453.povray) have similar to
462.libquantum but more complicated code patterns.
The tests 459.GemsFDTD and 437.leslie3d are Fortran tests but
lcc translates them to C so we can also see their speedup.
In the 453.povray hot functions there are no loops. The 16%
speedup is based only on peephole improvement!
16 / 20

Strict-aliasing clients
Strict-aliasing
Redundant
Load/Store
Elimination
Memory Runtime
Optimizations
DAM
RTMD
Loop Optimizations
APB
Pipelining
Peephole
17 / 20

Compile Time
In general the impact of the analysis on the compilation time is low.
Compilation time speedup:
lcc -O3
-ffast
lcc -O3
-ffast
-fwhole
gcc -O3 gcc -O3
-flto
GMean -3% 1% 1% 2%
The size of the stored analysis results is linear to the number of
operations in the procedure.
18 / 20

Summary
Advantages of strict-aliasing:
Relatively easy implementation
Works in per-module build mode
In some cases works with object ﬁelds
High scalability
Great execution speedup on VLIW processor
Disadvantages of strict-aliasing:
Needs complicated analysis for detecting strict-aliasing errors
Low precision
19 / 20

Conclusion
In this work:
Simple type-base alias analysis algorithm was described and
implemented for Elbrus compiler
The impact on the runtime and compile time characteristics
analyzed
Further work
Extending algorithm to disambiguate ﬁelds of structures
Detailed research of strict-aliasing errors in GNU/Linux
distribution
Comparison of diﬀerent pointer analysis precision
20 / 20

TMPA-2017: Simple Type Based Alias Analysis for a VLIW Processor

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to TMPA-2017: Simple Type Based Alias Analysis for a VLIW Processor

Similar to TMPA-2017: Simple Type Based Alias Analysis for a VLIW Processor (20)

More from Iosif Itkin

More from Iosif Itkin (20)

Recently uploaded

Recently uploaded (20)

TMPA-2017: Simple Type Based Alias Analysis for a VLIW Processor