MCST
Simple Type-Based Alias Analysis for a VLIW
Processor
Markin A. L. Alex.L.Markin@mcst.ru
Ermolitsky A. V. Alexander.V.Ermolitsky@mcst.ru
4 march 2017
Elbrus
Elbrus — general purpose VLIW (Very Long Instruction Word)
microprocessor.
Features:
23 instructions per tick
In-Order instruction execution
Array Access Unit (AAU) — asynchronous array loading from
memory to the Array Prefetch Buffer (APB)
Hardware support of loop pipelining
Disambiguation Access Memory (DAM) — hardware support
of pointer disambiguation
All these features vitaly need good compiler optimization.
2 / 20
Pointer analysis
void foo(int * a, float * b) {
for(int i = 1; i < N; i++) {
a[0] += a[i];
b[0] *= b[i];
} }
The purpose of pointer analysis is to detect whether a and b may
refer to the the same memory area.
It is difficult because:
Lack of information about program (in per-module build
mode)
Pointer analysis needs a lot of resources (in whole-program
mode)
Pointer analysis algorithms are complicated
3 / 20
Strict-aliasing
The C language allows to disambiguate pointers by types:
7 An object shall have its stored value accessed only by an
lvalue expression that has one of the following types:
a type compatible with the effective type of the object,
a qualified version of a type compatible with the effective
type of the object,
a type that is the signed or unsigned type corresponding to
the effective type of the object,
a type that is the signed or unsigned type corresponding to
a qualified version of the effective type of the object,
an aggregate or union type that includes one of the a
mentioned types among its members (including, recursively, a
member of a subaggregate or contained union), or
a character type.
4 / 20
Algorithm
The strict-aliasing implementation for lcc (Elbrus C Compiler)
works with the architecture-independent IR (EIR).
General description:
1. Gather all interesting READ and WRITE operations
2. Generate compatibility vector for each type of operations
3. Assign results of analysis to corresponding operations
Type-based alias analysis is implemented in all major compilers.
5 / 20
Implementation characteristics
Pointer analysis — answers whether two pointers can refer to
the same memory area
Intraprocedural — does not require whole program
information
Flow-insensitive — does not use information about the
program control-flow
Context-insensitive — does not use information from the
functions call points
No memory modeling
Result representation is vector
6 / 20
Runtime results
400.perlbench
401.bzip2
403.gcc
429.mcf
445.gobmk
462.libquantum
464.h264ref
471.omnetpp
473.astar
483.xalancbmk
0.90
0.95
1.00
1.05
1.10
1.15
17.49
lcc module
lcc whole
gcc module
gcc lto
Figure: Integer SPEC CPU2006 execution speedup (> 1 is better)
7 / 20
Runtime results
416.gamess
433.milc
434.zeusmp435.gromacs436.cactusADM437.leslie3d
444.namd
447.dealII
450.soplex
453.povray
454.calculix459.GemsFDTD465.tonto
482.sphinx3
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
lcc module
lcc whole
gcc module
gcc lto
Figure: Floating point SPEC CPU2006 execution speedup (> 1 is better)
8 / 20
Runtime results
GMean speedup gained with the help of strict-aliasing:
lcc -O3
-ffast
lcc -O3
-ffast
-fwhole
gcc -O3 gcc -O3
-flto
SPEC CPU2006
INT
28.6% 1.9% 1% 0%
SPEC CPU2006
FP
13.3% 4.3% 1.5 1.1%
Testing environment:
lcc — Elbrus 4C (Elbrus v3 ISA)
gcc — Intel Xeon E5-2650 (x86 64 ISA)
9 / 20
Implementation Aspects
Problem: strict aliasing violations are common. So separate
analysis for strict-aliasing errors detecting was implemented
Problem: unions are hard to analyse at compile time, so they
are ignored
10 / 20
462.libquantum
This test got 17.49 times execution speedup after enabling
strict-aliasing analysis for per-module build mode!
Three hottest functions have the same pattern:
void foo(str_1 * str) {
for(int i = 0; i < N; i++)
{
str->arr[i].field; // LOAD of arr and LOAD of
field
...
str->arr[i].field = val; // STORE to field
}
}
Dependence between STORE of field and LOAD of arr prohibits to
eliminate invariant LOAD.
11 / 20
462.libquantum
In the lcc architecture-independent representation (EIR) we have
the following operations:
loop:
...
o1. READ str : str_1 *
o2. RD_FIELD o1.arr : str_2 *
o3. ADD_P o2, i : str_2 *
o4. RD_FIELD o3.field : int32
...
o4. WR_FIELD o3.field <- val : int32
12 / 20
462.libquantum
The strict-aliasing analysis builds table of type compatibility for
three types:
str_1 * str_2 * int32
str_1 * 1 0 0
str_2 * 0 1 0
int32 0 0 1
In this example all three types are incompatibile and the operations
working with them can not refer to the same memory area.
13 / 20
462.libquantum
Speedup was gained by the Elbrus-specific optimizations. The
architecture-dependent IR of the loop is the following:
loop:
...
o1. LOAD str->arr 0 -> r1 // Alias vector: 010
o2. ADD_P r1 i -> r2
o3. LOAD r2 offset(field) -> r3 // Alias vector: 001
...
o4. STORE r2 offset(field) val // Alias vector: 001
Results of strict-aliasing makes possible to disambiguate operations
o1. LOAD and o4. STORE and to eliminate invariant o1. LOAD
from the loop.
14 / 20
462.libquantum
The only LOAD in the loop makes possible to evaluate some
optimizations:
o1. LOAD str->arr 0 -> r1 // Alias vector: 010
loop:
...
o2. MOVA arr_buff
...
o3. ADD_P r1 i -> r2
o4. STORE r2 offset(field) val // Alias vector: 001
Before strict-aliasing:
weak pipelining
DAM applied
no APB
After strict-aliasing:
improved pipelining
No DAM
APB
15 / 20
Other tests
Almost all other tests (except 453.povray) have similar to
462.libquantum but more complicated code patterns.
The tests 459.GemsFDTD and 437.leslie3d are Fortran tests but
lcc translates them to C so we can also see their speedup.
In the 453.povray hot functions there are no loops. The 16%
speedup is based only on peephole improvement!
16 / 20
Strict-aliasing clients
Strict-aliasing
Redundant
Load/Store
Elimination
Memory Runtime
Optimizations
DAM
RTMD
Loop Optimizations
APB
Pipelining
Peephole
17 / 20
Compile Time
In general the impact of the analysis on the compilation time is low.
Compilation time speedup:
lcc -O3
-ffast
lcc -O3
-ffast
-fwhole
gcc -O3 gcc -O3
-flto
GMean -3% 1% 1% 2%
The size of the stored analysis results is linear to the number of
operations in the procedure.
18 / 20
Summary
Advantages of strict-aliasing:
Relatively easy implementation
Works in per-module build mode
In some cases works with object fields
High scalability
Great execution speedup on VLIW processor
Disadvantages of strict-aliasing:
Needs complicated analysis for detecting strict-aliasing errors
Low precision
19 / 20
Conclusion
In this work:
Simple type-base alias analysis algorithm was described and
implemented for Elbrus compiler
The impact on the runtime and compile time characteristics
analyzed
Further work
Extending algorithm to disambiguate fields of structures
Detailed research of strict-aliasing errors in GNU/Linux
distribution
Comparison of different pointer analysis precision
20 / 20

TMPA-2017: Simple Type Based Alias Analysis for a VLIW Processor

  • 1.
    MCST Simple Type-Based AliasAnalysis for a VLIW Processor Markin A. L. Alex.L.Markin@mcst.ru Ermolitsky A. V. Alexander.V.Ermolitsky@mcst.ru 4 march 2017
  • 2.
    Elbrus Elbrus — generalpurpose VLIW (Very Long Instruction Word) microprocessor. Features: 23 instructions per tick In-Order instruction execution Array Access Unit (AAU) — asynchronous array loading from memory to the Array Prefetch Buffer (APB) Hardware support of loop pipelining Disambiguation Access Memory (DAM) — hardware support of pointer disambiguation All these features vitaly need good compiler optimization. 2 / 20
  • 3.
    Pointer analysis void foo(int* a, float * b) { for(int i = 1; i < N; i++) { a[0] += a[i]; b[0] *= b[i]; } } The purpose of pointer analysis is to detect whether a and b may refer to the the same memory area. It is difficult because: Lack of information about program (in per-module build mode) Pointer analysis needs a lot of resources (in whole-program mode) Pointer analysis algorithms are complicated 3 / 20
  • 4.
    Strict-aliasing The C languageallows to disambiguate pointers by types: 7 An object shall have its stored value accessed only by an lvalue expression that has one of the following types: a type compatible with the effective type of the object, a qualified version of a type compatible with the effective type of the object, a type that is the signed or unsigned type corresponding to the effective type of the object, a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object, an aggregate or union type that includes one of the a mentioned types among its members (including, recursively, a member of a subaggregate or contained union), or a character type. 4 / 20
  • 5.
    Algorithm The strict-aliasing implementationfor lcc (Elbrus C Compiler) works with the architecture-independent IR (EIR). General description: 1. Gather all interesting READ and WRITE operations 2. Generate compatibility vector for each type of operations 3. Assign results of analysis to corresponding operations Type-based alias analysis is implemented in all major compilers. 5 / 20
  • 6.
    Implementation characteristics Pointer analysis— answers whether two pointers can refer to the same memory area Intraprocedural — does not require whole program information Flow-insensitive — does not use information about the program control-flow Context-insensitive — does not use information from the functions call points No memory modeling Result representation is vector 6 / 20
  • 7.
  • 8.
  • 9.
    Runtime results GMean speedupgained with the help of strict-aliasing: lcc -O3 -ffast lcc -O3 -ffast -fwhole gcc -O3 gcc -O3 -flto SPEC CPU2006 INT 28.6% 1.9% 1% 0% SPEC CPU2006 FP 13.3% 4.3% 1.5 1.1% Testing environment: lcc — Elbrus 4C (Elbrus v3 ISA) gcc — Intel Xeon E5-2650 (x86 64 ISA) 9 / 20
  • 10.
    Implementation Aspects Problem: strictaliasing violations are common. So separate analysis for strict-aliasing errors detecting was implemented Problem: unions are hard to analyse at compile time, so they are ignored 10 / 20
  • 11.
    462.libquantum This test got17.49 times execution speedup after enabling strict-aliasing analysis for per-module build mode! Three hottest functions have the same pattern: void foo(str_1 * str) { for(int i = 0; i < N; i++) { str->arr[i].field; // LOAD of arr and LOAD of field ... str->arr[i].field = val; // STORE to field } } Dependence between STORE of field and LOAD of arr prohibits to eliminate invariant LOAD. 11 / 20
  • 12.
    462.libquantum In the lccarchitecture-independent representation (EIR) we have the following operations: loop: ... o1. READ str : str_1 * o2. RD_FIELD o1.arr : str_2 * o3. ADD_P o2, i : str_2 * o4. RD_FIELD o3.field : int32 ... o4. WR_FIELD o3.field <- val : int32 12 / 20
  • 13.
    462.libquantum The strict-aliasing analysisbuilds table of type compatibility for three types: str_1 * str_2 * int32 str_1 * 1 0 0 str_2 * 0 1 0 int32 0 0 1 In this example all three types are incompatibile and the operations working with them can not refer to the same memory area. 13 / 20
  • 14.
    462.libquantum Speedup was gainedby the Elbrus-specific optimizations. The architecture-dependent IR of the loop is the following: loop: ... o1. LOAD str->arr 0 -> r1 // Alias vector: 010 o2. ADD_P r1 i -> r2 o3. LOAD r2 offset(field) -> r3 // Alias vector: 001 ... o4. STORE r2 offset(field) val // Alias vector: 001 Results of strict-aliasing makes possible to disambiguate operations o1. LOAD and o4. STORE and to eliminate invariant o1. LOAD from the loop. 14 / 20
  • 15.
    462.libquantum The only LOADin the loop makes possible to evaluate some optimizations: o1. LOAD str->arr 0 -> r1 // Alias vector: 010 loop: ... o2. MOVA arr_buff ... o3. ADD_P r1 i -> r2 o4. STORE r2 offset(field) val // Alias vector: 001 Before strict-aliasing: weak pipelining DAM applied no APB After strict-aliasing: improved pipelining No DAM APB 15 / 20
  • 16.
    Other tests Almost allother tests (except 453.povray) have similar to 462.libquantum but more complicated code patterns. The tests 459.GemsFDTD and 437.leslie3d are Fortran tests but lcc translates them to C so we can also see their speedup. In the 453.povray hot functions there are no loops. The 16% speedup is based only on peephole improvement! 16 / 20
  • 17.
  • 18.
    Compile Time In generalthe impact of the analysis on the compilation time is low. Compilation time speedup: lcc -O3 -ffast lcc -O3 -ffast -fwhole gcc -O3 gcc -O3 -flto GMean -3% 1% 1% 2% The size of the stored analysis results is linear to the number of operations in the procedure. 18 / 20
  • 19.
    Summary Advantages of strict-aliasing: Relativelyeasy implementation Works in per-module build mode In some cases works with object fields High scalability Great execution speedup on VLIW processor Disadvantages of strict-aliasing: Needs complicated analysis for detecting strict-aliasing errors Low precision 19 / 20
  • 20.
    Conclusion In this work: Simpletype-base alias analysis algorithm was described and implemented for Elbrus compiler The impact on the runtime and compile time characteristics analyzed Further work Extending algorithm to disambiguate fields of structures Detailed research of strict-aliasing errors in GNU/Linux distribution Comparison of different pointer analysis precision 20 / 20