Gcc porting

GCC porting
Use instruction pattern describe
target ISA
Shiva Chen
shiva0217@gmail.com
May 2013

Outline
 Compiler structure
 Intermediate languages in GCC
 Optimization pass in GCC
 Define instruction pattern
 Operand constraints
 Match instruction pattern
 Strict RTL
 Target defined constraints
 Emit assembly code
 Target information usage
 Preserve word to describe instruction pattern
 Example of instruction pattern
 Split instruction pattern
 Instruction attribute
 Peephole pattern
 Instruction scheduling

 Three main intermediate languages format in GCC
 GENERIC
Language-independent representation generated by each front
end
Common representation for all the languages supported by
GCC.
 GIMPLE
Perform language independent and target independent
optimization
 RTL
Perform the optimization which will notice target feature by
porting code

Gimple optimization pass in GCC
4.6.2
004t.gimple
006t.vcg
009t.omplower
010t.lower
012t.eh
013t.cfg
017t.ssa
018t.veclower
019t.inline_param1
020t.einline
021t.early_optimizations
022t.copyrename1
023t.ccp1
024t.forwprop1
025t.ealias
026t.esra
027t.copyprop1
028t.mergephi1
029t.cddce1
030t.eipa_sra
031t.tailr1
032t.switchconv
034t.profile
035t.local-pure-const1
036t.fnsplit
037t.release_ssa
038t.inline_param2
057t.copyrename2
058t.cunrolli
059t.ccp2
060t.forwprop2
062t.alias
063t.retslot
064t.phiprop
065t.fre
066t.copyprop2
067t.mergephi2
068t.vrp1
069t.dce1
070t.cselim
071t.ifcombine
072t.phiopt1
073t.tailr2
074t.ch
076t.cplxlower
077t.sra
078t.copyrename3
079t.dom1
080t.phicprop1
081t.dse1
082t.reassoc1
083t.dce2
084t.forwprop3
085t.phiopt2
086t.objsz
087t.ccp3
088t.copyprop3
090t.bswap
091t.crited
092t.pre
093t.sink
094t.loop
095t.loopinit
096t.lim1
097t.copyprop4
…
143t.optimized

RTL optimization pass in GCC 4.6.2
004t.gimple
144r.expand
Other gimple pass
145r.sibling
147r.initvals
148r.unshare
149r.vregs
150r.into_cfglayout
151r.jump
152r.subreg1
153r.dfinit
154r.cse1
155r.fwprop1
156r.cprop1
158r.hoist
159r.cprop2
162r.ce1
163r.reginfo
164r.loop2
165r.loop2_init
166r.loop2_invariant
170r.loop2_done
172r.cprop3
173r.cse2
174r.dse1
175r.fwprop2
176r.auto_inc_dec
177r.init-regs
178r.dce
179r.combine
180r.ce2
182r.regmove
183r.outof_cfglayout
184r.split1
185r.subreg2
188r.asmcons
190r.sched1
191r.ira
192r.postreload
194r.split2
198r.pro_and_epilogue
199r.dse2
200r.csa
201r.peephole2
202r.ce3
204r.cprop_hardreg
205r.dce
206r.bbro
208r.split4
209r.sched2
212r.alignments
215r.mach
216r.barriers
217r.dbr
218r.split5
220r.shorten
221r.nothrow
222r.final
223r.dfinish
224t.statistics

 Why need divide optimization pass to
gimple pass and RTL pass?
 Gimple pass have more high level semantic
Ex: switch, array, structure, variable
Some optimization is more easier to designed when
high level semantic still exist
 However, gimple pass lack of target information
Ex: instruction length(size), supported ISA
Therefore, we need RTL optimization pass

Define instruction pattern
 All the RTL pattern must match target ISA
 How to tell GCC generate the RTL match ISA ?
Instruction patterns
 Use define_expand, define_insn to describe the instruction
patterns which target support
(define_insn “addsi3"
[
(set (match_operand:SI 0 “register_operand" "=r,r")
(plus:SI (match_operand:SI 1 “register_operand" "%r,r")
(match_operand:SI 2 “nonmemory_operand" “r,i"))
)
] ... )

Define instruction pattern
 GCC already define several instruction pattern
name and the semantic of the pattern
 addsi3
Add semantic with 3 SI mode operands
 GCC don’t know the operand constraint of the
target
 How to tell GCC our target’s operand constraint of each
instruction ?
Predicate
Constraint

Operand Constraints
 Multiple Alternative Constraints
[
)
] ... )
Predicate: register_operand, nonmemory_operand
Constraint: r, i
Predicate should contain each constraints of the operand
For operand 2 with SI mode
r(reg) belong to nonmemory_operand
i(immediate) belong to nonmemory_operand

Operand Constraints
 GCC already have predicate to restrict
operand
 Why need constraint field ?
Give the opportunity to change operand while
optimization
 Ex:
movi $r0, 4;
add $r1, $r1, $r0 {addsi3}
Constant propagation
=> addi $r1, $1, 4 {addsi3}

Operand Constraints
 GCC use two level operand constraint
 group same semantic instruction together with
single instruction pattern (addsi3)
 Lots of ISA designed have several assembly
instructions with same semantic and different
operand constraint
 Reduce the instruction pattern when porting

Operand Constraints
 Use instruction pattern do ISA support
checking when GCC generate a new RTL
pattern
 Check does the back end define the pattern by
define_insn
 Check the operand type support or not by
predicate
 Check the operand belong to which alternative
by constraint

Operand Constraints
[
)
] ... )
First alternative constraints
match “add”
Second alternative constraints
match “addi”

Match instruction pattern
[
)
] ... )
Ex:
(set (reg/f:SI 88)
(plus:SI (reg:SI 87)
(reg/v:SI 55))
1. Parsing RTL pattern
(set (op0)
(plus:SI (op1)
(op2))

Match instruction pattern
 When will generate new RTL pattern ?
 RTL expand phase (GIMPLE to RTL)
 During optimization
Ex:
(set (reg/f:SI 47)
(lshiftrt:SI (reg:SI 60)
(const_int 2))
(set (reg/f:SI 88)
(plus:SI (reg:SI 47)
(reg:SI 55))
(set (reg/f:SI 88)
(plus:SI (lshiftrt:SI (reg:SI 60)
(const_int 2))
(reg/v:SI 55))Combine phase
srli $r47, $r60, 2
add $r88, $r47, $r55
add_srli $r88, $r55, $r60, 2

Strict RTL
 Does the new generated RTL pattern
always satisfy constraint ?
 GCC allow certain kind un-match constraint
which reload could fix it later
 Predicate must always satisfy
RTL1
Not do optimization1
Do optimization1
RTL1
RTL2
Reload
Reload
RTL3
RTL2 not satisfy constraint
RTL4
1. RTL3 and RTL4
Satisfy constraint
2. RTL4 is better
Then RTL3

Strict RTL
 Constraint could allow certain un-match before
reload, and hope reload to fix it
 Ex: constraint is m (memory), but current operand is
constant, GCC will allow before reload
 Reload phase is after register allocation
In fact, during register allocation, GCC will call reload rapidly
while the operand not fit the constraint.
 After reload, the operand must satisfy one of the
operand constraint (strict RTL)

Strict RTL
(define_insn “movsi"
[
(set (match_operand:SI 0 “register_operand" "=r,m")
(match_operand:SI 1 “register_operand" “r,r"))
)
] ... )
(set (reg/f:SI 47)
(reg:SI 60))
(set (reg/f:SI 47)
(reg:SI 3))
Assume
After register allocation
Pseudo register r60 assigned to r3
and the hardware register is exhausted
RA (set (mem:SI (plus (sp)(const))))
(reg:SI 3))
Reload

Target defined constraints
 Target could define their own predicate and
constraint
 Target defined predicate
(define_predicate "index_operand"
(ior (match_operand 0 "register_operand")
(and (match_operand 0 “const_int_operand")
(match_test "(INTVAL (op) < 4096
&& INTVAL (op) > -4096))")))

Target defined constraints
 Target defined constraint
(define_register_constraint "l"
"LO_REGS"
"registers r0->r7.")
(define_memory_constraint "Uv"
"@internal In ARM/Thumb-2 state a valid VFP load/store address."
(and (match_code "mem")
(match_test "TARGET_32BIT
&& arm_coproc_mem_operand (op, FALSE)")))

Emit assembly code
 Multiple Alternative Constraints(define_insn “addsi3"
[ (set (match_operand:SI 0 “register_operand" "=r,r")
(match_operand:SI 2 “nonmemory_operand" “r,i")))]
“”
“@
add %0, %1, %2
addi %0, %1, %2”
)
Match First alternative constraints
match “add”
Output assembly code “add $r3, $r4, $5”
Ex:
(set (reg/f:SI 3)
(plus:SI (reg:SI 4)
(reg:SI 5))

Target information usage
 When will GCC use target information get from
instruction patterns ?
 RTL instruction pattern generation
generate insn-emit.c when building GCC by parsing instruction
patterns
 RTL instruction validation (target supported)
generate insn-recog.c when building GCC by parsing instruction
patterns
 Emit target assembly code
generate insn-output.c when building GCC by parsing
instruction patterns

Preserve word to describe instruction
pattern
define_insn
“naming pattern”
define_expand
“naming pattern”
define_insn
“*..”
RTL generation
RTL validation
Emit assembly
 GCC define several “naming patterns” and their semantic use to
generate RTL pattern during RTL expand phase
 ex: addsi3, subsi3, movsi, movhi …
 Some target ISA which the semantic not defined in GCC naming
pattern but the RTL could generate by some optimization
 ex: add_slli could generate by combine phase
 define un-naming pattern make the instruction validate
 define_insn “*add_slli”
 define_insn name with * prefix will identify as un-naming pattern

Example of instruction pattern
1600 ;; These control RTL generation for conditional jump insns
1601 (define_expand "cbranchsi4"
1602 [(set (pc)
1603 (if_then_else (match_operator 0 "ordered_comparison_operator"
1604 [(match_operand:SI 1 "nonmemory_nonsymbol_operand" "")
1605 (match_operand:SI 2 "nonmemory_nonsymbol_operand" "")])
1606 (label_ref (match_operand 3 "" ""))
1607 (pc)))]
1608 ""
1609 {
1610 sh_expand_cbranchsi4 (operands);
1611 DONE;
1612 }
1613 )
Semantic of “cbranchsi4”
compare operand1 and operand 2 by operator 0
branch to label 3 if the compare result is true
Predicate "ordered_comparison_operator“ including EQ,NE,
LT,LTU,LE,LEU,GT,GTU,GE,GEU.
Use porting function sh_expand_cbranchsi4 to generate RTL pattern

1621 (define_insn "*bcondz"
1622 [(set (pc)
1623 (if_then_else (match_operator 0 "bcondz_operator"
1624 [(match_operand:SI 1 "register_operand" "r")
1625 (const_int 0)])
1627 (pc)))]
1628 ""
1629 {
1630 switch (GET_CODE (operands[0]))
1631 {
1632 case EQ:
1633 return "beqz %1, %2";
1634 case NE:
1635 return "bnez %1, %2";
1636 case LT:
1637 return "bltz %1, %2";
1638 case LE:
1639 return "blez %1, %2";
1640 case GT:
1641 return "bgtz %1, %2";
1642 case GE:
1643 return "bgez %1, %2";
1644 default:
1645 gcc_unreachable ();
1646 }
1647 }
Un-naming pattern “*bcondz”
Use to validate RTL and emit
assembly code for the branch
compare with zero

1388 (define_insn "one_cmplsi2"
1389 [(set (match_operand:SI 0 "register_operand" "=r")
1390 (not:SI (match_operand:SI 1 "register_operand" "r")))]
1391 ""
1392 "nort%0, %1, %1“)
Semantic of “one_cmplsi2”
not operand1 and set to operand 0
Naming pattern “one_cmplsi2” use to generate RTL, validate RTL
And output assembly code
Output assembly “nor ra, rb, rb” to match the semantic

Split instruction pattern
 When will need split instruction pattern ?
 The const_int value too big that single assembly
instruction can’t encode
Split the const_int to high part and low part
Could split the constant while define_expand
 But it’s not good enough, why?
 Too early split the constant will lost the
opportunity to optimize the RTL pattern

 The optimization phase “move2add”could do the following
thing (use assembly code to present RTL semantic for
convenient )
move $r0, 123456
move $r1, 123457
move $r2, 123458
move $r0, 123456
addi $r1, $r0, 1
addi $r2, $r0, 2
sethi $r0, hi20(123456)
ori $r0, lo12(123456)
sethi $r1, hi20(123457)
ori $r1, lo12(123457)
sethi $r2, hi20(123458)
ori $r2, lo12(123458)
If split const_int to high/low part too
early
move2add will fail to transfer move
to add

 How to split instruction pattern not in RTL
expand phase ?
 Use define_split, define_insn_and_split

004t.gimple
144r.expand
Other gimple pass
145r.sibling
147r.initvals
148r.unshare
149r.vregs
150r.into_cfglayout
151r.jump
152r.subreg1
153r.dfinit
154r.cse1
155r.fwprop1
156r.cprop1
158r.hoist
159r.cprop2
162r.ce1
163r.reginfo
164r.loop2
165r.loop2_init
166r.loop2_invariant
170r.loop2_done
172r.cprop3
173r.cse2
174r.dse1
175r.fwprop2
176r.auto_inc_dec
177r.init-regs
178r.dce
179r.combine
180r.ce2
182r.regmove
183r.outof_cfglayout
184r.split1
185r.subreg2
188r.asmcons
190r.sched1
191r.ira
192r.postreload
194r.split2
198r.pro_and_epilogue
199r.dse2
200r.csa
201r.peephole2
202r.ce3
204r.cprop_hardreg
205r.dce
206r.bbro
208r.split4
209r.sched2
212r.alignments
215r.mach
216r.barriers
217r.dbr
218r.split5
220r.shorten
221r.nothrow
222r.final
223r.dfinish
224t.statistics

486 (define_insn_and_split "*movsi_const"
487 [(set (match_operand:WORD 0 "register_operand" "=r,r")
488 (match_operand:WORD 1 "immediate_operand" "P,i"))]
489 ""
490 {
491 if (GET_CODE (operands[1]) == CONST_INT
&& SIGNED_INT_FITS_N_BITS (INTVAL (operands[1]), 20))
492 {
493 return "movit%0, %1";
494 }
495 else
496 return "#";
497 }
498 "reload_completed && GET_CODE (operands[1]) == CONST_INT
&& ! SIGNED_INT_FITS_N_BITS (INTVAL (ope rands[1]), 20)"
499 [(set (match_dup 0) (high:SI (match_dup 1)))
500 (set (match_dup 0) (lo_sum:SI (match_dup 0) (match_dup 1)))]
If const_int not fit signed 20 bit
return “#”
which means the pattern will split in split phase

489 ""
490 {
492 {
494 }
495 else
496 return "#";
497 }
&& ! SIGNED_INT_FITS_N_BITS (INTVAL (operands[1]), 20)"
Split conditions:
Which is reload_completed (after reload)
&& the const_int not fit signed 20 bit

489 ""
490 {
492 {
494 }
495 else
496 return "#";
497 }
&& ! SIGNED_INT_FITS_N_BITS (INTVAL (operands[1]), 20)"
Split RTL pattern to set high part
And add low sum
match_dup 0 means duplicate operands 0 to this field

288 (define_split
289 [(set (match_operand:ANY64 0 "register_operand" "")
290 (match_operand:ANY64 1 "register_operand" ""))]
291 "reload_completed &&
292 (! USE_V3_SERISE_ISA)”
295 [(set (match_dup 0) (match_dup 1))
296 (set (match_dup 2) (match_dup 3))]
297 “…
Split condition would be
reload_completed && not V3 ISA
V3 have movd44 which could do
64 bit register move
ANY64: DI, DF
DI: double int
DF:double float
define_split Define_insn_
and_split
Split RTL
RTL
validation
Emit
assembly

Instruction attribute
120 (define_attr "type"
121 "unknown,load,store,bequal, alu, .."
122 (const_string "unknown"))
…
614 (define_insn "cmovn"
615 [(set (match_operand:SI 0 "register_operand" "=r")
616 (if_then_else (ne:SI (match_operand:SI 1 "register_operand" "r")
617 (const_int 0))
618 (match_operand:SI 2 "register_operand" "r")
619 (match_operand:SI 3 "register_operand" "0")))]
620 ""
621 "cmovnt%0, %2, %1"
622 [(set_attr "type" "alu")
623 (set_attr “length” “4”])
(define_attr “attribute_name” “value domain” (default value))

Instruction attribute
 Attribute “type” use to divide instruction to
several instruction group
 Help to write instruction scheduling porting code
 Attribute “length” give each instruction ISA
length (size) information make the GCC
could calculate branch distance correctly.

Peephole pattern
2072 ;; Merge move 0 to bcondz
2073 (define_peephole2
2074 [(set (match_operand:SI 0 "register_operand" "") (const_int 0))
2075 (set (pc)
2076 (if_then_else (match_operator 1 "bcondz_operator"
2077 [(match_dup 0)
2078 (match_operand:SI 2 "register_operand" "r")])
2080 (pc)))]
2081 "peep2_reg_dead_p (2, operands[0])"
2082 [(set (pc)
2083 (if_then_else:SI (match_dup 1)
2084 (label_ref (match_dup 3)) (pc)))]
2085 "
2086 {
2087 operands[1] = gen_rtx_fmt_ee (swap_condition (GET_CODE (operands[1])) ,
2088 SImode, operands[2], GEN_INT(0));
2089 }")
Old RTL
New RTL
movi $r0, 0
bne $r0, $r1, L3
bnez $r1, L3

Instruction scheduling
 Instruction scheduling is the optimization
pass in GCC
 change instruction without changing the
semantic of the code
 To reduce the pipeline stall to improve
performance
 Instruction scheduling is belong to RTL phase

 GCC have two scheduling pass
 Sched1
Do the interblock scheduling before Register allocation
 Try to find the innermost loop as region
 Schedule the instructions in the region
 Improve the performance of hot spot (innermost loop)
 Extend the scope to region to find more scheduling
opportunity
 Sched2
Do the single basic block scheduling after Register allocation
Register allocation may produce spill code (load/store)
 Need re-schedule again

 Instruction scheduling resolve the following
hazard to prevent pipeline stall
 Structure hazard
Structure hazard occur when two or more instruction
need the same function unit at the same time
 Data hazard
RAW (read after write): a true dependency
WAR (write after read): a anti-dependency
WAW(write after write): an output dependency

 GCC provide several interface to describe
pipeline model
 After parsing the pipeline description porting
code
Gcc will generate a automata as a pipeline hazard
recognizer
To figure out the possibility of the instruction issue by
the processor on a given simulated cycle
(define_automaton “name")

(define_automaton “a1")
(define_cpu_unit "decode1,decode2" "a1")
(define_cpu_unit "div" "a1")
(define_insn_reservation “alu_class" 1
(eq_attr “type" “alu")
"decode + alu")
(define_insn_reservation "mult_class" 1
(eq_attr “type" "mult")
"decode + mult")
a1: automata name
decode1, decode2, div: the cpu unit
(function unit) in the processor
define_insn_reservation: describe
pipeline rule for each instruction class
alu_class,mult_class:
insn-name (insn class)
(eq_attr “type" “alu"): match the rule
while the type attribute of the
Instruction pattern is alu
"decode + alu": regular expression
to describe the function unit usage
1 is the default cycle when the data
dependency occur

 Multiple Alternative Constraints(define_insn “addsi3"
[ (set (match_operand:SI 0 “register_operand" "=r,r")
(match_operand:SI 2 “nonmemory_operand" “r,i")))]
“”
“@
add %0, %1, %2
addi %0, %1, %2”
[(set_attr “type" “alu")
)
"decode + alu")

(define_automaton “a1")
(define_cpu_unit "decode1,decode2" "a1")
(define_cpu_unit “alu" "a1")
(define_cpu_unit “mult" "a1")
"decode + alu")
"decode + mult")
nothing
decode
+ alu
decode
+ mult
alu_class
next_cycle
next_cycle
mult_class
next_cycle
Current CPU
Function unit usage
Next cycle CPU
Function unit usage
State transition:
1. Occupy some function unit
2. release function some unit

"decode + alu")
"decode + mult")
(define_bypass 2 “alu_class" “alu_class“)
(define_bypass 3 “mult_class" “mult_class“)
producer consumer t 1 2 3 4 5
alu_class alu_class
mult_class
1 0 0 0 0
0 0 0 0 0
mult_class alu_class
mult_class
0 0 0 0 0
1 1 0 0 0
1 means will stall at t cycle
t cycle is the cycle time
After producer

producer consumer t 1 2 3 4 5
alu_class alu_class
mult_class
1 0 0 0 0
0 0 0 0 0
mult_class alu_class
mult_class
0 0 0 0 0
1 1 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 0 0 0 0
1 1 0 0 0
1 0 0 0 0
1 0 0 0 0
Current state
consumer
alu_class
mult_class
t 1 2 3 4 5

1. movi $r0, 0 {alu}
3. add $r0, $r0, $r1 {alu}
4. lwi $r4, [$sp + 4] {load}
5. mul $r5, $r0, $r4 {mul}
1
4
2
5
3
"decode + alu")
"decode + mult")
(define_insn_reservation “load_class" 1
(eq_attr “type" “load")
"decode + mem")
Bottom up calcuate priority of
Each instruction
By P = max {latency
+ one successor latency}
1
22
3
3
Dataflow graph

1
4
2
5
3
Dataflow graph
Ready list: 1 2 4
Pending list: 3 5
Queued list:
Scheduled list:
Ready Pending Queued Scheduled
Scheduled
Dependency
resolved
Data hazard
Pick the max priority insn from Ready list

1
4
2
5
3
Dataflow graph
Ready list: 4
Pending list: 3 5
Queued list: 2
Scheduled list:1
{alu} {alu}
{alu} {load}
{mult}
cycle 1
(define_bypass 2 “alu_class" “alu_class“)

1
4
2
5
3
Dataflow graph
Ready list: 2
Pending list: 3 5
Queued list:
Scheduled list:1 4
{alu} {alu}
{alu} {load}
{mult}
cycle 1
4. lwi $r4, [$sp + 4] {load}cycle 2

1
4
2
5
3
Dataflow graph
Ready list:
Pending list: 5
Queued list: 3
Scheduled list:1 4 2
{alu} {alu}
{alu} {load}
{mult}
cycle 1
2. movi $r1, 1 {alu}cycle 3

1
4
2
5
3
Dataflow graph
Ready list: 3
Pending list: 5
Queued list:
Scheduled list:1 4 2
{alu} {alu}
{alu} {load}
{mult}
cycle 1
cycle 4

1
4
2
5
3
Dataflow graph
Ready list: 5
Pending list:
Queued list:
Scheduled list:1 4 2 3
{alu} {alu}
{alu} {load}
{mult}
cycle 1
cycle 4
3. add $r0, $r0, $r1 {alu}cycle 5

1
4
2
5
3
Dataflow graph
Ready list:
Pending list:
Queued list:
Scheduled list:1 4 2 3 5
{alu} {alu}
{alu} {load}
{mult}
cycle 1
cycle 4
3. add $r0, $r0, $r1 {alu}cycle 5
5. mul $r5, $r0, $r4 {mul}cycle 6

Switch initialization conversion in
gimple optimization pass
31 int a,b;
32
33 switch (argc)
34 {
35 case 1:
36 case 2:
37 a = 8;
38 b = 6;
39 break;
40 case 3:
41 a = 9;
42 b = 5;
43 break;
44 case 12:
45 a = 10;
46 b = 4;
47 break;
48 default:
49 a = 16;
50 b = 1;
51 }
58 static const int = CSWTCH01[] = {6, 6, 5, 1, 1, 1, 1, 1, 1, 1, 1, 4};
59 static const int = CSWTCH02[] = {8, 8, 9, 16, 16, 16, 16, 16, 16, 16,
60 16, 16, 10};
61
62 if (((unsigned) argc) - 1 < 11)
63 {
64 a = CSWTCH02[argc - 1];
65 b = CSWTCH01[argc - 1];
66 }
67 else
68 {
69 a = 16;
70 b = 1;
71 }
Try to transfer switch statement to static array access

Gcc porting

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Gcc porting

Similar to Gcc porting (20)

Recently uploaded

Recently uploaded (20)

Gcc porting