F Cg
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

F Cg

on

  • 1,052 views

 

Statistics

Views

Total Views
1,052
Views on SlideShare
683
Embed Views
369

Actions

Likes
0
Downloads
20
Comments
0

2 Embeds 369

http://www.lingcc.com 368
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

F Cg Presentation Transcript

  • 1. Section F Code Generation
  • 2. Target Information Tables (targ_info)  Originated from Cydrome with major enhancements  Parameterized description of target instruction set, ABI and scheduling information  In form of C++ code − Compiled and linked with table generating routines − Generated tables are C++ files used by CG  Separate architecture details from optimization algorithms  Minimize compiler changes when retargeting to a new architecture  Target-specific content in target-specific sub-dirs  Support ISA subsets  Scheduling info for different processor implementation built as separate .so's − Compilation flag controls which .so to use in dlopen()
  • 3. Code Generator Intermediate Representation (CGIR)  Each target op corresponds to a machine instr  Format established by targ_info  Operands and results in a target op are TNs (or symbolic register)  Different types of TNs:  Symbols  Literals  Registers  Types of TNs must conform to instruction format  Each operation takes two operands and writes to one result (RISC-like)  Special instructions can write two results (e.g. mul)
  • 4. Code Generator Phase Structure WHIRL CG-expand Scheduling pre-pass CGIR Global register alloc Extended block optimizer Control flow opt. Local register alloc If-conversion Scheduling post-pass Loop optimizations Prolog and Epilog Software Loop pipelining unrolling Code emission .s
  • 5. CG Expansion Expand WHIRL into machine operations in CGIR (whirl2ops.cxx, cgexp.cxx, x8664/exp*.cxx) TNs created to store intermediate results Single assignment for each TN to minimize dependencies Rest of CG works on CGIR Each machine op contains pointer to WHIRL for alias and dependency information look-up Very target-specific For X86, pass before LRA to enforce two-operand instruction format by generating copy instructions: r = s + t Becomes r = s r += t Register coalescing in LRA removes unnecessary copies
  • 6. Extended Block Optimizer Perform peephole optimization within extended basic blocks (ebo.cxx) Extended basic block constructed by concatenating basic blocks single entry multiple side exits allowed A AB AC Optimizations forward propagation B C common subexpression elimination constant folding D dead code elimination • redundant stores and loads target-specific optimizations Extended Blocks: AB, AC • x86 load-execute instructions • x86 LEA for add
  • 7. Extended Block Optimizer Algorithm Build TNINFO to track TN values • unique TNINFO for each TN definition • tracks availability of TN value, identify the defining op, etc. • facilitate replacement of TN operands by earlier TNs Ops hash table • hash similar ops into same bucket • allows quick identification of “matching” ops • memory ops match if they access the same memory location • remove duplicate memory access • ALU ops match if they perform same function • common subexpression elimination EBO called many times through CG
  • 8. Control Flow Optimization Control flow graph built after CG expansion Optimizations performed: (cflow.cxx) • Branch removal • Convert branches with constant condition into gotos • Fold branch to branch, branch to goto, branch to return, etc. • Frequency-guide block placement • Use feedback, estimated frequencies or user-provided hints • Rearrange blocks to favor fall-through over taken branches • Grow basic block chains • Clone basic blocks to reduce branching (tail duplication) • Unreachable code removal
  • 9. Inner Loop Optimizations Cross-iteration loop optimizations (cio_rwtrans.cxx) Unrolling for innermost loops (cg_loop.cxx) Prefetch pruning (cg_loop.cxx)
  • 10. Cross-iteration Loop Optimizations Redundancy elimination for array elements across iteration boundary Build dependencies for memory access instructions in loop bodies Omega – iteration distance of the dependency Read redundancies: t = a[i-1] = a[i-1] =t = a[i] =u u=t Register moves inserted if live ranges overlap Initializations of temporaries before loop entry
  • 11. Cross-iteration Loop Optimizations Read redundancies from writes: = a[i] =v t= a[i-2] = a[i-2] = t v=u u=t Start with redundancies with small omegas until register pressure reaches threshold Write-write redundancies: a[i] = a[i-3] = a[i-3] = Requires insertion of residual stores at loop exit
  • 12. Cross-iteration CSE Built on results of cross-iteration read redundancies Remove operations and reduce register pressure t = a[i-1] + b[i-1] = a[i-1] + b[i-1] =t = a[i] + b[i] =u u=t
  • 13. Instruction Scheduling Performed at basic block scope (hb_sched.cxx) Construct dependence graph (cg_dep_graph.cxx) based on: • Alias information from wopt • Dependency information from LNO Good instruction scheduling may increase register pressure • Instruction scheduling and register allocation inter- dependent Instruction scheduling applied two times: • First scheduling pass to estimate register requirements under good instruction schedule • Global register allocator will grant requested number of registers in each basic block • Second scheduling pass performed after register allocation After register allocation, base dependencies on real registers Use list scheduling algorithm Can work in both forward or backward directions
  • 14. Roles of Register Allocation Make best use of the available physical registers to improve run-time performance Generate spill code when necessary Obey ABI and ISA regarding register usage:  Parameter registers  Function return registers  Callee-saved registers  Save/restore at procedure entry/exit  Caller-saved registers  Save/restore around calls  Special registers usage in special instructions
  • 15. Register Allocation Performed according to targ_info’s register file parameters Applied to register TNs (symbolic registers) Liveness analysis identifies global TNs (GTNs) whose live ranges span basic block boundaries (gra_live.cxx) Global Register Allocation (GRA) applied to GTNs only (gra_mon/*.cxx) BB level granularity Priority-based Coloring Algorithm Grant requested number of registers to LRA Local Register Allocation (LRA) applied to local TNs in each BB using registers granted by GRA (lra.cxx) Instruction level granularity One backward pass over instructions If GRA not applied, apply Localization pass to transforms GTNs to local by inserting spill code
  • 16. Liveness Analysis Use traditional data flow analysis Bit vectors based on TN numbers for each BB Pass over code to set local information:  live_usei – TNs with local upward-exposed use  defreach_geni – TNs defined in BBi Use data flow analysis to compute:  defreach_ini – some def reaches top of BBi  defreach_outi – some def reaches bottom of BBi  live_ini – use upward-exposed at top of BBi  live_outi – use upward-exposed at bottom of BBi At BBi, set of global TNs is given by: defreach_ini Λ live_ini + defreach_outi Λ live_outi
  • 17. GRA via Priority-based Register Allocation Use coarse-grained interference  Unit of allocation is BB  Each register assigned to 1 TN each BB  Live ranges are made up of BBs  TWO TNs interfere if their live ranges overlap Region-driven  Manage register resources over BBs  Split live ranges to fit regions where there are available registers No backtracking
  • 18. Register Allocation Data Structures LV – a live range LU – a unit (BB) within a live range (live unit) LV points to a list of its LUs To test interference, use bit vector of BBs (based on BB numbers) called BBmem •Two LVs interfere if their BBmem’s intersect dft_LUs: •LUs with default information: 0 use, 0 def, 0 call, no preference, etc. •Do not require allocation of LU nodes Interference Graph – linked list of pointers to interfering LVs for each LV
  • 19. Cost/Benefit Modeling Compare instruction count difference between allocating and not allocating If not allocating: In each BB where TN is locally upward-exposed, 1 load In each BB where TN is defined, 1 store If allocating May need to reload on entry to live range May need to spill on exit from live range Benefitbb = # avoided load/store X freqbb Costbb = # required spill/reload X freqbb ∑ (Benefit bb (TN ) − Costbb (TN )) Priority( LV ) = bb # of BBs
  • 20. Register Cost Modelling Caller-saved registers – need saves/restores around calls Callee-saved registers – need saves/restores at PU entries/exits if used the first time Parameter registers – preferencing for TN being passed as parameter (negative cost) Function return registers – preferencing for TN to contain function return value
  • 21. Register Allocation Algorithm Definition: Unconstrained LV – number of neighbors in interference graph less than number of registers assignable to it A. Compute forbidden register set for each LV; separate LVs into constrained and unconstrained pools B. For each register class, repeat until all unspilled LVs are allocated: 1. Compute priorities for new LVs; spill if negative 2. Pick unallocated LV with highest priority as LVbest 3. For each register r in ~forbidden(LVbest), compute Regcost(LVbest, r) 4. Pick rbest with least Regcost 5. If Priority(LVbest) < Regcost(LVbest, rbest), split LVbest; otherwise: a) Assign rbest to LVbest b) In list of LVs that interferes LVbest, update the forbidden register set, split uncolorable LVs and spill unsplittable LVs C. Assign registers to unconstrained LVs
  • 22. Purpose of Splitting To enhance colorability: • Split when forbidden(LV) = set of registers • If no register is available in any of its BBs where TN appears, spill entire LV To change shape of live ranges: • Automatically separate LVs with non-contiguous blocks • Split at calls when out of callee-saved registers • Split-out regions with 0 occurrences are spilled
  • 23. Splitting Live Ranges Carve largest possible LVnew out of LVorig such that forbidden(LVnew) is not full What remains of LVorig may or may not be allocatable Maintain list of LUs in topological ordering according to control flow graph, ignoring back edges Algorithm: 1. Go through LU list to pick LU such that TN appears and register is available 2. For each LU in LVnew: For each BBsucc of LU If BBsucc is member of LVorig, and adding BBsucc to LVnew does not cause forbidden(LVnew) to become full, Move BBsucc from LVorig to LVnew; Update forbidden(LVnew) and LVnew’s LU list
  • 24. Local Register Allocation Allocate registers to local TNs in one backward pass through instructions in a BB Set of registers available in each class for the BB updated on the fly In backward pass:  In each instruction, process the result TN before the operand TNs  First encounter of each local TN must be last use of the TN for its live range  Assign register to TN by picking round-robin from available set  For local TN appearing as result, free its register by adding register back to available set  Same register can be allocated to result and operand in same instruction  if a register move instruction, instruction can be deleted
  • 25. Handling X86 Register Peculiarities Extra copies with extra TNs introduced at start of LRA for each BB To conform to 2-operand instruction format: TN100 = TN101 + TN102 becomes TN100 = TN101 TN100 = TN100 + TN102 To conform to usage of 8-bit register subclass TN100 = sete … becomes TN101 = sete … TN100 = TN101  TN101 restricted to 8-bit register subclass  No restriction on TN100
  • 26. Handling X86 Special Registers Some instructions require specific registers Single-register subclasses created for rax, rcx and rdx Introduce copies involving TNs restricted to subclasses TN100 TN101 = mul32 TN200 TN201 becomes eax = TN200 eax edx = mul32 eax TN201 TN100 = eax TN101 = edx
  • 27. LRA can run out of registers Every TN must be allocated to a register When no register is available for a TN:  Call Fix_LRA_Blues() to free up extra registers  Re-try LRA for entire BB Fix_LRA_Blues() can use different strategies:  Spill a register allocated by GRA  Re-do instruction scheduling to reduce register pressure  Spill a previously allocated local TN over its live range
  • 28. Handling X87 Register Stack X87 supported only under –m32 X87 stack registers modelled as normal register file After last instruction scheduling phase, convert X87 registers to stack-like operations (x8664/cg_convert_x87.cxx) Stack maintained at same state at transitions between BBs Insert fxch instructions Use pop version of X87 instructions if dead
  • 29. Global Code Motion Performed after local register allocation and before the final instruction scheduling phase (gcm.cxx) Computes critical paths in each basic block Looks for opportunities to shorten critical paths by moving instructions to adjacent blocks • Beginning instructions moved to predecessors • Ending instructions moved to successors
  • 30. Handling Asm() In WHIRL: ASM_STMT represents an asm statement Inputs and outputs of ASM_STMT maintains interfaces in WHIRL Expand to asm pseudo-op inside CG (whirl2ops.cxx) Input and output interfaces become TNs TNs assigned real registers by GRA/LRA In code emission phase, operands in asm string substituted by assigned registers (cgemit.cxx, x8664/cgtarget.cxx)