General Structure of a Compiler
Lecture 2
General Structure of a Compiler
Conceptual Structure
• Front-end performs the analysis of the source language:
– Breaks up the source code into pieces and imposes a grammatical
structure.
Front-End Back-End
Intermediate
Representation
Source code Target code
9-Jan-15 CS 346 Lecture 2 2
– Using this, creates a generic Intermediate Representation (IR) of the
source code.
– Checks for syntax / semantics and provides informative messages
back to user in case of errors.
– Builds a symbol table to collect and store information about source
program.
• Back-end does the target language synthesis:
– Chooses instructions to implement each IR operation.
Conceptual Structure
Front-End Back-End
Intermediate
Representation
Source code Target code
9-Jan-15 CS 346 Lecture 2 3
– Translates IR into target code.
– Needs to conform with system interfaces.
– Automation has been less successful.
What is the implication of this separation (front-end: analysis; back-
end:synthesis) in building a compiler for, say, a new language?
m×n compilers with m+n components!
Front-end
Front-end
Front-end
Front-end
Back-end
Back-end
Back-end
Back-end
I.R.
Fortran
Smalltalk
target 1
C
Java
target 2
target 3
target 4
9-Jan-15 4
• All language specific knowledge must be encoded in the front-end
• All target specific knowledge must be encoded in the back-end
But: in practice, this strict separation is not trivial.
Front-end Back-end
CS 346 Lecture 2
General Structure of a Compiler
Lexical
Analysis
Syntax
Analysis
I.C.
Optimisation
Code
Generation
I.R.
tokens
Abstract Syntax Tree (AST)
IR
symbolic instructions
Source
9-Jan-15 5
Semantic
Analysis
Intermediate
code generat.
Target code
Generation
Target code
Optimisation
Annotated AST optimised symbolic instr.
front-end back-end
Target
CS 346 Lecture 2
Lexical Analysis (Scanning)
• Reads characters in the source program and groups them into
meaningful sequences called lexemes.
• Produces as output a token of the form <Token-class, Attribute>
and passes it to the next phase syntax analysis
9-Jan-15 6
• Token-class: The symbol for this token to be used during syntax
Analysis
• Attribute: Points to symbol table entry for this token
e.g.: The tokens for: fin = ini + rate * 60 are:
<id, 1> <=> <id, 2> <+> <id, 3> <*> <const, 60>
CS 346 Lecture 2
Symbol Table
1 pos …
2 ini …
3 rate …
… … …
Syntax Analysis (Parsing)
• Imposes a hierarchical structure on the token stream.
• This hierarchical structure is usually expressed by recursive rules.
• Context-free grammars formalise these recursive rules and guide
9-Jan-15 7
• Context-free grammars formalise these recursive rules and guide
syntax analysis to flag syntax error in case of any mismatch.
• The IR is usually represented as syntax trees
– Interior nodes represent operation
– Leaves depict the arguments of the operation
CS 346 Lecture 2
• Parse tree for: fin = ini + rate * 60
• Tokens: <id, 1> <=> <id, 2> <+> <id, 3> <*> <const, 60>
=
Syntax Analysis (Parsing)
+
<id,3>
<id,1>
*
60
<id,2>
9-Jan-15 8CS 346 Lecture 2
Semantic Analysis (context handling)
• Collects context (semantic) information from syntax tree and
symbol table and checks for semantic consistency with
language definition.
• Annotates nodes of the tree with the results.
• Semantic errors:
9-Jan-15 9
• Semantic errors:
– Type mismatches, incompatible operands, function called with
improper arguments, undeclared variable, etc.
– e.g: int ary[10], x; x = ary * 20;
• Type checkers in the semantic analyzer may also do automatic
type conversions if permitted by the language specification
(Coercions).
CS 346 Lecture 2
• Annotated Syntax Tree (AST) for the expression fin = ini + rate * 60
• Tokens: <id, 1> <=> <id, 2> <+> <id, 3> <*> <const, 60>
+
= float
float
Semantic Analysis (context handling)
*
+
60
<id,1>
<id,2>
<id,3>
float
float
float
float
float
intofloat
9-Jan-15 10CS 346 Lecture 2
• Translate language-specific constructs in the AST into
more general constructs.
• A criterion for the level of “generality”: it should be
straightforward to generate the target code from the
intermediate representation chosen.
Intermediate code generation
• Example of a form of IR (3-address code):
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
9-Jan-15 11CS 346 Lecture 2
• Generates streamlined code still in intermediate representation.
• A range of optimization techniques may be applied. For example,
– Removing unused variables
– Suppressing generation of unreachable code segments
– Constant Folding
– Common Sub-expression Elimination
– Loop optimization (Removing unmodified statements in a
Code Optimisation
– Loop optimization (Removing unmodified statements in a
loop)... etc.
• Example:
t1 = inttofloat(60) t1 = id3 * 60.0
t2 = id3 * t1 id1 = id2 + t1
t3 = id2 + t2
id1 = t3
9-Jan-15 12CS 346 Lecture 2
• Map 3-address code into assembly code of the target
architecture
– Instruction selection: A pattern matching problem.
– Register allocation: Each value should be in a register when it is
used (but there is only a limited number).
– Instruction scheduling: take advantage of multiple functional
units.
• Example — A possible translation using two registers:
Code Generation
• Example — A possible translation using two registers:
t1 = id3 * 60.0 MOVF id3, R2
id1 = id2 + t1 MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
9-Jan-15 13CS 346 Lecture 2
9-Jan-15 14CS 346 Lecture 2
Lecture2 general structure of a compiler

Lecture2 general structure of a compiler

  • 1.
    General Structure ofa Compiler Lecture 2 General Structure of a Compiler
  • 2.
    Conceptual Structure • Front-endperforms the analysis of the source language: – Breaks up the source code into pieces and imposes a grammatical structure. Front-End Back-End Intermediate Representation Source code Target code 9-Jan-15 CS 346 Lecture 2 2 – Using this, creates a generic Intermediate Representation (IR) of the source code. – Checks for syntax / semantics and provides informative messages back to user in case of errors. – Builds a symbol table to collect and store information about source program.
  • 3.
    • Back-end doesthe target language synthesis: – Chooses instructions to implement each IR operation. Conceptual Structure Front-End Back-End Intermediate Representation Source code Target code 9-Jan-15 CS 346 Lecture 2 3 – Translates IR into target code. – Needs to conform with system interfaces. – Automation has been less successful. What is the implication of this separation (front-end: analysis; back- end:synthesis) in building a compiler for, say, a new language?
  • 4.
    m×n compilers withm+n components! Front-end Front-end Front-end Front-end Back-end Back-end Back-end Back-end I.R. Fortran Smalltalk target 1 C Java target 2 target 3 target 4 9-Jan-15 4 • All language specific knowledge must be encoded in the front-end • All target specific knowledge must be encoded in the back-end But: in practice, this strict separation is not trivial. Front-end Back-end CS 346 Lecture 2
  • 5.
    General Structure ofa Compiler Lexical Analysis Syntax Analysis I.C. Optimisation Code Generation I.R. tokens Abstract Syntax Tree (AST) IR symbolic instructions Source 9-Jan-15 5 Semantic Analysis Intermediate code generat. Target code Generation Target code Optimisation Annotated AST optimised symbolic instr. front-end back-end Target CS 346 Lecture 2
  • 6.
    Lexical Analysis (Scanning) •Reads characters in the source program and groups them into meaningful sequences called lexemes. • Produces as output a token of the form <Token-class, Attribute> and passes it to the next phase syntax analysis 9-Jan-15 6 • Token-class: The symbol for this token to be used during syntax Analysis • Attribute: Points to symbol table entry for this token e.g.: The tokens for: fin = ini + rate * 60 are: <id, 1> <=> <id, 2> <+> <id, 3> <*> <const, 60> CS 346 Lecture 2 Symbol Table 1 pos … 2 ini … 3 rate … … … …
  • 7.
    Syntax Analysis (Parsing) •Imposes a hierarchical structure on the token stream. • This hierarchical structure is usually expressed by recursive rules. • Context-free grammars formalise these recursive rules and guide 9-Jan-15 7 • Context-free grammars formalise these recursive rules and guide syntax analysis to flag syntax error in case of any mismatch. • The IR is usually represented as syntax trees – Interior nodes represent operation – Leaves depict the arguments of the operation CS 346 Lecture 2
  • 8.
    • Parse treefor: fin = ini + rate * 60 • Tokens: <id, 1> <=> <id, 2> <+> <id, 3> <*> <const, 60> = Syntax Analysis (Parsing) + <id,3> <id,1> * 60 <id,2> 9-Jan-15 8CS 346 Lecture 2
  • 9.
    Semantic Analysis (contexthandling) • Collects context (semantic) information from syntax tree and symbol table and checks for semantic consistency with language definition. • Annotates nodes of the tree with the results. • Semantic errors: 9-Jan-15 9 • Semantic errors: – Type mismatches, incompatible operands, function called with improper arguments, undeclared variable, etc. – e.g: int ary[10], x; x = ary * 20; • Type checkers in the semantic analyzer may also do automatic type conversions if permitted by the language specification (Coercions). CS 346 Lecture 2
  • 10.
    • Annotated SyntaxTree (AST) for the expression fin = ini + rate * 60 • Tokens: <id, 1> <=> <id, 2> <+> <id, 3> <*> <const, 60> + = float float Semantic Analysis (context handling) * + 60 <id,1> <id,2> <id,3> float float float float float intofloat 9-Jan-15 10CS 346 Lecture 2
  • 11.
    • Translate language-specificconstructs in the AST into more general constructs. • A criterion for the level of “generality”: it should be straightforward to generate the target code from the intermediate representation chosen. Intermediate code generation • Example of a form of IR (3-address code): t1 = inttofloat(60) t2 = id3 * t1 t3 = id2 + t2 id1 = t3 9-Jan-15 11CS 346 Lecture 2
  • 12.
    • Generates streamlinedcode still in intermediate representation. • A range of optimization techniques may be applied. For example, – Removing unused variables – Suppressing generation of unreachable code segments – Constant Folding – Common Sub-expression Elimination – Loop optimization (Removing unmodified statements in a Code Optimisation – Loop optimization (Removing unmodified statements in a loop)... etc. • Example: t1 = inttofloat(60) t1 = id3 * 60.0 t2 = id3 * t1 id1 = id2 + t1 t3 = id2 + t2 id1 = t3 9-Jan-15 12CS 346 Lecture 2
  • 13.
    • Map 3-addresscode into assembly code of the target architecture – Instruction selection: A pattern matching problem. – Register allocation: Each value should be in a register when it is used (but there is only a limited number). – Instruction scheduling: take advantage of multiple functional units. • Example — A possible translation using two registers: Code Generation • Example — A possible translation using two registers: t1 = id3 * 60.0 MOVF id3, R2 id1 = id2 + t1 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1 9-Jan-15 13CS 346 Lecture 2
  • 14.