Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Compiler design
1
Structure of a Compiler 2
Compiler
Source code
Errors
Object code
Conceptual Structure: two major phases
 Front-end performs the analysis of the source language:
 Recognises legal and il...
General Structure of a compiler
4
Source
Intermediate code generat.
tokens
Intermediate Representation
Syntax Tree
Target ...
Lexical Analysis (Scanning)
 Reads characters in the source program and groups them into words -
lexeme (basic unit of sy...
Syntax (or syntactic) Analysis (Parsing)
 Imposes a hierarchical structure on the token stream.
 Use output of lexical a...
Semantic Analysis (context handling)
 Collects context (semantic) information, checks for semantic
errors, and annotates ...
Semantic Analysis (context handling)
 Important part is type checking, where compiler checks that each
operator has match...
Intermediate code generation
 Translate language-specific constructs in the AST into more general constructs.
 Generate ...
Intermediate code generation
 Generate an explicit machine like intermediate representation.
 Two important properties
...
Code Optimisation
 The goal is to improve the intermediate code and, thus, the effectiveness
of code generation and the p...
Code Generation Phase
 Map the AST onto a linear list of target machine instructions in a
symbolic form:
 Instruction se...
 t1=id3 * 60.0
 Id1=id2 + t1
 can be converted into target code as
LDF R2, R2, id3
MULF R2,R2 #60.0
LDF R1,id2
ADDF R1,...
Compiler Construction Tools
 Parser Generator : Automatically produce syntax analysis from grammatical description of
a p...
C compilation Phases 15
Lexical Analyzer 16
Lexical Analysis Versus Parsing
 Simplicity of design.
 Efficiency. With the task separated it is easier
to apply specia...
Token 18
Pattern is the description of of the form that the lexemes may take.
Token Example 19
Lexical Errors
 f i ( a == f (x) )
 The simplest recovery strategy is "panic mode" recovery.
 delete successive charact...
Input Buffering 21
• Buffer Pairs
• Pointer lexemeBegin, marks the beginning of the current lexeme.
• Pointer forward scan...
Specification fo tokens
 Reular expressions
 A vocabulary (alphabet) is a finite set of symbols.
 A string is any finit...
Strings
 A PREFIX of string S is any string obtained by removing zero or more symbols
from the end of s. For example, ban...
Operation on Languages
 Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the set
of digits {...
Regular Expressions
Patterns form a regular language. A regular expression is a way of
specifying a regular language. It i...
Examples
 integer  (+ | – | ) (0 | 1 | 2 | … | 9)+
 integer  (+ | – | ) (0 | (1 | 2 | … | 9) (0 | 1 | 2 | … | 9)*)
...
Building a Lexical Analyser by hand
Based on the specifications of tokens through regular expressions we can
write a lexic...
Building Lexical Analysers “automatically”
Idea: try the regular expressions one by one and find the longest match:
set (t...
We study REs to automate scanner construction!
Consider the problem of recognising register names starting with r and
requ...
Towards Automation
An easy (computerised) implementation of a transition diagram is a
transition table: a column for each ...
DFA Minimisation: the problem
 Problem: can we minimise the number of states?
 Answer: yes, if we can find groups of sta...
DFA minimisation: the algorithm
(Hopcroft’s algorithm: simple version)
Divide the states of the DFA into two groups: those...
How does it work? Recall (a | b)* abb
In each iteration, we consider any non-single-member groups and we
consider the part...
From NFA to minimized DFA: recall a (b | c)*
34
S0 S1 S2 S3
S4
S6
S5
S7
S8 S9
a  







b
c
-closure(move(s,*))...
DFA minimisation: recall a (b | c)*
Apply the minimisation algorithm to produce the minimal DFA:
35
A
D
C
B
a
c
c
c b
b
bC...
Lex/Flex: Generating Lexical Analysers
Flex is a tool for generating scanners: programs which recognised lexical
patterns ...
Upcoming SlideShare
Loading in …5
×

Compilers Design

Compilers Design

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Compilers Design

  1. 1. Compiler design 1
  2. 2. Structure of a Compiler 2 Compiler Source code Errors Object code
  3. 3. Conceptual Structure: two major phases  Front-end performs the analysis of the source language:  Recognises legal and illegal programs and reports errors.  “understands” the input program and collects its semantics in an IR.  Produces IR and shapes the code for the back-end.  Back-end does the target language synthesis:  Translates IR into target code. 3 Front-End Back-EndIntermediate Representation Source code Target code
  4. 4. General Structure of a compiler 4 Source Intermediate code generat. tokens Intermediate Representation Syntax Tree Target Machine code Lexical Analysis Semantic Analysis Syntax Analysis I.C.Optimisation Code Generation Target code Optimisation Syntax Tree Intermediate Representation Error Handler Symbol Table Manager
  5. 5. Lexical Analysis (Scanning)  Reads characters in the source program and groups them into words - lexeme (basic unit of syntax)  Produces words and recognises what sort they are.  The output is called token and is a pair of the form <token_name, attribute value>  Eg: Position = initial + rate * 60  keep a symbol table.  Lexical analysis eliminates white spaces  Speed is important - use a specialised tool: e.g., flex - a tool for generating scanners: programs which recognise lexical patterns in text; 5
  6. 6. Syntax (or syntactic) Analysis (Parsing)  Imposes a hierarchical structure on the token stream.  Use output of lexical analyzer to create a tree like structure of the token stream – syntax tree.  The tree shows the order in which operations are performed  Ordering based on the precedence of mathematical operations. 6 <id,3 > 60 = = = <id,1 > <id,2 >
  7. 7. Semantic Analysis (context handling)  Collects context (semantic) information, checks for semantic errors, and annotates nodes of the tree with the results.  Ensures that component of a program fit together meaningfully.  Also gathers type information and saves either in syntax tree or symbol table. 7 <id,3> inttofloat = = = <id,1> <id,2> 60
  8. 8. Semantic Analysis (context handling)  Important part is type checking, where compiler checks that each operator has matching operands.  Examples:  type checking: report error if an operator is applied to an incompatible operand.  Eg : Array index is an integer; report error if floating point index.  check flow-of-controls.  uniqueness or name-related checks. 8
  9. 9. Intermediate code generation  Translate language-specific constructs in the AST into more general constructs.  Generate an explicit low level intermediate representation .  intermediate representation should have two important properties.  It should be easy to produce.  It should be easy to translate into target machine code. t1 = inttofloat(60) t2 = id3 * t1 t3 = id2 * t2 Id1 = t3 9
  10. 10. Intermediate code generation  Generate an explicit machine like intermediate representation.  Two important properties  It should be easy to produce  It should be easy to translate into target code  The output of intermediate code generator consist of three address code sequence.  Each three address assignment instruction has atmost one operator on the right hand side.  The compiler must generate temporary name to hold the value computed by three address instruction.  Some instruction may have less than three operands. 10
  11. 11. Code Optimisation  The goal is to improve the intermediate code and, thus, the effectiveness of code generation and the performance of the target code ge improved.  Faster, shorter code and even that consumes less power.  Intofloat conversion can be done once. t1 = id3 * 60.0 id1 = id2 + t1  Significant amout of time spent on this phase. 11 Front-End Middle-End (optimiser) Back-End Target code Source code IR IR
  12. 12. Code Generation Phase  Map the AST onto a linear list of target machine instructions in a symbolic form:  Instruction selection: a pattern matching problem.  Register allocation: each value should be in a register when it is used  Instruction scheduling: take advantage of multiple functional units.  Target, machine-specific properties may be used to optimise the code.  Finally, machine code and associated information required by the Operating System are generated. 12
  13. 13.  t1=id3 * 60.0  Id1=id2 + t1  can be converted into target code as LDF R2, R2, id3 MULF R2,R2 #60.0 LDF R1,id2 ADDF R1,R2,R2 STF id1,R1 13
  14. 14. Compiler Construction Tools  Parser Generator : Automatically produce syntax analysis from grammatical description of a programming language.  Scanner Generator : Produce lexical analyzer from regular expression description of the language.  Syntax Directed translation engines : produce collection of routines for walking a parse tree and generating intermediate code.  Code Generator generators : produce code generator from a collection of rules for translating each operation of the intermediate language into the machine language for a target machine.  Data Flow Engines : that facilitate the gathering of information about how values are transmitted from one part of a program to each other part.  Compiler Construction tool kit : provide an integrated set of routines for constructing various phases of a compiler. 14
  15. 15. C compilation Phases 15
  16. 16. Lexical Analyzer 16
  17. 17. Lexical Analysis Versus Parsing  Simplicity of design.  Efficiency. With the task separated it is easier to apply specialized techniques.  Portability. Only the lexer need communicate with the outside. 17
  18. 18. Token 18 Pattern is the description of of the form that the lexemes may take.
  19. 19. Token Example 19
  20. 20. Lexical Errors  f i ( a == f (x) )  The simplest recovery strategy is "panic mode" recovery.  delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left.  Delete one character from the remaining input.  Insert a missing character into the remaining input.  Replace a character by another character.  Transpose two adjacent characters. 20
  21. 21. Input Buffering 21 • Buffer Pairs • Pointer lexemeBegin, marks the beginning of the current lexeme. • Pointer forward scans ahead until a pattern match is found. • Sentinel : special character not the part of the source program
  22. 22. Specification fo tokens  Reular expressions  A vocabulary (alphabet) is a finite set of symbols.  A string is any finite sequence of symbols from a vocabulary.  A language is any set of strings over a fixed vocabulary.  A grammar is a finite way of describing a language. 22
  23. 23. Strings  A PREFIX of string S is any string obtained by removing zero or more symbols from the end of s. For example, ban, banana, and E are prefixes of banana.  A SUFFIX of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana, banana, and E are suffixes of banana.  A SUBSTRING of s is obtained by deleting any prefix and any suffix from s. For instance, banana, nan, and E are substrings of banana.  The PROPER PREFIXES, SUFFIXES, and substrings of a string s are those, prefixes, suffixes, and substrings, respectively, of s that are not E or not equal to s itself.  A SUBSEQUENCE of s is any string formed by deleting zero or more not necessarily consecutive positions of s. For example, baan is a subsequence of banana. 23
  24. 24. Operation on Languages  Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D be the set of digits {a, 1 , . . . 9}  L U D is the set of letters and digits - strictly speaking the language with 62 strings of length one, each of which strings is either one letter or one digit.  LD is the set df 520 strings of length two, each consisting of one letter followed by one digit.  L4 is the set of all 4-letter strings.  L * is the set of ail strings of letters, including E, the empty string.  L(L U D)* is the set of all strings of letters and digits beginning with a letter.  D+ is the set of all strings of one or more digits. 24
  25. 25. Regular Expressions Patterns form a regular language. A regular expression is a way of specifying a regular language. It is a formula that describes a possibly infinite set of strings. (Have you ever tried ls [x-z]* ?) Regular Expression (RE) (over a vocabulary V):   is a RE denoting the empty set {}.  If a V then a is a RE denoting {a}.  If r1, r2 are REs then:  r1* denotes zero or more occurrences of r1;  r1r2 denotes concatenation;  r1 | r2 denotes either r1 or r2;  Shorthands: [a-d] for a | b | c | d; r+ for rr*; r? for r |  Describe the languages denoted by the following REs a; a | b; a*; (a | b)*; (a | b)(a | b); (a*b*)*; (a | b)*baa; (What about ls [x-z]* above? Hmm… not a good example?) 25
  26. 26. Examples  integer  (+ | – | ) (0 | 1 | 2 | … | 9)+  integer  (+ | – | ) (0 | (1 | 2 | … | 9) (0 | 1 | 2 | … | 9)*)  decimal  integer.(0 | 1 | 2 | … | 9)*  identifier  [a-zA-Z] [a-zA-Z0-9]*  Real-life application (perl regular expressions):  [+–]?(d+.d+|d+.|.d+)  [+–]?(d+.d+|d+.|.d+|d+)([eE][+–]?d+)? (for more information read: % man perlre) (Not all languages can be described by regular expressions. But, we don’t care for now). 26
  27. 27. Building a Lexical Analyser by hand Based on the specifications of tokens through regular expressions we can write a lexical analyser. One approach is to check case by case and split into smaller problems that can be solved ad hoc. Example: void get_next_token() { c=input_char(); if (is_eof(c)) { token  (EOF,”eof”); return} if (is_letter(c)) {recognise_id()} else if (is_digit(c)) {recognise_number()} else if (is_operator(c))||is_separator(c)) {token  (c,c)} //single char assumed else {token  (ERROR,c)} return; } ... do { get_next_token(); print(token.class, token.attribute); } while (token.class != EOF); Can be efficient; but requires a lot of work and may be difficult to modify! 27
  28. 28. Building Lexical Analysers “automatically” Idea: try the regular expressions one by one and find the longest match: set (token.class, token.length) (NULL, 0) // first find max_length such that input matches T1RE1 if max_length > token.length set (token.class, token.length) (T1, max_length) // second find max_length such that input matches T2RE2 if max_length > token.length set (token.class, token.length) (T2, max_length) … // n-th find max_length such that input matches TnREn if max_length > token.length set (token.class, token.length) (Tn, max_length) // error if (token.class == NULL) { handle no_match } Disadvantage: linearly dependent on number of token classes and requires restarting the search for each regular expression. 28
  29. 29. We study REs to automate scanner construction! Consider the problem of recognising register names starting with r and requiring at least one digit: Register  r (0|1|2|…|9) (0|1|2|…|9)* (or, Register  r Digit Digit*) The RE corresponds to a transition diagram: 29 Depicts the actions that take place in the scanner. • A circle represents a state; S0: start state; S2: final state (double circle) • An arrow represents a transition; the label specifies the cause of the transition. A string is accepted if, going through the transitions, ends in a final state (for example, r345, r0, r29, as opposed to a, r, rab) S0 S1 r digit S2 digit start
  30. 30. Towards Automation An easy (computerised) implementation of a transition diagram is a transition table: a column for each input symbol and a row for each state. An entry is a set of states that can be reached from a state on some input symbol. E.g.: state ‘r’ digit 0 1 - 1 - 2 2(final) - 2 If we know the transition table and the final state(s) we can build directly a recogniser that detects acceptance: char=input_char(); state=0; // starting state while (char != EOF) { state  table(state,char); if (state == ‘-’) return failure; word=word+char; char=input_char(); } if (state == FINAL) return acceptance; else return failure; 30
  31. 31. DFA Minimisation: the problem  Problem: can we minimise the number of states?  Answer: yes, if we can find groups of states where, for each input symbol, every state of such a group will have transitions to the same group. 31 A B D E C b a a a a a b bb b
  32. 32. DFA minimisation: the algorithm (Hopcroft’s algorithm: simple version) Divide the states of the DFA into two groups: those containing final states and those containing non-final states. while there are group changes for each group for each input symbol if for any two states of the group and a given input symbol, their transitions do not lead to the same group, these states must belong to different groups. For the curious, there is an alternative approach: create a graph in which there is an edge between each pair of states which cannot coexist in a group because of the conflict above. Then use a graph colouring algorithm to find the minimum number of colours needed so that any two nodes connected by an edge do not have the same colour (we’ll examine graph colouring algorithms later on, in register allocation) 32
  33. 33. How does it work? Recall (a | b)* abb In each iteration, we consider any non-single-member groups and we consider the partitioning criterion for all pairs of states in the group. 33 Iteration Current groups Split on a Split on b 0 {E}, {A,B,C,D} None {A,B,C}, {D} 1 {E}, {D}, {A,B,C} None {A,C}, {B} 2 {E}, {D}, {B}, {A, C} None None A B D E C b a a a a a b bb b B D E A, C a a aa b bb b
  34. 34. From NFA to minimized DFA: recall a (b | c)* 34 S0 S1 S2 S3 S4 S6 S5 S7 S8 S9 a          b c -closure(move(s,*))DFA states NFA states a b c A S0 S1,S2,S3, S4,S6,S9 None None B S1,S2,S3, S4,S6,S9 None S5,S8,S9, S3,S4,S6 S7,S8,S9, S3,S4,S6 C S5,S8,S9, S3,S4,S6 None S5,S8,S9, S3,S4,S6 S7,S8,S9, S3,S4,S6 D S7,S8,S9, S3,S4,S6 None S5,S8,S9, S3,S4,S6 S7,S8,S9, S3,S4,S6 A D C B a c c c b b b
  35. 35. DFA minimisation: recall a (b | c)* Apply the minimisation algorithm to produce the minimal DFA: 35 A D C B a c c c b b bCurrent groups Split on a Split on b Split on c 0 {B, C, D} {A} None None None A B,C,D b | c a Remember, last time, we said that a human could construct a simpler automaton than Thompson’s construction? Well, algorithms can produce the same DFA!
  36. 36. Lex/Flex: Generating Lexical Analysers Flex is a tool for generating scanners: programs which recognised lexical patterns in text (from the online manual pages: % man flex)  Lex input consists of 3 sections:  regular expressions;  pairs of regular expressions and C code;  auxiliary C code.  When the lex input is compiled, it generates as output a C source file lex.yy.c that contains a routine yylex(). After compiling the C file, the executable will start isolating tokens from the input according to the regular expressions, and, for each token, will execute the code associated with it. The array char yytext[] contains the representation of a token. 36

×