Compiler Design Material


Published on

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Compiler Design Material

  1. 1. 1.Explain the different phases of a Compiler.<br />Compiler:<br />Compiler is a program which translates a program written in one language (Source language) to an equivalent program in other language (the target language).<br />Source program-CompilerTarget program<br />Compiler is a Software for translating high level language (HLL) to machine level language.<br />It is nothing but a translator and it should know both the high level language and the architecture of the computer.<br />Most of the compilers are machine dependant but some compilers are machine independent. Eg: java.<br />Turbo C, common for both C & C++<br />Need for a Compiler:<br />We need Compilers, because the source program does not understand by the Computer. So, it has to convert into machine understandable language. So we use Compilers for this purpose.<br />We cannot use the same compiler for all computers. Because every HLL has its own syntaxes.<br />Phases of a Compiler:<br />A Compiler takes as input a source program and produces as output an equivalent Sequence of machine instructions.<br />This process is so complex that it is divided into a series of sub process called Phases.<br />The different phases of a compiler are as follows<br />Analysis Phases : 1. Lexical Analysis<br /> 2. Syntax Analysis<br /> 3. Semantic Analysis<br />Synthesis Phases: 4.Intermediate Code generator<br /> 5. Code Optimization<br /> 6. Code generation.<br />1. Lexical Analysis:<br />It is the first phase of a Compiler. Lexical analyzer or Scanner reads the characters in the source program and groups them into a stream of tokens.<br />The usual tokens are identifiers, keywords, Constants, Operators and Punctuation Symbols such as Comma and Parenthesis.<br />Each token is a Sub-String of the source program that is to be treated as a single unit. Tokens are of two types:<br /><ul><li>Specific Strings Eg: If, Semicolon
  2. 2. Classes of Strings Eg: identifier, Constants, Label.</li></ul>A token is treated as a pair consisting of two parts.<br /><ul><li>Token type
  3. 3. Token Value. </li></ul>The character sequence forming a token is called the lexeme for the token.<br />Certain tokens will be increased by a lexical value. The lexical analyser not only generates a token, but also it enters the lexeme into the symbol table.<br /> Symbol table<br /><ul><li>a
  4. 4. b
  5. 5. c</li></ul>Token values are represented by pairs in square brackets. The second component of the pair is an index to the symbol table where the information’s are kept. <br />For eg., Consider the expression<br />a = b + c * 20<br />After lexical Analysis it will be.<br />id1 = id2 + id3 *20<br />The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. Eg: Unrecognized Keyword.<br />2. Syntax Analysis:<br />It groups tokens together into Syntactic Structures called an Expression.<br />Expressions might further be combined to form statements.<br />Often the syntactic structure can be regarded as a tree where leaves are tokens, called as parse trees.<br />The parser has two functions. It checks if the tokens, occur in pattern that are permitted by the specification of the source language. Ie., Syntax checking.<br />For eg., Consider the expire the each position A+/B. After lexical Analysis this will be, as the token sequence id+/id in the Syntax analyzer.<br />On seeing / the syntax analyzer should detect an error. Because the presence of two adjacent binary operators violates the formulation rules.<br />The second aspect is to make explicit the hierarchical Structure of incoming token stream by identifying which parts of the token stream should be grouped.<br />The Syntax analysis can detect syntax errors. Eg., Syntax error.<br />3.Semantic Analysis:<br />An important role of semantic analysis is type checking.<br />Here the computer checks that the each operator has operands that are permitted by the source language specification.<br />Consider the eg: x= a+b<br />Diagram<br />The language specification may permit some operand coercions. For eg: When a binary arithmetic operator is applied to an integer and real. In this case, the compiler array need to convert the integer to a real.<br />In this phase, the compiler detects type mismatch error.<br />4. Intermediate Code generation:<br />It uses the structure produced by the syntax analyzer to create a stream of simple instructions.<br />Many styles are possible. One common style uses instruction with one operator and a small number of operands.<br />The output of the previous phase is some representation of a parse tree. This phase transforms this parse tree into an intermediate language.<br />One popular type of intermediate language is called Three address code.<br />A typical three- address code statement is A = B op C.<br />Where A, B, C are operands. OP is a binary Operator.<br />Eg: A = B + c * 20<br />Here, T1, T2, T3 are temporary variables. Id1, id2, id3 are the identifiers corresponding to A, B, C.<br />5. Code Optimization:<br />It is designed to improve the intermediate code. So that the Object program less space.<br />Optimization may involve:<br />1. Detection & removal of dead code.<br />2. Calculation of constant expressions & terms.<br />3. Collapsing of repeated expressions into temporary storage.<br />4. Loop unrolling.<br />5. Moving code outside the loops.<br />6. Removal of unnecessary temporary-variables.<br />For e.g.: A: = B+ C * 20 will be<br />T1 = id3 * 20.0<br />Id1 = id2 + T1<br />6. Code generation:<br />Once optimizations are completed, the intermediate code is mapped into the target languages. This involves,<br />Allocation of registers & memory<br />Generation of connect references.<br />Generation of correct types.<br />Generation of machine code.<br />Eg: MOVF id3, R2<br /> MULF # 20.0, R2<br /> MOVF id2, R1<br /> ADDF R2, R1<br /> MOVF R1, id1.<br />2.Compiler Construction Tools:<br />A number of tools have been developed for helping implement various phases of a compiler. Some useful compiler construction tools are a follow,<br />1.Parser generators.<br />2.Scanner generators.<br />3.Syntax Directed Translation Engines.<br />4.Automatic Code generators.<br />5.Data flow engines.<br />1.Parser Generators:<br />These produce syntax analyzers, normally from input that is based on a context-free grammar.<br />In early compilers, Syntax analysis not only consumed a large function of the running time of a compiler, but a large fraction of the interrectual effort of writing a compiler.<br />This phase is very easy to implement. Many parser generations utilize powerful parsing algorithms that are too complex to be carried.<br />2.Scanner Generators:<br />These automatically generate lexical analyzers, normally from a specification based on regular expressions.<br />The basic organization of the resulting lexical analyzer is in effect a finite automation.<br />3.Syntax – Directed Translation Engines:<br />These produce collection of routines that walk the parse tree such as intermediate code.<br />The basic idea is that one or more translations are associated with each node of the parse tree.<br />Each translation is defined in terms of translations at its neighbour nodes in the tree.<br />4.Automatic Code Generators:<br />This tool takes a collection of rules that define the translation of each operation of the intermediate language into the machine language for the target machine.<br />The rules must include sufficient detail that we can handle the different possible access methods for data, Eg. Variables may be in registers, in a fixed location in memory or may be allocated a position on a stack.<br />The basic technique is “template matching”. The intermediate code statements are replaced by templates.<br />That templates represent sequences of machine instructions.<br />The assumptions about storage of variables match from template to template.<br />5.Data Flow Engines:<br />Much of the information needed to perform good code optimization involves “data flow analysis”.<br />The gathering of information about how values are transmitted from one part of a program to each other part.<br />3.Issues in the design of a Code Generator:<br />Since the code generation phase is system dependent, the following issues arises during the code generation.<br /><ul><li>Input to the code generator.
  6. 6. Target Program.
  7. 7. Memory Management.
  8. 8. Instruction Selection.
  9. 9. Register Allocation.
  10. 10. Evaluation Order.</li></ul>1.Input to the code generator:<br />It is an intermediate code that may be of several forms.<br /><ul><li>Linear representation – Postfix notation
  11. 11. Three-address representation-quadruples
  12. 12. Virtual machine representation-stack machine code.
  13. 13. Graphical representation-syntax tree, dags.</li></ul>The intermediate language can be represented by quantities that the target machine can directly manipulate.<br />By inserting type – conversion operations, the type – checking has to be taken place and the semantic errors have to be detected already. <br />Thus the input to the code generator. Must be free of errors.<br />2.Target Programs:<br />The output of the code generator is the target program that may be of several forms.<br /><ul><li>Absolute machine language.
  14. 14. Relocatable machine language
  15. 15. Assembly language.</li></ul>Absolute machine language can placed in a fixed memory location and executed immediately.<br />Example Compilers that produce absolute code are WATFIV & PL/ C.<br />Producing a relocatable machine language program allows subprograms to be compiled separately.<br />A set of relocatable object modules can be linked together and loaded for execution by a linking loader. This leads to an added expense.<br />Producing an assembly- language program as output makes the code generation process easier. But it has to be assembled after code generation.<br />3.Memory Management:<br />Names in the source program is mapped to its address in runtime memory is done by the front end & the code generator.<br />The details about the name is available in the symbol table with the information. Such as its type, width, amount of storage needed etc.,<br />From the symbol table, a relative address can be determined for the name in a data area for the procedure.<br />If machine code is being generated, labels in three- address statements have to be converted to address of instructions. This process is parallel to the ‘back patching’ techniques.<br />Eg: When we encounter<br />J : goto i generate the jump instruction as follows:<br /><ul><li>If I<j, (i.e) backward jump, generate a jump instruction with the target address = machine location of the first instruction in the code for quadruple i.
  16. 16. If i>j (i.e) forward jump, We must store the location of that 1st instruction for quadruple j on quadruple i’s list.</li></ul>When we process quadruple i, all the instructions that refers memory location of i are filled.<br />4.Instruction Section: <br /> The uniformity and completeness of the instruction set are important factor. Otherwise some special exception handling is needed.<br /> Instruction speed and memory idioms are also important factors.<br /> A sample target code sequence for the three-address statement<br /> X: = Y + Z can be<br /> MOV Y, Ro // load Y into register Ro<br /> ADD Z, Ro // add Z to Ro<br /> MOV Ro, X // store Ro into X<br />But, this kind of statement – by – statement code generation often produces poor code. For Eg, the sequence of statements.<br /> a: = b + c<br /> d: = a + e<br /> Statement <br /> a= a + 1<br /> MOV a, Ro<br /> ADD #1, Ro<br /> MOV Ro, a<br />Here INC takes lesser time as compared to the other set of instruction.<br />5.Register Allocation:<br />Instructions involving registers are usually faster than involving operands in memory.<br />Store long life time values that are often used in the registers.<br />Contain machine requires even-odd register pairs for some operand and results.<br />For eg: in the IBM/ 370, the instruction division.<br /> D X, Y<br />In which X- divided even register in even / odd<br /> Y- divisor<br />Even register - remainder<br />Odd register – quotient<br />6.Evaluation Order:<br />The order in which computations are performed (ie., instructions execution) can affect the efficiency of the target code.<br />But picking up a best order is a difficult one.<br />Initially, We shall avoid the problem by generating code for the three- address statements in the order in which they have been produced by the intermediate code generator.<br />4.Discuss about parameter passing machanism:<br />Parameters used to provide the communication between the caller and callee.<br />There are four methods for associating actual and formal parameters. They are,<br /><ul><li>Call – by – Value
  17. 17. Call –by – reference
  18. 18. Copy – by – Restore
  19. 19. Call – by – Name</li></ul>1.Call – by – Value:<br />Call – by – value is the simplest method of passing parameters.<br />The actual parameters are evaluated and their r – values are passed to the called procedure.<br />This method is used in pascal and C.<br />l- value: It refers to the storage represented by an expression.<br />r-value: It refers to the value contained in the storage.<br />Call – by – value can be implemented as follows:<br />(i)A formal parameter to treated just like a local name, So the storage for the formals is in the activation record of the called procedure.<br />(ii)The caller evaluates the actual parameters and places their r-values in the storage for the formals.<br />2.Call – by – reference:<br />This method is otherwise called as call-by-address or call-by-location.<br />The caller passes a pointer to each location of actual parameters.<br />It an actual parameter is a name or an expression having an l-value then that l-value itself is passed.<br />However, if the actual parameter is an expression like a +b or 5, that has no l-value, than the expression is evaluated in a new location and the address of that location is passed.<br />3.Copy – Restore:<br />This method is a hybrid between Call – by – value and Call – by – reference. This is also known as copy – in – copy – out or value reset.<br />This Calling procedure calculates the value of the actual parameter and it is then copied to activation record for the called procedure.<br />The l – values of these actual parameters having l-values are determined before the call.<br />When control returns, the current r- values of the formal parameters are copied back into the l-values of the actual, using the i-values computed before the call.<br />4. Call – by – Name:<br />This procedure is treated like a macro, that is, its body is substituted for the call in the caller, with the actual parameters literally substituted for the formals.<br />Such a literal substitution is called macro- expansion or in- line expansion.<br />The local names of the called procedure are kept distinct from the names of the calling procedure.<br />The actual parameters are surrounded by parenthesis if necessary to presence their integrity.<br />6.Storage allocation strategies:<br />A different storage – allocation strategy is used in run – time memory organization.<br />They are,<br />1.Static allocation: Lays out storage at compile time for all data objects.<br />2.Stack allocation: Manages the run time storage.<br />3.Heap allocation: Allocates and de-allocates storage as needed at run time from heap.<br />These allocation strategies are applied to allocate memory for activation records. Different languages use different strategies for this purpose.<br />For eg: FORTRANStatic allocation<br /> Algol Stack allocation<br /> LISP Heap allocation<br />1.Static allocation:<br /> The fundamental characteristics of static allocation are as follows:<br />(i)Name binding occurs during compilation there is no need for a run-time support package.<br />(ii)Bindings do not change at run time.<br />(iii)On every invocation of procedure, its names are bound to the same storage locations. When control returns to a procedure, the values of the locals are the same as they were when control left the last time.(ie., this property allows the values of local names to be retained across activations of a procedure).<br />Eg: Function F()<br />{<br /> int a;<br />Print (a);<br /> a=10;<br />}<br />After calling F() once, if it was called a second time, the value of ‘a’ would initially be 10, and this will be printed.<br />(iv)The type of a name determines its storage requirement.<br /> The address for this storage is an offset from the procedures activation record and the compiler must decide where the activation records go, relative to the target code and to one another.<br /> After this position has been decided, the address of the activation records and hence of the storage for each name in the records are fixed.<br /> Thus at compile time, the addresses at which the target code can find the data it operates upon can be filled in. The addresses at which information is to be saved when a procedure call takes places are also known at compile time.<br />Static allocation have some limitations are:<br />(i)Size of data objects, as well as any constraints on their positions in memory, must be available at compile time.<br />(ii)No recursion, because all activations of a given procedure use the same bindings for local names.<br />(iii)No dynamic data structures, since no mechanism is provided for run-time storage allocation.<br />2.Stack allocation:<br />It is based on the idea of a control stack. Storage is organized as stack, and activation records are pushed and popped as activations begin and end respectively.<br />Storage for the locals in each call of a procedure is contained in the activation record for that call. Thus locals are bounds to fresh storage in each activation, because a new activation record is pushed on to the stack when a call is made.<br />The values of locals are deleted when the activation ends. ie., the values are lost because the storage for locals disappears when the activation record is popped.<br />Eg: Activation tree,<br /> DIAGRAM<br />Calling Sequences:<br />A call sequence allocates an activation records and enters information into its fields.<br />A return sequences and activation records differ, even for the same language.<br />The code in the calling sequence is often divided between the calling procedure and the procedure it calls.<br />There is no exact division of run time tasks between the caller and the callee.<br /> DIAGRAM<br />The register stack top points to the end of the machine status field in the activation records.<br />This position is known to the Caller, So it can be made responsible for setting up stack top before control flows to the called procedure.<br />This code for the callee can access its temporaries and the local data using offsets from stack top.<br />The call sequence is:<br />Caller: Evaluates actual<br /> Stores return address & old values of top-SP<br /> Increments top-SP<br />Callee: Saves register values & other status information.<br /> Initializes local data & begins execution.<br />The return sequence is<br />Callee: Places return value next to callers Activation Record.<br /> Restores top-SP & other registers.<br /> Branches to return address.<br />Caller: Copies returned value into its own Activation Record.<br />Limitations of Stack allocation:<br />Values of locals cannot be retained when activation ends.<br />A called activation cannot outlive the caller.<br />3. Heap Allocation:<br />Limitations of stack allocation are mentioned already, in those cases de-allocation of Activation records cannot occur in last in first out fashion.<br />Heap gives out pieces of contiguous storage for activation records.<br />Pieces may be de-allocated in any order over time the heap will consist of alternate areas that are free and in use.<br />Heap manager is supposed to make use of the free space.<br />For efficiency reasons it may be helpful to handle small activations as a special case.<br />For each size of interest keep a linked list of tree blocks of that size.<br />Fill a request of size S with block of size S where S is the smallest size greater than or equal to S.<br /> DIAGRAM<br />For large blocks of storage use heap manager.<br />For large amount of storage computation may take some time to use up the memory So that time taken by the manager be negligible compared to the computation.<br />Heap manage will dynamically allocate memory. This will come a run time overhead. As heap manager will have to take care of defragmentation and garbage collection.<br />