SlideShare a Scribd company logo
1 of 102
Unit 1
- Mr. J. B. Patil
-AISSMS’s IOIT Pune
1By Jaydeep Patil
Syllabus
• Introduction to compilers - Design issues,
passes, phases, symbol table
• Preliminaries - Memory management,
Operating system support for compiler,
• Compiler support for garbage collection
• Lexical Analysis - Tokens, Regular Expressions,
Process of Lexical analysis, Block Schematic,
Automatic construction of lexical analyzer
using LEX, LEX features and specification.
2By Jaydeep Patil
Compiler
• Programming languages are notations for describing computations to
people and to machines. The world as we know it depends on
programming languages, because all the software running on all the
computers was written in some programming language. But, before a
program can be run, it first must be translated into a form in which it can
be executed by a computer. The software systems that do this translation
are called compilers.
• A compiler is a program that can read a program in one language - the
source language - and translate it into an equivalent program in another
language - the target language; An important role of the compiler is to
report any errors in the source program that it detects during the
translation process.
3By Jaydeep Patil
4By Jaydeep Patil
Compiler
• If the target program is an executable
machine-language program, it can then be
called by the user to process inputs and
produce outputs
5By Jaydeep Patil
Interpreter
• An interpreter is another common kind of language
processor. Instead of producing a target program as a
translation, an interpreter appears to directly
execute the operations specified in the source
program on inputs supplied by the user.
6By Jaydeep Patil
• The machine-language target program produced by a
compiler is usually much faster than an interpreter at
mapping inputs to outputs . An interpreter, however, can
usually give better error diagnostics than a compiler,
because it executes the source program statement by
statement.
• Java language processors combine compilation and
interpretation, as shown in Fig. 1.4. A Java source program
may first be compiled into an intermediate form called
bytecodes. The bytecodes are then interpreted by a virtual
machine. A benefit of this arrangement is that bytecodes
compiled on one machine can be interpreted on another
machine , perhaps across a network.
7By Jaydeep Patil
8By Jaydeep Patil
Language Processing System
9By Jaydeep Patil
The Structure of a Compiler
• The Structure of compiler is divided into
two parts:
1. Analysis
2. Synthesis
10By Jaydeep Patil
The Structure of a Compiler
• The analysis part breaks up the source program into constituent pieces
and imposes a grammatical structure on them. It then uses this
structure to create an intermediate representation of the source
program.
• If the analysis part detects that the source program is either
syntactically ill formed or semantically unsound, then it must provide
informative messages, so the user can take corrective action.
• The analysis part also collects information about the source program
and stores it in a data structure called a symbol table, which is passed
along with the intermediate representation to the synthesis part.
• The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol table.
The analysis part is often called the front end of the compiler; the
synthesis part is the back end.
11By Jaydeep Patil
Phases of a compiler
The compilation process is operates
as a sequence of phases, each of
which transforms one representation
of the source program to another.
The symbol table, which stores
information about entire source
program, is used by all phases of the
compiler.
12By Jaydeep Patil
13By Jaydeep Patil
Phases of Compiler
• 1. Lexical Analysis Phase: The first phase of a compiler is
called lexical analysis or scanning. The lexical analyzer
reads the stream of characters making up the source
program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical
analyzer produces as output a token of the form
{token- name, attribute-value}
• That it passes on to the subsequent phase, syntax analysis .
In the token, the first component token- name is an
abstract symbol that is used during syntax analysis , and the
second component attribute-value points to an entry in the
symbol table for this token. Information from the symbol-
table entry 'is needed for semantic analysis and code
generation.
14By Jaydeep Patil
For example, suppose a source program contains the assignment statement
position = initial + rate * 60………………………………………………………. (1.1)
The characters in this assignment could be grouped into the following lexemes and
mapped into the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token {id, 1 ), where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
position. The symbol-table entry for an identifier holds information about the
identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token {=}. Since this
token needs no attribute-value, we have omitted the second component
3. initial is a lexeme that is mapped into the token (id, 2 ) , where 2 points to the
symbol-table entry for initial .
4. + is a lexeme that is mapped into the token (+).
5. rate is a lexeme that is mapped into the token ( id, 3 ) , where 3 points to the
symbol-table entry for rate.
6. * is a lexeme that is mapped into the token (* ) .
7. 60 is a lexeme that is mapped into the token (60) .
Blanks separating the lexemes would be discarded by the lexical analyzer. Figure 1.7
shows the representation of the assignment statement (1.1) after lexical analysis as
the sequence of tokens
( id, l) ( = ) (id, 2) (+) (id, 3) ( * ) (60) ………………………..(1.2)
In this representation, the token names =, +, and * are abstract symbols for the
assignment, addition, and multiplication operators, respectively. 15By Jaydeep Patil
Phases of Compiler
16By Jaydeep Patil
Phases of Compiler
• Syntax Analysis: The second phase of the
compiler is syntax analysis or parsing. The parser
uses the first components of the tokens produced
by the lexical analyzer to create a tree-like
intermediate representation that depicts the
grammatical structure of the token stream. A
typical representation is a syntax tree in which
each interior node represents an operation and
the children of the node represent the arguments
of the operation.
17By Jaydeep Patil
Phases of Compiler
18By Jaydeep Patil
Phases of Compiler
• Semantic Analysis: The semantic analyzer uses the syntax tree and
the information in the symbol table to check the source program for
semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for
subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands. For example,
many programming language definitions require an array index to be an
integer; the compiler must report an error if a floating-point number is
used to index an array.
• The language specification may permit some type conversions called
coercions.
• For example, a binary arithmetic operator may be applied to either a pair
of integers or to a pair of floating-point numbers. If the operator is applied
to a floating-point number and an integer, the compiler may convert or
coerce the integer into a floating-point number.
19By Jaydeep Patil
Phases of Compiler
20By Jaydeep Patil
Phases of Compiler
• Intermediate Code Generation: In the process of
translating a source program into target code, a compiler may
construct one or more intermediate representations, which
can have a variety of forms. Syntax trees are a form of
intermediate representation; they are commonly used during
syntax and semantic analysis.
• After syntax and semantic analysis of the source program,
many compilers generate an explicit low-level or machine-like
intermediate representation, which we can think of as a
program for an abstract machine. This intermediate
representation should have two important properties: it
should be easy to produce and it should be easy to translate
into the target machine.
21By Jaydeep Patil
Phases of Compiler
t1=inttofloat (60)
t2=id3 * t1
t3=id2 + t2
id1 = t3
22By Jaydeep Patil
Phases of Compiler
23By Jaydeep Patil
Phases of Compiler
• Code Optimization: The machine-independent code-optimization
phase attempts to improve the intermediate code so that better target
code will result. Usually better means faster, but other objectives may be
desired, such as shorter code, or target code that consumes less power.
For example, a straightforward algorithm generates the intermediate
code, using an instruction for each operator in the tree representation
that comes from the semantic analyzer.
• A simple intermediate code generation algorithm followed by code
optimization is a reasonable way to generate good target code. The
optimizer can deduce that the conversion of 60 from integer to floating
point can be done once and for all at compile time, so the inttofloat
operation can be eliminated by replacing the integer 60 by the floating-
point number 60.0. Moreover, t3 is used only once to transmit its value to
id1 so the optimizer can transform into the shorter sequence
• t 1 = id3 * 60 . 0
• id1 = id2 + t 1
24By Jaydeep Patil
Phases of Compiler
• There is a great variation in the amount of code optimization
different compilers perform. In those that do the most, the
so-called "optimizing compilers," a significant amount of time
is spent on this phase. There are simple optimizations that
significantly improve the running time of the target program
without slowing down compilation too much.
25By Jaydeep Patil
26By Jaydeep Patil
Phases of Compiler
• Code Generation: The code generator takes as input an
intermediate representation of the source program and maps
it into the target language. If the target language is machine
code, registers Or memory locations are selected for each of
the variables used by the program. Then, the intermediate
instructions are translated into sequences of machine
instructions that perform the same task. A crucial aspect of
code generation is the judicious assignment of registers to
hold variables.
27By Jaydeep Patil
Phases of Compiler
• LDF R2 , id3
• MULF R2 , R2 , #60 . 0
• LDF R1 , id2
• ADDF R1 , R 1 , R2
• STF id1 , R1
28By Jaydeep Patil
29By Jaydeep Patil
Phases of Compiler
• Symbol-Table Management: An essential function of a
compiler is to record the variable names used in the source
program and collect information about various attributes of each
name. These attributes may provide information about the storage
allocated for a name, its type, its scope (where in the program its
value may be used), and in the case of procedure names, such
things as the number and types of its arguments, the method of
passing each argument (for example, by value or by reference), and
the type returned.
• The symbol table is a data structure containing a record for each
variable name, with fields for the attributes of the name. The data
structure should be designed to allow the compiler to find the
record for each name quickly and to store or retrieve data from that
record quickly.
30By Jaydeep Patil
The Grouping of Phases into Passes
• The discussion of phases deals with the logical organization of
a compiler. In an implementation, activities from several
phases may be grouped together into a pass that reads an
input file and writes an output file. For example, the front-end
phases of lexical analysis, syntax analysis, semantic analysis,
and intermediate code generation might be grouped together
into one pass. Code optimization might be an optional pass.
Then there could be a back-end pass consisting of code
generation for a particular target machine.
31By Jaydeep Patil
Compiler- Construction Tools
1. Parser generators that automatically produce syntax analyzers from a
grammatical description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-
expression description of the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines
for walking a parse tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a
collection of rules for translating each operation of the intermediate language
into the machine language for a target machine.
5. Data-flow analysis engines that facilitate the gathering of information
about how values are transmitted from one part of a program to each
other part. Data-flow analysis is a key part of code optimization.
6. Compiler- construction toolkits that provide an integrated set of routines
for constructing various phases of a compiler.
32By Jaydeep Patil
Runtime Environments
• A compiler must accurately implement the
abstractions embodied in the source language
definition.
– such as names, scopes, bindings, data types,
operators, procedures, parameters, and flow-of-
control constructs
• The compiler creates and manages a run-time
environment in which it assumes its target
programs are being executed.
.
35By Jaydeep Patil
Runtime Environments
1.Program vs program execution
• A program consists of several procedures (functions)
• An execution of a program is called as a process
• An execution of a program would cause the
activation of the related procedures
• A name (e.g, a variable name)in a procedure would
be related to different data objects
Notes: An activation may manipulate data objects allocated for its use.
36By Jaydeep Patil
Runtime Environments
• Each Execution of a procedure is referred to as
an activation of the procedure. If the
procedure is recursive several of its activations
may be alive at the same time.
37By Jaydeep Patil
Runtime Environments
2. Allocation and de-allocation of data objects
– Managed by the run-time support package
– Allocation of a data object is just to allocate
storage space to the data object relating to the
name in the procedure
Notes: The representation of a data object at run time is determined by
its type.
38By Jaydeep Patil
Runtime Environments
Source Language Issues
1. Procedures: -is a declaration that, in its simplest form,
associates an identifier with statement/statments..
2. Functions:
3. A complete program will also be treated as procedure.
• When a procedure name appears within an executable
statement, we say that procedure is called at that point. The
basic idea is that a procedure call executes the procedure
body.
• Some of the identifier appearing in a procedure definition
are special and called formal parameters of the procedure.
39By Jaydeep Patil
Runtime Environments
Source Language Issues
2.Activation Trees
A tree to depict the control flow enters and
leaves activation of a procedure
f(4)
f(3) f(2)
f(2) f(1) f(1) f(0)
f(1) f(0)
Enter f(4)
Enter f(3)
Enter f(2)
Enter f(1)
Leave f(1)
Enter f(0)
Leave f(0)
Leave f(2)
Enter f(1)
Leave f(1)
Leave f(3)
Enter f(2)
Enter f(1)
Leave f(1)
Enter f(0)
Leave f(0)
Leave f(2)
Leave f(4)
40By Jaydeep Patil
Runtime Environments
Source Language Issues
3.Control stacks
A stack to keep track of live procedure
activations
41By Jaydeep Patil
Runtime Environments
Storage organization
1.Subdivision of Runtime Memory
Typical subdivision of runtime memory into code
and data areas
code
Static data
stack
heap
The generated target code;
Data objects;
A counterpart of the control stack to
keep track of procedure activations
A Separate are of run time memory
called heap, holds all
Other information.
42By Jaydeep Patil
Runtime Environments
Storage organization
2.Activation Records/Frames
1) Definition
A contiguous block of storage that stores
information needed by a single execution of a
procedure
43By Jaydeep Patil
Runtime Environments
Storage organization
2.Activation Records
2) A general activation record
Returned value
Actual parameters
Optional control link
Optional access link
Saved machine status
Local data
temporaries
44By Jaydeep Patil
Runtime Environments
Storage allocation strategies
1.Storage allocation strategies
– Static allocation
• Lays out storage for all data objects at compile time.
– Stack allocation
• Manages the run-time storage as a stack.
– Heap allocation
• Allocates and de-allocates storage as needed at run
time from a heap.
46By Jaydeep Patil
Runtime Environments
Storage allocation strategies
1.Static allocation
– The size of a data object and constraints on its
position in memory must be known at compile
time.
– Recursive procedures are restricted.
– Data structures cannot be created dynamically.
– Fortran permits static storage allocation
47By Jaydeep Patil
Runtime Environments
Storage allocation strategies
2. Stack allocation
Main idea
– Based on the idea of a control stack; storage is
organized as a stack, and activation records are
pushed and popped as activations begin and end,
respectively.
– Storage for the locals in each call of a procedure is
contained in the activation record for that call.
– Locals are bound to fresh storage in each
activation
48By Jaydeep Patil
Runtime Environments
Storage allocation strategies
3.Heap allocation
– A separate area of run-time memory, called a
heap.
– Heap allocation parcels out pieces of contiguous
storage, as needed for activation records or other
objects.
– Pieces may be de-allocated in any order.
49By Jaydeep Patil
50By Jaydeep Patil
Runtime Environments
Storage allocation strategies
4. Stack allocation
2)Calling sequences
– A call sequence allocates an activation record and
enters information into its fields
– A return sequence restores the state of the
machine so the calling procedure can continue
execution.
51By Jaydeep Patil
52By Jaydeep Patil
e.g. the calling sequence is like the following:
main ()
{……
q( );
}
p( )
{……}
q( )
{
……
p( );
……
}
53By Jaydeep Patil
Activation
record for
main
Activation
record for Q
Activation
record for P
TOP
Top-SP
54By Jaydeep Patil
Runtime Environments
Storage allocation strategies
4. Stack allocation
2)Calling sequences
Steps: (1)The caller evaluates actual parameters.
(2)The caller stores a return address and the old
value of top-sp into the callee’s activation record.
(3)The callee saves register values and other status
information
(4)the callee initializes its local data and begins
execution.
55By Jaydeep Patil
Runtime Environments
Storage allocation strategies
4. Stack allocation
2)calling sequences
A possible return sequence :
– (1)The callee places a return value next to the
activation record of the caller
– (2)Using the information in the status field, the
callee restores top-sp and other registers and
branches to a return address in the caller’s code.
– (3)Although top-sp has been decremented, the
caller can copy the returned value into its own
activation record and use it to evaluate an
expression.
56By Jaydeep Patil
57
Variable-length data
• In some languages, array size can depend on a
value passed to the procedure as a parameter.
• This and any other variable-sized data can still
be allocated on the stack, but BELOW the
callee’s activation record.
• In the activation record itself, we simply store
POINTERS to the to-be-allocated data.
By Jaydeep Patil
58
Example of variable- length data
• All variable-
length data is
pointed to
from the local
data area.
By Jaydeep Patil
Runtime Environments
Storage allocation strategies
4. Stack allocation
3)Stack allocation for non-nested procedure
– When entering a procedure, the activation record
of the procedure is set up on the top of the stack
– Initially, put the global(static) variables and the
activation record of the Main function
59By Jaydeep Patil
Runtime Environments
Storage allocation strategies
4. Stack allocation
3)Stack allocation for non-nested procedure
Main AR
Q AR
P AR
TOP
SP
global Var
60By Jaydeep Patil
Runtime Environments
Storage allocation strategies
4. Stack allocation
3)Stack allocation for non-nested procedure
Caller’s SP
Calling Return address
Amount of parameters
Formal parameters
Local data
Array data
Temporaries
Returned value
61By Jaydeep Patil
Garbage Collection
62By Jaydeep Patil
Lexical Analysis
• The main task of the lexical analyzer is to read the
input characters of the source program, group
them into lexemes, and produce as output a
sequence of tokens for each lexeme in the source
program. The stream of tokens is sent to the
parser for syntax analysis. It is common for the
lexical analyzer to interact with the symbol table
as well.
• When the lexical analyzer discovers a lexeme
constituting an identifier, it needs to enter that
lexeme into the symbol table.
68By Jaydeep Patil
Lexical Analysis
69By Jaydeep Patil
Lexical Analysis
• Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional attribute value. The
token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier. The
token names are the input symbols that the parser processes.
• A pattern is a description of the form that the lexemes of a token may take. In the
case of a keyword as a token, the pattern is just the sequence of characters that
form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
• A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
70By Jaydeep Patil
Lexical Analysis
Table gives Some typical tokens, their informally described patterns,
and some sample lexemes. To see how these concepts are used in
practice, in the C statement. printf ( "Total = %dri" , score ) ;
both printf and score are lexemes matching the pattern for token id,
and " Total = %dn" is a lexeme matching literal.
71By Jaydeep Patil
Lexical Analysis
• Attributes for Tokens:
• E = M * C ** 2
– <id, pointer to symbol-table entry for E>
– <assign_op>
– <id, pointer to symbol-table entry for M>
– <mult_op>
– <id, pointer to symbol-table entry for C>
– <exp_op>
– <number, integer value 2>
72By Jaydeep Patil
Lexical Analysis
• Lexical Errors: It is hard for a lexical analyzer to tell ,
without the aid of other components, that there is a source
-code error. For instance, if the string fi is encountered for
the first time in a C program in the context :
• f i ( a == f (x) ) . . .
• a lexical analyzer cannot tell whether f i is a misspelling of
the keyword if or an undeclared function identifier. Since f i
is a valid lexeme for the token id, the lexical analyzer must
return the token id to the parser and let some other phase
of the compiler - probably the parser in this case - handle
an error due to transposition of the letters.
73By Jaydeep Patil
Lexical Analysis
• However, suppose a situation arises in which the lexical analyzer is
unable to proceed because none of the patterns for tokens matches
any prefix of the remaining input. The simplest recovery strategy is
"panic mode" recovery. We delete successive characters from the
remaining input, until the lexical analyzer can find a well-formed
token at the beginning of what input is left. This recovery technique
may confuse the parser, but in an interactive computing
environment it may be quite adequate.
• Other possible error-recovery actions are:
– 1. Delete one character from the remaining input.
– 2. Insert a missing character into the remaining input.
– 3. Replace a character by another character.
– 4. Transpose two adjacent characters.
74By Jaydeep Patil
Lexical Analysis
• Specification of Tokens: The Patterns corresponding to a
token are generally specified using a compact notation known
as regular expression. Regular expressions of a language are
created by combining members of its alphabet. A regular
expression r corresponds to a set of strings L(r) where L(r) is
called a regular set or a regular language and may be infinite.
A regular expression is defined as follows:
• A basic regular expression a denotes the set {a} where a ∈∑;
L(a) ={a}
• The regular expression ε denotes the set {ε}
• If r and s are two regular expressions denoting the sets L(r)
and L(s) then; following are some rules for regular expressions
75By Jaydeep Patil
Specification of Tokens:
76By Jaydeep Patil
Regular Definition
77By Jaydeep Patil
Token Recognition
• The previous section described about tokens
specification of a language using regular expression.
This section will elaborate how to construct recognizers
that can identify the tokens occurring in input stream.
These recognizers are known as Finite Automata.
• A Finite Automaton (FA) consists of:
• A finite set of states
• A set of transitions (or moves) between states: The
transitions are labeled by characters form the alphabet
• A special start state
• A set of final or accepting states
78By Jaydeep Patil
Token Recognition
79By Jaydeep Patil
Lexical Analysis
• Token Recognition:
80By Jaydeep Patil
Lexical Analysis
81By Jaydeep Patil
Lexical Analysis
82By Jaydeep Patil
Input Buffering
• Buffer Pairs: Because of the amount of time taken to process characters
and the large number of characters that must be processed during the
compilation of a large source program, specialized buffering techniques
have been developed to reduce the amount of overhead required to
process a single input character. An important scheme involves two
buffers that are alternately reloaded, as suggested in Fig. 3.3.
83By Jaydeep Patil
Input Buffering
• Each buffer is of the same size N , and N is usually the
size of a disk block,e.g., 4096 bytes. Using one system
read command we can read N characters into a buffer,
rather than using one system call per character. If fewer
than N characters remain in the input file, then a
special character, represented by eof, marks the end of
the source file and is different from any possible
character of the source program.
• Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current
lexeme, whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
84By Jaydeep Patil
Input Buffering
• Once the next lexeme is determined, forward is set to the
character at its right end. Then, after the lexeme is recorded
as an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the
lexeme just found.
• Advancing forward requires that we first test whether we
have reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer.
85By Jaydeep Patil
Input Buffering
• Sentinels: In Buffered Pair we must check, each time we
advance forward, that we have not moved off one of the
buffers; if we do, then we must also reload the other buffer.
Thus, for each character read, we make two tests: one for
the end of the buffer, and one to determine what character
is read . We can combine the buffer-end test with the test
for the current character if we extend each buffer to hold a
sentinel character at the end. The sentinel is a special
character that cannot be part of the source program, and
a natural choice is the character eof.
• Figure 3.4 shows the arrangement with the sentinels
added. Note that eof retains its use as a marker for the end
of the entire input. Any eof that appears other than at the
end of a buffer means that the input is at an end.
86By Jaydeep Patil
Input Buffering
87By Jaydeep Patil
Automatic construction of lexical
analyzer-(LEX)
• Also Known as Scanner Generator.
• Lex, or in a more recent implementation Flex,
that allows one to specify a lexical analyzer by
specifying regular expressions to describe
patterns for tokens. The input notation for the
Lex tool is referred to as the Lex language and the
tool itself is the Lex compiler. Behind the scenes,
the Lex compiler transforms the input patterns
into a transition diagram and generates code, in a
file called lex.yy.c, that simulates this transition
diagram.
88By Jaydeep Patil
Automatic construction of lexical
analyzer-(LEX)
89By Jaydeep Patil
Automatic construction of lexical
analyzer-(LEX)
• Structure of Lex Programs:
• A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions
90By Jaydeep Patil
Automatic construction of lexical
analyzer-(LEX)
• The declarations section includes declarations of variables,
manifest constants ( identifiers declared to stand for a
constant, e.g. , the name of a token) , and regular
definitions,
• The translation rules each have the form Pattern { Action }
Each pattern is a regular expression, which may use the
regular definitions of the declaration section. The actions
are fragments of code , typically written in C, although
many variants of Lex using other languages have been
created.
• The third section holds whatever additional functions are
used in the actions. Alternatively, these functions can be
compiled separately and loaded with the lexical analyzer.
91By Jaydeep Patil
Lex
92By Jaydeep Patil
Lex
……..Definitions section……
%%
……Rules section……..
%%
……….C code section (subroutines)……..
• The input structure to Lex.
•Echo is an action and predefined macro in
lex that writes code matched by the
pattern.
93By Jaydeep Patil
Regular Expressions
94By Jaydeep Patil
Lex Regular Expressions (Extended Regular
Expressions)
• A regular expression matches a set of strings
• Regular expression
– Operators
– Character classes
– Arbitrary character
– Optional expressions
– Alternation and grouping
– Context sensitivity
– Repetitions and definitions
95By Jaydeep Patil
Operators
“  [ ] ^ - ? . * + | ( ) $ / { } % < >
96By Jaydeep Patil
Character Classes []
• [abc] matches a single character, which may be a,
b, or c
• Every operator meaning is ignored except  - and ^
• e.g.
[ab] => a or b
[a-z] => a or b or c or … or z
[-+0-9] => all the digits and the two signs
[^a-zA-Z] => any character which is not a
letter
97By Jaydeep Patil
Arbitrary Character .
• To match almost character, the operator
character . is the class of all characters except
newline
98By Jaydeep Patil
Optional & Repeated Expressions
• a? => zero or one instance of a
• a* => zero or more instances of a
• a+ => one or more instances of a
• E.g.
ab?c => ac or abc
[a-z]+ => all strings of lower case letters
[a-zA-Z][a-zA-Z0-9]* => all alphanumeric
strings with a leading alphabetic character
99By Jaydeep Patil
Pattern Matching Primitives
Metacharacter Matches
. any character except newline
n newline
* zero or more copies of the preceding expression
+ one or more copies of the preceding expression
? zero or one copy of the preceding expression
^ beginning of line / complement
$ end of line
a|b a or b
(ab)+ one or more copies of ab (grouping)
[ab] a or b
a{3} 3 instances of a
“a+b” literal “a+b” 100By Jaydeep Patil
Lex
• Pattern Matching examples.
101By Jaydeep Patil
Lex Predefined Variables
• yytext -- a string containing the lexeme
• yyleng -- the length of the lexeme
• yyin -- the input stream pointer
– the default input of default main() is stdin
• yyout -- the output stream pointer
– the default output of default main() is stdout.
• cs20: %./a.out < inputfile > outfile
• E.g.
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}
102By Jaydeep Patil
Lex Library Routines
• yylex()
– The default main() contains a call of yylex()
• yymore()
– return the next token
• yyless(n)
– retain the first n characters in yytext
• yywarp()
– is called whenever Lex reaches an end-of-file
– The default yywarp() always returns 1
103By Jaydeep Patil
Review of Lex Predefined Variables
Name Function
char *yytext pointer to matched string
int yyleng length of matched string
FILE *yyin input stream pointer
FILE *yyout output stream pointer
int yylex(void) call to invoke lexer, returns token
char* yymore(void) return the next token
int yyless(int n) retain the first n characters in yytext
int yywrap(void) wrapup, return 1 if done, 0 if not done
ECHO write matched string
REJECT go to the next alternative rule
INITAL initial start condition
BEGIN condition switch start condition
104By Jaydeep Patil
User Subroutines Section
• You can use your Lex routines in the same ways
you use routines in other programming languages.
%{
void foo();
%}
letter [a-zA-Z]
%%
{letter}+ foo();
%%
…
void foo() {
…
} 105By Jaydeep Patil
User Subroutines Section (cont’d)
• The section where main() is placed
%{
int counter = 0;
%}
letter [a-zA-Z]
%%
{letter}+ {printf(“a wordn”); counter++;}
%%
main() {
yylex();
printf(“There are total %d wordsn”, counter);
}
106By Jaydeep Patil
Lex
 Whitespace must separate the defining term and the associated expression.
 Code in the definitions section is simply copied as-is to the top of the generated C file and
must be bracketed with “%{“ and “%}” markers.
 substitutions in the rules section are surrounded by braces ({letter}) to distinguish them
from literals.
107By Jaydeep Patil
Lex includes this file and utilizes the definitions
for token values. To obtain tokens yacc calls
yylex . Function yylex has a return type of int
that returns a token. Values associated with
the token are returned by lex in variable yylval
108By Jaydeep Patil
Usage
• To run Lex on a source file, type
lex scanner.l
• It produces a file named lex.yy.c which is a C
program for the lexical analyzer.
• To compile lex.yy.c, type
cc lex.yy.c –ll
• To run the lexical analyzer program, type
./a.out < inputfile
109By Jaydeep Patil
END of Unit 1
110By Jaydeep Patil

More Related Content

What's hot

Advanced Operating System Lecture Notes
Advanced Operating System Lecture NotesAdvanced Operating System Lecture Notes
Advanced Operating System Lecture Notes
Anirudhan Guru
 

What's hot (20)

Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translation
 
Top down parsing
Top down parsingTop down parsing
Top down parsing
 
Relationship Among Token, Lexeme & Pattern
Relationship Among Token, Lexeme & PatternRelationship Among Token, Lexeme & Pattern
Relationship Among Token, Lexeme & Pattern
 
Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler Design
 
Deadlock dbms
Deadlock dbmsDeadlock dbms
Deadlock dbms
 
Target language in compiler design
Target language in compiler designTarget language in compiler design
Target language in compiler design
 
Lex
LexLex
Lex
 
Specification-of-tokens
Specification-of-tokensSpecification-of-tokens
Specification-of-tokens
 
1.Role lexical Analyzer
1.Role lexical Analyzer1.Role lexical Analyzer
1.Role lexical Analyzer
 
Lecture 01 introduction to compiler
Lecture 01 introduction to compilerLecture 01 introduction to compiler
Lecture 01 introduction to compiler
 
Translation of expression(copmiler construction)
Translation of expression(copmiler construction)Translation of expression(copmiler construction)
Translation of expression(copmiler construction)
 
Advanced Operating System Lecture Notes
Advanced Operating System Lecture NotesAdvanced Operating System Lecture Notes
Advanced Operating System Lecture Notes
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Phases of Compiler
Phases of CompilerPhases of Compiler
Phases of Compiler
 
Principle source of optimazation
Principle source of optimazationPrinciple source of optimazation
Principle source of optimazation
 
Phases of compiler
Phases of compilerPhases of compiler
Phases of compiler
 
Types of grammer - TOC
Types of grammer - TOCTypes of grammer - TOC
Types of grammer - TOC
 
Compiler construction
Compiler constructionCompiler construction
Compiler construction
 
Code generation in Compiler Design
Code generation in Compiler DesignCode generation in Compiler Design
Code generation in Compiler Design
 
TM - Techniques
TM - TechniquesTM - Techniques
TM - Techniques
 

Similar to Compiler Design

2-Design Issues, Patterns, Lexemes, Tokens-28-04-2023.docx
2-Design Issues, Patterns, Lexemes, Tokens-28-04-2023.docx2-Design Issues, Patterns, Lexemes, Tokens-28-04-2023.docx
2-Design Issues, Patterns, Lexemes, Tokens-28-04-2023.docx
venkatapranaykumarGa
 

Similar to Compiler Design (20)

what is compiler and five phases of compiler
what is compiler and five phases of compilerwhat is compiler and five phases of compiler
what is compiler and five phases of compiler
 
Concept of compiler in details
Concept of compiler in detailsConcept of compiler in details
Concept of compiler in details
 
The Phases of a Compiler
The Phases of a CompilerThe Phases of a Compiler
The Phases of a Compiler
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 
unit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfunit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdf
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 
Chapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdfChapter1pdf__2021_11_23_10_53_20.pdf
Chapter1pdf__2021_11_23_10_53_20.pdf
 
Introduction to Compilers
Introduction to CompilersIntroduction to Compilers
Introduction to Compilers
 
Compiler Design Material
Compiler Design MaterialCompiler Design Material
Compiler Design Material
 
Compiler Construction introduction
Compiler Construction introductionCompiler Construction introduction
Compiler Construction introduction
 
Chapter#01 cc
Chapter#01 ccChapter#01 cc
Chapter#01 cc
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Assignment1
Assignment1Assignment1
Assignment1
 
System software module 4 presentation file
System software module 4 presentation fileSystem software module 4 presentation file
System software module 4 presentation file
 
Dineshmaterial1 091225091539-phpapp02
Dineshmaterial1 091225091539-phpapp02Dineshmaterial1 091225091539-phpapp02
Dineshmaterial1 091225091539-phpapp02
 
Compiler an overview
Compiler  an overviewCompiler  an overview
Compiler an overview
 
Chapter-1.pptx compiler Design Course Material
Chapter-1.pptx compiler Design Course MaterialChapter-1.pptx compiler Design Course Material
Chapter-1.pptx compiler Design Course Material
 
Cpcs302 1
Cpcs302  1Cpcs302  1
Cpcs302 1
 
2-Design Issues, Patterns, Lexemes, Tokens-28-04-2023.docx
2-Design Issues, Patterns, Lexemes, Tokens-28-04-2023.docx2-Design Issues, Patterns, Lexemes, Tokens-28-04-2023.docx
2-Design Issues, Patterns, Lexemes, Tokens-28-04-2023.docx
 
Pros and cons of c as a compiler language
  Pros and cons of c as a compiler language  Pros and cons of c as a compiler language
Pros and cons of c as a compiler language
 

Recently uploaded

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 

Recently uploaded (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 

Compiler Design

  • 1. Unit 1 - Mr. J. B. Patil -AISSMS’s IOIT Pune 1By Jaydeep Patil
  • 2. Syllabus • Introduction to compilers - Design issues, passes, phases, symbol table • Preliminaries - Memory management, Operating system support for compiler, • Compiler support for garbage collection • Lexical Analysis - Tokens, Regular Expressions, Process of Lexical analysis, Block Schematic, Automatic construction of lexical analyzer using LEX, LEX features and specification. 2By Jaydeep Patil
  • 3. Compiler • Programming languages are notations for describing computations to people and to machines. The world as we know it depends on programming languages, because all the software running on all the computers was written in some programming language. But, before a program can be run, it first must be translated into a form in which it can be executed by a computer. The software systems that do this translation are called compilers. • A compiler is a program that can read a program in one language - the source language - and translate it into an equivalent program in another language - the target language; An important role of the compiler is to report any errors in the source program that it detects during the translation process. 3By Jaydeep Patil
  • 5. Compiler • If the target program is an executable machine-language program, it can then be called by the user to process inputs and produce outputs 5By Jaydeep Patil
  • 6. Interpreter • An interpreter is another common kind of language processor. Instead of producing a target program as a translation, an interpreter appears to directly execute the operations specified in the source program on inputs supplied by the user. 6By Jaydeep Patil
  • 7. • The machine-language target program produced by a compiler is usually much faster than an interpreter at mapping inputs to outputs . An interpreter, however, can usually give better error diagnostics than a compiler, because it executes the source program statement by statement. • Java language processors combine compilation and interpretation, as shown in Fig. 1.4. A Java source program may first be compiled into an intermediate form called bytecodes. The bytecodes are then interpreted by a virtual machine. A benefit of this arrangement is that bytecodes compiled on one machine can be interpreted on another machine , perhaps across a network. 7By Jaydeep Patil
  • 10. The Structure of a Compiler • The Structure of compiler is divided into two parts: 1. Analysis 2. Synthesis 10By Jaydeep Patil
  • 11. The Structure of a Compiler • The analysis part breaks up the source program into constituent pieces and imposes a grammatical structure on them. It then uses this structure to create an intermediate representation of the source program. • If the analysis part detects that the source program is either syntactically ill formed or semantically unsound, then it must provide informative messages, so the user can take corrective action. • The analysis part also collects information about the source program and stores it in a data structure called a symbol table, which is passed along with the intermediate representation to the synthesis part. • The synthesis part constructs the desired target program from the intermediate representation and the information in the symbol table. The analysis part is often called the front end of the compiler; the synthesis part is the back end. 11By Jaydeep Patil
  • 12. Phases of a compiler The compilation process is operates as a sequence of phases, each of which transforms one representation of the source program to another. The symbol table, which stores information about entire source program, is used by all phases of the compiler. 12By Jaydeep Patil
  • 14. Phases of Compiler • 1. Lexical Analysis Phase: The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the form {token- name, attribute-value} • That it passes on to the subsequent phase, syntax analysis . In the token, the first component token- name is an abstract symbol that is used during syntax analysis , and the second component attribute-value points to an entry in the symbol table for this token. Information from the symbol- table entry 'is needed for semantic analysis and code generation. 14By Jaydeep Patil
  • 15. For example, suppose a source program contains the assignment statement position = initial + rate * 60………………………………………………………. (1.1) The characters in this assignment could be grouped into the following lexemes and mapped into the following tokens passed on to the syntax analyzer: 1. position is a lexeme that would be mapped into a token {id, 1 ), where id is an abstract symbol standing for identifier and 1 points to the symbol table entry for position. The symbol-table entry for an identifier holds information about the identifier, such as its name and type. 2. The assignment symbol = is a lexeme that is mapped into the token {=}. Since this token needs no attribute-value, we have omitted the second component 3. initial is a lexeme that is mapped into the token (id, 2 ) , where 2 points to the symbol-table entry for initial . 4. + is a lexeme that is mapped into the token (+). 5. rate is a lexeme that is mapped into the token ( id, 3 ) , where 3 points to the symbol-table entry for rate. 6. * is a lexeme that is mapped into the token (* ) . 7. 60 is a lexeme that is mapped into the token (60) . Blanks separating the lexemes would be discarded by the lexical analyzer. Figure 1.7 shows the representation of the assignment statement (1.1) after lexical analysis as the sequence of tokens ( id, l) ( = ) (id, 2) (+) (id, 3) ( * ) (60) ………………………..(1.2) In this representation, the token names =, +, and * are abstract symbols for the assignment, addition, and multiplication operators, respectively. 15By Jaydeep Patil
  • 16. Phases of Compiler 16By Jaydeep Patil
  • 17. Phases of Compiler • Syntax Analysis: The second phase of the compiler is syntax analysis or parsing. The parser uses the first components of the tokens produced by the lexical analyzer to create a tree-like intermediate representation that depicts the grammatical structure of the token stream. A typical representation is a syntax tree in which each interior node represents an operation and the children of the node represent the arguments of the operation. 17By Jaydeep Patil
  • 18. Phases of Compiler 18By Jaydeep Patil
  • 19. Phases of Compiler • Semantic Analysis: The semantic analyzer uses the syntax tree and the information in the symbol table to check the source program for semantic consistency with the language definition. It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation. • An important part of semantic analysis is type checking, where the compiler checks that each operator has matching operands. For example, many programming language definitions require an array index to be an integer; the compiler must report an error if a floating-point number is used to index an array. • The language specification may permit some type conversions called coercions. • For example, a binary arithmetic operator may be applied to either a pair of integers or to a pair of floating-point numbers. If the operator is applied to a floating-point number and an integer, the compiler may convert or coerce the integer into a floating-point number. 19By Jaydeep Patil
  • 20. Phases of Compiler 20By Jaydeep Patil
  • 21. Phases of Compiler • Intermediate Code Generation: In the process of translating a source program into target code, a compiler may construct one or more intermediate representations, which can have a variety of forms. Syntax trees are a form of intermediate representation; they are commonly used during syntax and semantic analysis. • After syntax and semantic analysis of the source program, many compilers generate an explicit low-level or machine-like intermediate representation, which we can think of as a program for an abstract machine. This intermediate representation should have two important properties: it should be easy to produce and it should be easy to translate into the target machine. 21By Jaydeep Patil
  • 22. Phases of Compiler t1=inttofloat (60) t2=id3 * t1 t3=id2 + t2 id1 = t3 22By Jaydeep Patil
  • 23. Phases of Compiler 23By Jaydeep Patil
  • 24. Phases of Compiler • Code Optimization: The machine-independent code-optimization phase attempts to improve the intermediate code so that better target code will result. Usually better means faster, but other objectives may be desired, such as shorter code, or target code that consumes less power. For example, a straightforward algorithm generates the intermediate code, using an instruction for each operator in the tree representation that comes from the semantic analyzer. • A simple intermediate code generation algorithm followed by code optimization is a reasonable way to generate good target code. The optimizer can deduce that the conversion of 60 from integer to floating point can be done once and for all at compile time, so the inttofloat operation can be eliminated by replacing the integer 60 by the floating- point number 60.0. Moreover, t3 is used only once to transmit its value to id1 so the optimizer can transform into the shorter sequence • t 1 = id3 * 60 . 0 • id1 = id2 + t 1 24By Jaydeep Patil
  • 25. Phases of Compiler • There is a great variation in the amount of code optimization different compilers perform. In those that do the most, the so-called "optimizing compilers," a significant amount of time is spent on this phase. There are simple optimizations that significantly improve the running time of the target program without slowing down compilation too much. 25By Jaydeep Patil
  • 27. Phases of Compiler • Code Generation: The code generator takes as input an intermediate representation of the source program and maps it into the target language. If the target language is machine code, registers Or memory locations are selected for each of the variables used by the program. Then, the intermediate instructions are translated into sequences of machine instructions that perform the same task. A crucial aspect of code generation is the judicious assignment of registers to hold variables. 27By Jaydeep Patil
  • 28. Phases of Compiler • LDF R2 , id3 • MULF R2 , R2 , #60 . 0 • LDF R1 , id2 • ADDF R1 , R 1 , R2 • STF id1 , R1 28By Jaydeep Patil
  • 30. Phases of Compiler • Symbol-Table Management: An essential function of a compiler is to record the variable names used in the source program and collect information about various attributes of each name. These attributes may provide information about the storage allocated for a name, its type, its scope (where in the program its value may be used), and in the case of procedure names, such things as the number and types of its arguments, the method of passing each argument (for example, by value or by reference), and the type returned. • The symbol table is a data structure containing a record for each variable name, with fields for the attributes of the name. The data structure should be designed to allow the compiler to find the record for each name quickly and to store or retrieve data from that record quickly. 30By Jaydeep Patil
  • 31. The Grouping of Phases into Passes • The discussion of phases deals with the logical organization of a compiler. In an implementation, activities from several phases may be grouped together into a pass that reads an input file and writes an output file. For example, the front-end phases of lexical analysis, syntax analysis, semantic analysis, and intermediate code generation might be grouped together into one pass. Code optimization might be an optional pass. Then there could be a back-end pass consisting of code generation for a particular target machine. 31By Jaydeep Patil
  • 32. Compiler- Construction Tools 1. Parser generators that automatically produce syntax analyzers from a grammatical description of a programming language. 2. Scanner generators that produce lexical analyzers from a regular- expression description of the tokens of a language. 3. Syntax-directed translation engines that produce collections of routines for walking a parse tree and generating intermediate code. 4. Code-generator generators that produce a code generator from a collection of rules for translating each operation of the intermediate language into the machine language for a target machine. 5. Data-flow analysis engines that facilitate the gathering of information about how values are transmitted from one part of a program to each other part. Data-flow analysis is a key part of code optimization. 6. Compiler- construction toolkits that provide an integrated set of routines for constructing various phases of a compiler. 32By Jaydeep Patil
  • 33. Runtime Environments • A compiler must accurately implement the abstractions embodied in the source language definition. – such as names, scopes, bindings, data types, operators, procedures, parameters, and flow-of- control constructs • The compiler creates and manages a run-time environment in which it assumes its target programs are being executed. . 35By Jaydeep Patil
  • 34. Runtime Environments 1.Program vs program execution • A program consists of several procedures (functions) • An execution of a program is called as a process • An execution of a program would cause the activation of the related procedures • A name (e.g, a variable name)in a procedure would be related to different data objects Notes: An activation may manipulate data objects allocated for its use. 36By Jaydeep Patil
  • 35. Runtime Environments • Each Execution of a procedure is referred to as an activation of the procedure. If the procedure is recursive several of its activations may be alive at the same time. 37By Jaydeep Patil
  • 36. Runtime Environments 2. Allocation and de-allocation of data objects – Managed by the run-time support package – Allocation of a data object is just to allocate storage space to the data object relating to the name in the procedure Notes: The representation of a data object at run time is determined by its type. 38By Jaydeep Patil
  • 37. Runtime Environments Source Language Issues 1. Procedures: -is a declaration that, in its simplest form, associates an identifier with statement/statments.. 2. Functions: 3. A complete program will also be treated as procedure. • When a procedure name appears within an executable statement, we say that procedure is called at that point. The basic idea is that a procedure call executes the procedure body. • Some of the identifier appearing in a procedure definition are special and called formal parameters of the procedure. 39By Jaydeep Patil
  • 38. Runtime Environments Source Language Issues 2.Activation Trees A tree to depict the control flow enters and leaves activation of a procedure f(4) f(3) f(2) f(2) f(1) f(1) f(0) f(1) f(0) Enter f(4) Enter f(3) Enter f(2) Enter f(1) Leave f(1) Enter f(0) Leave f(0) Leave f(2) Enter f(1) Leave f(1) Leave f(3) Enter f(2) Enter f(1) Leave f(1) Enter f(0) Leave f(0) Leave f(2) Leave f(4) 40By Jaydeep Patil
  • 39. Runtime Environments Source Language Issues 3.Control stacks A stack to keep track of live procedure activations 41By Jaydeep Patil
  • 40. Runtime Environments Storage organization 1.Subdivision of Runtime Memory Typical subdivision of runtime memory into code and data areas code Static data stack heap The generated target code; Data objects; A counterpart of the control stack to keep track of procedure activations A Separate are of run time memory called heap, holds all Other information. 42By Jaydeep Patil
  • 41. Runtime Environments Storage organization 2.Activation Records/Frames 1) Definition A contiguous block of storage that stores information needed by a single execution of a procedure 43By Jaydeep Patil
  • 42. Runtime Environments Storage organization 2.Activation Records 2) A general activation record Returned value Actual parameters Optional control link Optional access link Saved machine status Local data temporaries 44By Jaydeep Patil
  • 43. Runtime Environments Storage allocation strategies 1.Storage allocation strategies – Static allocation • Lays out storage for all data objects at compile time. – Stack allocation • Manages the run-time storage as a stack. – Heap allocation • Allocates and de-allocates storage as needed at run time from a heap. 46By Jaydeep Patil
  • 44. Runtime Environments Storage allocation strategies 1.Static allocation – The size of a data object and constraints on its position in memory must be known at compile time. – Recursive procedures are restricted. – Data structures cannot be created dynamically. – Fortran permits static storage allocation 47By Jaydeep Patil
  • 45. Runtime Environments Storage allocation strategies 2. Stack allocation Main idea – Based on the idea of a control stack; storage is organized as a stack, and activation records are pushed and popped as activations begin and end, respectively. – Storage for the locals in each call of a procedure is contained in the activation record for that call. – Locals are bound to fresh storage in each activation 48By Jaydeep Patil
  • 46. Runtime Environments Storage allocation strategies 3.Heap allocation – A separate area of run-time memory, called a heap. – Heap allocation parcels out pieces of contiguous storage, as needed for activation records or other objects. – Pieces may be de-allocated in any order. 49By Jaydeep Patil
  • 48. Runtime Environments Storage allocation strategies 4. Stack allocation 2)Calling sequences – A call sequence allocates an activation record and enters information into its fields – A return sequence restores the state of the machine so the calling procedure can continue execution. 51By Jaydeep Patil
  • 50. e.g. the calling sequence is like the following: main () {…… q( ); } p( ) {……} q( ) { …… p( ); …… } 53By Jaydeep Patil
  • 51. Activation record for main Activation record for Q Activation record for P TOP Top-SP 54By Jaydeep Patil
  • 52. Runtime Environments Storage allocation strategies 4. Stack allocation 2)Calling sequences Steps: (1)The caller evaluates actual parameters. (2)The caller stores a return address and the old value of top-sp into the callee’s activation record. (3)The callee saves register values and other status information (4)the callee initializes its local data and begins execution. 55By Jaydeep Patil
  • 53. Runtime Environments Storage allocation strategies 4. Stack allocation 2)calling sequences A possible return sequence : – (1)The callee places a return value next to the activation record of the caller – (2)Using the information in the status field, the callee restores top-sp and other registers and branches to a return address in the caller’s code. – (3)Although top-sp has been decremented, the caller can copy the returned value into its own activation record and use it to evaluate an expression. 56By Jaydeep Patil
  • 54. 57 Variable-length data • In some languages, array size can depend on a value passed to the procedure as a parameter. • This and any other variable-sized data can still be allocated on the stack, but BELOW the callee’s activation record. • In the activation record itself, we simply store POINTERS to the to-be-allocated data. By Jaydeep Patil
  • 55. 58 Example of variable- length data • All variable- length data is pointed to from the local data area. By Jaydeep Patil
  • 56. Runtime Environments Storage allocation strategies 4. Stack allocation 3)Stack allocation for non-nested procedure – When entering a procedure, the activation record of the procedure is set up on the top of the stack – Initially, put the global(static) variables and the activation record of the Main function 59By Jaydeep Patil
  • 57. Runtime Environments Storage allocation strategies 4. Stack allocation 3)Stack allocation for non-nested procedure Main AR Q AR P AR TOP SP global Var 60By Jaydeep Patil
  • 58. Runtime Environments Storage allocation strategies 4. Stack allocation 3)Stack allocation for non-nested procedure Caller’s SP Calling Return address Amount of parameters Formal parameters Local data Array data Temporaries Returned value 61By Jaydeep Patil
  • 60. Lexical Analysis • The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well. • When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. 68By Jaydeep Patil
  • 62. Lexical Analysis • Tokens, Patterns, and Lexemes • A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. • A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings. • A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token. 70By Jaydeep Patil
  • 63. Lexical Analysis Table gives Some typical tokens, their informally described patterns, and some sample lexemes. To see how these concepts are used in practice, in the C statement. printf ( "Total = %dri" , score ) ; both printf and score are lexemes matching the pattern for token id, and " Total = %dn" is a lexeme matching literal. 71By Jaydeep Patil
  • 64. Lexical Analysis • Attributes for Tokens: • E = M * C ** 2 – <id, pointer to symbol-table entry for E> – <assign_op> – <id, pointer to symbol-table entry for M> – <mult_op> – <id, pointer to symbol-table entry for C> – <exp_op> – <number, integer value 2> 72By Jaydeep Patil
  • 65. Lexical Analysis • Lexical Errors: It is hard for a lexical analyzer to tell , without the aid of other components, that there is a source -code error. For instance, if the string fi is encountered for the first time in a C program in the context : • f i ( a == f (x) ) . . . • a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an undeclared function identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other phase of the compiler - probably the parser in this case - handle an error due to transposition of the letters. 73By Jaydeep Patil
  • 66. Lexical Analysis • However, suppose a situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is "panic mode" recovery. We delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left. This recovery technique may confuse the parser, but in an interactive computing environment it may be quite adequate. • Other possible error-recovery actions are: – 1. Delete one character from the remaining input. – 2. Insert a missing character into the remaining input. – 3. Replace a character by another character. – 4. Transpose two adjacent characters. 74By Jaydeep Patil
  • 67. Lexical Analysis • Specification of Tokens: The Patterns corresponding to a token are generally specified using a compact notation known as regular expression. Regular expressions of a language are created by combining members of its alphabet. A regular expression r corresponds to a set of strings L(r) where L(r) is called a regular set or a regular language and may be infinite. A regular expression is defined as follows: • A basic regular expression a denotes the set {a} where a ∈∑; L(a) ={a} • The regular expression ε denotes the set {ε} • If r and s are two regular expressions denoting the sets L(r) and L(s) then; following are some rules for regular expressions 75By Jaydeep Patil
  • 70. Token Recognition • The previous section described about tokens specification of a language using regular expression. This section will elaborate how to construct recognizers that can identify the tokens occurring in input stream. These recognizers are known as Finite Automata. • A Finite Automaton (FA) consists of: • A finite set of states • A set of transitions (or moves) between states: The transitions are labeled by characters form the alphabet • A special start state • A set of final or accepting states 78By Jaydeep Patil
  • 72. Lexical Analysis • Token Recognition: 80By Jaydeep Patil
  • 75. Input Buffering • Buffer Pairs: Because of the amount of time taken to process characters and the large number of characters that must be processed during the compilation of a large source program, specialized buffering techniques have been developed to reduce the amount of overhead required to process a single input character. An important scheme involves two buffers that are alternately reloaded, as suggested in Fig. 3.3. 83By Jaydeep Patil
  • 76. Input Buffering • Each buffer is of the same size N , and N is usually the size of a disk block,e.g., 4096 bytes. Using one system read command we can read N characters into a buffer, rather than using one system call per character. If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of the source file and is different from any possible character of the source program. • Two pointers to the input are maintained: 1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine. 2. Pointer forward scans ahead until a pattern match is found. 84By Jaydeep Patil
  • 77. Input Buffering • Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found. • Advancing forward requires that we first test whether we have reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer. 85By Jaydeep Patil
  • 78. Input Buffering • Sentinels: In Buffered Pair we must check, each time we advance forward, that we have not moved off one of the buffers; if we do, then we must also reload the other buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one to determine what character is read . We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof. • Figure 3.4 shows the arrangement with the sentinels added. Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a buffer means that the input is at an end. 86By Jaydeep Patil
  • 80. Automatic construction of lexical analyzer-(LEX) • Also Known as Scanner Generator. • Lex, or in a more recent implementation Flex, that allows one to specify a lexical analyzer by specifying regular expressions to describe patterns for tokens. The input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and generates code, in a file called lex.yy.c, that simulates this transition diagram. 88By Jaydeep Patil
  • 81. Automatic construction of lexical analyzer-(LEX) 89By Jaydeep Patil
  • 82. Automatic construction of lexical analyzer-(LEX) • Structure of Lex Programs: • A Lex program has the following form: declarations %% translation rules %% auxiliary functions 90By Jaydeep Patil
  • 83. Automatic construction of lexical analyzer-(LEX) • The declarations section includes declarations of variables, manifest constants ( identifiers declared to stand for a constant, e.g. , the name of a token) , and regular definitions, • The translation rules each have the form Pattern { Action } Each pattern is a regular expression, which may use the regular definitions of the declaration section. The actions are fragments of code , typically written in C, although many variants of Lex using other languages have been created. • The third section holds whatever additional functions are used in the actions. Alternatively, these functions can be compiled separately and loaded with the lexical analyzer. 91By Jaydeep Patil
  • 85. Lex ……..Definitions section…… %% ……Rules section…….. %% ……….C code section (subroutines)…….. • The input structure to Lex. •Echo is an action and predefined macro in lex that writes code matched by the pattern. 93By Jaydeep Patil
  • 87. Lex Regular Expressions (Extended Regular Expressions) • A regular expression matches a set of strings • Regular expression – Operators – Character classes – Arbitrary character – Optional expressions – Alternation and grouping – Context sensitivity – Repetitions and definitions 95By Jaydeep Patil
  • 88. Operators “ [ ] ^ - ? . * + | ( ) $ / { } % < > 96By Jaydeep Patil
  • 89. Character Classes [] • [abc] matches a single character, which may be a, b, or c • Every operator meaning is ignored except - and ^ • e.g. [ab] => a or b [a-z] => a or b or c or … or z [-+0-9] => all the digits and the two signs [^a-zA-Z] => any character which is not a letter 97By Jaydeep Patil
  • 90. Arbitrary Character . • To match almost character, the operator character . is the class of all characters except newline 98By Jaydeep Patil
  • 91. Optional & Repeated Expressions • a? => zero or one instance of a • a* => zero or more instances of a • a+ => one or more instances of a • E.g. ab?c => ac or abc [a-z]+ => all strings of lower case letters [a-zA-Z][a-zA-Z0-9]* => all alphanumeric strings with a leading alphabetic character 99By Jaydeep Patil
  • 92. Pattern Matching Primitives Metacharacter Matches . any character except newline n newline * zero or more copies of the preceding expression + one or more copies of the preceding expression ? zero or one copy of the preceding expression ^ beginning of line / complement $ end of line a|b a or b (ab)+ one or more copies of ab (grouping) [ab] a or b a{3} 3 instances of a “a+b” literal “a+b” 100By Jaydeep Patil
  • 93. Lex • Pattern Matching examples. 101By Jaydeep Patil
  • 94. Lex Predefined Variables • yytext -- a string containing the lexeme • yyleng -- the length of the lexeme • yyin -- the input stream pointer – the default input of default main() is stdin • yyout -- the output stream pointer – the default output of default main() is stdout. • cs20: %./a.out < inputfile > outfile • E.g. [a-z]+ printf(“%s”, yytext); [a-z]+ ECHO; [a-zA-Z]+ {words++; chars += yyleng;} 102By Jaydeep Patil
  • 95. Lex Library Routines • yylex() – The default main() contains a call of yylex() • yymore() – return the next token • yyless(n) – retain the first n characters in yytext • yywarp() – is called whenever Lex reaches an end-of-file – The default yywarp() always returns 1 103By Jaydeep Patil
  • 96. Review of Lex Predefined Variables Name Function char *yytext pointer to matched string int yyleng length of matched string FILE *yyin input stream pointer FILE *yyout output stream pointer int yylex(void) call to invoke lexer, returns token char* yymore(void) return the next token int yyless(int n) retain the first n characters in yytext int yywrap(void) wrapup, return 1 if done, 0 if not done ECHO write matched string REJECT go to the next alternative rule INITAL initial start condition BEGIN condition switch start condition 104By Jaydeep Patil
  • 97. User Subroutines Section • You can use your Lex routines in the same ways you use routines in other programming languages. %{ void foo(); %} letter [a-zA-Z] %% {letter}+ foo(); %% … void foo() { … } 105By Jaydeep Patil
  • 98. User Subroutines Section (cont’d) • The section where main() is placed %{ int counter = 0; %} letter [a-zA-Z] %% {letter}+ {printf(“a wordn”); counter++;} %% main() { yylex(); printf(“There are total %d wordsn”, counter); } 106By Jaydeep Patil
  • 99. Lex  Whitespace must separate the defining term and the associated expression.  Code in the definitions section is simply copied as-is to the top of the generated C file and must be bracketed with “%{“ and “%}” markers.  substitutions in the rules section are surrounded by braces ({letter}) to distinguish them from literals. 107By Jaydeep Patil
  • 100. Lex includes this file and utilizes the definitions for token values. To obtain tokens yacc calls yylex . Function yylex has a return type of int that returns a token. Values associated with the token are returned by lex in variable yylval 108By Jaydeep Patil
  • 101. Usage • To run Lex on a source file, type lex scanner.l • It produces a file named lex.yy.c which is a C program for the lexical analyzer. • To compile lex.yy.c, type cc lex.yy.c –ll • To run the lexical analyzer program, type ./a.out < inputfile 109By Jaydeep Patil
  • 102. END of Unit 1 110By Jaydeep Patil