Compiler Design

Unit 1
- Mr. J. B. Patil
-AISSMS’s IOIT Pune
1By Jaydeep Patil

Syllabus
• Introduction to compilers - Design issues,
passes, phases, symbol table
• Preliminaries - Memory management,
Operating system support for compiler,
• Compiler support for garbage collection
• Lexical Analysis - Tokens, Regular Expressions,
Process of Lexical analysis, Block Schematic,
Automatic construction of lexical analyzer
using LEX, LEX features and specification.
2By Jaydeep Patil

Compiler
• Programming languages are notations for describing computations to
people and to machines. The world as we know it depends on
programming languages, because all the software running on all the
computers was written in some programming language. But, before a
program can be run, it first must be translated into a form in which it can
be executed by a computer. The software systems that do this translation
are called compilers.
• A compiler is a program that can read a program in one language - the
source language - and translate it into an equivalent program in another
language - the target language; An important role of the compiler is to
report any errors in the source program that it detects during the
translation process.
3By Jaydeep Patil

Compiler
• If the target program is an executable
machine-language program, it can then be
called by the user to process inputs and
produce outputs
5By Jaydeep Patil

Interpreter
• An interpreter is another common kind of language
processor. Instead of producing a target program as a
translation, an interpreter appears to directly
execute the operations specified in the source
program on inputs supplied by the user.
6By Jaydeep Patil

• The machine-language target program produced by a
compiler is usually much faster than an interpreter at
mapping inputs to outputs . An interpreter, however, can
usually give better error diagnostics than a compiler,
because it executes the source program statement by
statement.
• Java language processors combine compilation and
interpretation, as shown in Fig. 1.4. A Java source program
may first be compiled into an intermediate form called
bytecodes. The bytecodes are then interpreted by a virtual
machine. A benefit of this arrangement is that bytecodes
compiled on one machine can be interpreted on another
machine , perhaps across a network.
7By Jaydeep Patil

Language Processing System
9By Jaydeep Patil

The Structure of a Compiler
• The Structure of compiler is divided into
two parts:
1. Analysis
2. Synthesis
10By Jaydeep Patil

The Structure of a Compiler
• The analysis part breaks up the source program into constituent pieces
and imposes a grammatical structure on them. It then uses this
structure to create an intermediate representation of the source
program.
• If the analysis part detects that the source program is either
syntactically ill formed or semantically unsound, then it must provide
informative messages, so the user can take corrective action.
• The analysis part also collects information about the source program
and stores it in a data structure called a symbol table, which is passed
along with the intermediate representation to the synthesis part.
• The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol table.
The analysis part is often called the front end of the compiler; the
synthesis part is the back end.
11By Jaydeep Patil

Phases of a compiler
The compilation process is operates
as a sequence of phases, each of
which transforms one representation
of the source program to another.
The symbol table, which stores
information about entire source
program, is used by all phases of the
compiler.
12By Jaydeep Patil

Phases of Compiler
• 1. Lexical Analysis Phase: The first phase of a compiler is
called lexical analysis or scanning. The lexical analyzer
reads the stream of characters making up the source
program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical
analyzer produces as output a token of the form
{token- name, attribute-value}
• That it passes on to the subsequent phase, syntax analysis .
In the token, the first component token- name is an
abstract symbol that is used during syntax analysis , and the
second component attribute-value points to an entry in the
symbol table for this token. Information from the symbol-
table entry 'is needed for semantic analysis and code
generation.
14By Jaydeep Patil

For example, suppose a source program contains the assignment statement
position = initial + rate * 60………………………………………………………. (1.1)
The characters in this assignment could be grouped into the following lexemes and
mapped into the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token {id, 1 ), where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
position. The symbol-table entry for an identifier holds information about the
identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token {=}. Since this
token needs no attribute-value, we have omitted the second component
3. initial is a lexeme that is mapped into the token (id, 2 ) , where 2 points to the
symbol-table entry for initial .
4. + is a lexeme that is mapped into the token (+).
5. rate is a lexeme that is mapped into the token ( id, 3 ) , where 3 points to the
symbol-table entry for rate.
6. * is a lexeme that is mapped into the token (* ) .
7. 60 is a lexeme that is mapped into the token (60) .
Blanks separating the lexemes would be discarded by the lexical analyzer. Figure 1.7
shows the representation of the assignment statement (1.1) after lexical analysis as
the sequence of tokens
( id, l) ( = ) (id, 2) (+) (id, 3) ( * ) (60) ………………………..(1.2)
In this representation, the token names =, +, and * are abstract symbols for the
assignment, addition, and multiplication operators, respectively. 15By Jaydeep Patil

Phases of Compiler
16By Jaydeep Patil

Phases of Compiler
• Syntax Analysis: The second phase of the
compiler is syntax analysis or parsing. The parser
uses the first components of the tokens produced
by the lexical analyzer to create a tree-like
intermediate representation that depicts the
grammatical structure of the token stream. A
typical representation is a syntax tree in which
each interior node represents an operation and
the children of the node represent the arguments
of the operation.
17By Jaydeep Patil

Phases of Compiler
18By Jaydeep Patil

Phases of Compiler
• Semantic Analysis: The semantic analyzer uses the syntax tree and
the information in the symbol table to check the source program for
semantic consistency with the language definition. It also gathers type
information and saves it in either the syntax tree or the symbol table, for
subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands. For example,
many programming language definitions require an array index to be an
integer; the compiler must report an error if a floating-point number is
used to index an array.
• The language specification may permit some type conversions called
coercions.
• For example, a binary arithmetic operator may be applied to either a pair
of integers or to a pair of floating-point numbers. If the operator is applied
to a floating-point number and an integer, the compiler may convert or
coerce the integer into a floating-point number.
19By Jaydeep Patil

Phases of Compiler
20By Jaydeep Patil

Phases of Compiler
• Intermediate Code Generation: In the process of
translating a source program into target code, a compiler may
construct one or more intermediate representations, which
can have a variety of forms. Syntax trees are a form of
intermediate representation; they are commonly used during
syntax and semantic analysis.
• After syntax and semantic analysis of the source program,
many compilers generate an explicit low-level or machine-like
intermediate representation, which we can think of as a
program for an abstract machine. This intermediate
representation should have two important properties: it
should be easy to produce and it should be easy to translate
into the target machine.
21By Jaydeep Patil

Phases of Compiler
t1=inttofloat (60)
t2=id3 * t1
t3=id2 + t2
id1 = t3
22By Jaydeep Patil

Phases of Compiler
23By Jaydeep Patil

Phases of Compiler
• Code Optimization: The machine-independent code-optimization
phase attempts to improve the intermediate code so that better target
code will result. Usually better means faster, but other objectives may be
desired, such as shorter code, or target code that consumes less power.
For example, a straightforward algorithm generates the intermediate
code, using an instruction for each operator in the tree representation
that comes from the semantic analyzer.
• A simple intermediate code generation algorithm followed by code
optimization is a reasonable way to generate good target code. The
optimizer can deduce that the conversion of 60 from integer to floating
point can be done once and for all at compile time, so the inttofloat
operation can be eliminated by replacing the integer 60 by the floating-
point number 60.0. Moreover, t3 is used only once to transmit its value to
id1 so the optimizer can transform into the shorter sequence
• t 1 = id3 * 60 . 0
• id1 = id2 + t 1
24By Jaydeep Patil

Phases of Compiler
• There is a great variation in the amount of code optimization
different compilers perform. In those that do the most, the
so-called "optimizing compilers," a significant amount of time
is spent on this phase. There are simple optimizations that
significantly improve the running time of the target program
without slowing down compilation too much.
25By Jaydeep Patil

Phases of Compiler
• Code Generation: The code generator takes as input an
intermediate representation of the source program and maps
it into the target language. If the target language is machine
code, registers Or memory locations are selected for each of
the variables used by the program. Then, the intermediate
instructions are translated into sequences of machine
instructions that perform the same task. A crucial aspect of
code generation is the judicious assignment of registers to
hold variables.
27By Jaydeep Patil

Phases of Compiler
• LDF R2 , id3
• MULF R2 , R2 , #60 . 0
• LDF R1 , id2
• ADDF R1 , R 1 , R2
• STF id1 , R1
28By Jaydeep Patil

Phases of Compiler
• Symbol-Table Management: An essential function of a
compiler is to record the variable names used in the source
program and collect information about various attributes of each
name. These attributes may provide information about the storage
allocated for a name, its type, its scope (where in the program its
value may be used), and in the case of procedure names, such
things as the number and types of its arguments, the method of
passing each argument (for example, by value or by reference), and
the type returned.
• The symbol table is a data structure containing a record for each
variable name, with fields for the attributes of the name. The data
structure should be designed to allow the compiler to find the
record for each name quickly and to store or retrieve data from that
record quickly.
30By Jaydeep Patil

The Grouping of Phases into Passes
• The discussion of phases deals with the logical organization of
a compiler. In an implementation, activities from several
phases may be grouped together into a pass that reads an
input file and writes an output file. For example, the front-end
phases of lexical analysis, syntax analysis, semantic analysis,
and intermediate code generation might be grouped together
into one pass. Code optimization might be an optional pass.
Then there could be a back-end pass consisting of code
generation for a particular target machine.
31By Jaydeep Patil

Compiler- Construction Tools
1. Parser generators that automatically produce syntax analyzers from a
grammatical description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-
expression description of the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines
for walking a parse tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a
collection of rules for translating each operation of the intermediate language
into the machine language for a target machine.
5. Data-flow analysis engines that facilitate the gathering of information
about how values are transmitted from one part of a program to each
other part. Data-flow analysis is a key part of code optimization.
6. Compiler- construction toolkits that provide an integrated set of routines
for constructing various phases of a compiler.
32By Jaydeep Patil

Runtime Environments
• A compiler must accurately implement the
abstractions embodied in the source language
definition.
– such as names, scopes, bindings, data types,
operators, procedures, parameters, and flow-of-
control constructs
• The compiler creates and manages a run-time
environment in which it assumes its target
programs are being executed.
.
35By Jaydeep Patil

1.Program vs program execution
• A program consists of several procedures (functions)
• An execution of a program is called as a process
• An execution of a program would cause the
activation of the related procedures
• A name (e.g, a variable name)in a procedure would
be related to different data objects
Notes: An activation may manipulate data objects allocated for its use.
36By Jaydeep Patil

• Each Execution of a procedure is referred to as
an activation of the procedure. If the
procedure is recursive several of its activations
may be alive at the same time.
37By Jaydeep Patil

2. Allocation and de-allocation of data objects
– Managed by the run-time support package
– Allocation of a data object is just to allocate
storage space to the data object relating to the
name in the procedure
Notes: The representation of a data object at run time is determined by
its type.
38By Jaydeep Patil

Source Language Issues
1. Procedures: -is a declaration that, in its simplest form,
associates an identifier with statement/statments..
2. Functions:
3. A complete program will also be treated as procedure.
• When a procedure name appears within an executable
statement, we say that procedure is called at that point. The
basic idea is that a procedure call executes the procedure
body.
• Some of the identifier appearing in a procedure definition
are special and called formal parameters of the procedure.
39By Jaydeep Patil

2.Activation Trees
A tree to depict the control flow enters and
leaves activation of a procedure
f(4)
f(3) f(2)
f(2) f(1) f(1) f(0)
f(1) f(0)
Enter f(4)
Enter f(3)
Enter f(2)
Enter f(1)
Leave f(1)
Enter f(0)
Leave f(0)
Leave f(2)
Enter f(1)
Leave f(1)
Leave f(3)
Enter f(2)
Enter f(1)
Leave f(1)
Enter f(0)
Leave f(0)
Leave f(2)
Leave f(4)
40By Jaydeep Patil

3.Control stacks
A stack to keep track of live procedure
activations
41By Jaydeep Patil

Storage organization
1.Subdivision of Runtime Memory
Typical subdivision of runtime memory into code
and data areas
code
Static data
stack
heap
The generated target code;
Data objects;
A counterpart of the control stack to
keep track of procedure activations
A Separate are of run time memory
called heap, holds all
Other information.
42By Jaydeep Patil

2.Activation Records/Frames
1) Definition
A contiguous block of storage that stores
information needed by a single execution of a
procedure
43By Jaydeep Patil

2.Activation Records
2) A general activation record
Returned value
Actual parameters
Optional control link
Optional access link
Saved machine status
Local data
temporaries
44By Jaydeep Patil

Storage allocation strategies
1.Storage allocation strategies
– Static allocation
• Lays out storage for all data objects at compile time.
– Stack allocation
• Manages the run-time storage as a stack.
– Heap allocation
• Allocates and de-allocates storage as needed at run
time from a heap.
46By Jaydeep Patil

1.Static allocation
– The size of a data object and constraints on its
position in memory must be known at compile
time.
– Recursive procedures are restricted.
– Data structures cannot be created dynamically.
– Fortran permits static storage allocation
47By Jaydeep Patil

2. Stack allocation
Main idea
– Based on the idea of a control stack; storage is
organized as a stack, and activation records are
pushed and popped as activations begin and end,
respectively.
– Storage for the locals in each call of a procedure is
contained in the activation record for that call.
– Locals are bound to fresh storage in each
activation
48By Jaydeep Patil

3.Heap allocation
– A separate area of run-time memory, called a
heap.
– Heap allocation parcels out pieces of contiguous
storage, as needed for activation records or other
objects.
– Pieces may be de-allocated in any order.
49By Jaydeep Patil

4. Stack allocation
2)Calling sequences
– A call sequence allocates an activation record and
enters information into its fields
– A return sequence restores the state of the
machine so the calling procedure can continue
execution.
51By Jaydeep Patil

e.g. the calling sequence is like the following:
main ()
{……
q( );
}
p( )
{……}
q( )
{
……
p( );
……
}
53By Jaydeep Patil

Activation
record for
main
Activation
record for Q
Activation
record for P
TOP
Top-SP
54By Jaydeep Patil

4. Stack allocation
2)Calling sequences
Steps: (1)The caller evaluates actual parameters.
(2)The caller stores a return address and the old
value of top-sp into the callee’s activation record.
(3)The callee saves register values and other status
information
(4)the callee initializes its local data and begins
execution.
55By Jaydeep Patil

4. Stack allocation
2)calling sequences
A possible return sequence :
– (1)The callee places a return value next to the
activation record of the caller
– (2)Using the information in the status field, the
callee restores top-sp and other registers and
branches to a return address in the caller’s code.
– (3)Although top-sp has been decremented, the
caller can copy the returned value into its own
activation record and use it to evaluate an
expression.
56By Jaydeep Patil

57
Variable-length data
• In some languages, array size can depend on a
value passed to the procedure as a parameter.
• This and any other variable-sized data can still
be allocated on the stack, but BELOW the
callee’s activation record.
• In the activation record itself, we simply store
POINTERS to the to-be-allocated data.
By Jaydeep Patil

58
Example of variable- length data
• All variable-
length data is
pointed to
from the local
data area.
By Jaydeep Patil

4. Stack allocation
3)Stack allocation for non-nested procedure
– When entering a procedure, the activation record
of the procedure is set up on the top of the stack
– Initially, put the global(static) variables and the
activation record of the Main function
59By Jaydeep Patil

4. Stack allocation
Main AR
Q AR
P AR
TOP
SP
global Var
60By Jaydeep Patil

4. Stack allocation
Caller’s SP
Calling Return address
Amount of parameters
Formal parameters
Local data
Array data
Temporaries
Returned value
61By Jaydeep Patil

Garbage Collection
62By Jaydeep Patil

Lexical Analysis
• The main task of the lexical analyzer is to read the
input characters of the source program, group
them into lexemes, and produce as output a
sequence of tokens for each lexeme in the source
program. The stream of tokens is sent to the
parser for syntax analysis. It is common for the
lexical analyzer to interact with the symbol table
as well.
• When the lexical analyzer discovers a lexeme
constituting an identifier, it needs to enter that
lexeme into the symbol table.
68By Jaydeep Patil

Lexical Analysis
69By Jaydeep Patil

Lexical Analysis
• Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional attribute value. The
token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier. The
token names are the input symbols that the parser processes.
• A pattern is a description of the form that the lexemes of a token may take. In the
case of a keyword as a token, the pattern is just the sequence of characters that
form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
• A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
70By Jaydeep Patil

Lexical Analysis
Table gives Some typical tokens, their informally described patterns,
and some sample lexemes. To see how these concepts are used in
practice, in the C statement. printf ( "Total = %dri" , score ) ;
both printf and score are lexemes matching the pattern for token id,
and " Total = %dn" is a lexeme matching literal.
71By Jaydeep Patil

Lexical Analysis
• Attributes for Tokens:
• E = M * C ** 2
– <id, pointer to symbol-table entry for E>
– <assign_op>
– <id, pointer to symbol-table entry for M>
– <mult_op>
– <id, pointer to symbol-table entry for C>
– <exp_op>
– <number, integer value 2>
72By Jaydeep Patil

Lexical Analysis
• Lexical Errors: It is hard for a lexical analyzer to tell ,
without the aid of other components, that there is a source
-code error. For instance, if the string fi is encountered for
the first time in a C program in the context :
• f i ( a == f (x) ) . . .
• a lexical analyzer cannot tell whether f i is a misspelling of
the keyword if or an undeclared function identifier. Since f i
is a valid lexeme for the token id, the lexical analyzer must
return the token id to the parser and let some other phase
of the compiler - probably the parser in this case - handle
an error due to transposition of the letters.
73By Jaydeep Patil

Lexical Analysis
• However, suppose a situation arises in which the lexical analyzer is
unable to proceed because none of the patterns for tokens matches
any prefix of the remaining input. The simplest recovery strategy is
"panic mode" recovery. We delete successive characters from the
remaining input, until the lexical analyzer can find a well-formed
token at the beginning of what input is left. This recovery technique
may confuse the parser, but in an interactive computing
environment it may be quite adequate.
• Other possible error-recovery actions are:
– 1. Delete one character from the remaining input.
– 2. Insert a missing character into the remaining input.
– 3. Replace a character by another character.
– 4. Transpose two adjacent characters.
74By Jaydeep Patil

Lexical Analysis
• Specification of Tokens: The Patterns corresponding to a
token are generally specified using a compact notation known
as regular expression. Regular expressions of a language are
created by combining members of its alphabet. A regular
expression r corresponds to a set of strings L(r) where L(r) is
called a regular set or a regular language and may be infinite.
A regular expression is defined as follows:
• A basic regular expression a denotes the set {a} where a ∈∑;
L(a) ={a}
• The regular expression ε denotes the set {ε}
• If r and s are two regular expressions denoting the sets L(r)
and L(s) then; following are some rules for regular expressions
75By Jaydeep Patil

Specification of Tokens:
76By Jaydeep Patil

Regular Definition
77By Jaydeep Patil

Token Recognition
• The previous section described about tokens
specification of a language using regular expression.
This section will elaborate how to construct recognizers
that can identify the tokens occurring in input stream.
These recognizers are known as Finite Automata.
• A Finite Automaton (FA) consists of:
• A finite set of states
• A set of transitions (or moves) between states: The
transitions are labeled by characters form the alphabet
• A special start state
• A set of final or accepting states
78By Jaydeep Patil

Token Recognition
79By Jaydeep Patil

Lexical Analysis
• Token Recognition:
80By Jaydeep Patil

Lexical Analysis
81By Jaydeep Patil

Lexical Analysis
82By Jaydeep Patil

Input Buffering
• Buffer Pairs: Because of the amount of time taken to process characters
and the large number of characters that must be processed during the
compilation of a large source program, specialized buffering techniques
have been developed to reduce the amount of overhead required to
process a single input character. An important scheme involves two
buffers that are alternately reloaded, as suggested in Fig. 3.3.
83By Jaydeep Patil

Input Buffering
• Each buffer is of the same size N , and N is usually the
size of a disk block,e.g., 4096 bytes. Using one system
read command we can read N characters into a buffer,
rather than using one system call per character. If fewer
than N characters remain in the input file, then a
special character, represented by eof, marks the end of
the source file and is different from any possible
character of the source program.
• Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current
lexeme, whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
84By Jaydeep Patil

Input Buffering
• Once the next lexeme is determined, forward is set to the
character at its right end. Then, after the lexeme is recorded
as an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the
lexeme just found.
• Advancing forward requires that we first test whether we
have reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer.
85By Jaydeep Patil

Input Buffering
• Sentinels: In Buffered Pair we must check, each time we
advance forward, that we have not moved off one of the
buffers; if we do, then we must also reload the other buffer.
Thus, for each character read, we make two tests: one for
the end of the buffer, and one to determine what character
is read . We can combine the buffer-end test with the test
for the current character if we extend each buffer to hold a
sentinel character at the end. The sentinel is a special
character that cannot be part of the source program, and
a natural choice is the character eof.
• Figure 3.4 shows the arrangement with the sentinels
added. Note that eof retains its use as a marker for the end
of the entire input. Any eof that appears other than at the
end of a buffer means that the input is at an end.
86By Jaydeep Patil

Input Buffering
87By Jaydeep Patil

Automatic construction of lexical
analyzer-(LEX)
• Also Known as Scanner Generator.
• Lex, or in a more recent implementation Flex,
that allows one to specify a lexical analyzer by
specifying regular expressions to describe
patterns for tokens. The input notation for the
Lex tool is referred to as the Lex language and the
tool itself is the Lex compiler. Behind the scenes,
the Lex compiler transforms the input patterns
into a transition diagram and generates code, in a
file called lex.yy.c, that simulates this transition
diagram.
88By Jaydeep Patil

analyzer-(LEX)
89By Jaydeep Patil

analyzer-(LEX)
• Structure of Lex Programs:
• A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions
90By Jaydeep Patil

analyzer-(LEX)
• The declarations section includes declarations of variables,
manifest constants ( identifiers declared to stand for a
constant, e.g. , the name of a token) , and regular
definitions,
• The translation rules each have the form Pattern { Action }
Each pattern is a regular expression, which may use the
regular definitions of the declaration section. The actions
are fragments of code , typically written in C, although
many variants of Lex using other languages have been
created.
• The third section holds whatever additional functions are
used in the actions. Alternatively, these functions can be
compiled separately and loaded with the lexical analyzer.
91By Jaydeep Patil

Lex
……..Definitions section……
%%
……Rules section……..
%%
……….C code section (subroutines)……..
• The input structure to Lex.
•Echo is an action and predefined macro in
lex that writes code matched by the
pattern.
93By Jaydeep Patil

Regular Expressions
94By Jaydeep Patil

Lex Regular Expressions (Extended Regular
Expressions)
• A regular expression matches a set of strings
• Regular expression
– Operators
– Character classes
– Arbitrary character
– Optional expressions
– Alternation and grouping
– Context sensitivity
– Repetitions and definitions
95By Jaydeep Patil

Operators
“ [ ] ^ - ? . * + | ( ) $ / { } % < >
96By Jaydeep Patil

Character Classes []
• [abc] matches a single character, which may be a,
b, or c
• Every operator meaning is ignored except - and ^
• e.g.
[ab] => a or b
[a-z] => a or b or c or … or z
[-+0-9] => all the digits and the two signs
[^a-zA-Z] => any character which is not a
letter
97By Jaydeep Patil

Arbitrary Character .
• To match almost character, the operator
character . is the class of all characters except
newline
98By Jaydeep Patil

Optional & Repeated Expressions
• a? => zero or one instance of a
• a* => zero or more instances of a
• a+ => one or more instances of a
• E.g.
ab?c => ac or abc
[a-z]+ => all strings of lower case letters
[a-zA-Z][a-zA-Z0-9]* => all alphanumeric
strings with a leading alphabetic character
99By Jaydeep Patil

Pattern Matching Primitives
Metacharacter Matches
. any character except newline
n newline
* zero or more copies of the preceding expression
+ one or more copies of the preceding expression
? zero or one copy of the preceding expression
^ beginning of line / complement
$ end of line
a|b a or b
(ab)+ one or more copies of ab (grouping)
[ab] a or b
a{3} 3 instances of a
“a+b” literal “a+b” 100By Jaydeep Patil

Lex
• Pattern Matching examples.
101By Jaydeep Patil

Lex Predefined Variables
• yytext -- a string containing the lexeme
• yyleng -- the length of the lexeme
• yyin -- the input stream pointer
– the default input of default main() is stdin
• yyout -- the output stream pointer
– the default output of default main() is stdout.
• cs20: %./a.out < inputfile > outfile
• E.g.
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}
102By Jaydeep Patil

Lex Library Routines
• yylex()
– The default main() contains a call of yylex()
• yymore()
– return the next token
• yyless(n)
– retain the first n characters in yytext
• yywarp()
– is called whenever Lex reaches an end-of-file
– The default yywarp() always returns 1
103By Jaydeep Patil

Review of Lex Predefined Variables
Name Function
char *yytext pointer to matched string
int yyleng length of matched string
FILE *yyin input stream pointer
FILE *yyout output stream pointer
int yylex(void) call to invoke lexer, returns token
char* yymore(void) return the next token
int yyless(int n) retain the first n characters in yytext
int yywrap(void) wrapup, return 1 if done, 0 if not done
ECHO write matched string
REJECT go to the next alternative rule
INITAL initial start condition
BEGIN condition switch start condition
104By Jaydeep Patil

User Subroutines Section
• You can use your Lex routines in the same ways
you use routines in other programming languages.
%{
void foo();
%}
letter [a-zA-Z]
%%
{letter}+ foo();
%%
…
void foo() {
…
} 105By Jaydeep Patil

User Subroutines Section (cont’d)
• The section where main() is placed
%{
int counter = 0;
%}
letter [a-zA-Z]
%%
{letter}+ {printf(“a wordn”); counter++;}
%%
main() {
yylex();
printf(“There are total %d wordsn”, counter);
}
106By Jaydeep Patil

Lex
 Whitespace must separate the defining term and the associated expression.
 Code in the definitions section is simply copied as-is to the top of the generated C file and
must be bracketed with “%{“ and “%}” markers.
 substitutions in the rules section are surrounded by braces ({letter}) to distinguish them
from literals.
107By Jaydeep Patil

Lex includes this file and utilizes the definitions
for token values. To obtain tokens yacc calls
yylex . Function yylex has a return type of int
that returns a token. Values associated with
the token are returned by lex in variable yylval
108By Jaydeep Patil

Usage
• To run Lex on a source file, type
lex scanner.l
• It produces a file named lex.yy.c which is a C
program for the lexical analyzer.
• To compile lex.yy.c, type
cc lex.yy.c –ll
• To run the lexical analyzer program, type
./a.out < inputfile
109By Jaydeep Patil

END of Unit 1
110By Jaydeep Patil

Compiler Design

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Compiler Design

Similar to Compiler Design (20)

Recently uploaded

Recently uploaded (20)

Compiler Design