SS & CD Module 3

CompilerDesign
Module 2
Module 3
Chapter 1: Introduction
Chapter 2: Lexical Analysis

CompilerDesign
Module 3
Chapter 1: Introduction

Compilers
• “Compilation”
Translation of a program written in a source language into a semantically
equivalent program written in a target language
Compiler
Error messages
Source
Program
Target
Program
Input
Output

Interpreters
• “Interpretation”
• Performing the operations implied by the source program
Interpreter
Source
Program
Input
Output
Error messages

Preprocessors, Compilers, Assemblers,
and Linkers
Preprocessor
Compiler
Assembler
Linker
Skeletal Source Program
Source Program
Target Assembly Program
Relocatable Object Code
Absolute Machine Code
Libraries and
Relocatable Object Files

Phases
of
Compiler lexical analyzer
syntax analyzer
semantic analyzer
source program
tokens
parse trees
parse trees
intermediate code generator
code optimizer
code generator
intermediate code
optimized intermediate code
target program

Lexical Analysis
• The first phase of compiler is called as lexical analysis or
scanning
• The lexical analyser news live stream of characters making of
the source program and groups the character into meaningful
sequence of lexemes.
• The lexical analyser produces has an output of a token in the
form
<token name, attribute value>
• In the token the first component token name is the abstract
sample that is used during the syntax analysis and the
component attribute value points to the entry in the simple
table for this token
• Ex: Source input is
position =initial + rate * 60

• Position : the lexeme will match this as <id, 1>
• = : the lexeme will match this as <=>, because = is an abstract
symbol.
• Initial : the lexeme will match this as <id, 2>
• + : the lexeme will match this as <+>
• Rate: the lexeme will match this as <id, 3>
• * : the lexeme will match this as <*>
• 60: the lexeme will match this as <60>
<id,1> <=> <id, 2> <+> <id, 3> <*> <60> => lexmes

Syntax Analysis
parse tree
<id,1> <=> <id, 2> <+> <id, 3> <*> <60>
Syntax Analysis
<id, 1>
=
*
+
<id, 2>
60
<id, 3>

Semantic Analysis
<id, 1>
=
*
+
<id, 2>
60
<id, 3>
Semantic Analysis
<id, 1>
=
*
+
<id, 2>
60.0
<id, 3> int to float

Symbol Table
• There is a record for each identifier
• The attributes include name, type, location, etc.

Intermediate Code Generation
<id, 1>
=
*
+
<id, 2>
60.0
<id, 3> int to float
Intermediate Code Generation
t1= inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3

Code Optimizer
Code Optimizer
t1= inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
t1= id3*60.0
id1 = id2+t1

Code Generation
Code Generation
t1 = id3*60.0
id1 = id2+t1
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1,R1,R2
STF id1, R1

Qualities of a Good Compiler
What qualities would you want in a compiler?
• generates correct code (first and foremost!)
• generates fast code
• conforms to the specifications of the input language
• copes with essentially arbitrary input size, variables, etc.
• compilation time (linearly)proportional to size of source
• good diagnostics
• consistent optimisations
• works well with the debugger

The Evolution of Programing
Language
• The Move to High Level Language
1. A major step towards higher-level languages was nade in the
lat the 1930's with the development of Fortran for scientiic co
for business data processing, and Lisp for symbollc
computation.
2. Classification based on generation
a) First generation : machine languages
b) Second Generation : assembly languages
c) Third generation : high level languages like fortran, Cobol,C,C++
and Java
d) Fourth Generation : languages designed for specific applications
like SQL for database, NOMAD for report generation etc
e) Fifth generation : languages applied to logic and constraints based
like prolog andOPS5

3. Classification of languages uses the term imperative and
declarative
a) imperative in which a program specifies how a computation is to
be done
b) declarative for languages in which a program specifies what
computation is to be done
c) Languages such as C. C++, C#, and Java are imperative languages
d) Functional languages such as ML and Haskell and constrain
languages such as Prolog are often considered to be declarative
languages
4. The term on Neumann language is applied to programming a whose
computational mode is based on the von Neumann computer
architecture , Many of today's languages, such as Fortran and C are von
Neumann languages
5. Based object oriented languages and scripting languages

CompilerDesign
Module 2
Module 3
Chapter 2: Lexical Analysis

Lexical Analysis
The first phase of the compiler, the main task of the lexical analyser is
to read the input characters of the source program,
group them into lexemes
produce the output as a sequence of tokens for each lexeme in the
source program
• As shown in the figure the call suggested by get next token command
causes the lexical analyser to read the characters from its input until it can
identify the next legs and produce for it the next token which in returns to
the parser.
Lexical
analyzer
symbol
table
parser
Source
program
token
getNexttoken()

Some Terminology
• A token is a pair consisting of a token name and optional
attribute value
• A pattern is a description of the form that the lexemes of a
token may take
• A Lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical
analyser has a instance of that token

Lexical analyser is divided into cascade of two processor scanning
and lexical analysis
• Scanning: consists of a multiple processes that do not require
tokenization of the input such as deletion of comments
compaction of consecutive whitespace characters into one
• Lexical analysis: is a proper is the more complex portion which
produces the tokens from the output of the scanner
• Token syntax is
<token name, attribute value>

Ex: E = M * C ** 2
• For the above source program, the tokens are generated as by
using attribute value itself.
• <id, E>
• < assign_op>
• <id, M>
• < multi_op>
• <id, C>
• <exp_op>
• <number, 2>
Or
• <2>

Lexical Errors
• It is hard for a lexical analyzer to tell, without the aid of other
components, that there is a source-code error. For instance, if
the string fi is encountered for the first time in a C program in
the context:
Ex: fi (a=s f(x))
• a lexical analyzer cannot tell whether fi is a misspelling of the
keyword if or an undeclared function identifier.
• Since fi is a valid lexeme for the token id, the lexical analyzer
must return the token id to the parser and let some other
phase of the compiler—probably the parser in this case handle
an error due to transposition of the letters.

• The simplest recovery strategy is "panic mode recovery”.
• We delete successive characters from the remaining input,
until the lexical analyzer can find a well-formed token at the
beginning of what input is left.
• This recover technique may confuse the parser, but in an
interactive computing environment it may be quite adequate.
• Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

Input Buffering
• For instance we cannot be sure we've seen the end of an
identifier until we see a character that is not a letter or digit, and
therefore is not part of the lexeme for id.
• In C. single-character operators like -, , or < could also be the
beginning of a two-character operator like ->, s, or <#.
• Thus, we shall introduce a two-buffer scheme that handles large
lookhead safely and sentinels that saves the time checking for
the end of buffers.

Buffer Pair
• Specialized buffering techniques have been developed to reduce
the amount of overhead required to process a single input
character.
• An important scheme involves two buffers that are alternately
reloaded, as suggested in Fig.
forward
lexemeBegin
Fig: using a pair of input buffer
E = M * C * * 2 eof

• Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
• Once the next lexeme is determined, forward is set to the
character at its right end. Then, after the lexeme is recorded as an
attribute value of a token returned to the parser, lexemebegin is
set to character immediately after the lexeme just found

Sentinels
• We must check, each time we advance forward, that we have not
moved off one of the buffers; if we do, then we must also reload the
other buffer.
• Thus for each character read, we make two tests:
one for the end of the buffer, &
one to determine what character is read.
• We can combine the buffer-end test with the test for the current
character if we extend each buffer to hold a sentinel character at the
end.
• The sentinel is a special character that cannot be part of the source
program, and a natural choice is the character eof.
E = M eof * C * * 2 eof eof
lexemeBegin forward

Specification of Tokens
Strings and Languages
1. {0, 1} is an binary alphabet.
2. A string over alphabet is a finite sequence of symbols drawn from that
alphabet.
3. The empty string is denoted by , the sting length is zero.
4. A language is any countable set of strings over some fixed alphabets
5. The language L containing an empty string is represented by { }

Specification of Tokens
Regular Expression
• Given an alphabet ,
1. is a regular expression and L { } is { }, the language whose member is
an empty string.
2. For each a , a is a regular expression denote {a}, the set containing
the string a.
3. r and s are regular expressions denoting the language L(r ) and L(s ). Then
 ( r ) | ( s ) is a regular expression denoting L( r ) U L( s )
 ( r ) ( s ) is a regular expression denoting L( r ) L ( s )
 ( r )* is a regular expression denoting (L ( r )) *

• Let = {a, b}
• a *
• a +
• a | b
• (a | b) (a | b)
• (a | b)*
• a | a*b
• The Grammar a|b identifies the language {a,b}
• The Grammar (a|b) (a|b) identifies the language {aa,ab,ba,bb}
• The Grammar a* identifies the language consisting string of zero or
more occurrences of a {, a, aa,aaaa,aaaa,aaaaa,}
• The Grammar a+ identifies the language consisting string of one or
more occurrences of a { a,aa,aaaa,aaaa,aaaaa,}
• The Grammar (a|b)* identifies the language {, a,b, aa, ab, ba, bb….}
• The Grammar a|a*b identifies the language {a, ab,aab,aaaab,….}

• Ex 1: To identify letters, digits, underscore
letters A|B|……|Z|a|b|…..|z|_
digit 0|1|2|…|9
id letter (letter|digit)*

• Ex 2: To identify unsigned numbers (integers or floating point)
such as 48618, 516.14,166.2-4e13, 0.15456E9
digit 0|1|2|….|9
digits digit digit*
optionalFraction . digits |
optinalExponent (e|E(+|-| )) digits |
number digits optionalFraction optinalExponent

Ex 1:
letters A|B|……|Z|a|b|…..|z|_
digit 0|1|2|…|9
id letter (letter|digit)
Updated to
• letter [A-Za-z_]
• digit [0-9]
• id letter (letter|digit)*

Ex 2:
digit 0|1|2|….|9
digits digit digit*
optionalFraction . digit |
optinalExponent (e|E(+|-| )digits) |
number digits optionalFraction optinalExponent
Updated to
• digit [0-9]
• digits digit+
• number digits(.digits)? (e|E[+-]? digits )?

• For the given regular expression grammar , describe the language
1. a(a|b)*a
2. (a|b)*a(a|b)(a|b)
3. a*ba*ba*ba*

Recognition of Token
• In our discussion we will make use of Dandling if loop statement.
stmt if expr then stmt
| if expr then stmt else stmt
|
expr term relop term
| term
term id
| number
A grammar for branching statement

• The grammar fragment describes a simple form of branching
statements and conditional expressions.
• For relop, we use the comparison operators of languages like
Pascal or SQL where = is "equals" and <> is "not equals," because
it presents an interesting structure of lexemes.
• The terminals of the grammar, which are if, then, else, relop, id,
and number, are the names of tokens as far as the lexical
analyzer is concerned.
• The patterns for these tokens are described using regular
definitions, as shown below.

digit [0-9]
digits digit+
number digits(.digits)? (e[+-]? digits )?
letter [A-Za-z_]
id letter (letter|digit)*
if if
then then
else else
relop < | > | <= | >= | = | < >
Patterns for token for if-else-then statement
For this language, the lexical analyzer will recognize the keywords
if, then, and else, as well as lexemes that match the
patterns for relop, id, and number

Lexemes Token name Attribute Value
if if -
then then -
else else -
any id id Pointer to table entry
any number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
< > relop NE
> relop GT
>= relop GE

Transition Diagrams
1. We always indicate an accepting state by a double circle.
2. If it is necessary to retract the forward pointer one position,
then we shall additionally place a * near that accepting state.
3. One state is designated the start state, or initial state it is
indicated by an edge labeled "start " entering from nowhere.

0
6
5
4
3
2
1
8
7
return ( relop, LE)
return ( relop, NE)
return ( relop, LT)
return ( relop, EQ)
return ( relop, GE)
return ( relop, GT)
others
others
start < =
=
=
>
>
*
*
Transition diagram for relop

• We begin in state 0, the start state.
• If we see < as the first input symbol, then among the lexemes that
match the pattern for relop, we can be looking at <,<>, or <=.
• We therefore go to state 1, and look at the next character.
• If it is =, then we recognize lexeme <=, enter state 2, and return the
token relop with attribute LE, the symbolic constant representing
this particular comparison operator.
• If in state 1 the next character is >, then instead we have lexeme < >,
and enter state 3 to return an indication that the not-equals
operator has been found.
• On any other character, the lexeme is < and we enter state 4 to
return that information.
• Note, however, that state 4 has a* to indicate that we must retract
the input one position.

• Transition diagram for id’s and keywords
• Id-> letter(letter|digit)*
9 11
10 return (getToken(),
installID)
start letter other
letter or digit
*

• Transition diagram for unsigned numbers
• number  digits(.digits)? (e[+-]? digits )?
12 14
13
start digit
other
digit
*
15
21
20
19
18
17
16
digit
digit
digit
digit
digit
other
other
*
*
.
E
E + or -

Lex
• Lex and YACC helps you write programs that transforms
structured input.
• Lex generates C code for lexical analyzer whereas YACC generates
Code for Syntax analyzer.
• Lexical analyzer is build using a tool called LEX.
• Input is given to LEX and lexical analyzer is generated.
• Lex is a UNIX utility.
• It is a program generator designed for lexical processing of
character input streams.
• It uses the patterns that match strings in the input and converts
the strings to tokens.
• Lex helps you by taking a set of descriptions of possible tokens
and producing a C routine, which we call a lexical analyzer

Lex
compiler
C
compiler
a.out
Lex source program
Prog.l
lex.yy.c
lex.yy.c a.out
Input stream
Sequence of
token
Figure creating a lexical analyzer with Lex

Steps in writing LEX Program:
• 1st step: Using gedit create a file with extension .l
• For example: prg1.l
• 2nd Step: converting to c file for compilation : lex prg1.l
• 3rd Step compiling : gcc lex.yy.c –ll
• 4th Step output : ./a.out

Structure of LEX source program:
DEFINITION SECTION
%%
RULE SECTION
%%
CODE SECTION

• The declarations section includes declarations of variables,
manifest constant (identifiers declared to stand for a constant,
e.g, the name of a token)
• The translation rules or rule section each have the form
Pattern { Action}
• Each pattern is a regular expression, which may use the regular
definitions of the declaration section.
• The actions are fragments of code, typically written in C.
• The third code section holds whatever additional functions are
used in the actions. Alternatively, these functions can be
compiled separately and loaded with the lexical analyzer.

Lex terms
Lex variables
yyin OfthetypeFILE*.Thispointstothecurrentfilebeingparsedbythelexer.
yyout
OfthetypeFILE*.Thispointstothelocationwheretheoutputofthelexer
willbewritten.By default,bothyyinandyyoutpointtostandardinputand
output.
yytext The text of the matched pattern is stored in this variable (char*).
yyleng Gives the length of the matched pattern.
Lex Functions
yylex() The functionthatstartsthe analysis.ItisautomaticallygeneratedbyLex.
yywrap()
Thisfunction iscalled when endoffile(orinput) isencountered.Ifthis
function returns 1,the parsing stops. So, this can be used to parse multiple
files.

Write a LEX program to recognize valid arithmetic expression
Prog1.l
%{
int op=0, id=0, br=0;
%}
%%
[+] {op++;}
[-] {op++;}
[*] {op++;}
[/] {op++;}
[a-z A-Z0-9]+ {id++;}
[(] {br++;}
[)] {br--;}
%%

main()
{
printf(" enter an arithmetic expressionn");
yylex();
printf(" no.of identifiers= %dn",id);
printf(" no.of operators= %dn", op);
if(op>=id || br!=0 || id==1)
printf(" invalid expressionn");
else
printf(" valid expressionn");
}

[student@localhost ~]$ vi 1a.l
[student@localhost ~]$ lex 1a.l
[student@localhost ~]$ gcc lex.yy.c -ll
[student@localhost ~]$ ./a.out
Enter the arithmetic expression: a+b*(c/d)
no of identifiers are= 4
no of operators are =3
valid expression
[student@localhost ~]$ ./a.out
Enter the arithmetic expression : ab+(
no of identifiers are=2
no of operators are=3
invalid expression
[student@localhost ~]$

Finite Automata
• These are essentially graphs, like transition diagrams, with a few
differences:
1. Finite automata are recognizers; they simply say "yes" or "no"
about each possible input string.
2. Finite automata come in two flavours:
(a) Nondeterministic finite automata (NFA) have no restrictions on
the labels of their edges. A symbol can label several edges out of
the same state, and , the empty string, is a possible label.
(b) Deterministic finite automata (DFA) have, for each state, and for
each symbol of its input alphabet exactly one edge with that
symbol leaving that state.
Both deterministic and nondeterministic finite automata are capable
of rec organizing the same languages

Nondeterministic Finite Automata
(NFA)
• A nondeterministic finite automaton (NFA) consists of:
1. A finite set of states S.
2. A set of input symbols , the input alphabet. We assume that
, which stands for the empty string, is never a member of .
3. A transition function that gives, for each state, and for each
symbol in U{ } a set of next states.
4. A state so from S that is distinguished as the start state (or
initial state).
5. A set of states F, a subset of S, that is distinguished as the
accepting states (or final states).
We can represent either an NFA or DFA by a transition graph,
where the nodes are states and the labelled edges represent the
transition function.

NFA
• EX: the transition graph for an NFA accepting the language of
regular expression (a|b)*abb.
0 3
1
start
a
b
2
b
a
b
State a b
0 {0,1} {0}
1 {2}
2 {3]
3
Transition table for NFA for above diagram
Transtion diagarm for
(a|b)*abb.

NFA
• NFA accepting L(aa*|bb*)
0
2
1
3 4
start
a
b


a
b
State a b 
0
1
2
3
4

Deterministic Finite Automata
(DFA)
• A DFA is a special case of NFA where.
1. There are no moves on input
2. For each state ‘s’ and input symbol ‘a’, there is exactly one
edge out of ‘s’ labelled ‘a’.
(Hint to remember: both incoming and outgoing edges names
should be same form one state)
We are using a transition table to represent a DFA, then each entry
is a single state. We may therefore represent this state without the
curly braces that we use to form set.

DFA
• EX: the transition graph for an DFA accepting the language of
regular expression (a|b)*abb.
0 3
1
start
b
b
2
b
a b
State a b
0 1 0
1 1 2
2 1 3
3 1 0
Transition table for DFA for above diagram
Transtion diagarm for
(a|b)*abb.
a
a
a

SS & CD Module 3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SS & CD Module 3

Similar to SS & CD Module 3 (20)

Recently uploaded

Recently uploaded (20)

SS & CD Module 3