Lexical analysis - Compiler Design

Lexical Analyzer / Scanner
P. Kuppusamy - Lexical Analyzer

Outline
• Role of lexical analyzer
• Specification of tokens
• Recognition of tokens
• Lexical analyzer generator
• Finite automata
• Design of lexical analyzer generator

Lexical analyzer terminologies
Lexeme
Lexemes are the words derived from the character input stream.
It is a particular instant of a token.
Ex. cout << 3+2-3;
Processing By Scanner : {cout} | space | {<<} | space | {3} {+} {2} {-} {3} {;}
Lexemes: cout, <<, 3, +, 2, +, 3, ;
Token
Group of characters having a collective meaning.
Tokens are lexemes mapped into a token-name and an attribute-value
Syntax: . <identifier, lexeme> // lexeme is optional
Tokens: <id1, cout > < operator1, << > < number1, 3 > < operator, + >
< number2, 2 > < operator2, - > < number3, 3 > < punctuator, ; >
Pattern:
Rule that describes how a token can be formed.
E.g.
identifier: ([a-z] | [A-Z]) ([a-z]|[A-Z]|[0-9])*

Lexical analyzer
 The scanner analyses the source program by reading character by character.
 Then grouping the characters into individual words and symbols (tokens).
 Remove the comment lines and white spaces (blank space, tab, newline, etc)
 Identifiers and values are stored in symbol table.
 Types of Tokens: constant, identifier, keyword, operator and symbol.
• E.g.
const pi = 3.14159;
Token 1: (const, -)
Token 2: (identifier, ‘pi’)
Token 3: (=, -)
Token 4: (realnumber, 3.14159)
Token 5: (; , -)

Role of Lexical analyzer
Lexical
Analyzer
Parser
Source
program
token
getNextToken()
Symbol
table
To semantic
analysis
• If LA identifies the lexeme constituting an identifier, lexeme needs
to store in the symbol table.
• Sometime type of identifier may read from symbol table.

Why do separate Lexical analysis and parsing
• Lexical analyzer need not to be an individual phase.
• But having a separate phase simplifies the design and
improves the compiler efficiency and portability.
Example of Tokens
Token Informal description Sample lexemes
if
else
comparison
id
number
literal
Characters i, f
Characters e, l, s, e
< or > or <= or >= or == or !=
Letter followed by letter and digits
Any numeric constant
Anything but “ sorrounded by “
if
else
<=, !=
pi, score, D2
3.14159, 0, 6.02e23
“core dumped”, “(“, “{“,
“,” , “}”, “[“, “;”, etc.,

Tokens, their patterns, and attribute values
ws – whitespace (t, n, blank space, etc.)
relop – relational operator P. Kuppusamy - Lexical Analyzer

Example: Find the tokens & lexemes
printf(“total = %dn”, score);
Token Informal description Lexemes
id
literal
Letter followed by letter and digits
Anything but “ sorrounded by “
printf, score
“total = %dn”
“, ”
2. Find the tokens & lexemes for E = M*C^2

Attributes for tokens
• E = M * C ^ 2
 <id, pointer to symbol table entry for E>
 <assign-op>
 <id, pointer to symbol table entry for M>
 <mult-op>
 <id, pointer to symbol table entry for C>
 <exp-op>
 <number, integer value 2>

RE Terminilogy:
 alphabet : a finite set of symbols.
E.g. ∑ ={a, b, c}
 A string is a finite sequence of symbols over an alphabet ∑
(sometimes a string is also called a sentence or a word).
 A language is a set of strings over an alphabet ∑.
Tokens are specified by regular expressions effectively.

Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑ then a is a regular expression, L(a) = {a}
• Operations on languages (a set):
• Union
(r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
• Concatenation
(r) . (s) is a regular expression denoting the language L(r) . L(s)
• Kleene closure
(r)* is a regular expression denoting (L(r))*
• Positive closure
(r) is a regular expression denting (L(r))+
i
i
L
L



0
*

i
i
L
L




1


• Regular expressions are used to represent the patterns for the tokens.
• The regular definition for C identifiers (letters, digits, and underscores).
 Letter A | B | C|…| Z | a | b | … |z| _
 digit  0|1|2 |… | 9
 id letter( letter | digit )*
• The regular definition for Unsigned numbers (integer or floating point)
such as 5280, 0.01234, 6.336E4, or 1.89E-4.
 digit  0|1|2 |… | 9
 digits  digit digit*
 optionalFraction  .digits | 
 optionalExponent  ( E( + |- | ) digits ) | 
 number  digits optionalFraction optionalExponent
• More examples: integer constant, string constants, reserved words, operator,
real constant.
Regular definition (Pattern)

RECOGNITION OF TOKENS
• Tokens recognized by state transition diagrams.
• Recognition of identifiers
• Recognition of Whitespace (delimiters)
• Recognition of Relational operator
• Recognition of Keywords
• Recognition of Numbers (integer / float)

RECOGNITION OF TOKENS
Given the grammar of branching statement:
The pattern for whitespace (ws), keywords & operators:
 ws  (blankspace | tab | newline)+
 if  if
 then  then
 else  else
 relop  < | > | <= | >= | = | <>

Transition diagrams
• Transition diagram for relop
* indicates input retraction (pull back)
(i.e. not > , =)
(i.e. not =)

Transition diagrams (cont.)
• Transition diagram for reserved words and identifiers
• Transition diagram for whitespace
(i.e. not letter
or digit
(i.e. not delimiter)

Transition diagrams (cont.)
• Transition diagram for unsigned numbers
(i.e. not digit)

Major Data Structure in a
Compiler

Data Structure for Communication among Phases
• TOKENS
• A scanner collects characters into a token, as a value of an
enumerated data type for tokens
• Also preserve the string of characters or other derived
information, such as name of identifier, value of a number token
• A single global variable or an array of tokens
• SYNTAX TREE
• A standard pointer-based structure generated by parser
• Each node represents information collect by parser, which
maybe dynamically allocated or stored in symbol table
• The node requires different attributes depending on kind of
language structure, which may be represented as variable record.

• SYMBOL TABLE
• Keeps information associated with identifiers: function, variable,
constants, and data types
• Interacts with almost every phase of compiler.
• Access operation need to be constant-time
• One or several hash tables are often used,
• LITERAL TABLE
• Stores constants and strings, reducing size of program
• Quick insertion and lookup are essential

• INTERMEDIATE CODE
• Kept as an array of text string, a temporary text, or a linked list
of structures depends on kind of intermediate code (e.g. three-
address code and p-code (Pascal-code))
• Should be easy for reorganization
• TEMPORARY FILES
• Holds the product of intermediate steps during compiling
• The problem of memory constraints or back-patch is addressed
during code generation

Bootstrapping in Compiler Design
• Bootstrapping is a process to translate simple language into more
complicated program which in turn may handle for more
complicated program.
• This complicated program can further handle even more
complicated program and so on.
• A full bootstrap is necessary when building a new compiler from
scratch.
• Writing a compiler for any high level language is a complicated
process.
• It takes more time to write a compiler from scratch.
• Hence simple language is used to generate target code in some
stages (phases). P. Kuppusamy - Lexical Analyzer

Tombstone (T) Diagram
Program P implemented in L
L
P
M
Machine implemented in hardware
M
L
Language interpreter in L
• T Diagram contains set of “puzzle pieces” that
represents compilers and other related language processing
programs.
S  T
L
Translator implemented in L
S – Lang
Compiler
accepts as i/p
T – Lang o/p
by Compiler
L – Lang Compiler
is written

Tombstone diagram: Combination rules
M
M
P
ok
M
L
P
Wrong!
L
P
S  T
M
Wrong!
S
P P
T
S  T
M
M
ok
Both should same

• Write a cross compiler for new language X. Let the implementation
language of this compiler is Y and the target code being generated is in
language Z. That is, we create XYZ.
• Now if existing compiler Y runs on machine M and generates code for
M then it is denoted as YMM.
Example: Bootstrapping in Compiler Design
X  Z
Y
Compiler1
Y  M
M
M

• Now if we run XYZ using YMM then we get a compiler XMZ. That
means a compiler for source language X that generates a target code in
language Z and which runs on machine M.
Example: Bootstrapping in Compiler Design
X  Z
Y
Compiler1
Y  M
M
X  Z
M
This cross compiler can
be used for bootstrapping
on machine M but we do
not want to rely on it
permanently.
Cross Compiler
Compiler2

Bootstrap to improve efficiency
The efficiency of programs and compilers:
Efficiency of programs:
- memory usage
- runtime
Efficiency of compilers:
- Efficiency of the compiler itself
- Efficiency of the emitted code

Lexical Analyzer Generator - Lex
Lex is the language or Tool for specifying the Lexical Analyzer.
LA specifies the Regular Expressions
RE are used to represents the patterns for the tokens.

Create Lexical Analyzer Generator - Lex
Lexical
Compiler
Lex Source program
lex.l
lex.yy.c
C
compiler
lex.yy.c a.out
(obj code)
a.out
Input stream Sequence
of tokens

Structure of Lex programs
declarations
%%
translation rules
%%
auxiliary functions
Pattern {Action}
It contains 3 sections
1. Declarations 2. Translation rules 3. Auxiliary functions
Denotes starting of translation rules
Denotes end of translation rules

Lex : Example
1. Declaration section
i) Declare identifiers, constants:
%{
Declare identifiers, constants, definitions of manifest constants (LT, LE, EQ, NE, GT,
GE, IF, THEN, ELSE, ID, NUMBER, RELOP) here
}%
E.g.:
%{
int a,b;
float s;
}%

Declaration section (cont..d)
ii). Declare regular expressions (without % symbol)
/* regular definitions*/
delim [ tn]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(.{digit}+)?(E[+-]?{digit}+)?
? -- Denotes that combination executes 0 or 1 time

2. Translation rules section
• It Starts with %% and ends with %%
• Syntax:
Form of Pattern followed by action
%%
Pattern1 {Action1}
Pattern2 {Action2}
………
Pattern n {Action n}
%%
Pattern is Regular Expression & Action is C statements
E.g.
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER); }
…
%%
Int installID()
{/* funtion to install the lexeme,
whose first character is pointed
to by yytext, and whose length
is yyleng, into the symbol table
and return a pointer thereto */
}
Int installNum()
{ /* similar to installID, but puts
numerical constants into a
separate table */
}

3. Auxiliary function section
All required functions are defined in this section
This program used 2 predefined functions that are installID(),
installNum().
These functions need not to be defined here.
E.g.
Some user defined functions are
add()
{
// statements
}
mul()
{
//statements
}
Int installID()
{/* funtion to install the lexeme,
whose first character is pointed
to by yytext, and whose length is
yyleng, into the symbol table and
return a pointer thereto */
}
Int installNum()
{ /* similar to installID, but puts
numerical constants into a
separate table */
}

Example - Lex program to recognize tokens:
Identify keywords, rel. operator, numbers
%{
/*no variable declaration */
}%
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(.{digit}+)?(E[+-]?{digit}+)?
%%
{id} {printf(“%s is an identifier”, yytext); } //yytest provides token id , { } is optional
if {printf(“%s is keyword”, yytext); }
else {printf(“%s is keyword”, yytext); }
“<“ {printf(“%s is Less than operator”, yytext); }

Lex program to recognize tokens (cont..d)
“>“ {printf(“%s is Greater than operator”, yytext); }
“ <=“ {printf(“%s is Less than or equal operator”, yytext); }
{number} {printf(“%s is number”, yytext); }
%%

Reference
• A.V. Aho, M.S. Lam, R. Sethi, J. D. Ullman, Compilers Principles,
Techniques and Tools, Pearson Edition, 2013.
• Lex & yacc – John R. Levine, Tony Mason, Doug Brown, O’reilly

Lexical analysis - Compiler Design

More Related Content

What's hot

Similar to Lexical analysis - Compiler Design

More from Kuppusamy P

Recently uploaded

Lexical analysis - Compiler Design