Lexical Analyzer / Scanner
P. Kuppusamy - Lexical Analyzer
Outline
• Role of lexical analyzer
• Specification of tokens
• Recognition of tokens
• Lexical analyzer generator
• Finite automata
• Design of lexical analyzer generator
P. Kuppusamy - Lexical Analyzer
Lexical analyzer terminologies
Lexeme
Lexemes are the words derived from the character input stream.
It is a particular instant of a token.
Ex. cout << 3+2-3;
Processing By Scanner : {cout} | space | {<<} | space | {3} {+} {2} {-} {3} {;}
Lexemes: cout, <<, 3, +, 2, +, 3, ;
Token
Group of characters having a collective meaning.
Tokens are lexemes mapped into a token-name and an attribute-value
Syntax: . <identifier, lexeme> // lexeme is optional
Tokens: <id1, cout > < operator1, << > < number1, 3 > < operator, + >
< number2, 2 > < operator2, - > < number3, 3 > < punctuator, ; >
Pattern:
Rule that describes how a token can be formed.
E.g.
identifier: ([a-z] | [A-Z]) ([a-z]|[A-Z]|[0-9])*
P. Kuppusamy - Lexical Analyzer
Lexical analyzer
 The scanner analyses the source program by reading character by character.
 Then grouping the characters into individual words and symbols (tokens).
 Remove the comment lines and white spaces (blank space, tab, newline, etc)
 Identifiers and values are stored in symbol table.
 Types of Tokens: constant, identifier, keyword, operator and symbol.
• E.g.
const pi = 3.14159;
Token 1: (const, -)
Token 2: (identifier, ‘pi’)
Token 3: (=, -)
Token 4: (realnumber, 3.14159)
Token 5: (; , -)
P. Kuppusamy - Lexical Analyzer
Role of Lexical analyzer
Lexical
Analyzer
Parser
Source
program
token
getNextToken()
Symbol
table
To semantic
analysis
• If LA identifies the lexeme constituting an identifier, lexeme needs
to store in the symbol table.
• Sometime type of identifier may read from symbol table.
P. Kuppusamy - Lexical Analyzer
Why do separate Lexical analysis and parsing
• Lexical analyzer need not to be an individual phase.
• But having a separate phase simplifies the design and
improves the compiler efficiency and portability.
Example of Tokens
Token Informal description Sample lexemes
if
else
comparison
id
number
literal
Characters i, f
Characters e, l, s, e
< or > or <= or >= or == or !=
Letter followed by letter and digits
Any numeric constant
Anything but “ sorrounded by “
if
else
<=, !=
pi, score, D2
3.14159, 0, 6.02e23
“core dumped”, “(“, “{“,
“,” , “}”, “[“, “;”, etc.,
P. Kuppusamy - Lexical Analyzer
Tokens, their patterns, and attribute values
ws – whitespace (t, n, blank space, etc.)
relop – relational operator P. Kuppusamy - Lexical Analyzer
Example: Find the tokens & lexemes
printf(“total = %dn”, score);
Token Informal description Lexemes
id
literal
Letter followed by letter and digits
Anything but “ sorrounded by “
printf, score
“total = %dn”
“, ”
2. Find the tokens & lexemes for E = M*C^2
P. Kuppusamy - Lexical Analyzer
Attributes for tokens
• E = M * C ^ 2
 <id, pointer to symbol table entry for E>
 <assign-op>
 <id, pointer to symbol table entry for M>
 <mult-op>
 <id, pointer to symbol table entry for C>
 <exp-op>
 <number, integer value 2>
P. Kuppusamy - Lexical Analyzer
RE Terminilogy:
 alphabet : a finite set of symbols.
E.g. ∑ ={a, b, c}
 A string is a finite sequence of symbols over an alphabet ∑
(sometimes a string is also called a sentence or a word).
 A language is a set of strings over an alphabet ∑.
Tokens are specified by regular expressions effectively.
P. Kuppusamy - Lexical Analyzer
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑ then a is a regular expression, L(a) = {a}
• Operations on languages (a set):
• Union
(r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
• Concatenation
(r) . (s) is a regular expression denoting the language L(r) . L(s)
• Kleene closure
(r)* is a regular expression denoting (L(r))*
• Positive closure
(r) is a regular expression denting (L(r))+
i
i
L
L



0
*

i
i
L
L




1

P. Kuppusamy - Lexical Analyzer
• Regular expressions are used to represent the patterns for the tokens.
• The regular definition for C identifiers (letters, digits, and underscores).
 Letter A | B | C|…| Z | a | b | … |z| _
 digit  0|1|2 |… | 9
 id letter( letter | digit )*
• The regular definition for Unsigned numbers (integer or floating point)
such as 5280, 0.01234, 6.336E4, or 1.89E-4.
 digit  0|1|2 |… | 9
 digits  digit digit*
 optionalFraction  .digits | 
 optionalExponent  ( E( + |- | ) digits ) | 
 number  digits optionalFraction optionalExponent
• More examples: integer constant, string constants, reserved words, operator,
real constant.
Regular definition (Pattern)
P. Kuppusamy - Lexical Analyzer
RECOGNITION OF TOKENS
• Tokens recognized by state transition diagrams.
• Recognition of identifiers
• Recognition of Whitespace (delimiters)
• Recognition of Relational operator
• Recognition of Keywords
• Recognition of Numbers (integer / float)
P. Kuppusamy - Lexical Analyzer
RECOGNITION OF TOKENS
Given the grammar of branching statement:
The pattern for whitespace (ws), keywords & operators:
 ws  (blankspace | tab | newline)+
 if  if
 then  then
 else  else
 relop  < | > | <= | >= | = | <>
P. Kuppusamy - Lexical Analyzer
Transition diagrams
• Transition diagram for relop
* indicates input retraction (pull back)
(i.e. not > , =)
(i.e. not =)
P. Kuppusamy - Lexical Analyzer
Transition diagrams (cont.)
• Transition diagram for reserved words and identifiers
• Transition diagram for whitespace
(i.e. not letter
or digit
(i.e. not delimiter)
P. Kuppusamy - Lexical Analyzer
Transition diagrams (cont.)
• Transition diagram for unsigned numbers
(i.e. not digit)
P. Kuppusamy - Lexical Analyzer
Major Data Structure in a
Compiler
P. Kuppusamy - Lexical Analyzer
Data Structure for Communication among Phases
• TOKENS
• A scanner collects characters into a token, as a value of an
enumerated data type for tokens
• Also preserve the string of characters or other derived
information, such as name of identifier, value of a number token
• A single global variable or an array of tokens
• SYNTAX TREE
• A standard pointer-based structure generated by parser
• Each node represents information collect by parser, which
maybe dynamically allocated or stored in symbol table
• The node requires different attributes depending on kind of
language structure, which may be represented as variable record.
P. Kuppusamy - Lexical Analyzer
• SYMBOL TABLE
• Keeps information associated with identifiers: function, variable,
constants, and data types
• Interacts with almost every phase of compiler.
• Access operation need to be constant-time
• One or several hash tables are often used,
• LITERAL TABLE
• Stores constants and strings, reducing size of program
• Quick insertion and lookup are essential
Data Structure for Communication among Phases
P. Kuppusamy - Lexical Analyzer
• INTERMEDIATE CODE
• Kept as an array of text string, a temporary text, or a linked list
of structures depends on kind of intermediate code (e.g. three-
address code and p-code (Pascal-code))
• Should be easy for reorganization
• TEMPORARY FILES
• Holds the product of intermediate steps during compiling
• The problem of memory constraints or back-patch is addressed
during code generation
Data Structure for Communication among Phases
P. Kuppusamy - Lexical Analyzer
Bootstrapping in Compiler Design
• Bootstrapping is a process to translate simple language into more
complicated program which in turn may handle for more
complicated program.
• This complicated program can further handle even more
complicated program and so on.
• A full bootstrap is necessary when building a new compiler from
scratch.
• Writing a compiler for any high level language is a complicated
process.
• It takes more time to write a compiler from scratch.
• Hence simple language is used to generate target code in some
stages (phases). P. Kuppusamy - Lexical Analyzer
Tombstone (T) Diagram
Program P implemented in L
L
P
M
Machine implemented in hardware
M
L
Language interpreter in L
• T Diagram contains set of “puzzle pieces” that
represents compilers and other related language processing
programs.
S  T
L
Translator implemented in L
S – Lang
Compiler
accepts as i/p
T – Lang o/p
by Compiler
L – Lang Compiler
is written
P. Kuppusamy - Lexical Analyzer
Tombstone diagram: Combination rules
M
M
P
ok
M
L
P
Wrong!
L
P
S  T
M
Wrong!
S
P P
T
S  T
M
M
ok
Both should same
P. Kuppusamy - Lexical Analyzer
• Write a cross compiler for new language X. Let the implementation
language of this compiler is Y and the target code being generated is in
language Z. That is, we create XYZ.
• Now if existing compiler Y runs on machine M and generates code for
M then it is denoted as YMM.
Example: Bootstrapping in Compiler Design
X  Z
Y
Compiler1
Y  M
M
M
P. Kuppusamy - Lexical Analyzer
• Now if we run XYZ using YMM then we get a compiler XMZ. That
means a compiler for source language X that generates a target code in
language Z and which runs on machine M.
Example: Bootstrapping in Compiler Design
X  Z
Y
Compiler1
Y  M
M
X  Z
M
This cross compiler can
be used for bootstrapping
on machine M but we do
not want to rely on it
permanently.
Cross Compiler
Compiler2
P. Kuppusamy - Lexical Analyzer
Bootstrap to improve efficiency
The efficiency of programs and compilers:
Efficiency of programs:
- memory usage
- runtime
Efficiency of compilers:
- Efficiency of the compiler itself
- Efficiency of the emitted code
P. Kuppusamy - Lexical Analyzer
Lexical Analyzer Generator - Lex
Lex is the language or Tool for specifying the Lexical Analyzer.
LA specifies the Regular Expressions
RE are used to represents the patterns for the tokens.
P. Kuppusamy - Lexical Analyzer
Create Lexical Analyzer Generator - Lex
Lexical
Compiler
Lex Source program
lex.l
lex.yy.c
C
compiler
lex.yy.c a.out
(obj code)
a.out
Input stream Sequence
of tokens
P. Kuppusamy - Lexical Analyzer
Structure of Lex programs
declarations
%%
translation rules
%%
auxiliary functions
Pattern {Action}
It contains 3 sections
1. Declarations 2. Translation rules 3. Auxiliary functions
Denotes starting of translation rules
Denotes end of translation rules
P. Kuppusamy - Lexical Analyzer
Lex : Example
1. Declaration section
i) Declare identifiers, constants:
%{
Declare identifiers, constants, definitions of manifest constants (LT, LE, EQ, NE, GT,
GE, IF, THEN, ELSE, ID, NUMBER, RELOP) here
}%
E.g.:
%{
int a,b;
float s;
}%
P. Kuppusamy - Lexical Analyzer
Declaration section (cont..d)
ii). Declare regular expressions (without % symbol)
/* regular definitions*/
delim [ tn]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(.{digit}+)?(E[+-]?{digit}+)?
? -- Denotes that combination executes 0 or 1 time
P. Kuppusamy - Lexical Analyzer
2. Translation rules section
• It Starts with %% and ends with %%
• Syntax:
Form of Pattern followed by action
%%
Pattern1 {Action1}
Pattern2 {Action2}
………
Pattern n {Action n}
%%
Pattern is Regular Expression & Action is C statements
E.g.
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER); }
…
%%
Int installID()
{/* funtion to install the lexeme,
whose first character is pointed
to by yytext, and whose length
is yyleng, into the symbol table
and return a pointer thereto */
}
Int installNum()
{ /* similar to installID, but puts
numerical constants into a
separate table */
}
P. Kuppusamy - Lexical Analyzer
3. Auxiliary function section
All required functions are defined in this section
This program used 2 predefined functions that are installID(),
installNum().
These functions need not to be defined here.
E.g.
Some user defined functions are
add()
{
// statements
}
mul()
{
//statements
}
Int installID()
{/* funtion to install the lexeme,
whose first character is pointed
to by yytext, and whose length is
yyleng, into the symbol table and
return a pointer thereto */
}
Int installNum()
{ /* similar to installID, but puts
numerical constants into a
separate table */
}
P. Kuppusamy - Lexical Analyzer
Example - Lex program to recognize tokens:
Identify keywords, rel. operator, numbers
%{
/*no variable declaration */
}%
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(.{digit}+)?(E[+-]?{digit}+)?
%%
{id} {printf(“%s is an identifier”, yytext); } //yytest provides token id , { } is optional
if {printf(“%s is keyword”, yytext); }
else {printf(“%s is keyword”, yytext); }
“<“ {printf(“%s is Less than operator”, yytext); }
P. Kuppusamy - Lexical Analyzer
Lex program to recognize tokens (cont..d)
“>“ {printf(“%s is Greater than operator”, yytext); }
“ <=“ {printf(“%s is Less than or equal operator”, yytext); }
{number} {printf(“%s is number”, yytext); }
%%
P. Kuppusamy - Lexical Analyzer
Reference
• A.V. Aho, M.S. Lam, R. Sethi, J. D. Ullman, Compilers Principles,
Techniques and Tools, Pearson Edition, 2013.
• Lex & yacc – John R. Levine, Tony Mason, Doug Brown, O’reilly
P. Kuppusamy - Lexical Analyzer

Lexical analysis - Compiler Design

  • 1.
    Lexical Analyzer /Scanner P. Kuppusamy - Lexical Analyzer
  • 2.
    Outline • Role oflexical analyzer • Specification of tokens • Recognition of tokens • Lexical analyzer generator • Finite automata • Design of lexical analyzer generator P. Kuppusamy - Lexical Analyzer
  • 3.
    Lexical analyzer terminologies Lexeme Lexemesare the words derived from the character input stream. It is a particular instant of a token. Ex. cout << 3+2-3; Processing By Scanner : {cout} | space | {<<} | space | {3} {+} {2} {-} {3} {;} Lexemes: cout, <<, 3, +, 2, +, 3, ; Token Group of characters having a collective meaning. Tokens are lexemes mapped into a token-name and an attribute-value Syntax: . <identifier, lexeme> // lexeme is optional Tokens: <id1, cout > < operator1, << > < number1, 3 > < operator, + > < number2, 2 > < operator2, - > < number3, 3 > < punctuator, ; > Pattern: Rule that describes how a token can be formed. E.g. identifier: ([a-z] | [A-Z]) ([a-z]|[A-Z]|[0-9])* P. Kuppusamy - Lexical Analyzer
  • 4.
    Lexical analyzer  Thescanner analyses the source program by reading character by character.  Then grouping the characters into individual words and symbols (tokens).  Remove the comment lines and white spaces (blank space, tab, newline, etc)  Identifiers and values are stored in symbol table.  Types of Tokens: constant, identifier, keyword, operator and symbol. • E.g. const pi = 3.14159; Token 1: (const, -) Token 2: (identifier, ‘pi’) Token 3: (=, -) Token 4: (realnumber, 3.14159) Token 5: (; , -) P. Kuppusamy - Lexical Analyzer
  • 5.
    Role of Lexicalanalyzer Lexical Analyzer Parser Source program token getNextToken() Symbol table To semantic analysis • If LA identifies the lexeme constituting an identifier, lexeme needs to store in the symbol table. • Sometime type of identifier may read from symbol table. P. Kuppusamy - Lexical Analyzer
  • 6.
    Why do separateLexical analysis and parsing • Lexical analyzer need not to be an individual phase. • But having a separate phase simplifies the design and improves the compiler efficiency and portability. Example of Tokens Token Informal description Sample lexemes if else comparison id number literal Characters i, f Characters e, l, s, e < or > or <= or >= or == or != Letter followed by letter and digits Any numeric constant Anything but “ sorrounded by “ if else <=, != pi, score, D2 3.14159, 0, 6.02e23 “core dumped”, “(“, “{“, “,” , “}”, “[“, “;”, etc., P. Kuppusamy - Lexical Analyzer
  • 7.
    Tokens, their patterns,and attribute values ws – whitespace (t, n, blank space, etc.) relop – relational operator P. Kuppusamy - Lexical Analyzer
  • 8.
    Example: Find thetokens & lexemes printf(“total = %dn”, score); Token Informal description Lexemes id literal Letter followed by letter and digits Anything but “ sorrounded by “ printf, score “total = %dn” “, ” 2. Find the tokens & lexemes for E = M*C^2 P. Kuppusamy - Lexical Analyzer
  • 9.
    Attributes for tokens •E = M * C ^ 2  <id, pointer to symbol table entry for E>  <assign-op>  <id, pointer to symbol table entry for M>  <mult-op>  <id, pointer to symbol table entry for C>  <exp-op>  <number, integer value 2> P. Kuppusamy - Lexical Analyzer
  • 10.
    RE Terminilogy:  alphabet: a finite set of symbols. E.g. ∑ ={a, b, c}  A string is a finite sequence of symbols over an alphabet ∑ (sometimes a string is also called a sentence or a word).  A language is a set of strings over an alphabet ∑. Tokens are specified by regular expressions effectively. P. Kuppusamy - Lexical Analyzer
  • 11.
    Regular expressions • Ɛis a regular expression, L(Ɛ) = {Ɛ} • If a is a symbol in ∑ then a is a regular expression, L(a) = {a} • Operations on languages (a set): • Union (r) | (s) is a regular expression denoting the language L(r) ∪ L(s) • Concatenation (r) . (s) is a regular expression denoting the language L(r) . L(s) • Kleene closure (r)* is a regular expression denoting (L(r))* • Positive closure (r) is a regular expression denting (L(r))+ i i L L    0 *  i i L L     1  P. Kuppusamy - Lexical Analyzer
  • 12.
    • Regular expressionsare used to represent the patterns for the tokens. • The regular definition for C identifiers (letters, digits, and underscores).  Letter A | B | C|…| Z | a | b | … |z| _  digit  0|1|2 |… | 9  id letter( letter | digit )* • The regular definition for Unsigned numbers (integer or floating point) such as 5280, 0.01234, 6.336E4, or 1.89E-4.  digit  0|1|2 |… | 9  digits  digit digit*  optionalFraction  .digits |   optionalExponent  ( E( + |- | ) digits ) |   number  digits optionalFraction optionalExponent • More examples: integer constant, string constants, reserved words, operator, real constant. Regular definition (Pattern) P. Kuppusamy - Lexical Analyzer
  • 13.
    RECOGNITION OF TOKENS •Tokens recognized by state transition diagrams. • Recognition of identifiers • Recognition of Whitespace (delimiters) • Recognition of Relational operator • Recognition of Keywords • Recognition of Numbers (integer / float) P. Kuppusamy - Lexical Analyzer
  • 14.
    RECOGNITION OF TOKENS Giventhe grammar of branching statement: The pattern for whitespace (ws), keywords & operators:  ws  (blankspace | tab | newline)+  if  if  then  then  else  else  relop  < | > | <= | >= | = | <> P. Kuppusamy - Lexical Analyzer
  • 15.
    Transition diagrams • Transitiondiagram for relop * indicates input retraction (pull back) (i.e. not > , =) (i.e. not =) P. Kuppusamy - Lexical Analyzer
  • 16.
    Transition diagrams (cont.) •Transition diagram for reserved words and identifiers • Transition diagram for whitespace (i.e. not letter or digit (i.e. not delimiter) P. Kuppusamy - Lexical Analyzer
  • 17.
    Transition diagrams (cont.) •Transition diagram for unsigned numbers (i.e. not digit) P. Kuppusamy - Lexical Analyzer
  • 18.
    Major Data Structurein a Compiler P. Kuppusamy - Lexical Analyzer
  • 19.
    Data Structure forCommunication among Phases • TOKENS • A scanner collects characters into a token, as a value of an enumerated data type for tokens • Also preserve the string of characters or other derived information, such as name of identifier, value of a number token • A single global variable or an array of tokens • SYNTAX TREE • A standard pointer-based structure generated by parser • Each node represents information collect by parser, which maybe dynamically allocated or stored in symbol table • The node requires different attributes depending on kind of language structure, which may be represented as variable record. P. Kuppusamy - Lexical Analyzer
  • 20.
    • SYMBOL TABLE •Keeps information associated with identifiers: function, variable, constants, and data types • Interacts with almost every phase of compiler. • Access operation need to be constant-time • One or several hash tables are often used, • LITERAL TABLE • Stores constants and strings, reducing size of program • Quick insertion and lookup are essential Data Structure for Communication among Phases P. Kuppusamy - Lexical Analyzer
  • 21.
    • INTERMEDIATE CODE •Kept as an array of text string, a temporary text, or a linked list of structures depends on kind of intermediate code (e.g. three- address code and p-code (Pascal-code)) • Should be easy for reorganization • TEMPORARY FILES • Holds the product of intermediate steps during compiling • The problem of memory constraints or back-patch is addressed during code generation Data Structure for Communication among Phases P. Kuppusamy - Lexical Analyzer
  • 22.
    Bootstrapping in CompilerDesign • Bootstrapping is a process to translate simple language into more complicated program which in turn may handle for more complicated program. • This complicated program can further handle even more complicated program and so on. • A full bootstrap is necessary when building a new compiler from scratch. • Writing a compiler for any high level language is a complicated process. • It takes more time to write a compiler from scratch. • Hence simple language is used to generate target code in some stages (phases). P. Kuppusamy - Lexical Analyzer
  • 23.
    Tombstone (T) Diagram ProgramP implemented in L L P M Machine implemented in hardware M L Language interpreter in L • T Diagram contains set of “puzzle pieces” that represents compilers and other related language processing programs. S  T L Translator implemented in L S – Lang Compiler accepts as i/p T – Lang o/p by Compiler L – Lang Compiler is written P. Kuppusamy - Lexical Analyzer
  • 24.
    Tombstone diagram: Combinationrules M M P ok M L P Wrong! L P S  T M Wrong! S P P T S  T M M ok Both should same P. Kuppusamy - Lexical Analyzer
  • 25.
    • Write across compiler for new language X. Let the implementation language of this compiler is Y and the target code being generated is in language Z. That is, we create XYZ. • Now if existing compiler Y runs on machine M and generates code for M then it is denoted as YMM. Example: Bootstrapping in Compiler Design X  Z Y Compiler1 Y  M M M P. Kuppusamy - Lexical Analyzer
  • 26.
    • Now ifwe run XYZ using YMM then we get a compiler XMZ. That means a compiler for source language X that generates a target code in language Z and which runs on machine M. Example: Bootstrapping in Compiler Design X  Z Y Compiler1 Y  M M X  Z M This cross compiler can be used for bootstrapping on machine M but we do not want to rely on it permanently. Cross Compiler Compiler2 P. Kuppusamy - Lexical Analyzer
  • 27.
    Bootstrap to improveefficiency The efficiency of programs and compilers: Efficiency of programs: - memory usage - runtime Efficiency of compilers: - Efficiency of the compiler itself - Efficiency of the emitted code P. Kuppusamy - Lexical Analyzer
  • 28.
    Lexical Analyzer Generator- Lex Lex is the language or Tool for specifying the Lexical Analyzer. LA specifies the Regular Expressions RE are used to represents the patterns for the tokens. P. Kuppusamy - Lexical Analyzer
  • 29.
    Create Lexical AnalyzerGenerator - Lex Lexical Compiler Lex Source program lex.l lex.yy.c C compiler lex.yy.c a.out (obj code) a.out Input stream Sequence of tokens P. Kuppusamy - Lexical Analyzer
  • 30.
    Structure of Lexprograms declarations %% translation rules %% auxiliary functions Pattern {Action} It contains 3 sections 1. Declarations 2. Translation rules 3. Auxiliary functions Denotes starting of translation rules Denotes end of translation rules P. Kuppusamy - Lexical Analyzer
  • 31.
    Lex : Example 1.Declaration section i) Declare identifiers, constants: %{ Declare identifiers, constants, definitions of manifest constants (LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP) here }% E.g.: %{ int a,b; float s; }% P. Kuppusamy - Lexical Analyzer
  • 32.
    Declaration section (cont..d) ii).Declare regular expressions (without % symbol) /* regular definitions*/ delim [ tn] ws {delim}+ letter [A-Za-z] digit [0-9] id {letter}({letter}|{digit})* number {digit}+(.{digit}+)?(E[+-]?{digit}+)? ? -- Denotes that combination executes 0 or 1 time P. Kuppusamy - Lexical Analyzer
  • 33.
    2. Translation rulessection • It Starts with %% and ends with %% • Syntax: Form of Pattern followed by action %% Pattern1 {Action1} Pattern2 {Action2} ……… Pattern n {Action n} %% Pattern is Regular Expression & Action is C statements E.g. %% {ws} {/* no action and no return */} if {return(IF);} then {return(THEN);} else {return(ELSE);} {id} {yylval = (int) installID(); return(ID); } {number} {yylval = (int) installNum(); return(NUMBER); } … %% Int installID() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */ } Int installNum() { /* similar to installID, but puts numerical constants into a separate table */ } P. Kuppusamy - Lexical Analyzer
  • 34.
    3. Auxiliary functionsection All required functions are defined in this section This program used 2 predefined functions that are installID(), installNum(). These functions need not to be defined here. E.g. Some user defined functions are add() { // statements } mul() { //statements } Int installID() {/* funtion to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */ } Int installNum() { /* similar to installID, but puts numerical constants into a separate table */ } P. Kuppusamy - Lexical Analyzer
  • 35.
    Example - Lexprogram to recognize tokens: Identify keywords, rel. operator, numbers %{ /*no variable declaration */ }% letter [A-Za-z] digit [0-9] id {letter}({letter}|{digit})* number {digit}+(.{digit}+)?(E[+-]?{digit}+)? %% {id} {printf(“%s is an identifier”, yytext); } //yytest provides token id , { } is optional if {printf(“%s is keyword”, yytext); } else {printf(“%s is keyword”, yytext); } “<“ {printf(“%s is Less than operator”, yytext); } P. Kuppusamy - Lexical Analyzer
  • 36.
    Lex program torecognize tokens (cont..d) “>“ {printf(“%s is Greater than operator”, yytext); } “ <=“ {printf(“%s is Less than or equal operator”, yytext); } {number} {printf(“%s is number”, yytext); } %% P. Kuppusamy - Lexical Analyzer
  • 37.
    Reference • A.V. Aho,M.S. Lam, R. Sethi, J. D. Ullman, Compilers Principles, Techniques and Tools, Pearson Edition, 2013. • Lex & yacc – John R. Levine, Tony Mason, Doug Brown, O’reilly P. Kuppusamy - Lexical Analyzer