LEX Tool
by
Dr. R. A. Deshmukh
LEX TOOL
 A tool widely used to specify lexical
analyzers for a variety of languages or the
breaking up of an input stream into
meaningful units, or tokens.
 For example, consider breaking a text file
up into individual words.
We refer to the tool as Lex compiler and to
its input specification as the Lex language.
LEX TOOL Lexical Analyzer Generator
Lexical
Compiler
Lex
Source
program
lex.l
lex.yy.c
C
compiler
lex.yy.
c
a.out
a.out
Input
stream
Sequence
of tokens
Structure of Lex programs
declarations
%%
translation rules
%%
auxiliary
functions
Pattern
{Action}
Internal Structure of Lex
Lex
Regular
expressions
NFA DFA
Minimal
DFA
The final states of the DFA are
associated with actions
Declaration
The declarations section
includes declarations of
variables,manifest constants(A
manifest constant is an
identifier that is declared to
represent a constant e.g.
# define PIE 3.14), the files to be
included. (written in %{ %} )
 and definitions of regular
Translation rules
The translation rules of a Lex program are
statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
… …
where each p is a regular expression and
each action is a program fragment
describing what action the lexical
analyzer should take when a string
pattern matches with p matches a lexeme.
In Lex the actions are written in C.
Auxiliary procedures
The third section holds whatever
auxiliary procedures are needed by the
actions.
Alternatively these procedures can be
compiled separately and loaded with
the lexical analyzer. The auxiliary
procedures are written in C language.
How does this Lexical analyzer work?
The lexical analyzer created by Lex behaves
in concert with a parser in the following
manner.
When activated by the parser, the lexical
analyzer begins reading its remaining
input , one character at a time, until it has
found the longest prefix of the input that is
matched by one of the regular expressions p
contd…
How does this Lexical analyzer work?
then it executes the corresponding action. Typically
the action will return control to the parser. However, if
it does not, then the lexical analyzer proceeds to find
more lexemes, until an action causes control to
return to the parser. The repeated search for lexemes
until an explicit return allows the lexical analyzer to
process white space and comments conveniently.
 The lexical analyzer returns a single quantity,
the token, to the parser. To pass an attribute
value with information about the lexeme, we
can set the global variable yylval.
contd…
 e.g. Suppose the lexical analyzer
returns a single token for all the
relational operators, in which case
the parser won’t be able to
distinguish between
”<=”, ”>=”, ”<”, ”>”, ”==” etc. We
can set yylval appropriately to
specify the nature of the operator.
The two variables yytext and yyleng
 Lex makes the lexeme available to the
routines appearing in the third
section through two variables yytext
and yyleng
1) yytext is a variable that is a pointer
to the first character of the lexeme.
2)yyleng is an integer telling how long
the lexeme is.
A lexeme may match more than one
patterns. How is this problem resolved?
 Take for example the lexeme if. It matches the
patterns for both keyword if and identifier. If the
pattern for keyword if precedes the pattern for
identifier in the declaration list of the lex program
the conflict is resolved in favour of the keyword.
In general this ambiguity-resolving strategy makes
it easy to reserve keywords by listing them ahead of
the pattern for identifiers.
The Lex’s strategy of selecting the longest
prefix matched by a pattern makes it easy to
resolve other conflicts like the one between “<” and
“<=”.
Regular Expression Basics
 . : matches any single character except n
 * : matches 0 or more instances of the preceding regular
expression
 + : matches 1 or more instances of the preceding regular
expression
 ? : matches 0 or 1 instance of the preceding regular
expression
 | : matches the preceding or following regular
expression
 [ ] : defines a character class
 () : groups enclosed regular expression into a new
regular expression
 “…”: matches everything within the “ “ literally
Regular Expression Basics
x|y : x or y
{i} : definition of i
x/y : x, only if followed by y (y not removed
from input)
x{m,n} : m to n occurrences of x
 x :x, but only at beginning of line
x$ : x, but only at end of line
“s” : exactly what is in the quotes
A regular expression finishes with a space,
tab or
newline.
Pattern Matching Examples
Expression Matches
abc* ab abc abcc abccc ...
abc+ abc abcc abccc ...
a(bc)+ abc abcbc abcbcbc ...
a(bc)? a abc
[abc] one of: a, b, c
[a-z] any letter, a-z
[a-z] one of: a, -, z
[-az] one of: -, a, z
[A-Za-z0-9]+ one or more alphanumeric
characters
[ tn]+ whitespace
[^ab] anything except: a, b
[a^b] one of: a, ^, b
Lex Predefined Variables
Name Function
int yylex(void) call to invoke lexer, returns
token
char *yytext pointer to matched string
yyleng length of matched string
yylval value associated with token
int yywrap(void) wrapup, return 1 if done, 0 if
not done
FILE *yyout output file
FILE *yyin input file
ECHO write matched string
Usage
 To run Lex on a source file, type
lex scanner.l
 It produces a file named lex.yy.c which is a C
program for the lexical analyzer.
 To compile lex.yy.c, type
gcc lex.yy.c
 To run the lexical analyzer program, type
1) ./a.out <inputfile
2) ./a.out
3) ./a.out <inputfile >outputfile
Program to count no. of words
%{
int counter = 0;
%}
letter [a-zA-Z]
%%
{letter}+ {printf(“a wordn”); counter++;}
%%
main() {
yylex();
printf(“There are total %d wordsn”, counter);
}
int yywrap(void) {
return 1;
}
[root@localhost ~]# lex a1.l
[root@localhost ~]# gcc
lex.yy.c
[root@localhost ~]# ./a.out
pune
a word
bombay
a word
Program to count no. of identifiers
digit [0-9]
letter [a-zA-Z]
%{
# include<stdio.h>
%}
%%
{letter}({letter}|{digit})* {printf(“id: %sn”, yytext);}
n{printf(“new linen”);}
%%
main() {
yylex();
}
int yywrap(void) {
return 1;
}
[root@localhost ~]# lex a3.l
[root@localhost ~]# gcc -o a3
lex.yy.c
[root@localhost ~]# ./a3
abc ghg jkhjk
id: abc
id: ghg
id: jkhjk
new line
[root@localhost ~]# ./a3 <a.txt
id: Today
id: is
id: Friday
new line
id: Environment
id: is
id: good
new line
[root@localhost ~]# cat
a.txt
Today is Friday
Environment is good
Program to count no. of characters,
words and lines in a program
%{
int nchar=0, nword=0, nline=0;
%}
%%
n { nline++; nchar++; }
[^ tn]+ { nword++, nchar += yyleng; }
. { nchar++; }
%%
int main(void) {
yylex();
printf("%dt%dt%dn", nchar, nword, nline);
return 0;
}
[root@localhost ~]# lex a2.l
[root@localhost ~]# gcc -o a2
lex.yy.c
[root@localhost ~]# ./a2
hello good morning
how r u
28 6 2
[root@localhost ~]#
Program to count no. of lines, tabs, characters,spaces
and words in a program
%{
# include<stdio.h>
int lc=0,tc=0,sc=0,wc=0,cc=0;
%}
%%
[^ tn]+ {cc+=yyleng; wc++;}
[ ] {cc++;sc++;}
[t] {cc++; tc++;}
n {cc++; lc++;}
%%
main()
{
yyin = fopen("a1.txt","r");
yyout=fopen("a2.txt","w");
yylex();
fclose(yyin);
fclose(yyout);
}
int yywrap()
{
fprintf(yyout,"no. of lines =%dn",lc);
fprintf(yyout,"no. of tabs=%dn",tc);
fprintf(yyout,"no. of characters = %d
n",cc);
fprintf(yyout,"no. of spaces = %d
n",sc);
fprintf(yyout,"no. of words=%dn",wc);
return 1;
}
[root@localhost ~]# lex l1.l
[root@localhost ~]# gcc -o l1 lex.yy.c
[root@localhost ~]# ./l1
[root@localhost ~]# cat a1.txt
Today is Friday
Environment is good
[root@localhost ~]# cat a2.txt
no. of lines =2
no. of tabs=4
no. of characters = 39
no. of spaces = 3
no. of words=6
[root@localhost ~]#
%{
# include<stdio.h>
int pcount=0,ncount=0;
%}
pos_num [+]?[0-9]+(.[0-9]+)?([eE][-+]?[0-9]+)?
neg_num [-][0-9]+(.[0-9]+)?([eE][-+]?[0-9]+)?
%%
{pos_num} {printf("positive number = %sn",yytext);pcount+
+;}
{neg_num} {printf("negative number = %sn",yytext);ncount+
+;}
.|n {}
%%
main()
{
yylex();
}
int yywrap()
{
printf("no. of positive numbers = %dn",pcount);
printf("no. of negative numbers = %dn",ncount);
Program to count positive and negative numbers in a program
[root@localhost ~]# lex num.l
[root@localhost ~]# gcc -o num lex.yy.c
[root@localhost ~]# ./num
23 -90
positive number = 23
negative number = -90
90.78E+4
positive number = 90.78E+4
no. of positive numbers = 2
no. of negative numbers = 1
[root@localhost ~]#
%{
# include<stdio.h>
# include<string.h>
struct symtab
{
char *name;
};
struct symtab sym[10],*k;
struct symtab *install_id(char
*s);
void disp();
%}
L [a-zA-Z]
D [0-9]
id {L}({L}|{D})*
num {D}+(.{D}+)?([eE][-+]?{D}+)?
bop [-+*/=]
uop "++"|"--"
relop "<"|"<="|">"|">="|"!="|"=="
lop "&&"|"||"
bitlop [&|!]
kew
“if"|"else"|"while"|"do"|"for"|"int"|"char"|"float"
pun [,;'"[]{})(]
comment "/*"(.|n)*"*/"|"//"(.)*
ws [ tn]+
st "(.)*"
%%
{ws} { }
{kew} {printf("keyword =%sn",yytext);}
{id} {k=install_id(yytext);printf("identifier =%sn",yytext);}
{num} {printf("constant =%sn",yytext);}
{bop} {printf("binary op =%sn",yytext);}
{uop} {printf("unary op =%sn",yytext);}
{relop} {printf("relational op=%sn",yytext);}
{lop} {printf("logical op =%sn",yytext);}
{pun} {printf("punct =%sn",yytext);}
{bitlop} {printf("bitwise logical op=%sn",yytext);}
{comment} {printf("comment =%sn",yytext);}
{st} {printf("string =%sn",yytext);}
%%
main()
{
yylex();
disp();
}
struct symtab *install_id(char *s)
{
struct symtab *p;
//printf("in symbol tablen");
for(p=&sym[0];p<&sym[10];p++)
{
if(p->name && !strcmp(s,p->name))
return p;
if(!p->name)
{
p->name=strdup(s);
return p;
}
}
}
void disp()
{
struct symtab *p;
printf("symbol tablen");
for(p=&sym[0];p<&sym[10];p++)
{
if(p->name)
printf("%sn",p->name);
}
}
int yywrap()
{
return 1;
}
[root@localhost ~]# lex la.l
[root@localhost ~]# gcc -o la lex.yy.c
[root@localhost ~]# ./la <b.txt
identifier =main
punct =(
punct =)
punct ={
keyword =int
identifier =a
punct =,
identifier =b
punct =;
identifier =a
binary op ==
identifier =b
binary op =+
identifier =c
punct =;
keyword =if
punct =(
identifier =a
relational op=<=
identifier =m
punct =)
[root@localhost ~]# lex la.l
[root@localhost ~]# gcc -o la lex.yy.c
[root@localhost ~]# ./la <b.txt
identifier =main
punct =(
punct =)
punct ={
keyword =int
identifier =a
punct =,
identifier =b
punct =;
identifier =a
binary op ==
identifier =b
binary op =+
identifier =c
punct =;
keyword =if
punct =(
identifier =a
relational op=<=
identifier =m
punct =)
[root@localhost ~]# lex la.l
[root@localhost ~]# gcc -o la lex.yy.c
[root@localhost ~]# ./la <b.txt
identifier =main
punct =(
punct =)
punct ={
keyword =int
identifier =a
punct =,
identifier =b
punct =;
identifier =a
binary op ==
identifier =b
binary op =+
identifier =c
punct =;
keyword =if
punct =(
identifier =a
relational op=<=
identifier =m
punct =)
identifier =a
unary op =++
punct =;
comment =/*dfdfdf*/
punct =}
symbol table
main
a
b
c
m
[root@localhost ~]# cat b.txt
main()
{
int a,b;
a=b+c;
if(a<=m)
a++;
/*dfdfdf*/
}
[root@localhost ~]#

Compiler Design_LEX Tool for Lexical Analysis.pptx

  • 1.
    LEX Tool by Dr. R.A. Deshmukh
  • 2.
    LEX TOOL  Atool widely used to specify lexical analyzers for a variety of languages or the breaking up of an input stream into meaningful units, or tokens.  For example, consider breaking a text file up into individual words. We refer to the tool as Lex compiler and to its input specification as the Lex language.
  • 3.
    LEX TOOL LexicalAnalyzer Generator Lexical Compiler Lex Source program lex.l lex.yy.c C compiler lex.yy. c a.out a.out Input stream Sequence of tokens
  • 4.
    Structure of Lexprograms declarations %% translation rules %% auxiliary functions Pattern {Action}
  • 5.
    Internal Structure ofLex Lex Regular expressions NFA DFA Minimal DFA The final states of the DFA are associated with actions
  • 6.
    Declaration The declarations section includesdeclarations of variables,manifest constants(A manifest constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14), the files to be included. (written in %{ %} )  and definitions of regular
  • 7.
    Translation rules The translationrules of a Lex program are statements of the form : p1 {action 1} p2 {action 2} p3 {action 3} … … where each p is a regular expression and each action is a program fragment describing what action the lexical analyzer should take when a string pattern matches with p matches a lexeme. In Lex the actions are written in C.
  • 8.
    Auxiliary procedures The thirdsection holds whatever auxiliary procedures are needed by the actions. Alternatively these procedures can be compiled separately and loaded with the lexical analyzer. The auxiliary procedures are written in C language.
  • 9.
    How does thisLexical analyzer work? The lexical analyzer created by Lex behaves in concert with a parser in the following manner. When activated by the parser, the lexical analyzer begins reading its remaining input , one character at a time, until it has found the longest prefix of the input that is matched by one of the regular expressions p contd…
  • 10.
    How does thisLexical analyzer work? then it executes the corresponding action. Typically the action will return control to the parser. However, if it does not, then the lexical analyzer proceeds to find more lexemes, until an action causes control to return to the parser. The repeated search for lexemes until an explicit return allows the lexical analyzer to process white space and comments conveniently.  The lexical analyzer returns a single quantity, the token, to the parser. To pass an attribute value with information about the lexeme, we can set the global variable yylval. contd…
  • 11.
     e.g. Supposethe lexical analyzer returns a single token for all the relational operators, in which case the parser won’t be able to distinguish between ”<=”, ”>=”, ”<”, ”>”, ”==” etc. We can set yylval appropriately to specify the nature of the operator.
  • 12.
    The two variablesyytext and yyleng  Lex makes the lexeme available to the routines appearing in the third section through two variables yytext and yyleng 1) yytext is a variable that is a pointer to the first character of the lexeme. 2)yyleng is an integer telling how long the lexeme is.
  • 13.
    A lexeme maymatch more than one patterns. How is this problem resolved?  Take for example the lexeme if. It matches the patterns for both keyword if and identifier. If the pattern for keyword if precedes the pattern for identifier in the declaration list of the lex program the conflict is resolved in favour of the keyword. In general this ambiguity-resolving strategy makes it easy to reserve keywords by listing them ahead of the pattern for identifiers. The Lex’s strategy of selecting the longest prefix matched by a pattern makes it easy to resolve other conflicts like the one between “<” and “<=”.
  • 14.
    Regular Expression Basics . : matches any single character except n  * : matches 0 or more instances of the preceding regular expression  + : matches 1 or more instances of the preceding regular expression  ? : matches 0 or 1 instance of the preceding regular expression  | : matches the preceding or following regular expression  [ ] : defines a character class  () : groups enclosed regular expression into a new regular expression  “…”: matches everything within the “ “ literally
  • 15.
    Regular Expression Basics x|y: x or y {i} : definition of i x/y : x, only if followed by y (y not removed from input) x{m,n} : m to n occurrences of x  x :x, but only at beginning of line x$ : x, but only at end of line “s” : exactly what is in the quotes A regular expression finishes with a space, tab or newline.
  • 16.
    Pattern Matching Examples ExpressionMatches abc* ab abc abcc abccc ... abc+ abc abcc abccc ... a(bc)+ abc abcbc abcbcbc ... a(bc)? a abc [abc] one of: a, b, c [a-z] any letter, a-z [a-z] one of: a, -, z [-az] one of: -, a, z [A-Za-z0-9]+ one or more alphanumeric characters [ tn]+ whitespace [^ab] anything except: a, b [a^b] one of: a, ^, b
  • 17.
    Lex Predefined Variables NameFunction int yylex(void) call to invoke lexer, returns token char *yytext pointer to matched string yyleng length of matched string yylval value associated with token int yywrap(void) wrapup, return 1 if done, 0 if not done FILE *yyout output file FILE *yyin input file ECHO write matched string
  • 18.
    Usage  To runLex on a source file, type lex scanner.l  It produces a file named lex.yy.c which is a C program for the lexical analyzer.  To compile lex.yy.c, type gcc lex.yy.c  To run the lexical analyzer program, type 1) ./a.out <inputfile 2) ./a.out 3) ./a.out <inputfile >outputfile
  • 19.
    Program to countno. of words %{ int counter = 0; %} letter [a-zA-Z] %% {letter}+ {printf(“a wordn”); counter++;} %% main() { yylex(); printf(“There are total %d wordsn”, counter); } int yywrap(void) { return 1; } [root@localhost ~]# lex a1.l [root@localhost ~]# gcc lex.yy.c [root@localhost ~]# ./a.out pune a word bombay a word
  • 20.
    Program to countno. of identifiers digit [0-9] letter [a-zA-Z] %{ # include<stdio.h> %} %% {letter}({letter}|{digit})* {printf(“id: %sn”, yytext);} n{printf(“new linen”);} %% main() { yylex(); } int yywrap(void) { return 1; }
  • 21.
    [root@localhost ~]# lexa3.l [root@localhost ~]# gcc -o a3 lex.yy.c [root@localhost ~]# ./a3 abc ghg jkhjk id: abc id: ghg id: jkhjk new line [root@localhost ~]# ./a3 <a.txt id: Today id: is id: Friday new line id: Environment id: is id: good new line [root@localhost ~]# cat a.txt Today is Friday Environment is good
  • 22.
    Program to countno. of characters, words and lines in a program %{ int nchar=0, nword=0, nline=0; %} %% n { nline++; nchar++; } [^ tn]+ { nword++, nchar += yyleng; } . { nchar++; } %% int main(void) { yylex(); printf("%dt%dt%dn", nchar, nword, nline); return 0; }
  • 23.
    [root@localhost ~]# lexa2.l [root@localhost ~]# gcc -o a2 lex.yy.c [root@localhost ~]# ./a2 hello good morning how r u 28 6 2 [root@localhost ~]#
  • 24.
    Program to countno. of lines, tabs, characters,spaces and words in a program %{ # include<stdio.h> int lc=0,tc=0,sc=0,wc=0,cc=0; %} %% [^ tn]+ {cc+=yyleng; wc++;} [ ] {cc++;sc++;} [t] {cc++; tc++;} n {cc++; lc++;} %% main() { yyin = fopen("a1.txt","r"); yyout=fopen("a2.txt","w"); yylex(); fclose(yyin); fclose(yyout); } int yywrap() { fprintf(yyout,"no. of lines =%dn",lc); fprintf(yyout,"no. of tabs=%dn",tc); fprintf(yyout,"no. of characters = %d n",cc); fprintf(yyout,"no. of spaces = %d n",sc); fprintf(yyout,"no. of words=%dn",wc); return 1; }
  • 25.
    [root@localhost ~]# lexl1.l [root@localhost ~]# gcc -o l1 lex.yy.c [root@localhost ~]# ./l1 [root@localhost ~]# cat a1.txt Today is Friday Environment is good [root@localhost ~]# cat a2.txt no. of lines =2 no. of tabs=4 no. of characters = 39 no. of spaces = 3 no. of words=6 [root@localhost ~]#
  • 26.
    %{ # include<stdio.h> int pcount=0,ncount=0; %} pos_num[+]?[0-9]+(.[0-9]+)?([eE][-+]?[0-9]+)? neg_num [-][0-9]+(.[0-9]+)?([eE][-+]?[0-9]+)? %% {pos_num} {printf("positive number = %sn",yytext);pcount+ +;} {neg_num} {printf("negative number = %sn",yytext);ncount+ +;} .|n {} %% main() { yylex(); } int yywrap() { printf("no. of positive numbers = %dn",pcount); printf("no. of negative numbers = %dn",ncount); Program to count positive and negative numbers in a program
  • 27.
    [root@localhost ~]# lexnum.l [root@localhost ~]# gcc -o num lex.yy.c [root@localhost ~]# ./num 23 -90 positive number = 23 negative number = -90 90.78E+4 positive number = 90.78E+4 no. of positive numbers = 2 no. of negative numbers = 1 [root@localhost ~]#
  • 28.
    %{ # include<stdio.h> # include<string.h> structsymtab { char *name; }; struct symtab sym[10],*k; struct symtab *install_id(char *s); void disp(); %}
  • 29.
    L [a-zA-Z] D [0-9] id{L}({L}|{D})* num {D}+(.{D}+)?([eE][-+]?{D}+)? bop [-+*/=] uop "++"|"--" relop "<"|"<="|">"|">="|"!="|"==" lop "&&"|"||" bitlop [&|!] kew “if"|"else"|"while"|"do"|"for"|"int"|"char"|"float" pun [,;'"[]{})(] comment "/*"(.|n)*"*/"|"//"(.)* ws [ tn]+ st "(.)*"
  • 30.
    %% {ws} { } {kew}{printf("keyword =%sn",yytext);} {id} {k=install_id(yytext);printf("identifier =%sn",yytext);} {num} {printf("constant =%sn",yytext);} {bop} {printf("binary op =%sn",yytext);} {uop} {printf("unary op =%sn",yytext);} {relop} {printf("relational op=%sn",yytext);} {lop} {printf("logical op =%sn",yytext);} {pun} {printf("punct =%sn",yytext);} {bitlop} {printf("bitwise logical op=%sn",yytext);} {comment} {printf("comment =%sn",yytext);} {st} {printf("string =%sn",yytext);} %%
  • 31.
    main() { yylex(); disp(); } struct symtab *install_id(char*s) { struct symtab *p; //printf("in symbol tablen"); for(p=&sym[0];p<&sym[10];p++) { if(p->name && !strcmp(s,p->name)) return p; if(!p->name) { p->name=strdup(s); return p; } } } void disp() { struct symtab *p; printf("symbol tablen"); for(p=&sym[0];p<&sym[10];p++) { if(p->name) printf("%sn",p->name); } } int yywrap() { return 1; }
  • 32.
    [root@localhost ~]# lexla.l [root@localhost ~]# gcc -o la lex.yy.c [root@localhost ~]# ./la <b.txt identifier =main punct =( punct =) punct ={ keyword =int identifier =a punct =, identifier =b punct =; identifier =a binary op == identifier =b binary op =+ identifier =c punct =; keyword =if punct =( identifier =a relational op=<= identifier =m punct =) [root@localhost ~]# lex la.l [root@localhost ~]# gcc -o la lex.yy.c [root@localhost ~]# ./la <b.txt identifier =main punct =( punct =) punct ={ keyword =int identifier =a punct =, identifier =b punct =; identifier =a binary op == identifier =b binary op =+ identifier =c punct =; keyword =if punct =( identifier =a relational op=<= identifier =m punct =) [root@localhost ~]# lex la.l [root@localhost ~]# gcc -o la lex.yy.c [root@localhost ~]# ./la <b.txt identifier =main punct =( punct =) punct ={ keyword =int identifier =a punct =, identifier =b punct =; identifier =a binary op == identifier =b binary op =+ identifier =c punct =; keyword =if punct =( identifier =a relational op=<= identifier =m punct =)
  • 33.
    identifier =a unary op=++ punct =; comment =/*dfdfdf*/ punct =} symbol table main a b c m [root@localhost ~]# cat b.txt main() { int a,b; a=b+c; if(a<=m) a++; /*dfdfdf*/ } [root@localhost ~]#