2. Lex
lex is a program (generator) that generates lexical
analyzers, (widely used on Unix).
It is mostly used with Yacc parser generator.
Written by Eric Schmidt and Mike Lesk.
It reads the input stream (specifying the lexical analyzer )
and outputs source code implementing the lexical analyzer
in the C programming language.
Lex will read patterns (regular expressions); then produces
C code for a lexical analyzer that scans for identifiers.
2
Hathal & Ahmad
3. Lex
◦ A simple pattern: letter(letter|digit)*
Regular expressions are translated by lex to a computer program that mimics an
FSA.
This pattern matches a string of characters that begins with a single letter
followed by zero or more letters or digits.
3
Hathal & Ahmad
4. Lex
Some limitations, Lex cannot be used to recognize nested structures such as
parentheses, since it only has states and transitions between states.
So, Lex is good at pattern matching, while Yacc is for more challenging
tasks.
4
Hathal & Ahmad
9. Lex
Whitespace must separate the defining term and the associated expression.
Code in the definitions section is simply copied as-is to the top of the generated C file and must
be bracketed with “%{“ and “%}” markers.
substitutions in the rules section are surrounded by braces ({letter}) to distinguish them from
literals.
9
Hathal & Ahmad
10. Yacc
Theory:
◦ Yacc reads the grammar and generate C code for a parser .
◦ Grammars written in Backus Naur Form (BNF) .
◦ BNF grammar used to express context-free languages .
◦ e.g. to parse an expression , do reverse operation( reducing the
expression)
◦ This known as bottom-up or shift-reduce parsing .
◦ Using stack for storing (LIFO).
10
Hathal & Ahmad
11. Yacc
Input to yacc is divided into three sections.
... definitions ...
%%
... rules ...
%%
... subroutines ...
11
Hathal & Ahmad
12. Yacc
The definitions section consists of:
◦ token declarations .
◦ C code bracketed by “%{“ and “%}”.
◦ the rules section consists of:
BNF grammar .
the subroutines section consists of:
◦ user subroutines .
12
Hathal & Ahmad
13. yacc& lex in Together
The grammar:
program -> program expr | ε
expr -> expr + expr | expr - expr | id
Program and expr are nonterminals.
Id are terminals (tokens returned by lex) .
expression may be :
◦ sum of two expressions .
◦ product of two expressions .
◦ Or an identifiers
13
Hathal & Ahmad
17. 17
The Lex and Flex Scanner
Generators
Lex and its newer cousin flex are
scanner generators
Systematically translate regular
definitions into C source code for
efficient scanning
Generated code is easy to integrate in
C applications
18. 18
Creating a Lexical Analyzer with
Lex and Flex
lex or flex
compiler
lex
source
program
lex.l
lex.yy.c
input
stream
C
compiler
a.out
sequence
of tokens
lex.yy.c
a.out
19. 19
Lex Specification
A lex specification consists of three parts:
regular definitions, C declarations in %{
%}
%%
translation rules
%%
user-defined auxiliary procedures
The translation rules are of the form:
p1 { action1 }
p2 { action2 }
…
pn { actionn }
20. 20
Regular Expressions in Lex
x match the character x
. match the character .
“string” match contents of string of characters
. match any character except newline
^ match beginning of a line
$ match the end of a line
[xyz] match one character x, y, or z (use to escape -)
[^xyz]match any character except x, y, and z
[a-z] match one of a to z
r* closure (match zero or more occurrences)
r+ positive closure (match one or more occurrences)
r? optional (match zero or one occurrence)
r1r2 match r1 then r2 (concatenation)
r1|r2 match r1 or r2 (union)
( r ) grouping
r1r2 match r1 when followed by r2
{d} match the regular expression defined by d