2. WHAT IS LEXICAL ANALYSIS
The first phase of a compiler.
The input is a high level language program, such
as a ’C’ program in the form of a sequence of
characters.
The output is a sequence of tokens that is sent to
the parser for syntax analysis.
Strips off blanks, tabs, newlines, and comments
from the source program.
Keeps track of line numbers
2
3. TOKENS, PATTERNS, AND LEXEMES
Token (also called word)
A string of characters which logically belong together.
Classes of similar lexemes.
identifier, keywords, constants etc.
Pattern
A rule which describes a token.
The pattern is said to match each string in the set
Eg. Pattern for identifier is letter is followed by letter or
digits.
Lexeme
The sequence of characters matched by a pattern to form
the corresponding token.
3
4. Examples of tokens, lexemes and patterns
Classes of Tokens
Identifiers: names chosen by the programmer
Keywords: names already in the programming
language
Separators: punctuation characters
Operators: symbols that operate on arguments
and produce results
Literals: numeric, logical, textual literals 4
Token Lexeme Pattern
ID x y n0 letter followed by letters and digits
NUM -123 4.5 any numeric constant
IF if if
LPAREN ( (
LITERAL ``Hello'' any string of characters between `` and ``
6. TOKENS IN PROGRAMMING LANGUAGE
Keywords, operators, identifiers, constants, literal
strings, punctuation symbols such as parentheses,
brackets, commas, semicolons, and colons, etc.
A unique integer representing the token is passed by
the lexical analyzer to the parser.
Attributes for tokens (apart from the integer
representing the token)
identifier: the lexeme of the token, or a pointer into the
symbol table where the lexeme is stored by the lexical
analyzer.
intnum: the value of the integer (similarly for
floatnum, etc.)
string: the string itself.
The exact set of attributes are dependent on the
compiler designer. 6
7. SPECIFICATION AND RECOGNITION OF TOKENS
Regular definitions, a mechanism based on regular
expressions are very popular for specification of tokens.
Has been implemented in the lexical analyzer generator
tool, LEX.
We study regular expressions first and then, token
specification using LEX.
Transition diagrams, a variant of finite state automata,
are used to implement regular definitions and to
recognize tokens.
Transition diagrams are usually used to model a Lexical
Analyzer before translating them to programs by hand.
LEX automatically generates optimized FSA from regular
definitions
We study FSA and their generation from regular
expressions (covered) in order to understand transition
diagrams and LEX.
7
8. SPECIFYING AND RECOGNIZING TOKENS
Alphabets: any finite set of symbols.
{0,1} is a set of binary alphabets
{a-z, A-Z} is a set of English language alphabets
{0-9,A-F} is a set of Hexadecimal alphabets
Strings: any finite sequence of alphabets.
Length of the string is the total number of occurrence of alphabets
String of length zero is known as empty string and is denoted by ε
(epsilon)
Special Symbols:
Arithmetic: +, - , % , * , / Preprocessor: #
Punctuation: , ; . -> Location Specifier : &
Assignment: = , += , /= , *= , -= Logical: &, &&, |, ||, !
8
9. LANGUAGE
Considered as a finite set of strings over some finite
set of alphabets.
Mathematically set operations can be performed on
computer languages.
Finite languages can be described by regular
expressions.
9
10. REGULAR EXPRESSION
Is an expression that describes a set of strings.
An important notation for specifying tokens
Lexical analyzer scan and identify only a finite set valid
string/token/lexeme that belong to the language in hand.
Regular grammar: the grammar defined by regular
expressions.
Regular language: the language defined by the regular
grammar.
Regular expressions are used to specify the patterns of
tokens.
10
11. Most formalisms provide the following operations to
construct regular expressions:
Alternation:
A vertical bar separates alternatives
Eg. gray|grey can match “gray” or “grey”
Grouping:
Use parentheses to define the scope and precedence of
the operators.
Eg. gray|grey and gr(a|e)y are equivalent
Quantification:
Specifies how often an element is allowed to occur.
11
12. SYNTAX OF REGULAR EXPRESSION
metasequence description
. Matches any single character except newline
[ ] Single character that is contained within the bracket
[a,b,c] = {a,b,c}
[^ ] Single character that is not contained within the bracket
[^abc] = { x is a character : x is not a or b or c}
* Zero or more times
ab*c = ac, abc, abbc
+ One or more times
[0-9] = 1, 10, 116
? Zero or one time
[0-9]? = “ “ , 8
| Choice(aka alternation or set union)
abc|def = “abc” or “def”
( ) Group to be a new expression
(01) Denotes string “01” 12
13. AUTOMATA
A machine that accepts a language.
Finite state automata accepts RLs which
corresponds to REs.
Applications of Automata:
Switching circuit design
Lexical analyzer in a compiler
String processing (grep, awk), etc.
State charts used in object-oriented design
Modeling control applications, e.g., elevator operation
Parsers of all types
Compilers
13
14. FINITE STATE AUTOMATION
Is an acceptor or recognizer of regular languages
Is a 5 tuple ( Q, Σ, δ, q0 , F )
Q : finite set of states
Σ : input alphabet
δ : transition function , δ : Q × Σ → Q
q0 : the start state
F : the set of final accepting states
In one move from some state q, an FSA reads an input
symbol, changes the state based on δ, and gets ready to
read the next input symbol.
If the last state reached is not a final state, then the input
string is rejected. 14
16. TYPES OF FSA
Non-deterministic Finite Automata (NFA)
There may be multiple possible transitions or some
transitions that do not require an input ()
Deterministic Finite Automata (DFA)
The transition from each state is uniquely determined by
the current input character.
For each state, at most 1 edge labeled “a” leaving state
No transitions
16
17. NON-DETERMINISTIC FINITE AUTOMATA
Five tuple
(Q , Σ, δ, q0 , F )
δ = Q x Σ 2Q
Given the current state there could be multiple next
states
The next state may be chosen at random
All the next states may be chosen in parallel
Example:
L = { Set of all strings that end with 0 }
L = { Set of all strings that start with 0 }
L = { Sets of all strings over {0,1} of length 2 } 17
18. DETERMINISTIC FINITE AUTOMATA
Given the current state we know what the next state
will be
It has only one unique next state
It has no choices or randomness
It is simple and easy to design
18
19. NFA TO DFA
Every DFA is an NFA, but not vice versa
There is an equivalent DFA for every NFA
Dead configuration in NFA is equivalent to
Dead/trap state in DFA
Find the equivalent DFA for the NFA given by M = [
{A,B,C}, (a,b), δ , A , {C} ] where δ is given by:
19
a b
A A,B C
B A B
C - A,B
20. Example
L = { set of all strings over (0,1) that ends with ‘01’ }
Design an NFA for a language that accepts all
strings over {0,1} in which the second last symbol is
always ‘1’. Then convert it to its equivalent DFA.
20
21. ERROR RECOVERY
Certain languages do not have any reserved words,
e.g., while, do, if, else, etc., are reserved in ’C’ or
'C++', but not in PL/1.
In FORTRAN, some keywords are context-dependent.
In the statement, DO 10 I = 10.86, DO10I is an
identifier, and DO is not a keyword.
But in the statement, DO 10 I = 10, 86, DO is a
keyword.
Such features require substantial look ahead for
resolution.
Lexical Analyzer skips characters in the input until a
well-formed token is found. 21
22. When an error occurs, the lexical analyzer recovers by:
skipping (deleting) successive characters from the
remaining input until the lexical analyzer can find a
well-formed token(panic mode recovery).
deleting extraneous characters
inserting missing characters
replacing an incorrect character by a correct
character.
transposing to adjacent characters
22
23. LEXICAL ANALYZER GENERATOR (LEX)
Lexer or Scanner
The algorithm that divides the program into units
Lex
A program that takes a set of descriptions of possible
tokens and produce a C routine that implements a
scanner.
23
24. LEX STRUCTURE
%{
<c global variables, prototypes, comments>
%}
[ Definition Section ]
%%
[ Rules Section ] – define how to scan and what
action to take for each token
%%
C auxiliary subroutine – any user code
24
25. RULES SECTION
Format
pattern { corresponding actions }
---
pattern { corresponding actions }
Regular Expression C Expression
Example
[0-9][0-9]* { printf(“number”); }
25
26. TWO NOTES ON LEX
1. Lex matches token with longest match
Input: abc
Rule: [a-z]+
-> Token: abc ( not “a” or “ab” )
2. Lex uses the first applicable rule
Input: post
Rule1: “post” { printf(“hello”); }
Rule2: [a-zA-Z]+ { printf(“world”); }
-> It will print hello, (not “world”)
26
27. VARIABLES OF A LEX PROGRAM
yytext
Whenever the scanner matches a token, the text of the
token is stored in the null terminated string yytext
A variable that is a pointer to the first character of the
lexeme.
yyleng
The length of the string yytext
yylex( )
The scanner created by the lex has the entry point yylex( )
27