3. Lexical Analysis 3
The Goals of Chapter 3
See how to construct a lexical analyzer
– Lexeme description
– Code to recognize lexemes
Lexical Analyzer Generator
– Lex and Flex
Regular Expressions
– To Nondeterministic finite autamota
– To Finite Automata
– Then to Code
4. Lexical Analysis 4
What is a Lexical Analyzer
supposed to do?
Read the input characters of the source
program,
Group them into lexemes,
– Check with the symbol table regarding this
lexeme
Produce as output a sequence of tokens
– This stream is sent to the parser
5. Lexical Analysis 5
What is a Lexical Analyzer
supposed to do?
It may also
– strip out comments and whitespace.
– Keep track of line numbers in the source file
(for error messages)
6. Lexical Analysis 6
Terms
Token
– A pair (token name, [attribute value])
– Ex: (integer, 32)
Pattern
– A description of the form the lexemes will take
– Ex: regular expressions
Lexeme
– An instance matched by the pattern.
8. Lexical Analysis 8
Token Attributes
When more than one lexeme can match a
pattern, the lexical analyzer must provide
the next phase additional information.
– Number – needs a value
– Id – needs the identifier name, or probably a
symbol table pointer
9. Lexical Analysis 9
Specification of Tokens
Regular expressions are an important
notation for specifying lexeme patterns.
We are going to
– first study the formal notation for a regular
expression
– Then we will show how to build a lexical
analyzer that uses regular expressions.
– Then we will see how they are used in a
lexical analyzer.
10. Lexical Analysis 10
Buffering
In principle, the analyzer goes through the
source string a character at a time;
In practice, it must be able to access
substrings of the source.
Hence the source is normally read into a
buffer
The scanner needs two subscripts to note
places in the buffer
– lexeme start & current position
11. Lexical Analysis 11
Strings and Alphabets
Def: Alphabet – a finite set of symbols
– Typically letters, digits, and punctuation
Example:
– {0,1} is a binary alphabet.
12. Lexical Analysis 12
Strings and Alphabets
Def: String – (over an alphabet) is a finite
sequence of symbols drawn from that
alphabet.
– Sometimes called a “word”
– The empty string e
13. Lexical Analysis 13
Strings and Alphabets
Def: Language – any countable set of
strings over some fixed alphabet.
Notes:
– Abstract languages like , and {e} fit this
definition, but so also do C programs
– This definition also does not require any
meaning – that will come later.
15. Lexical Analysis 15
Finite State Automata
The compiler writer defines tokens in the
language by means of regular
expressions.
Informally a regular expression is a
compact notation that indicates what
characters may go together into lexemes
belonging to that particular token and how
they go together.
We will see regular expressions later
16. Lexical Analysis 16
The lexical analyzer is best
implemented as a finite state machine
or a finite state automaton.
Informally a Finite-State Automaton is a
system that has a finite set of states
with rules for making transitions
between states.
The goal now is to explain these two
things in detail and bridge the gap from
the first to the second.
17. Lexical Analysis 17
State Diagrams
and State Tables
Def: State Diagram -- is a directed graph
where the vertices in the graph represent
states, and the edges indicate transitions
between the states.
Let’s consider a vending machine that
sells candy bars for 25 cents, and it takes
nickels, dimes, and quarters.
19. Lexical Analysis 19
Def: State Table -- is a table with states
down the left side, inputs across the top,
and row/column values indicate the
current state/input and what state to go
to.
20. Lexical Analysis 20
Formal Definition
Def: FSA -- A FSA, M consists of
– a finite set of input symbols S (the input alphabet)
– a finite set of states Q
– A starting state q0 (which is an element of Q)
– A set of accepting states F (a subset of Q)
(these are sometimes called final states)
– A state-transition function N: (Q x S) -> Q
M = (S, Q, q0, F, N)
23. Lexical Analysis 23
Example 1 (cont.):
– State 1 is the Starting State. This is shown
by the arrow in the Machine, and the fact
that it is the first state in the table.
– States 3 and 4 are the accepting states.
This is shown by the double circles in the
machine, and the fact that they are
underlined in the table.
26. Lexical Analysis 26
Example 2 (cont.):
– This machine shows that it is entirely
possible to have unreachable states in an
automaton.
– These states can be removed without
affecting the automaton.
27. Lexical Analysis 27
Acceptance
We use FSA's for recognizing tokens
A character string is recognized (or
accepted) by FSA M if, when the last
character has been read, M is in one of
the accepting states.
– If we pass through an accepting state, but end
in a non-accepting state, the string is NOT
accepted.
28. Lexical Analysis 28
Def: language -- A language is any set
of strings.
Def: a language over an alphabet S is
any set of strings made up only of the
characters from S
Def: L(M) -- the language accepted by
M is the set of all strings over S that are
accepted by M
29. Lexical Analysis 29
if we have an FSA for every token, then
the language accepted by that FSA is
the set of all lexemes embraced by that
token.
Def: equivalent --
M1 == M2 iff L(M1) = L(M2).
30. Lexical Analysis 30
A FSA can be easily programmed if the state
table is stored in memory as a two-
dimensional array.
table : array[1..nstates,1..ninputs] of byte;
Given an input string w, the code would look
something like this:
state := 1;
for i:=1 to length(w) do
begin
col:= char_to_col(w[i]);
state:= table[state, col]
end;
31. Lexical Analysis 31
Nondeterministic Finite-State
Automata
So far, the behavior of our FSAs has
always been predictable. But there is
another type of FSA in which the state
transitions are not predictable.
In these machines, the state transition
function is of the form:
N: Q x (S U {e}) -> P(Q)
– Note: some authors use a Greek lambda, l or
L
32. Lexical Analysis 32
This means two things:
– There can be transitions without input.
(That is why the e is in the domain)
– Input can transition to a number of states.
(That is the significance of the power set
in the codomain)
33. Lexical Analysis 33
Since this makes the behavior
unpredictable, we call it a
nondeterministic FSA
– So now we have DFAs and NFAs (or
NDFAs)
a string is accepted if there is at least 1
path from the start state to an accepting
state.
38. Lexical Analysis 38
Equivalence
For every non-deterministic machine M we
can construct an equivalent deterministic
machine M'
Therefore, why study N-FSA?
– 1.Theory.
– 2.Tokens -> Reg.Expr. -> N-FSA -> D-FSA
40. Lexical Analysis 40
The Subset Construction
Constructing an equivalent DFA from a
given NFA hinges on the fact that
transitions between state sets are
deterministic even if the transitions
between states are not.
Acceptance states of the subset machine
are those subsets that contain at least 1
accepting state.
41. Lexical Analysis 41
Generic brute force construction is
impractical.
– As the number of states in M increases,
the number of states in M' increases
drastically (n vs. 2n). If we have a NFA
with 20 states |P(Q)| is something over a
million.
– This also leads to the creation of many
unreachable states. (which can be omitted)
The trick is to only create subset states
as you need them.
44. Lexical Analysis 44
Why do we care?
Lexical Analysis and Syntactic Analysis
are typically run off of tables.
These tables are large and laborious to
build.
Therefore, we use a program to build the
tables.
45. Lexical Analysis 45
But there are two major problems:
– How do we represent a token for the table
generating program?
– How does the program convert this into the
corresponding FSA?
Tokens are described using regular
expressions.
47. Lexical Analysis 47
Regular Expressions
Informally a regular expression of an
alphabet S is a combination of characters
from S and certain operators indicating
concatenation, selection, or repetition.
– b* -- 0 or more b's (Kleene Star)
– b+ -- 1 or more b's
– | -- a|b -- choice
48. Lexical Analysis 48
Regular Expressions
Def: Regular Expression:
– any character in S is an RE
– e is an RE
– if R and S are RE's so are
RS, R|S, R*, R+, S*, S+.
Only expressions formed by these rules
are regular.
49. Lexical Analysis 49
Regular Expressions
REs can be used to describe only a limited
variety of languages, but they are powerful
enough to be used to define tokens.
One limitation -- many languages put
length limitations on their tokens, RE's
have no means of enforcing such
limitations.
50. Lexical Analysis 50
Extensions to REs
One or more Instances +
Zero or One Instance ?
Character Classes
– Digit -> [0-9]
– Digits -> Digit+
– Number -> Digits (. Digits)? (E [+-]? Digits)?
56. Lexical Analysis 56
The Pumping Lemma
Given a machine with n states
and a string w in L(M) has length n
w must go through n+1 states, therefore
something is repeated (call it y)
therefore w = xyz and y can be looped.
so xy*z is also part of the language.
57. Lexical Analysis 57
The goal of the pumping lemma is to
show that there are some languages
that are not regular.
For Example:
– LR = {wcwR | w in (0,1)*}
– LP -- matching parens
this is handled in syntax analysis.
58. Lexical Analysis 58
Application to Lexical Analysis
Now you are ready to put it all together:
– Given 2 tokens' regular expression
X = aa*(b|c)
Y = (b|c)c*
– Construct the NDFA
– Construct the DFA
60. Lexical Analysis 60
Recognizing Tokens
The scanner must ignore white space
(except to note the end of a token)
– Add white space transition from Start state to
Start state.
When you enter an accept state,
announce it
– (therefore you cannot pass through accept
states)
– The string may be the entire program.
61. Lexical Analysis 61
One accept state for each token, so we
know what we found.
Identifier/Keyword differences
– Accept everything as an identifier, and then
look up keywords in table. Or pre-load the
Symbol Table with Keywords.
When you read an identifier, you read
the next character in order to tell it was
the end. You need to back up (put it
back on the input stream).
62. Lexical Analysis 62
Comments
– Recognize the beginning of comment, and
then ignore everything until the end of
comment.
– What if there are multiple types of
comments?
Character Strings
– single or double quotes?