Challenge the future
Delft
University of
Technology
Course IN4303
Compiler Construction
Eduardo Souza, Guido Wachsmuth, Eelco Visser
Lexical Analysis
today’s lecture
Lexical Analysis
Overview
2
today’s lecture
Lexical Analysis
Overview
Lexical analysis
2
today’s lecture
Lexical Analysis
Overview
Lexical analysis
Regular languages
• regular grammars
• regular expressions
• finite state automata
2
today’s lecture
Lexical Analysis
Overview
Lexical analysis
Regular languages
• regular grammars
• regular expressions
• finite state automata
Equivalence of formalisms
• constructive approach
2
today’s lecture
Lexical Analysis
Overview
Lexical analysis
Regular languages
• regular grammars
• regular expressions
• finite state automata
Equivalence of formalisms
• constructive approach
Tool generation
2
Lexical Analysis
Regular Grammars
I
3
formal languages
Lexical Analysis 4
Recap: A Theory of Language
formal languages
Lexical Analysis
vocabulary Σ
finite, nonempty set of elements (words, letters)
alphabet
4
Recap: A Theory of Language
formal languages
Lexical Analysis
vocabulary Σ
finite, nonempty set of elements (words, letters)
alphabet
string over Σ
finite sequence of elements chosen from Σ
word, sentence, utterance
4
Recap: A Theory of Language
formal languages
Lexical Analysis
vocabulary Σ
finite, nonempty set of elements (words, letters)
alphabet
string over Σ
finite sequence of elements chosen from Σ
word, sentence, utterance
formal language λ
set of strings over a vocabulary Σ
λ ⊆ Σ*
4
Recap: A Theory of Language
formal grammars
Lexical Analysis 5
Recap: A Theory of Language
formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
5
Recap: A Theory of Language
formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
5
Recap: A Theory of Language
nonterminal symbol
formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
5
Recap: A Theory of Language
context
formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
5
Recap: A Theory of Language
replacement
formal grammars
Lexical Analysis
formal grammar G = (N, Σ, P, S)
nonterminal symbols N
terminal symbols Σ
production rules P ⊆ (N∪Σ)*
N (N∪Σ)*
× (N∪Σ)*
start symbol S∈N
grammar classes
type-0, unrestricted
type-1, context-sensitive: (a A c, a b c)
type-2, context-free: P ⊆ N × (N∪Σ)*
type-3, regular: (A, x) or (A, xB)
5
Recap: A Theory of Language
right regular grammar
Lexical Analysis 6
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
Decimal Numbers
right regular grammar
Lexical Analysis 7
Id → “a” R
…
Id → “z” R
R → “a” R
…
R → “z” R
R → “0” R
…
R → “9” R
Id → “a”
…
Id → “z”
R → “a”
…
R → “z”
R → “0”
…
R → “9”
Identifiers
formal languages
Lexical Analysis 8
Recap: A Theory of Language
formal languages
Lexical Analysis
formal grammar G = (N, Σ, P, S)
8
Recap: A Theory of Language
formal languages
Lexical Analysis
formal grammar G = (N, Σ, P, S)
derivation relation G ⊆ (N∪Σ)*
× (N∪Σ)*
w G w’
∃(p, q)∈P: ∃u,v∈(N∪Σ)*
:
w=u p v ∧ w’=u q v
8
Recap: A Theory of Language
formal languages
Lexical Analysis
formal grammar G = (N, Σ, P, S)
derivation relation G ⊆ (N∪Σ)*
× (N∪Σ)*
w G w’
∃(p, q)∈P: ∃u,v∈(N∪Σ)*
:
w=u p v ∧ w’=u q v
formal language L(G) ⊆ Σ*
L(G) = {w∈Σ*
| S G
*
w}
8
Recap: A Theory of Language
formal languages
Lexical Analysis
formal grammar G = (N, Σ, P, S)
derivation relation G ⊆ (N∪Σ)*
× (N∪Σ)*
w G w’
∃(p, q)∈P: ∃u,v∈(N∪Σ)*
:
w=u p v ∧ w’=u q v
formal language L(G) ⊆ Σ*
L(G) = {w∈Σ*
| S G
*
w}
classes of formal languages
8
Recap: A Theory of Language
Lexical Analysis 9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
S → a
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
S → a
S → aA
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
S → a
S → aA
A → aB
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
S → a
S → aA
A → aB
B → aC
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
S → a
S → aA
A → aB
B → aC
C → aD
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
S → a
S → aA
A → aB
B → aC
C → aD
D → aS
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
G:
S → a
S → aA
A → aB
B → aC
C → aD
D → aS
9
Example
Is G regular?

L(G) = ?
Lexical Analysis
Regular Expressions
II
10
overview
Lexical Analysis
Recap: Regular Expressions
basics
• symbol from an alphabet
• ε
combinators
• alternation: E1 | E2
• concatenation: E1 E2
• repetition: E*
• optional: E? = E | ε
• one or more: E+ = E E*
11
regular expressions
Lexical Analysis 12
Num: (0|1|2|3|4|5|6|7|8|9)+
Id: (a|…|z)(a|…|z|0|…|9)*
Decimal Numbers & Identifiers
formal languages
Lexical Analysis
Regular Expressions
basics
• L(a) = {“a”}
• L(ε) = {“”}
combinators
• L(E1 | E2) = L(E1) ∪ L(E2)
• L(E1 E2) = L(E1) · L(E2)
• L(E*) = L(E)*
13
regular expressions
Lexical Analysis 14
Num: (0|1|2|3|4|5|6|7|8|9)+
A valid Num is any word that consists of a sequence of
one or more digits.
Id: (a|…|z)(a|…|z|0|…|9)*
A valid Id is any word that starts with a lowercase letter
followed by sequence of zero or more lowercase letters or
digits.
Decimal Numbers & Identifiers
Lexical Analysis 15
Example
Lexical Analysis 15
Example
Gr:
Lexical Analysis 15
Example
Gr:
Lexical Analysis 15
Example
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
Lexical Analysis 15
Example
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
Lexical Analysis 15
Example
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
L(Gr) = ?
Lexical Analysis 16
Example
Lexical Analysis 16
Example
Gr:
Lexical Analysis 16
Example
Gr:
Lexical Analysis 16
Example
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
Lexical Analysis 16
Example
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
Lexical Analysis 16
Example
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
L(Gr) = The set of words that start with
zero or more capital letters, followed by
zero or more decimal digits.
Lexical Analysis
Finite Automata
III
17
formal definition
Lexical Analysis 18
Finite Automata
formal definition
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
states Q
input symbols Σ
transition function T
start state q0∈Q
final states F⊆Q
18
Finite Automata
formal definition
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
states Q
input symbols Σ
transition function T
start state q0∈Q
final states F⊆Q
transition function
nondeterministic FA T : Q × Σ → P(Q)
NFA with ε-moves T : Q × (Σ ∪ {ε}) → P(Q)
deterministic FA T : Q × Σ → Q
18
Finite Automata
finite automata
Lexical Analysis
Decimal Numbers & Identifiers
19
1 2
0-9
0-9
1 2
a-z a-z
0-9
formal languages
Lexical Analysis 20
Nondeterministic Finite Automata
formal languages
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
20
Nondeterministic Finite Automata
formal languages
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
transition function T : Q × Σ → P(Q)
T({q1,..., qn}, x) := T(q1, x) ∪ … ∪ T(qn, x)
T*({q1,..., qn}, ε) := {q1,..., qn}
T*({q1,..., qn}, xw) := T*(T({q1,..., qn}, x), w)
20
Nondeterministic Finite Automata
formal languages
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
transition function T : Q × Σ → P(Q)
T({q1,..., qn}, x) := T(q1, x) ∪ … ∪ T(qn, x)
T*({q1,..., qn}, ε) := {q1,..., qn}
T*({q1,..., qn}, xw) := T*(T({q1,..., qn}, x), w)
formal language L(M) ⊆ Σ*
L(M) = {w∈Σ*
| T*
({q0}, w) ∩ F ≠ ∅}
20
Nondeterministic Finite Automata
formal languages
Lexical Analysis 21
Nondeterministic Finite Automata
3 4
f
1 2
a-z a-z
0-9
i
formal languages
Lexical Analysis 22
Deterministic Finite Automata
formal languages
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
22
Deterministic Finite Automata
formal languages
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
transition function T : Q × Σ → Q
T*(q, ε) := q
T*(q, xw) := T*(T(q, x), w)
22
Deterministic Finite Automata
formal languages
Lexical Analysis
finite automaton M = (Q, Σ, T, q0, F)
transition function T : Q × Σ → Q
T*(q, ε) := q
T*(q, xw) := T*(T(q, x), w)
formal language L(M) ⊆ Σ*
L(M) = {w∈Σ*
| T*
(q0, w) ∈ F}
22
Deterministic Finite Automata
formal languages
Lexical Analysis 23
Deterministic Finite Automata
1 2
0-9
0-9
1 2
a-z a-z
0-9
Lexical Analysis
Equivalence
IV
24
formalisms
Lexical Analysis
Regular Languages
25
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
formalisms
Lexical Analysis
Regular Languages
25
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
formalisms
Lexical Analysis
Regular Languages
25
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
formalisms
Lexical Analysis
Regular Languages
25
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
formalisms
Lexical Analysis
Regular Languages
25
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
formalisms
Lexical Analysis
Regular Languages
25
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
formalisms
Lexical Analysis
Regular Languages
26
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
formalisms
Lexical Analysis
Regular Languages
27
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
right regular grammar
Lexical Analysis 28
NFA construction
right regular grammar
Lexical Analysis
formal grammar G = (N, Σ, P, S)
28
NFA construction
right regular grammar
Lexical Analysis
formal grammar G = (N, Σ, P, S)
finite automaton M = (N ∪ {f}, Σ, T, S, F)
28
NFA construction
right regular grammar
Lexical Analysis
formal grammar G = (N, Σ, P, S)
finite automaton M = (N ∪ {f}, Σ, T, S, F)
transition function T
(X, aY)∈P : (X, a, Y)∈T
(X, a)∈P : (X, a, f)∈T
28
NFA construction
right regular grammar
Lexical Analysis
formal grammar G = (N, Σ, P, S)
finite automaton M = (N ∪ {f}, Σ, T, S, F)
transition function T
(X, aY)∈P : (X, a, Y)∈T
(X, a)∈P : (X, a, f)∈T
final states F
(S, ε)∈P : F = {S, f}
else: F = {f}
28
NFA construction
Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
N
f
Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
N
f
Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
N
f
0-9
Example
Lexical Analysis 29
Num → “0” Num
Num → “1” Num
Num → “2” Num
Num → “3” Num
Num → “4” Num
Num → “5” Num
Num → “6” Num
Num → “7” Num
Num → “8” Num
Num → “9” Num
Num → “0”
Num → “1”
Num → “2”
Num → “3”
Num → “4”
Num → “5”
Num → “6”
Num → “7”
Num → “8”
Num → “9”
NFA construction
N
f
0-9
0-9
Example
Lexical Analysis
formal grammar G = (N, Σ, P, S)
finite automaton M = (N ∪ {f}, Σ, T, S, F)
transition function T
(X, aY)∈P : (X, a, Y)∈T
(X, a)∈P : (X, a, f)∈T
final states F
(S, ε)∈P : F = {S, f}
else: F = {f}
30
NFA construction
G:
S → a
S → aA
A → aB
B → aC
C → aD
D → aS
formalisms
Lexical Analysis
Regular Languages
31
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
regular expressions
Lexical Analysis
NFA construction
32
a
a
(x) εε
x
x*
ε
x
ε
regular expressions
Lexical Analysis
NFA construction
33
x|y
x
xy
y
ε
ε
ε
ε
εε
x y
ε
regular expressions
Lexical Analysis
NFA construction
34
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
formalisms
Lexical Analysis
Regular Languages
35
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
ε elimination
Lexical Analysis
additional final states
• states with ε-moves into final states
• become final states themselves
additional transitions
• ε-move from source to target state
• transitions from target state
• add these transitions to the source state
36
NFA construction
ε elimination
Lexical Analysis
NFA construction
additional final states
• states with ε-moves into final states
• become final states themselves
additional transitions
• ε-move from source to target state
• transitions from target state
• add these transitions to the source state
37
Gr:
S → ('A'= | … | 'Z')*(0 | … | 9)*
formalisms
Lexical Analysis
Regular Languages
38
right regular
grammars
left regular
grammars
regular
expressions
NFAs DFAs
NFAs with
ε-moves
eliminating nondeterminism
Lexical Analysis
nondeterministic finite automaton M = (Q, Σ, T, q0, F)
deterministic finite automaton M’ = (P(Q), Σ, T’, {q0}, F’)
transition function T’
• T’({q1,..., qn}, x) = T({q1,..., qn}, x) = T(q1, x) ∪ … ∪ T(qn, x)
final states F’ = {S⎮S⊆Q, S∩F ≠ ∅}
• all states that include a final state of the original NFA
39
Powerset construction
example
Lexical Analysis
Powerset construction
40
3 4
f
1 2
a-z a-z
0-9
i
example
Lexical Analysis
Powerset construction
41
3 4
f
1 2
a-z a-z
0-9
i
1 2
a-h
j-z
a-z
0-9
24
f
i
23
a-e,g-z
0-9
a-z
0-9
Lexical Analysis
Summary
V
42
lessons learned
Lexical Analysis
Summary
43
lessons learned
Lexical Analysis
Summary
What are the formalisms to describe regular languages?
• regular grammars
• regular expressions
• finite state automata
43
lessons learned
Lexical Analysis
Summary
What are the formalisms to describe regular languages?
• regular grammars
• regular expressions
• finite state automata
Why are these formalisms equivalent?
• constructive proofs
43
lessons learned
Lexical Analysis
Summary
What are the formalisms to describe regular languages?
• regular grammars
• regular expressions
• finite state automata
Why are these formalisms equivalent?
• constructive proofs
How can we generate compiler tools from that?
• implement DFAs
43
learn more
Lexical Analysis
Literature
44
learn more
Lexical Analysis
Literature
Formal languages
Noam Chomsky: Three models for the description of language. 1956
J. E. Hopcroft, R. Motwani, J. D. Ullman: Introduction to Automata
Theory, Languages, and Computation. 2006
44
learn more
Lexical Analysis
Literature
Formal languages
Noam Chomsky: Three models for the description of language. 1956
J. E. Hopcroft, R. Motwani, J. D. Ullman: Introduction to Automata
Theory, Languages, and Computation. 2006
Lexical analysis
Andrew W. Appel, Jens Palsberg: Modern Compiler Implementation in
Java, 2nd edition. 2002
44
Lexical Analysis
Copyrights
45
Lexical Analysis 46
copyrights
Lexical Analysis
Pictures
Slide 1:
Book Scanner by Ben Woosley, some rights reserved
Slides 4, 5, 8:
Noam Chomsky by Fellowsisters, some rights reserved
Slide 6, 7, 11, 15:
Tiger by Bernard Landgraf, some rights reserved
Slide 18:
Coffee Primo by Dominica Williamson, some rights reserved
Slide 32:
Pine Creek by Nicholas, some rights reserved
47

Lexical Analysis