Specification of Tokens
• Definitions:
• The ALPHABET (often written ∑) is the set of legal input symbols
• A STRING over some alphabet ∑ is a finite sequence of symbols
from ∑
• The LENGTH of string s is written |s|
• The EMPTY STRING is a special 0-length string denoted ε
• REGULAR EXPRESSIONS (REs) are the most
common notation for pattern specification.
• Every pattern specifies a set of strings, so an RE
names a set of strings.
More definitions: strings and
substrings
• A PREFIX of s is formed by removing 0 or more
trailing symbols of s
• A SUFFIX of s is formed by removing 0 or more
leading symbols of s
• A SUBSTRING of s is formed by deleting a
prefix and a suffix from s
• A PROPER prefix, suffix, or substring is a
nonempty string x that is, respectively, a prefix,
suffix, or substring of s but with x ≠ s.
More definitions
• A LANGUAGE is a set of strings over a fixed
alphabet ∑.
• Example languages:
– Ø (the empty set)
– { ε }
– { a, aa, aaa, aaaa }
• The CONCATENATION of two strings x and y is
written xy
• String EXPONENTIATION is written si, where s0
= ε and si = si-1s for i>0.
Regular expressions
• REs let us precisely define a set of strings.
• For C identifiers, we might use
letter ( letter | digit )*
• Parentheses are for grouping, | means “OR”,
and * means zero or more instances.
• Every RE ‘r’ defines a language L(r).
Regular expressions
• Here are the rules for writing REs over an
alphabet ∑ :
1. ε is an RE denoting { ε }, the language containing
only the empty string.
2. If ‘a’ is in ∑, then a is a RE denoting { a }.
3. If r and s are REs denoting L(r) and L(s), then
1. (r)|(s) is a RE denoting L(r) ∪ L(s)
2. (r)(s) is a RE denoting L(r) L(s)
3. (r)* is a RE denoting (L(r))*
4. (r) is a RE denoting L(r)
Additional conventions
• To avoid too many parentheses, we assume:
1. * has the highest precedence, and is left
associative.
2. Concatenation has the 2nd highest precedence,
and is left associative.
3. | has the lowest precedence and is left
associative.
Example REs
1. a | b
2. ( a | b ) ( a | b )
3. a*
4. (a | b )*
5. a | a*b
Equivalence of REs
Axiom Description
r|s = s|r | is commutative
r|(s|t) = (r|s)t | is associative
(rs)t = r(st) Concatenation is associative
r(s|t) = rs|rt
(s|t)r = sr|tr
Concatenation distributes over |
ε r = r
r ε = r
ε Is the identity element for concatenation
r* = (r| ε)* Relation between * and ε
r** = r* * is idempotent
Regular definitions
• Example for identifiers in C:
letter -> A | B | … | Z | a | b | … | z
digit -> 0 | 1 | … | 9
id -> letter ( letter | digit )*
• Example for numbers in Pascal:
digit -> 0 | 1 | … | 9
digits -> digit digit*
optional_fraction -> . digits | ε
optional_exponent -> ( E ( + | - | ε ) digits ) | ε
num -> digits optional_fraction optional_exponent

Specification-of-tokens

  • 1.
    Specification of Tokens •Definitions: • The ALPHABET (often written ∑) is the set of legal input symbols • A STRING over some alphabet ∑ is a finite sequence of symbols from ∑ • The LENGTH of string s is written |s| • The EMPTY STRING is a special 0-length string denoted ε • REGULAR EXPRESSIONS (REs) are the most common notation for pattern specification. • Every pattern specifies a set of strings, so an RE names a set of strings.
  • 2.
    More definitions: stringsand substrings • A PREFIX of s is formed by removing 0 or more trailing symbols of s • A SUFFIX of s is formed by removing 0 or more leading symbols of s • A SUBSTRING of s is formed by deleting a prefix and a suffix from s • A PROPER prefix, suffix, or substring is a nonempty string x that is, respectively, a prefix, suffix, or substring of s but with x ≠ s.
  • 3.
    More definitions • ALANGUAGE is a set of strings over a fixed alphabet ∑. • Example languages: – Ø (the empty set) – { ε } – { a, aa, aaa, aaaa } • The CONCATENATION of two strings x and y is written xy • String EXPONENTIATION is written si, where s0 = ε and si = si-1s for i>0.
  • 4.
    Regular expressions • REslet us precisely define a set of strings. • For C identifiers, we might use letter ( letter | digit )* • Parentheses are for grouping, | means “OR”, and * means zero or more instances. • Every RE ‘r’ defines a language L(r).
  • 5.
    Regular expressions • Hereare the rules for writing REs over an alphabet ∑ : 1. ε is an RE denoting { ε }, the language containing only the empty string. 2. If ‘a’ is in ∑, then a is a RE denoting { a }. 3. If r and s are REs denoting L(r) and L(s), then 1. (r)|(s) is a RE denoting L(r) ∪ L(s) 2. (r)(s) is a RE denoting L(r) L(s) 3. (r)* is a RE denoting (L(r))* 4. (r) is a RE denoting L(r)
  • 6.
    Additional conventions • Toavoid too many parentheses, we assume: 1. * has the highest precedence, and is left associative. 2. Concatenation has the 2nd highest precedence, and is left associative. 3. | has the lowest precedence and is left associative.
  • 7.
    Example REs 1. a| b 2. ( a | b ) ( a | b ) 3. a* 4. (a | b )* 5. a | a*b
  • 8.
    Equivalence of REs AxiomDescription r|s = s|r | is commutative r|(s|t) = (r|s)t | is associative (rs)t = r(st) Concatenation is associative r(s|t) = rs|rt (s|t)r = sr|tr Concatenation distributes over | ε r = r r ε = r ε Is the identity element for concatenation r* = (r| ε)* Relation between * and ε r** = r* * is idempotent
  • 9.
    Regular definitions • Examplefor identifiers in C: letter -> A | B | … | Z | a | b | … | z digit -> 0 | 1 | … | 9 id -> letter ( letter | digit )* • Example for numbers in Pascal: digit -> 0 | 1 | … | 9 digits -> digit digit* optional_fraction -> . digits | ε optional_exponent -> ( E ( + | - | ε ) digits ) | ε num -> digits optional_fraction optional_exponent