Specification-of-tokens

Specification of Tokens
• Definitions:
• The ALPHABET (often written ∑) is the set of legal input symbols
• A STRING over some alphabet ∑ is a finite sequence of symbols
from ∑
• The LENGTH of string s is written |s|
• The EMPTY STRING is a special 0-length string denoted ε
• REGULAR EXPRESSIONS (REs) are the most
common notation for pattern specification.
• Every pattern specifies a set of strings, so an RE
names a set of strings.

More definitions: strings and
substrings
• A PREFIX of s is formed by removing 0 or more
trailing symbols of s
• A SUFFIX of s is formed by removing 0 or more
leading symbols of s
• A SUBSTRING of s is formed by deleting a
prefix and a suffix from s
• A PROPER prefix, suffix, or substring is a
nonempty string x that is, respectively, a prefix,
suffix, or substring of s but with x ≠ s.

More definitions
• A LANGUAGE is a set of strings over a fixed
alphabet ∑.
• Example languages:
– Ø (the empty set)
– { ε }
– { a, aa, aaa, aaaa }
• The CONCATENATION of two strings x and y is
written xy
• String EXPONENTIATION is written si, where s0
= ε and si = si-1s for i>0.

Regular expressions
• REs let us precisely define a set of strings.
• For C identifiers, we might use
letter ( letter | digit )*
• Parentheses are for grouping, | means “OR”,
and * means zero or more instances.
• Every RE ‘r’ defines a language L(r).

Regular expressions
• Here are the rules for writing REs over an
alphabet ∑ :
1. ε is an RE denoting { ε }, the language containing
only the empty string.
2. If ‘a’ is in ∑, then a is a RE denoting { a }.
3. If r and s are REs denoting L(r) and L(s), then
1. (r)|(s) is a RE denoting L(r) ∪ L(s)
2. (r)(s) is a RE denoting L(r) L(s)
3. (r)* is a RE denoting (L(r))*
4. (r) is a RE denoting L(r)

Additional conventions
• To avoid too many parentheses, we assume:
1. * has the highest precedence, and is left
associative.
2. Concatenation has the 2nd highest precedence,
and is left associative.
3. | has the lowest precedence and is left
associative.

Example REs
1. a | b
2. ( a | b ) ( a | b )
3. a*
4. (a | b )*
5. a | a*b

Equivalence of REs
Axiom Description
r|s = s|r | is commutative
r|(s|t) = (r|s)t | is associative
(rs)t = r(st) Concatenation is associative
r(s|t) = rs|rt
(s|t)r = sr|tr
Concatenation distributes over |
ε r = r
r ε = r
ε Is the identity element for concatenation
r* = (r| ε)* Relation between * and ε
r** = r* * is idempotent

Regular definitions
• Example for identifiers in C:
letter -> A | B | … | Z | a | b | … | z
digit -> 0 | 1 | … | 9
id -> letter ( letter | digit )*
• Example for numbers in Pascal:
digit -> 0 | 1 | … | 9
digits -> digit digit*
optional_fraction -> . digits | ε
optional_exponent -> ( E ( + | - | ε ) digits ) | ε
num -> digits optional_fraction optional_exponent

Specification-of-tokens

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Specification-of-tokens

Similar to Specification-of-tokens (20)

Recently uploaded

Recently uploaded (20)

Specification-of-tokens