1
SPECIFICATION OF
TOKENS
2
Strings and Languages
• Regular Expressions are an important notation for specifying patterns.
• Alphabet – any finite set of symbols
e.g. ASCII, binary alphabet, UNICODE, EBCDIC,LATIN-1
• String – A finite sequence of symbols drawn from an alphabet
– Banana (ASCII Alphabet)
– Length of a string => |s|
– Empty String => ε
• Other terms relating to strings: prefix; suffix; substring; proper prefix,
suffix, or substring (non-empty, not entire string); subsequence
• Language – A set of strings over a fixed alphabet
3
Languages
• A language, L, is simply any set of strings over a
fixed alphabet.
Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…}
Special Languages:  - EMPTY LANGUAGE
 - contains  string only
4
String operations
• Given String: banana
• Prefix : ban, banana
• Suffix : ana, banana
• Substring : nan, ban, ana, banana
• Subsequence: bnan, nn
• Proper Prefix and Suffix
5
String Operations
• Concatenation
– xy; s = s = s;  - identity for concatenation
– s0 =  if i > 0 si = si-1s
6
Operations on Languages
OPERATION DEFINITION
union of L and M
written L  M
concatenation of L
and M written LM
Kleene closure of L
written L*
positive closure of L
written L+
L  M = {s | s is in L or s is in M}
LM = {st | s is in L and t is in M}
L+=


0
i
i
L
L* denotes “zero or more concatenations of “ L
L*=


1
i
i
L
L+ denotes “one or more concatenations of “ L
Exponentiation Lo={ε}, L1=L,L2=LL
7
Operations on Languages
• LUD is the set of letters and digits
• LD is the set of strings consisting of a
letter followed by a digit
• L4 is the set of all four strings
• L* is the set of strings including ε
• D+ is the set of strings of one or more
digits.
8
Say What?
L = {A, B, C, D } D = {1, 2, 3}
• L  D
{A, B, C, D, 1, 2, 3 }
• LD
{A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
• L2
{ AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
• L*
{ All possible strings of L plus  }
• L+
L* - 
• L (L  D )
Valid :{ A1,AA2,B345,CD45} Invlaid:{321,4A2}
• L (L  D )*
Valid:{ A,A1,A23,D3,DA5..} Invalid:{31}
9
Regular Expressions
• A Regular Expression is a Set of Rules /
Techniques for Constructing Sequences of
Symbols (Strings) from an Alphabet.
• Let  Be an Alphabet, r a Regular Expression
Then L(r) is the Language That is characterized
by the Rules of r
10
Regular Expressions
• Defined over an alphabet Σ
• ε represents {ε}, the set containing the empty string
• If a is a symbol in Σ, then a is a regular expression
denoting {a}, the set containing the string a
• If r and s are regular expressions denoting the
languages L(r) and L(s), then:
– (r)|(s) is a regular expression denoting L(r)U L(s)
– (r)(s) is a regular expression denoting L(r)L(s)
– (r)* is a regular expression denoting (L(r))*
– (r) is a regular expression denoting L(r)
• Precedence: * (left associative), then concatenation (left
associative), then | (left associative)
11
Regular Expressions
Alphabet = {a, b}
1. a|b denotes {a, b}
2. (a|b)(a|b) denotes {ab, aa, ba, bb}
3. a* denotes {, a, aa, …}
4. (a|b)* - Strings of a’s and b’s including the 
5. a|a*b – a followed by zero/more a’s followed by b
12
Algebraic Properties of Regular
Expressions
AXIOM DESCRIPTION
r | s = s | r
r | (s | t) = (r | s) | t
(r s) t = r (s t)
r = r
r = r
r* = ( r |  )*
r ( s | t ) = r s | r t
( s | t ) r = s r | t r
r** = r*
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
relation between * and 
 Is the identity element for concatenation
* is idempotent
13
Regular Definitions
• Names maybe given to regular expressions; these
names can be used like symbols
• Let  is an alphabet of basic symbols. The regular
definition is a sequence of definitions of the form
d1 r1
d2 r2
. . .
dn rn
Where, each di is a distinct name, and each ri is a
regular expression over the symbols in   {d1, d2,
…, di-1 }
14
Regular Definitions
• Example 1:
– letter  A|B|…|Z|a|b|…|z
– digit  0|1|…|9
– id  letter (letter | digit)*
• Example 2
– digit  0 | 1 | 2 | … | 9
– digits  digit digit*
– optional_fraction  . digits | 
– optional_exponent  ( E ( + | -| ) digits) | 
– num  digits optional_fraction optional_exponent
15
Regular Definitions
• Shorthand
– One or more instances: r+ denotes rr*
– Zero or one Instance: r? denotes r|ε
– Character classes: [a-z] denotes
[a|b|…|z]
16
Example
• digit  0 | 1 | 2 | … | 9
• digits  digit+
• optional_fraction  (. digits ) ?
• optional_exponent  ( E ( + | -) ? digits) ?
• num  digits optional_fraction optional_exponent
17
Limitations of Regular
Expression
• Some languages cannot be described by any regular
expression
• Cannot describe balanced or nested constructs
– Example, all valid strings of balanced parentheses
– This can be done with CFG
• Cannot describe repeated strings
– Example: {wcw|w is a string of a’s and b’s}
– This can be done with CFG
• Can be used to denote only a fixed or unspecified
number of repetitions.

2_2Specification of Tokens.ppt

  • 1.
  • 2.
    2 Strings and Languages •Regular Expressions are an important notation for specifying patterns. • Alphabet – any finite set of symbols e.g. ASCII, binary alphabet, UNICODE, EBCDIC,LATIN-1 • String – A finite sequence of symbols drawn from an alphabet – Banana (ASCII Alphabet) – Length of a string => |s| – Empty String => ε • Other terms relating to strings: prefix; suffix; substring; proper prefix, suffix, or substring (non-empty, not entire string); subsequence • Language – A set of strings over a fixed alphabet
  • 3.
    3 Languages • A language,L, is simply any set of strings over a fixed alphabet. Alphabet Languages {0,1} {0,10,100,1000,100000…} {0,1,00,11,000,111,…} {a,b,c} {abc,aabbcc,aaabbbccc,…} {A, … ,Z} {FOR,WHILE,GOTO,…} {A,…,Z,a,…,z,0,…9, { All legal PASCAL progs} +,-,…,<,>,…} Special Languages:  - EMPTY LANGUAGE  - contains  string only
  • 4.
    4 String operations • GivenString: banana • Prefix : ban, banana • Suffix : ana, banana • Substring : nan, ban, ana, banana • Subsequence: bnan, nn • Proper Prefix and Suffix
  • 5.
    5 String Operations • Concatenation –xy; s = s = s;  - identity for concatenation – s0 =  if i > 0 si = si-1s
  • 6.
    6 Operations on Languages OPERATIONDEFINITION union of L and M written L  M concatenation of L and M written LM Kleene closure of L written L* positive closure of L written L+ L  M = {s | s is in L or s is in M} LM = {st | s is in L and t is in M} L+=   0 i i L L* denotes “zero or more concatenations of “ L L*=   1 i i L L+ denotes “one or more concatenations of “ L Exponentiation Lo={ε}, L1=L,L2=LL
  • 7.
    7 Operations on Languages •LUD is the set of letters and digits • LD is the set of strings consisting of a letter followed by a digit • L4 is the set of all four strings • L* is the set of strings including ε • D+ is the set of strings of one or more digits.
  • 8.
    8 Say What? L ={A, B, C, D } D = {1, 2, 3} • L  D {A, B, C, D, 1, 2, 3 } • LD {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 } • L2 { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD} • L* { All possible strings of L plus  } • L+ L* -  • L (L  D ) Valid :{ A1,AA2,B345,CD45} Invlaid:{321,4A2} • L (L  D )* Valid:{ A,A1,A23,D3,DA5..} Invalid:{31}
  • 9.
    9 Regular Expressions • ARegular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) from an Alphabet. • Let  Be an Alphabet, r a Regular Expression Then L(r) is the Language That is characterized by the Rules of r
  • 10.
    10 Regular Expressions • Definedover an alphabet Σ • ε represents {ε}, the set containing the empty string • If a is a symbol in Σ, then a is a regular expression denoting {a}, the set containing the string a • If r and s are regular expressions denoting the languages L(r) and L(s), then: – (r)|(s) is a regular expression denoting L(r)U L(s) – (r)(s) is a regular expression denoting L(r)L(s) – (r)* is a regular expression denoting (L(r))* – (r) is a regular expression denoting L(r) • Precedence: * (left associative), then concatenation (left associative), then | (left associative)
  • 11.
    11 Regular Expressions Alphabet ={a, b} 1. a|b denotes {a, b} 2. (a|b)(a|b) denotes {ab, aa, ba, bb} 3. a* denotes {, a, aa, …} 4. (a|b)* - Strings of a’s and b’s including the  5. a|a*b – a followed by zero/more a’s followed by b
  • 12.
    12 Algebraic Properties ofRegular Expressions AXIOM DESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r (s t) r = r r = r r* = ( r |  )* r ( s | t ) = r s | r t ( s | t ) r = s r | t r r** = r* | is commutative | is associative concatenation is associative concatenation distributes over | relation between * and   Is the identity element for concatenation * is idempotent
  • 13.
    13 Regular Definitions • Namesmaybe given to regular expressions; these names can be used like symbols • Let  is an alphabet of basic symbols. The regular definition is a sequence of definitions of the form d1 r1 d2 r2 . . . dn rn Where, each di is a distinct name, and each ri is a regular expression over the symbols in   {d1, d2, …, di-1 }
  • 14.
    14 Regular Definitions • Example1: – letter  A|B|…|Z|a|b|…|z – digit  0|1|…|9 – id  letter (letter | digit)* • Example 2 – digit  0 | 1 | 2 | … | 9 – digits  digit digit* – optional_fraction  . digits |  – optional_exponent  ( E ( + | -| ) digits) |  – num  digits optional_fraction optional_exponent
  • 15.
    15 Regular Definitions • Shorthand –One or more instances: r+ denotes rr* – Zero or one Instance: r? denotes r|ε – Character classes: [a-z] denotes [a|b|…|z]
  • 16.
    16 Example • digit 0 | 1 | 2 | … | 9 • digits  digit+ • optional_fraction  (. digits ) ? • optional_exponent  ( E ( + | -) ? digits) ? • num  digits optional_fraction optional_exponent
  • 17.
    17 Limitations of Regular Expression •Some languages cannot be described by any regular expression • Cannot describe balanced or nested constructs – Example, all valid strings of balanced parentheses – This can be done with CFG • Cannot describe repeated strings – Example: {wcw|w is a string of a’s and b’s} – This can be done with CFG • Can be used to denote only a fixed or unspecified number of repetitions.