2_2Specification of Tokens.ppt

2
Strings and Languages
• Regular Expressions are an important notation for specifying patterns.
• Alphabet – any finite set of symbols
e.g. ASCII, binary alphabet, UNICODE, EBCDIC,LATIN-1
• String – A finite sequence of symbols drawn from an alphabet
– Banana (ASCII Alphabet)
– Length of a string => |s|
– Empty String => ε
• Other terms relating to strings: prefix; suffix; substring; proper prefix,
suffix, or substring (non-empty, not entire string); subsequence
• Language – A set of strings over a fixed alphabet

3
Languages
• A language, L, is simply any set of strings over a
fixed alphabet.
Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…}
Special Languages:  - EMPTY LANGUAGE
 - contains  string only

4
String operations
• Given String: banana
• Prefix : ban, banana
• Suffix : ana, banana
• Substring : nan, ban, ana, banana
• Subsequence: bnan, nn
• Proper Prefix and Suffix

5
String Operations
• Concatenation
– xy; s = s = s;  - identity for concatenation
– s0 =  if i > 0 si = si-1s

6
Operations on Languages
OPERATION DEFINITION
union of L and M
written L  M
concatenation of L
and M written LM
Kleene closure of L
written L*
positive closure of L
written L+
L  M = {s | s is in L or s is in M}
LM = {st | s is in L and t is in M}
L+=


0
i
i
L
L* denotes “zero or more concatenations of “ L
L*=


1
i
i
L
L+ denotes “one or more concatenations of “ L
Exponentiation Lo={ε}, L1=L,L2=LL

7
Operations on Languages
• LUD is the set of letters and digits
• LD is the set of strings consisting of a
letter followed by a digit
• L4 is the set of all four strings
• L* is the set of strings including ε
• D+ is the set of strings of one or more
digits.

8
Say What?
L = {A, B, C, D } D = {1, 2, 3}
• L  D
{A, B, C, D, 1, 2, 3 }
• LD
{A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
• L2
{ AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
• L*
{ All possible strings of L plus  }
• L+
L* - 
• L (L  D )
Valid :{ A1,AA2,B345,CD45} Invlaid:{321,4A2}
• L (L  D )*
Valid:{ A,A1,A23,D3,DA5..} Invalid:{31}

9
Regular Expressions
• A Regular Expression is a Set of Rules /
Techniques for Constructing Sequences of
Symbols (Strings) from an Alphabet.
• Let  Be an Alphabet, r a Regular Expression
Then L(r) is the Language That is characterized
by the Rules of r

10
Regular Expressions
• Defined over an alphabet Σ
• ε represents {ε}, the set containing the empty string
• If a is a symbol in Σ, then a is a regular expression
denoting {a}, the set containing the string a
• If r and s are regular expressions denoting the
languages L(r) and L(s), then:
– (r)|(s) is a regular expression denoting L(r)U L(s)
– (r)(s) is a regular expression denoting L(r)L(s)
– (r)* is a regular expression denoting (L(r))*
– (r) is a regular expression denoting L(r)
• Precedence: * (left associative), then concatenation (left
associative), then | (left associative)

11
Regular Expressions
Alphabet = {a, b}
1. a|b denotes {a, b}
2. (a|b)(a|b) denotes {ab, aa, ba, bb}
3. a* denotes {, a, aa, …}
4. (a|b)* - Strings of a’s and b’s including the 
5. a|a*b – a followed by zero/more a’s followed by b

12
Algebraic Properties of Regular
Expressions
AXIOM DESCRIPTION
r | s = s | r
r | (s | t) = (r | s) | t
(r s) t = r (s t)
r = r
r = r
r* = ( r |  )*
r ( s | t ) = r s | r t
( s | t ) r = s r | t r
r** = r*
| is commutative
| is associative
concatenation is associative
concatenation distributes over |
relation between * and 
 Is the identity element for concatenation
* is idempotent

13
Regular Definitions
• Names maybe given to regular expressions; these
names can be used like symbols
• Let  is an alphabet of basic symbols. The regular
definition is a sequence of definitions of the form
d1 r1
d2 r2
. . .
dn rn
Where, each di is a distinct name, and each ri is a
regular expression over the symbols in   {d1, d2,
…, di-1 }

14
Regular Definitions
• Example 1:
– letter  A|B|…|Z|a|b|…|z
– digit  0|1|…|9
– id  letter (letter | digit)*
• Example 2
– digit  0 | 1 | 2 | … | 9
– digits  digit digit*
– optional_fraction  . digits | 
– optional_exponent  ( E ( + | -| ) digits) | 
– num  digits optional_fraction optional_exponent

15
Regular Definitions
• Shorthand
– One or more instances: r+ denotes rr*
– Zero or one Instance: r? denotes r|ε
– Character classes: [a-z] denotes
[a|b|…|z]

16
Example
• digit  0 | 1 | 2 | … | 9
• digits  digit+
• optional_fraction  (. digits ) ?
• optional_exponent  ( E ( + | -) ? digits) ?
• num  digits optional_fraction optional_exponent

17
Limitations of Regular
Expression
• Some languages cannot be described by any regular
expression
• Cannot describe balanced or nested constructs
– Example, all valid strings of balanced parentheses
– This can be done with CFG
• Cannot describe repeated strings
– Example: {wcw|w is a string of a’s and b’s}
– This can be done with CFG
• Can be used to denote only a fixed or unspecified
number of repetitions.

2_2Specification of Tokens.ppt

More Related Content

What's hot

Similar to 2_2Specification of Tokens.ppt

More from Ratnakar Mikkili

Recently uploaded

2_2Specification of Tokens.ppt