Like this presentation? Why not share!

# automata7.ppt

## by peterbuck on May 10, 2010

• 1,365 views

### Views

Total Views
1,365
Views on SlideShare
1,364
Embed Views
1

Likes
0
Downloads
91
Comments
0

### 1 Embed1

 http://www.slideshare.net 1

### Upload Details

Uploaded via SlideShare as Microsoft PowerPoint

### Usage Rights

© All Rights Reserved

### Report content

Edit your comment

## automata7.pptPresentation Transcript

• Discrete Maths
• Recognising input using:
• automata : a graph-based technique
• regular expressions : an algebraic technique
• equivalent to automata
241-303 , Semester 1 2009-2010 7 . Automata and Regular Expressions
• Overview
• 1. Introduction to Automata
• 2. Representing Automata
• 3. The ‘aeiou’ Automaton
• 4. Generating Output
• 5. Bounce Filter Example
• 6. Deterministic and Nondeterministic Automata
continued
• 7. ‘washington’ Partial Anagrams 8. Regular Expressions 9. UNIX Regular Expressions 10. From REs to Automata 11. More Information
• 1. Introduction to Automata
• A finite state automaton represents a problem as a series of states and transitions between the states
• the automaton starts in an initial state
• input causes a transition from the current state to another;
• a state may be accepting
• the automaton can terminate successfully when it enters an accepting state (if it wants to)
• 1.1. An Example
• The states are the ovals.
• The transitions are the arrows
• labelled with the input that ‘trigger’ them
• The ‘oddA’ state is accepting.
evenA oddA start b a a b continued The ‘even-odd’ Automaton
• Execution Sequence
• Input Move to State
b a b a a evenA b a b a a evenA b a b a a oddA b a b a a oddA initial state the automaton could choose to terminate here b a b a a evenA b a b a a oddA stops since no more input
• 1.2. Why are Automata Useful?
• Automata are a very good way of modeling finite-state systems which change state due to input. Examples:
• text editors, compilers, UNIX tools like grep
• communications protocols
• digital hardware components
• e.g. adders, RAM
very different applications
• 2. Representing Automata
• Automata have a mathematical basis which allows them to be analysed, e.g.:
• prove that they accept correct input
• prove that they do not accept incorrect input
• Automata can be manipulated to simplify them, and they can be automatically converted into code.
• 2.1. A Mathematical Coding
• We can represent an automaton in terms of sets and mathematical functions.
• The ‘even-odd’ automaton is:
• startSet = { evenA }
• acceptSet = { oddA }
• nextState(evenA, b) => evenA nextState(evenA, a) => oddA nextState(oddA, b) => oddA nextState(oddA, a) => evenA
continued
• Analysis of the mathematical form can show that the ‘even-odd’ automaton only accepts strings which:
• contain an odd number of ‘a’s
• e.g.
• babaa abb abaab aabba aaaaba …
• 2.2. Automaton in Code
• It is easy to (automatically) translate an automaton into code, but ...
• an automaton graph does not contain all the details needed for a program
• The main extra coding issues:
• what to do when we enter an accepting state?
• what to do when the input cannot be processed?
• e.g. abzz is entered
• Encoding the ‘even-odd’ Automaton
• enum state {evenA, oddA}; // possible states enum state currState = evenA; // start state int isAccepting = 0; // false int ch; while ((ch = getchar()) != EOF)) { currState = nextState(currState, ch); isAccepting = acceptable(currState); } if (isAccepting) printf(“accepted ); else printf(“not accepted ”);
continued accepting state only used at end of input
• enum state nextState(enum state s, int ch) { if ((s == evenA) && (ch == ‘b’)) return evenA; if ((s == evenA) && (ch == ‘a’)) return oddA; if ((s == oddA) && (ch == ‘b’)) return oddA; if ((s == oddA) && (ch == ‘a’)) return evenA; printf(“Illegal Input”); exit(1); }
simple handling of incorrect input continued
• int acceptable(enum state s) { if (s == oddA) return 1; // oddA is an accepting state return 0; }
• 3. The ‘aeiou’ Automaton
• What English words contain the five vowels (a, e, i, o, u) in order?
• Some words that match:
• abstemious
• facetious
• sacrilegious
• 3.1. Automaton Graph 0 L - a a start 1 L - e e 2 L - i i 3 L - o o 4 L - u u 5 L = all letters
• 3.2. Execution Sequence (1)
• Input Move to State
f a c e t i o u s 0 0 1 1 continued f a c e t i o u s f a c e t i o u s f a c e t i o u s
• Input Move to State
2 2 f a c e t i o u s f a c e t i o u s 3 4 f a c e t i o u s f a c e t i o u s 5 f a c e t i o u s the automaton can terminate here; no need to process more input
• Execution Sequence (2)
• Input Move to State
a n d r e w 0 a n d r e w 1 a n d r e w 1 a n d r e w 1 continued
• Input Move to State
a n d r e w 1 a n d r e w 2 a n d r e w 2, and end of input means failure
• 3.3. Translation to Code
• enum state {0, 1, 2, 3, 4, 5}; // poss. states enum state currState = 0; // start state int isAccepting = 0; // false int ch; while ((ch = getchar()) != EOF) && !isAccepting ) { currState = nextState(currState, ch); isAccepting = acceptable(currState); } if (isAccepting) printf(“accepted ); else printf(“not accepted ”);
stop processing when the accepting state is entered continued
• enum state nextState(enum state s, int ch) { if (s == 0) { if (ch == ‘a’) return 1; else return 0; // input is L-a } if (s == 1) { if (ch == ‘e’) return 2; else return 1; // input is L-e } if (s == 2) { if (ch == ‘i’) return 3; else return 2; // input is L-i } :
continued
• : if (s == 3) { if (ch == ‘o’) return 4; else return 3; // input is L-o } if (s == 4) { if (ch == ‘u’) return 5; else return 4; // input is L-u } printf(“Illegal Input”); exit(1); } // end of nextState()
simple handling of incorrect input
• int acceptable(enum state s) { if (s == 5) return 1; // 5 is an accepting state return 0; }
• 4. Generating Output
• One possible extension to the basic automaton idea is to allow output:
• when a transition is ‘triggered’ there can be optional output as well
• Automata which generate output are sometimes called Finite State Machines (FSMs).
• 4.1. ‘even-odd’ with Output
• When the ‘a’ transition is triggered out of the evenA state, then a ‘1’ is output.
evenA oddA start b a /1 a b
• 4.2. Mathematical Coding
• Add an ‘output’ mathematical function to the automaton representation:
• output( evenA, a ) => 1
• 4.3. Extending the C Coding
• The while loop for ‘even-odd’ will become:
• : while ((ch = getchar()) != EOF)) { output(currState, ch); currState = nextState(currState, ch); isAccepting = acceptable(currState); } :
continued
• The output() C function:
• void output(enum state s, int ch) { if ((s == evenA) && (ch == ‘a’)) putchar(‘1’); }
• 5. Bounce Filter Example
• A signal processing problem:
• a stream of 1’s and 0’s are ‘smoothed’ by the filter so that:
• a single 0 surrounded by 1’s becomes a 1: ...1111 0 1111... => ...111111111...
• a single 1 surrounded by 0’s becomes a 0 ...0000 1 0000... => ...000000000...
• This kind of filtering is used in image processing to reduce ‘noise’.
• 5.1. The ‘bounce’ Automaton b a d c start 0/0 1/0 1/1 0/0 0/0 1/1 0/1 1/1
• Notes
• There is no accepting state
• the code will simply terminate at EOF
• The ‘a’ and ‘b’ states (left side) mostly have transitions that output ‘0’s.
• The ‘c’ and ‘d’ states (right side) mostly have transitions that output ‘1’s.
• 5.2. Execution Sequence
• Input Move to State Output
0 1 0 1 1 0 1 a a 0 b 0 a 0 continued 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1
• Input Move to State Output
b 0 c 1 d 1 c 1 moved to right hand side 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1
• 5.3. I/O Behaviour
• Input: 0 1 0 1 1 0 1 Output: 0 0 0 0 1 1 1
• It takes 2 bits of the same type before the automaton realises that it has a new bit sequence rather than a ‘noise’ bit.
smoothed away in the output
• 6. Deterministic and Nondeterministic Automata
• We have been writing deterministic automata so far:
• for an input read by a state there is at most one transition that can be fired
• state ‘s’ can process input ‘a’ and ‘w’, and fails for anything else
S a w
• Nondeterministic Automata
• A nondeterministic (ND) automaton can have 2 or more transitions with the same label leaving a state.
• Problem : if state S sees input ‘x’, then which transition should it use?
S a x x U T V
• 6.1. The ‘man’ Automaton
• Accept all strings that contain “man”
• this is hard to write as a deterministic automaton. The following has bugs:
0 1 2 3 start L - m m a n L - n L - a continued WRONG
• The input string command will get stuck at state 0:
0 o 0 m 1 m 0 a 0 n 0 d 0 c the problem starts here 0
• 6.2. A ND Automaton Solution
• It is nondeterministic because an ‘m’ input in state 0 can be dealt with by two transitions:
• a transition back to state 0, or
• a transition to state 1
0 1 2 3 start L m a n continued
• Processing command input:
0 o 0 m 0 m 0 a 0 n 0 d 0 c 0 1 1 a 2 n 3 accepting state m fail: reject the input
• 6.3. Executing a ND Automata
• It is difficult to code ND automata in conventional languages, such as C.
• Two different coding approaches:
• 1. When an input arrives, execute all transitions in parallel . See which succeeds.
• 2. When an input arrives, try one transition . If it leads to failure then backtrack and try another transition.
• Approach (1) in Parlog
• A concurrent logic programming language.
• state0([X|Rest]) :- state0(Rest) : true. state0([m|Rest]) :- state1(Rest) : true. state1([a|Rest]) :- state2(Rest). state2([n|Rest]).
concurrent testing Call: ?- state0([c,o,m,m,a,n,d]).
• Approach (2) in Prolog
• nextState(0, _, 0). nextState(0, ‘m’, 1). nextState(1, ‘a’, 2). nextState(2, ‘n’, 3). nda(State, [Ch|Input]) :- nextState(State, Ch, NewState), nda(NewState, Input). nda(3, []). // accepting state
Call: ?- nda(0, [c,o,m,m,a,n,d]). the nondeterministic part a sequential logic programming language
• 6.4. Why use ND Automata?
• With nondeterminism, some problems are easier to solve/model.
• Nondeterminism is common in some application areas, such as AI, graph search, and compilers.
continued
• It is possible to translate a ND automaton into a (larger, complex) deterministic one.
• In mathematical terms, ND automata and determinstic automata are equivalent
• they can be used to model all the same problems
• 7. ‘washington’ Partial Anagrams
• Find all the words which can be made from the letters in “washington”.
• There are over 240 words. Some of the 7-letter words:
• agonist
• goatish
• showing
• washing
• 7.1. A Two Stage Process
• 1. Select all the words from a dictionary (e.g. /usr/share/dict/words on calvin ) which use the letters in “washington”
• use a deterministic automaton
• 2. Delete the words which use the “washington” letters too many times (e.g. “hash”)
• use a nondeterministic automaton
• 7.2. Stage 1: Deterministic Automaton
• Send each word in the dictionary through the automaton:
• If state 1 is reached, then the word is passed to stage 2.
0 1 start newline S = {w,a,s,h,i,n,g,t,o}
• For example, “hash ” is accepted:
0 a 0 s 0 h 0 1 0 h
• 7.3. Stage 2: ND Automaton
• Check if a word uses a “washington” letter too often:
• e.g. delete “hash”
• The ND automaton succeeds if a word uses too many letters.
• Then the program will not output the word.
• Checking each Letter
• There are 9 different letters in “washington”.
• Nine deterministic automaton can be used to detect if the given word has:
• more than 1 ‘a’
• more than 1 ‘g’
• ...
• more than 2 ‘n’s
• Check for more than 1 ‘a’
• If this succeeds then the program will not output the word.
0 1 2 start L - a a a L - a e.g. ‘ nana’
• Checking all the Letters at Once
• The 9 deterministic automaton can be applied to the same word at the same time.
• Combine the 9 deterministic automata to create a single nondeterministic automaton.
• Nondeterminstic Checking 0 1 2 start L a a L - a 3 4 g L - g g 5 6 h L - h h continued two a's two g's two h's
• 9 10 11 n L - n n n L - n 7 8 i L - i i 12 13 o L - o o continued two i's three n's two o's
• 16 17 t t L - t 14 15 s L - s s 18 19 w L - w w two s's two t's two w's
• Processing “hash”
• Reaching an accepting state means that the program will not output “hash”.
0 a 0 0 h 0 h 0 5 14 14 1 1 1 6 5 5 5 h a s h h h s a s
• 7.4. UNIX Coding
• Stages 0,1,2, piped together:
• tr A-Z a-z < /usr/share/dict/words | grep '^[washingto]*\$' | egrep -v 'a.*a|g.*g|h.*h|i.*i| n.*n.*n|o.*o|s.*s|t.*t|w.*w’
• The call to tr translates all the words taken from the dictionary into lower case.
tr grep egrep -v /usr/share/dict/words
• 8. Regular Expressions (REs)
• REs are an algebraic way of specifying how to recognise input
• ‘ algebraic’ means that the recognition pattern is defined using RE operands and operators
• REs are equivalent to automata
• REs and automata can be used on all the same problems
• 8.1. REs in grep
• grep searches input lines, a line at a time.
• If the line contains a string that matches grep's RE (pattern), then the line is output.
grep &quot;RE&quot; input lines (e.g. from a file) hello andy my name is andy my bye byhe output matching lines (e.g. to a file) continued
• Examples grep &quot;and&quot; hello andy my name is andy my bye byhe hello andy my name is andy grep –E &quot;an|my&quot; hello andy my name is andy my bye byhe hello andy my name is andy my bye byhe continued &quot;|&quot; means &quot;or&quot;
• grep &quot;hel*&quot; hello andy my name is andy my bye byhe hello andy my bye byhe &quot;*&quot; means &quot;0 or more&quot;
• 8.2. Why use REs?
• They are very useful for expressing patterns that recognise textual input.
• For example, REs are used in:
• editors
• compilers
• web-based search engines
• communication protocols
• 8.3. The RE Language
• A RE defines a pattern which recognises (matches) a set of strings
• e.g. a RE can be defined that recognises the strings { aa, aba, abba, abbba, abbbba, …}
• These recognisable strings are sometimes called the RE’s language .
• RE Operands
• There are 4 basic kinds of operands:
• characters (e.g. ‘a’, ‘1’, ‘(‘)
• the symbol  (means an empty string ‘’)
• the symbol {} (means the empty set)
• variables, which can be assigned a RE
• variable = RE
• RE Operators
• There are three basic operators:
• union ‘|’
• concatenation
• closure *
• Union
• S | T
• this RE can use the S or T RE to match strings
• Example REs:
• a | b matches strings {a, b}
• a | b | c matches strings {a, b, c }
• Concatenation
• S T
• this RE will use the S RE followed by the T RE to match against strings
• Example REs:
• a b matches the string { ab }
• w | (a b) matches the strings {w, ab}
• What strings are matched by the RE (a | ab ) (c | bc)
• Equivalent to:
• {a, ab} followed by {c, bc}
• => {ac, abc, abc, abbc}
• => {ac, abc, abbc}
• Closure
• S*
• this RE can use the S RE 0 or more times to match against strings
• Example RE:
• a* matches the strings: {  , a, aa, aaa, aaaa, aaaaa, ... }
empty string
• 8.4. REs for C Identifiers
• We define two RE variables, letter and digit :
• letter = A | B | C | D ... Z | a | b | c | d .... z
• digit = 0 | 1 | 2 | ... 9
• ident is defined using letter and digit :
• ident = letter ( letter | digit )*
continued
• Strings matched by ident include:
• ab345 w h5g
• Strings not matched:
• 2 \$abc ****
• 9. UNIX Regular Expressions
• Different UNIX tools use slightly different extensions of the basic RE notation
• vi , awk , sed , grep , egrep , etc.
• Extra features include:
• character classes
• line start ‘^’ and end ‘\$’ symbols
• the wild card symbol ‘.’
• additional operators, R? and R+
• 9.1. Character Classes
• The character class [a 1 a 2 ... a n ] stands for a 1 | a 2 | ... | a n
• a 1 - a n stands for the set of characters between a 1 and a n
• e.g. [A-Z] [a-z0-9]
• 9.2. Line Start and End
• The ‘^’ matches the beginning of the line, ‘\$’ matches the end
• e.g. grep ‘^andr’ /usr/share/dict/words grep '^[washingto]*\$' /usr/share/dict/words
• Example as a Diagram grep &quot;^andr&quot; A A's AOL AOL's : : androgen androgen's androgynous android android's androids /usr/share/dict/words
• 9.3. Wild Card Symbol
• The ‘.’ stands for any character except the newline
• e.g. grep ‘^a..b.\$’ chapter1.txt grep ‘t.*t.*t’ manual
• grep &quot;^a..b.\$&quot; A A's AOL AOL's : : adobe alibi ameba /usr/share/dict/words
• 9.4. R? and R+
• R? stands for  | R (0 or 1 R)
• R+ stands for R | RR | RRR | ... which can also be written as R R*
• one or more occurrences of R
• 9.5. Operator Precedence
• The operators *, +, and ? have the highest precedence.
• Then comes concatenation
• Union ‘|’ is the lowest precedence
• Example:
• a | bc? means a | (b(c?)), and matches the strings {a, b, bc}
• 10. From REs to Automata
• The translation uses a special kind of ND automata which uses  -transitions . Automata of this type are sometimes called  -NFAs .
• The translation steps are:
• RE =>  -NFA
•  -NFA => ND automaton
• ND automaton => deterministic automaton
• deterministic automaton => code
• 10.1.  -NFAs
• A  -NFA allows a transition to use a  label .
• A transition using an  label can be triggered without having to match any input.
•  -NFA Example
• a*b | b*a is accepted by the following  -NFA:
1 6 2 4 3 5 start  b a   b a  nondeterminism occurs here Example input: &quot;bbba&quot;
• 10.2. RE to  -NFA
• The resulting  -NFA has:
• one start state and one accepting state
• at most two transitions out of any state
• The construction uses standard automata ‘pieces’ corresponding to RE operands and operators.
• The pieces are put together based on an expression tree for the RE.
• Automata Pieces for RE Operands x start Automaton for a character x  start Automaton for  start Automaton for {} This automaton does not accept any strings.
• Automata Pieces for RE Operators
• Union S | T:
S T start     continued
• Concatenation S T:
S T start  continued
• Closure S*:
S  start   
• 10.3. Translating a | bc*
• The first step in building the automaton is to draw a | bc* as an expression tree:
| . * c b a the concatenate symbol
• Translate the 3 leaves 1 2 a start Automaton for a 4 5 b start Automaton for b 7 8 c start Automaton for c
• Automaton for c* 7 8 9  6 start    c
• Automaton for bc* 7 8 9  6 start    c 5 4  b
• Final Automaton for a | bc* 7 8  6 start    c 5 4  b 3 2 1 a 9 0    
• 10.4. From  -NFA to ND Automaton
• The  -transitions can be removed by combining the states that use them.
• If we are in a state S with  -transition outputs, then we are also in any state that can be reached from S by following those  transitions.
• Example: simplify the lower branch of a|bc*
7 8  6    c 5 4  b 3 9 0   continued
• becomes: 7 8  6    c 5 4  b 3 9 0   3 9  continued
• becomes: 7 8  6,9,3   c b 9,3 0,4 continued becomes: 7 5,6,9,3   c b 8,9,3 0,4 5  state combination begins
• becomes: 5,6,9,3  c b 7,8,9,3 0,4 becomes: c b 5,6,7,8,9,3 0,4 simplify the labels: c b 5 0
• All of a|bc* simplified:
5 2 0 b a start c This also happens to be a deterministic automaton, so the translation is finished.
• 11. More Information
• Johnsonbaugh, R. 1997 . Discrete Mathematics , Prentice Hall, chapter 10 .