Lecture2 B

477 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
477
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lecture2 B

  1. 1. CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr Dr. Christof Monz TA: Adam Lee
  2. 2. Regular Expressions and Finite State Automata <ul><li>REs: Language for specifying text strings </li></ul><ul><li>Search for document containing a string </li></ul><ul><ul><li>Searching for “woodchuck” </li></ul></ul><ul><li>Finite-state automata (FSA) (singular: automaton) </li></ul><ul><ul><ul><li>How much wood would a woodchuck chuck if a woodchuck would chuck wood? </li></ul></ul></ul><ul><ul><li>Searching for “woodchucks” with an optional final “s” </li></ul></ul>
  3. 3. Regular Expressions <ul><li>Basic regular expression patterns </li></ul><ul><li>Perl-based syntax (slightly different from other notations for regular expressions) </li></ul><ul><li>Disjunctions /[wW]oodchuck/ </li></ul>
  4. 4. Regular Expressions <ul><li>Ranges [A-Z] </li></ul><ul><li>Negations [^Ss] </li></ul>
  5. 5. Regular Expressions <ul><li>Optional characters ? , * and + </li></ul><ul><ul><li>? (0 or 1) </li></ul></ul><ul><ul><ul><li>/colou ? r/  color or colour </li></ul></ul></ul><ul><ul><li>* (0 or more) </li></ul></ul><ul><ul><ul><li>/oo * h!/  oh! or Ooh! or Ooooh! </li></ul></ul></ul><ul><ul><li>+ (1 or more) </li></ul></ul><ul><ul><ul><li>/o + h!/  oh! or Ooh! or Ooooh! </li></ul></ul></ul><ul><li>Wild cards . - /beg . n/  begin or began or begun </li></ul>* + Stephen Cole Kleene
  6. 6. Regular Expressions <ul><li>Anchors ^ and $ </li></ul><ul><ul><li>/ ^ [A-Z]/  “ R amallah, P alestine” </li></ul></ul><ul><ul><li>/ ^ [^A-Z]/  “ ¿ verdad?” “ r eally?” </li></ul></ul><ul><ul><li>/. $ /  “It is over . ” </li></ul></ul><ul><ul><li>/. $ /  ? </li></ul></ul><ul><li>Boundaries  and B </li></ul><ul><ul><li>/  on  /  “ on my way” “M on day” </li></ul></ul><ul><ul><li>/ B on  /  “automat on ” </li></ul></ul><ul><li>Disjunction | </li></ul><ul><ul><li>/yours|mine/  “it is either yours or mine ” </li></ul></ul>
  7. 7. Disjunction, Grouping, Precedence <ul><li>Column 1 Column 2 Column 3 … How do we express this? </li></ul><ul><li>/ Column [0-9]+ * / </li></ul><ul><li>/ ( Column [0-9]+ + )* / </li></ul><ul><li>Precedence </li></ul><ul><ul><li>Parenthesis () </li></ul></ul><ul><ul><li>Counters * + ? {} </li></ul></ul><ul><ul><li>Sequences and anchors the ^my end$ </li></ul></ul><ul><ul><li>Disjunction | </li></ul></ul><ul><li>REs are greedy! </li></ul>
  8. 8. Perl Commands While ( $line=<STDIN> ){ if ( $line =~ /the/ ){ print “MATCH: $line”; } }
  9. 9. Writing correct expressions <ul><li>Exercise: Write a Perl regular expression to match the English article “the”: </li></ul>/the/ / [tT] he/ /  [tT] he  / / [^a-zA-Z] [tT] he [^a-zA-Z] / / ( ^ | [^a-zA-Z] ) [tT] he [^a-zA-Z] /
  10. 10. A more complex example <ul><li>Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: </li></ul>/ $ [0-9]+/ / $ [0-9]+ .[0-9][0-9] / /  $ [0-9]+ ( .[0-9][0-9] )?  / /  $ [0-9][0-9]?[0-9]? ( .[0-9][0-9] )?  / /  [0-9]+ * ([MG]Hz|[Mm]egahertz| [Gg]igahertz)  / /  [0-9]+ * (Mb|[Mm]egabytes ? )  / /  [0-9](.[0-9]+) * (Gb|[Gg]igabytes ? )  /
  11. 11. Advanced operators should be _
  12. 12. Substitutions and Memory <ul><li>Substitutions </li></ul>s/colour/color/ s/colour/color/g s/([Cc]olour)/$1olor/ /the (.*)er they were, the $1er they will be/ /the (.*)er they (.*), the $1er they $2/ Substitute as many times as possible! Case insensitive matching s/colour/color/i <ul><li>Memory ( $1 , $2 , etc. refer back to matches) </li></ul>
  13. 13. Eliza [Weizenbaum, 1966] <ul><li>User: Men are all alike </li></ul><ul><li>ELIZA: IN WHAT WAY </li></ul><ul><li>User: They’re always bugging us about something or other </li></ul><ul><li>ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? </li></ul><ul><li>User: Well, my boyfriend made me come here </li></ul><ul><li>ELIZA: YOUR BOYFRIEND MADE YOU COME HERE </li></ul><ul><li>User: He says I’m depressed much of the time </li></ul><ul><li>ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED </li></ul>
  14. 14. Eliza-style regular expressions <ul><li>s/ .* YOU ARE (depressed|sad) .* /I AM SORRY TO HEAR YOU ARE 1/ </li></ul><ul><li>s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE 1/ </li></ul><ul><li>s/.* all .*/IN WHAT WAY/ </li></ul><ul><li>s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ </li></ul>Step 1: replace first person with second person references s/  I(’m| am)  /YOU ARE/g s/  my  /YOUR/g S/  mine  /YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations
  15. 15. Finite-state Automata <ul><li>Finite-state automata (FSA) </li></ul><ul><li>Regular languages </li></ul><ul><li>Regular expressions </li></ul>
  16. 16. Finite-state Automata (Machines) /^baa+!$/ q 0 q 1 q 2 q 3 q 4 b a a ! a state transition final state baa! baaa! baaaa! baaaaa! ...
  17. 17. Input Tape REJECT a b a ! b q 0 0 1 2 3 4 b a a ! a
  18. 18. Input Tape ACCEPT b a a a q 0 q 1 q 2 q 3 q 3 q 4 ! 0 1 2 3 4 b a a ! a
  19. 19. Finite-state Automata <ul><li>Q: a finite set of N states </li></ul><ul><ul><li>Q = {q 0 , q 1 , q 2 , q 3 , q 4 } </li></ul></ul><ul><li> : a finite input alphabet of symbols </li></ul><ul><ul><li> = {a, b, !} </li></ul></ul><ul><li>q 0 : the start state </li></ul><ul><li>F: the set of final states </li></ul><ul><ul><li>F = {q 4 } </li></ul></ul><ul><li> (q,i): transition function </li></ul><ul><ul><li>Given state q and input symbol i, return new state q' </li></ul></ul><ul><ul><li> (q 3 ,!)  q 4 </li></ul></ul>
  20. 20. State-transition Tables 4: 3 2 1 0 State Ø 3 Ø 4 3 Ø Ø Ø Ø Ø 2 Ø Ø Ø 1 ! a b Input
  21. 21. D-RECOGNIZE function D-RECOGNIZE ( tape , machine ) returns accept or reject index  Beginning of tape current-state  Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state  transition-table [current-state, tape[index]] index  index + 1 end
  22. 22. Adding a failing state q 0 q 1 q 2 q 3 q 4 b a a ! a q F a ! b ! b ! b b a !
  23. 23. Adding an “all else” arc q 0 q 1 q 2 q 3 q 4 b a a ! a q F = = = =
  24. 24. Languages and Automata <ul><li>Can use FSA as a generator as well as a recognizer </li></ul><ul><li>Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. </li></ul><ul><ul><li>L(M) = {baa!, baaa!, baaaa!, …} </li></ul></ul><ul><li>Regular languages vs. non-regular languages </li></ul>
  25. 25. Languages and Automata <ul><li>Deterministic vs. Non-deterministic FSAs </li></ul><ul><li>Epsilon (  ) transitions </li></ul>
  26. 26. Using NFSAs to accept strings <ul><li>Backup : add markers at choice points, then possibly revisit unexplored arcs at marked choice point. </li></ul><ul><li>Look-ahead : look ahead in input </li></ul><ul><li>Parallelism : look at alternatives in parallel </li></ul>
  27. 27. Using NFSAs Input Ø 4 Ø Ø Ø ! 4: 3 2 1 0 State Ø 2,3 Ø Ø Ø Ø Ø Ø Ø Ø 2 Ø Ø Ø 1  a b
  28. 28. Readings for next time <ul><li>J&M Chapter 3 </li></ul>

×