Lecture2 B
Upcoming SlideShare
Loading in...5
×
 

Lecture2 B

on

  • 427 views

 

Statistics

Views

Total Views
427
Views on SlideShare
427
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lecture2 B Lecture2 B Presentation Transcript

  • CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr Dr. Christof Monz TA: Adam Lee
  • Regular Expressions and Finite State Automata
    • REs: Language for specifying text strings
    • Search for document containing a string
      • Searching for “woodchuck”
    • Finite-state automata (FSA) (singular: automaton)
        • How much wood would a woodchuck chuck if a woodchuck would chuck wood?
      • Searching for “woodchucks” with an optional final “s”
  • Regular Expressions
    • Basic regular expression patterns
    • Perl-based syntax (slightly different from other notations for regular expressions)
    • Disjunctions /[wW]oodchuck/
  • Regular Expressions
    • Ranges [A-Z]
    • Negations [^Ss]
  • Regular Expressions
    • Optional characters ? , * and +
      • ? (0 or 1)
        • /colou ? r/  color or colour
      • * (0 or more)
        • /oo * h!/  oh! or Ooh! or Ooooh!
      • + (1 or more)
        • /o + h!/  oh! or Ooh! or Ooooh!
    • Wild cards . - /beg . n/  begin or began or begun
    * + Stephen Cole Kleene
  • Regular Expressions
    • Anchors ^ and $
      • / ^ [A-Z]/  “ R amallah, P alestine”
      • / ^ [^A-Z]/  “ ¿ verdad?” “ r eally?”
      • /. $ /  “It is over . ”
      • /. $ /  ?
    • Boundaries  and B
      • /  on  /  “ on my way” “M on day”
      • / B on  /  “automat on ”
    • Disjunction |
      • /yours|mine/  “it is either yours or mine ”
  • Disjunction, Grouping, Precedence
    • Column 1 Column 2 Column 3 … How do we express this?
    • / Column [0-9]+ * /
    • / ( Column [0-9]+ + )* /
    • Precedence
      • Parenthesis ()
      • Counters * + ? {}
      • Sequences and anchors the ^my end$
      • Disjunction |
    • REs are greedy!
  • Perl Commands While ( $line=<STDIN> ){ if ( $line =~ /the/ ){ print “MATCH: $line”; } }
  • Writing correct expressions
    • Exercise: Write a Perl regular expression to match the English article “the”:
    /the/ / [tT] he/ /  [tT] he  / / [^a-zA-Z] [tT] he [^a-zA-Z] / / ( ^ | [^a-zA-Z] ) [tT] he [^a-zA-Z] /
  • A more complex example
    • Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:
    / $ [0-9]+/ / $ [0-9]+ .[0-9][0-9] / /  $ [0-9]+ ( .[0-9][0-9] )?  / /  $ [0-9][0-9]?[0-9]? ( .[0-9][0-9] )?  / /  [0-9]+ * ([MG]Hz|[Mm]egahertz| [Gg]igahertz)  / /  [0-9]+ * (Mb|[Mm]egabytes ? )  / /  [0-9](.[0-9]+) * (Gb|[Gg]igabytes ? )  /
  • Advanced operators should be _
  • Substitutions and Memory
    • Substitutions
    s/colour/color/ s/colour/color/g s/([Cc]olour)/$1olor/ /the (.*)er they were, the $1er they will be/ /the (.*)er they (.*), the $1er they $2/ Substitute as many times as possible! Case insensitive matching s/colour/color/i
    • Memory ( $1 , $2 , etc. refer back to matches)
  • Eliza [Weizenbaum, 1966]
    • User: Men are all alike
    • ELIZA: IN WHAT WAY
    • User: They’re always bugging us about something or other
    • ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?
    • User: Well, my boyfriend made me come here
    • ELIZA: YOUR BOYFRIEND MADE YOU COME HERE
    • User: He says I’m depressed much of the time
    • ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
  • Eliza-style regular expressions
    • s/ .* YOU ARE (depressed|sad) .* /I AM SORRY TO HEAR YOU ARE 1/
    • s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE 1/
    • s/.* all .*/IN WHAT WAY/
    • s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
    Step 1: replace first person with second person references s/  I(’m| am)  /YOU ARE/g s/  my  /YOUR/g S/  mine  /YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations
  • Finite-state Automata
    • Finite-state automata (FSA)
    • Regular languages
    • Regular expressions
  • Finite-state Automata (Machines) /^baa+!$/ q 0 q 1 q 2 q 3 q 4 b a a ! a state transition final state baa! baaa! baaaa! baaaaa! ...
  • Input Tape REJECT a b a ! b q 0 0 1 2 3 4 b a a ! a
  • Input Tape ACCEPT b a a a q 0 q 1 q 2 q 3 q 3 q 4 ! 0 1 2 3 4 b a a ! a
  • Finite-state Automata
    • Q: a finite set of N states
      • Q = {q 0 , q 1 , q 2 , q 3 , q 4 }
    •  : a finite input alphabet of symbols
      •  = {a, b, !}
    • q 0 : the start state
    • F: the set of final states
      • F = {q 4 }
    •  (q,i): transition function
      • Given state q and input symbol i, return new state q'
      •  (q 3 ,!)  q 4
  • State-transition Tables 4: 3 2 1 0 State Ø 3 Ø 4 3 Ø Ø Ø Ø Ø 2 Ø Ø Ø 1 ! a b Input
  • D-RECOGNIZE function D-RECOGNIZE ( tape , machine ) returns accept or reject index  Beginning of tape current-state  Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state  transition-table [current-state, tape[index]] index  index + 1 end
  • Adding a failing state q 0 q 1 q 2 q 3 q 4 b a a ! a q F a ! b ! b ! b b a !
  • Adding an “all else” arc q 0 q 1 q 2 q 3 q 4 b a a ! a q F = = = =
  • Languages and Automata
    • Can use FSA as a generator as well as a recognizer
    • Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language.
      • L(M) = {baa!, baaa!, baaaa!, …}
    • Regular languages vs. non-regular languages
  • Languages and Automata
    • Deterministic vs. Non-deterministic FSAs
    • Epsilon (  ) transitions
  • Using NFSAs to accept strings
    • Backup : add markers at choice points, then possibly revisit unexplored arcs at marked choice point.
    • Look-ahead : look ahead in input
    • Parallelism : look at alternatives in parallel
  • Using NFSAs Input Ø 4 Ø Ø Ø ! 4: 3 2 1 0 State Ø 2,3 Ø Ø Ø Ø Ø Ø Ø Ø 2 Ø Ø Ø 1  a b
  • Readings for next time
    • J&M Chapter 3