Lec03

374 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
374
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lec03

  1. 1. Speech and Language Processing Lecture 03 Chapter 2 and 3 of SLP
  2. 2. Today • Finite-state methods • English Morphology • Finite-State Transducers03/27/12 Speech and Language Processing - Jurafsky and Martin 2
  3. 3. Regular Expressions and Text Searching • Everybody does it  Emacs, vi, grep, Microsoft Words, etc.. • Regular expressions are a compact textual representation of a set of strings representing a language. • The regular expression is used for specifying text strings in all sorts of text processing and information extraction applications. Speech and Language Processing - Jurafsky and Martin03/27/12 3
  4. 4. Regular Expressions and Text Searching • A string is a sequence of symbols; • For most text-based search techniques, a string is any sequence of alphanumeric characters (letters, numbers, spaces, tabs, and punctuation). • A regular expression is an algebraic notation for characterizing a set of strings.03/27/12 Speech and Language Processing - Jurafsky and Martin 4
  5. 5. Regular Expressions03/27/12 Speech and Language Processing - Jurafsky and Martin 5
  6. 6. Example • Find all the instances of the word “the” in a text.  /the/  /[tT]he/  /b[tT]heb/03/27/12 Speech and Language Processing - Jurafsky and Martin 6
  7. 7. Errors • The process we just went through was based on two fixing kinds of errors  Matching strings that we should not have matched (there, then, other)  False positives (Type I)  Not matching things that we should have matched (The)  False negatives (Type II)03/27/12 Speech and Language Processing - Jurafsky and Martin 7
  8. 8. Errors • Reducing the error rate for an application often involves two antagonistic efforts:  Increasing accuracy, or precision, (minimizing false positives)  Increasing coverage, or recall, (minimizing false negatives).03/27/12 Speech and Language Processing - Jurafsky and Martin 8
  9. 9. Finite State Automata • Regular expressions can be viewed as a textual way of specifying the structure of finite-state automata (FSA). • FSAs and their probabilistic relatives capture significant aspects of what linguists say we need for morphology and parts of syntax.03/27/12 Speech and Language Processing - Jurafsky and Martin 9
  10. 10. FSAs as Graphs • Let’s start with the sheep language from Chapter 2  /baa+!/ • A directed graph, a finite set of vertices (nodes), together with a set of directed links between pairs of vertices called arcs.03/27/12 Speech and Language Processing - Jurafsky and Martin 10
  11. 11. Sheep FSA • We can say the following things about this machine  It has 5 states  b, a, and ! are in its alphabet  q0 is the start state  q4 is an accept state  It has 5 transitions03/27/12 Speech and Language Processing - Jurafsky and Martin 11
  12. 12. More Formally • You can specify an FSA by enumerating the following things.  The set of states: Q  A finite alphabet: Σ  A start state  A set of accept/final states  A transition function that maps QxΣ to Q03/27/12 Speech and Language Processing - Jurafsky and Martin 12
  13. 13. About Alphabets • Don’t take term alphabet word too narrowly; it just means we need a finite set of symbols in the input. • These symbols can and will stand for bigger objects that can have internal structure.03/27/12 Speech and Language Processing - Jurafsky and Martin 13
  14. 14. Dollars and Cents03/27/12 Speech and Language Processing - Jurafsky and Martin 14
  15. 15. Yet Another View • The guts of FSAs b a ! e can ultimately be 0 1 represented as tables 1 2 2 2,3 If you’re in state 1 and you’re looking at an a, go to state 2 3 4 403/27/12 Speech and Language Processing - Jurafsky and Martin 15
  16. 16. Recognition • Recognition is the process of determining if a string should be accepted by a machine • Or… it’s the process of determining if a string is in the language we’re defining with the machine • Or… it’s the process of determining if a regular expression matches a string • Those all amount the same thing in the end03/27/12 Speech and Language Processing - Jurafsky and Martin 16
  17. 17. Recognition • Traditionally, (Turing’s notion) this process is depicted with a tape.03/27/12 Speech and Language Processing - Jurafsky and Martin 17
  18. 18. Recognition • Simply a process of starting in the start state • Examining the current input • Consulting the table • Going to a new state and updating the tape pointer. • Until you run out of tape.03/27/12 Speech and Language Processing - Jurafsky and Martin 18
  19. 19. D-Recognize QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.03/27/12 Speech and Language Processing - Jurafsky and Martin 19
  20. 20. Key Points • Deterministic means that at each point in processing there is always one unique thing to do (no choices). • D-recognize is a simple table-driven interpreter. • The algorithm is universal for all unambiguous regular languages.  To change the machine, you simply change the table.03/27/12 Speech and Language Processing - Jurafsky and Martin 20
  21. 21. Recognition as Search • You can view this algorithm as a trivial kind of state-space search. • States are pairings of tape positions and state numbers. • Operators are compiled into the table • Goal state is a pairing with the end of tape position and a final accept state03/27/12 Speech and Language Processing - Jurafsky and Martin 21
  22. 22. Generative Formalisms • Formal Languages are sets of strings composed of symbols from a finite set of symbols. • Finite-state automata define formal languages (without having to enumerate all the strings in the language) • The term Generative is based on the view that you can run the machine as a generator to get strings from the language.03/27/12 Speech and Language Processing - Jurafsky and Martin 22
  23. 23. Generative Formalisms • FSAs can be viewed from two perspectives:  Acceptors that can tell you if a string is in the language  Generators to produce all and only the strings in the language03/27/12 Speech and Language Processing - Jurafsky and Martin 23
  24. 24. Non-Determinism DFSA NDFSA03/27/12 Speech and Language Processing - Jurafsky and Martin 24
  25. 25. Non-Determinism cont. • Yet another technique  Epsilon transitions  Key point: these transitions do not examine or advance the tape during recognition03/27/12 Speech and Language Processing - Jurafsky and Martin 25
  26. 26. Equivalence • Non-deterministic machines can be converted to deterministic ones with a fairly simple construction • That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones in terms of the languages they can accept03/27/12 Speech and Language Processing - Jurafsky and Martin 26
  27. 27. ND Recognition • Two basic approaches (used in all major implementations of regular expressions, see Friedl 2006) 1. Either take a ND machine and convert it to a D machine and then do recognition with that. 2. Or explicitly manage the process of recognition as a state-space search (leaving the machine as is).03/27/12 Speech and Language Processing - Jurafsky and Martin 27
  28. 28. Non-Deterministic Recognition: Search • In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. • But not all paths directed through the machine for an accept string lead to an accept state. • No paths through the machine lead to an accept state for a string not in the language.03/27/12 Speech and Language Processing - Jurafsky and Martin 28
  29. 29. Non-Deterministic Recognition • So success in non-deterministic recognition occurs when a path is found through the machine that ends in an accept. • Failure occurs when all of the possible paths for a given string lead to failure.03/27/12 Speech and Language Processing - Jurafsky and Martin 29
  30. 30. Example b a a a ! q0 q1 q2 q2 q3 q403/27/12 Speech and Language Processing - Jurafsky and Martin 30
  31. 31. Example03/27/12 Speech and Language Processing - Jurafsky and Martin 31
  32. 32. Example03/27/12 Speech and Language Processing - Jurafsky and Martin 32
  33. 33. Example03/27/12 Speech and Language Processing - Jurafsky and Martin 33
  34. 34. Example03/27/12 Speech and Language Processing - Jurafsky and Martin 34
  35. 35. Example03/27/12 Speech and Language Processing - Jurafsky and Martin 35
  36. 36. Example03/27/12 Speech and Language Processing - Jurafsky and Martin 36
  37. 37. Example03/27/12 Speech and Language Processing - Jurafsky and Martin 37
  38. 38. Example03/27/12 Speech and Language Processing - Jurafsky and Martin 38
  39. 39. Key Points • States in the search space are pairings of tape positions and states in the machine. • By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.03/27/12 Speech and Language Processing - Jurafsky and Martin 39
  40. 40. Why Bother? • Non-determinism doesn’t get us more formal power and it causes headaches so why bother?  More natural (understandable) solutions03/27/12 Speech and Language Processing - Jurafsky and Martin 40
  41. 41. Today • Finite-state methods • English Morphology • Finite-State Transducers03/27/12 Speech and Language Processing - Jurafsky and Martin 41
  42. 42. Words • Finite-state methods are particularly useful in dealing with a lexicon • Many devices, most with limited memory, need access to large lists of words • And they need to perform fairly sophisticated tasks with those lists • So we’ll first talk about some facts about words and then come back to computational methods03/27/12 Speech and Language Processing - Jurafsky and Martin 42
  43. 43. English Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • We can usefully divide morphemes into two classes  Stems: The core meaning-bearing units  Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions03/27/12 Speech and Language Processing - Jurafsky and Martin 43
  44. 44. English Morphology • We can further divide morphology up into two broad classes  Inflectional  Derivational03/27/12 Speech and Language Processing - Jurafsky and Martin 44
  45. 45. Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word:  Has the same word class as the original  Serves a grammatical/semantic purpose that is  Different from the original  But is nevertheless transparently related to the original03/27/12 Speech and Language Processing - Jurafsky and Martin 45
  46. 46. Nouns and Verbs in English • Nouns are simple  Markers for plural and possessive • Verbs are only slightly more complex  Markers appropriate to the tense of the verb03/27/12 Speech and Language Processing - Jurafsky and Martin 46
  47. 47. Regulars and Irregulars • It is a little complicated by the fact that some words misbehave (refuse to follow the rules)  Mouse/mice, goose/geese, ox/oxen  Go/went, fly/flew • The terms regular and irregular are used to refer to words that follow the rules and those that don’t03/27/12 Speech and Language Processing - Jurafsky and Martin 47
  48. 48. Regular and Irregular Verbs • Regulars…  Walk, walks, walking, walked, walked • Irregulars  Eat, eats, eating, ate, eaten  Catch, catches, catching, caught, caught  Cut, cuts, cutting, cut, cut03/27/12 Speech and Language Processing - Jurafsky and Martin 48
  49. 49. Inflectional Morphology • So inflectional morphology in English is fairly straightforward • But is complicated by the fact that are irregularities03/27/12 Speech and Language Processing - Jurafsky and Martin 49
  50. 50. Derivational Morphology • Derivational morphology is the messy stuff that no one ever taught you.  Quasi-systematicity  Irregular meaning change  Changes of word class03/27/12 Speech and Language Processing - Jurafsky and Martin 50
  51. 51. Derivational Examples • Verbs and Adjectives to Nouns -ation computerize computerization -ee appoint appointee -er kill killer -ness fuzzy fuzziness03/27/12 Speech and Language Processing - Jurafsky and Martin 51
  52. 52. Derivational Examples • Nouns and Verbs to Adjectives -al computation computational -able embrace embraceable -less clue clueless03/27/12 Speech and Language Processing - Jurafsky and Martin 52
  53. 53. Example: Compute • Many paths are possible… • Start with compute  Computer -> computerize -> computerization  Computer -> computerize -> computerizable • But not all paths/operations are equally good (allowable?)  Clue  Clue -> *clueable03/27/12 Speech and Language Processing - Jurafsky and Martin 53
  54. 54. Morpholgy and FSAs • We’d like to use the machinery provided by FSAs to capture these facts about morphology  Accept strings that are in the language  Reject strings that are not  And do so in a way that doesn’t require us to in effect list all the words in the language03/27/12 Speech and Language Processing - Jurafsky and Martin 54
  55. 55. Start Simple • Regular singular nouns are ok • Regular plural nouns have an -s on the end • Irregulars are ok as is03/27/12 Speech and Language Processing - Jurafsky and Martin 55
  56. 56. Simple Rules03/27/12 Speech and Language Processing - Jurafsky and Martin 56
  57. 57. Now Plug in the Words03/27/12 Speech and Language Processing - Jurafsky and Martin 57
  58. 58. Derivational Rules If everything is an accept state how do things ever get rejected?03/27/12 Speech and Language Processing - Jurafsky and Martin 58
  59. 59. Parsing/Generation vs. Recognition • We can now run strings through these machines to recognize strings in the language • But recognition is usually not quite what we need  Often if we find some string in the language we might like to assign a structure to it (parsing)  Or we might have some structure and we want to produce a surface form for it (production/generation) • Example  From “cats” to “cat +N +PL”03/27/12 Speech and Language Processing - Jurafsky and Martin 59
  60. 60. Today • Finite-state methods • English Morphology • Finite-State Transducers03/27/12 Speech and Language Processing - Jurafsky and Martin 60
  61. 61. Finite State Transducers • The simple story  Add another tape  Add extra symbols to the transitions  On one tape we read “cats”, on the other we write “cat +N +PL”03/27/12 Speech and Language Processing - Jurafsky and Martin 61
  62. 62. FSTs03/27/12 Speech and Language Processing - Jurafsky and Martin 62
  63. 63. Applications • The kind of parsing we’re talking about is normally called morphological analysis • It can either be • An important stand-alone component of many applications (spelling correction, information retrieval) • Or simply a link in a chain of further linguistic analysis03/27/12 Speech and Language Processing - Jurafsky and Martin 63
  64. 64. Transitions c:c a:a t:t +N: ε +PL:s • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s03/27/12 Speech and Language Processing - Jurafsky and Martin 64
  65. 65. Typical Uses • Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). • And we’ll write to the second tape using the other symbols on the transitions.03/27/12 Speech and Language Processing - Jurafsky and Martin 65
  66. 66. Ambiguity • Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. • Didn’t matter which path was actually traversed • In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result03/27/12 Speech and Language Processing - Jurafsky and Martin 66
  67. 67. Ambiguity • What’s the right parse (segmentation) for • Unionizable • Union-ize-able • Un-ion-ize-able • Each represents a valid path through the derivational morphology machine.03/27/12 Speech and Language Processing - Jurafsky and Martin 67
  68. 68. Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored03/27/12 Speech and Language Processing - Jurafsky and Martin 68
  69. 69. The Gory Details • Of course, its not as easy as • “cat +N +PL” <-> “cats” • As we saw earlier there are geese, mice and oxen • But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes • Cats vs Dogs • Fox and Foxes03/27/12 Speech and Language Processing - Jurafsky and Martin 69
  70. 70. Multi-Tape Machines • To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next • So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols03/27/12 Speech and Language Processing - Jurafsky and Martin 70
  71. 71. Multi-Level Tape Machines • We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape03/27/12 Speech and Language Processing - Jurafsky and Martin 71
  72. 72. Lexical to Intermediate Level03/27/12 Speech and Language Processing - Jurafsky and Martin 72
  73. 73. Intermediate to Surface • The add an “e” rule as in fox^s# <-> foxes#03/27/12 Speech and Language Processing - Jurafsky and Martin 73
  74. 74. Foxes03/27/12 Speech and Language Processing - Jurafsky and Martin 74
  75. 75. Note • A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply. • Meaning that they are written out unchanged to the output tape.03/27/12 Speech and Language Processing - Jurafsky and Martin 75
  76. 76. Overall Scheme • We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). • Lexical level to intermediate forms • We have a larger set of machines that capture orthographic/spelling rules. • Intermediate forms to surface forms03/27/12 Speech and Language Processing - Jurafsky and Martin 76
  77. 77. Overall Scheme03/27/12 Speech and Language Processing - Jurafsky and Martin 77
  78. 78. Cascades • This is an architecture that we’ll see again and again • Overall processing is divided up into distinct rewrite steps • The output of one layer serves as the input to the next • The intermediate tapes may or may not wind up being useful in their own right03/27/12 Speech and Language Processing - Jurafsky and Martin 78
  79. 79. Overall Plan03/27/12 Speech and Language Processing - Jurafsky and Martin 79
  80. 80. Final Scheme03/27/12 Speech and Language Processing - Jurafsky and Martin 80
  81. 81. Composition • Create a set of new states that correspond to each pair of states from the original machines (New states are called (x,y), where x is a state from M1, and y is a state from M2) • Create a new FST transition table for the new machine according to the following intuition…03/27/12 Speech and Language Processing - Jurafsky and Martin 81
  82. 82. Composition • There should be a transition between two states in the new machine if it’s the case that the output for a transition from a state from M1, is the same as the input to a transition from M2 or…03/27/12 Speech and Language Processing - Jurafsky and Martin 82
  83. 83. Composition • δ3((xa,ya), i:o) = (xb,yb) iff There exists c such that δ1(xa, i:c) = xb AND δ2(ya, c:o) = yb03/27/12 Speech and Language Processing - Jurafsky and Martin 83

×