• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
NLP (Fall 2013): Syntax & Parsing, Context-Free Grammars, Cocke-Younger-Kasami (CYK) Algorithm, Early Algorithm: Input Representation & Dotted Rules
 

NLP (Fall 2013): Syntax & Parsing, Context-Free Grammars, Cocke-Younger-Kasami (CYK) Algorithm, Early Algorithm: Input Representation & Dotted Rules

on

  • 688 views

 

Statistics

Views

Total Views
688
Views on SlideShare
516
Embed Views
172

Actions

Likes
0
Downloads
0
Comments
0

25 Embeds 172

http://vkedco.blogspot.com 111
http://www.vkedco.blogspot.com 17
http://vkedco.blogspot.in 10
http://reader.aol.com 5
http://www.vkedco.blogspot.in 3
http://vkedco.blogspot.fr 2
http://www.blogger.com 2
http://vkedco.blogspot.it 2
http://vkedco.blogspot.co.il 2
http://vkedco.blogspot.de 2
http://www.vkedco.blogspot.ru 2
http://vkedco.blogspot.ru 1
http://www.vkedco.blogspot.ro 1
http://vkedco.blogspot.sk 1
http://www.vkedco.blogspot.nl 1
http://vkedco.blogspot.ie 1
http://vkedco.blogspot.nl 1
http://b7143848.theseblogs.com 1
http://vkedco.blogspot.com.es 1
http://vkedco.blogspot.ca 1
http://vkedco.blogspot.gr 1
http://www.vkedco.blogspot.hk 1
http://vkedco.blogspot.mx 1
http://vkedco.blogspot.co.uk 1
http://vkedco.blogspot.tw 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    NLP (Fall 2013): Syntax & Parsing, Context-Free Grammars, Cocke-Younger-Kasami (CYK) Algorithm, Early Algorithm: Input Representation & Dotted Rules NLP (Fall 2013): Syntax & Parsing, Context-Free Grammars, Cocke-Younger-Kasami (CYK) Algorithm, Early Algorithm: Input Representation & Dotted Rules Presentation Transcript

    • Natural Language Processing Syntax & Parsing, Context-Free Grammars, Cocke-Younger-Kasami (CYK) & Early Algorithms Vladimir Kulyukin
    • Outline  Syntax & Parsing  Context-Free Grammars  Definition  Epsilon Productions  Useful & Useless Symbols  Chomsky Normal Form (CNF)  More Efficient Parsing Approaches:  Cocke-Younger-Kasami Algorithm  Early Algorithm: Input Representation & Dotted Rules
    • Syntax & Parsing  Syntax in the NLP context refers to the study of sentence or text structure  Parsing is the process of assigning a parse tree to a string  A grammar is required to generate parse trees  Grammars for natural languages consists of syntactic categories and parts of speech
    • Context-Free Grammars
    • Context-Free Grammar (CFG): Definition   . overstringaisandwhere,formtheof isproductioneachs;productionofsetfiniteais symbol;starttheis alphabet;terminaltheis alphabet;lnonterminatheis where,,,,tuple-4aisCFGA VΣ yVXyX P VS Σ V PSVGG    
    • A Sample Context-Free Grammar  S  NP VP  S  AUX NP VP  S  VP  NP  DET NOMINAL  NOMINAL  NOUN  NOMINAL  NOUN NOMINAL  NP  ProperNoun  NP  VERB  DET  that | this | a  NOUN  left  AUX  does  VERB  make  PREP  from | to | on  ProperNoun  USU  NOMINAL  NOMINAL PP  PP  PREP NP
    • Formal Context-Free Languages
    • Example 01   .fromderivedisthatshow,oninduction By.|:CFGfollowingheConsider t:Proof free.-contextisLthatShow.0Let:Claim Sban aSbS |nbaL nn nn  
    • Example 02   .fromderivedisthatshow,oninduction By.|:CFGfollowingheConsider t:Proof free.-contextisthatShow.0|Let:Claim 3 3 Sban aSbbbS LnbaL nn nn  
    • Example 03: CF Language Concatenation          | | :LforGCFGaconstructcan weNow.|:forCFGabeLet.|:for CFGabeLet.thatObservefree.-contextareandboth 01,ExampleBy.0|and0|Let:Proof .free-contextis0,0|Let:Claim 22 11 21 2222111 12121 21 dcSS baSS SSS dcSSLGbaSSL GLLLLL mdcLnbaL mndcbaL mmnn mmnn       
    • Example 04 .andof rulestheto|addingbyforCFGaconstructcan Then wely.respective,andofsymbolsstartingtwothebe andLetly.respective,andforandgrammars free-contextexist twotherefree,-contextareandSince:Proof free.-contextisThen .Letlanguages.free-contextbeandLet:Claim 21 21 21 212121 21 2121 GG SSSLG GG SSLLGG LL L LLLLL  
    • Example 05      .,oflengthon theinductionbyThen, | | :CFGfollowingthebeLet:Proof .,12|,:Claim * LGLz baX XXXSS G NkkzbazL    
    • Useful & Useless Symbols in CFGs
    • Useful & Useless Symbols  Let G = (V, T, P, S) be a CFG grammar  A symbol X is useful if there is a derivation S * αXβ * w for some α, β in (V U T)* and w is in T*  A symbol X is useless if there is no such derivation
    • Useful & Useless Symbols  Let G = (V, T, P, S) be a CFG grammar  A symbol X is useful if there is a derivation S * αXβ * w for some α, β in (V U T)* and w is in T*  A symbol X is useless if there is no such derivation
    • Example: Useful & Useless Symbols Suppose CFG G has the following productions: S  AB | a A  a A, B are useless symbols S is a useful symbol
    • Elimination of Symbols that Do Not Derive Terminal Strings       .which forsomeisthere'ineachforsuch that ,',,''CFGequivalentanexistsThere .and,,,CFGaLet:Lemma1 * * wA TwVA SPTVG GLSPTVG    
    • Computation of V’ OLDVars = { }; // put into NEWVars those variables that derive terminals in one step. NEWVars = {A | A  w for some w in T*} // keep looping until no more variables can be added to NEWVars. while OLDVars != NEWVars { OLDVars = NEWVars; // Put into NEWVars only those variables in the left-hand side of // grammar rules if the corresponding right hand side consists of terminals // or variables in OLDVars. NEWVars = OLDVars U { A | A  α, α in (T U OLDVars)* }; } V’ = NEWVars; // V’ contains only those variables that generate terminal strings P’ is the set of productions in P whose symbols are in V’ U T
    • Elimination of Useless Symbols & ɛ-Productions Every non-empty context-free language that does not contain ɛ can be generated by a grammar with no useless symbols or ɛ-productions
    • Chomsky Normal Form (CNF) A grammar G = (V, T, S, P) is said to be in Chomsky Normal Form (CNF) if each production in P has the following form: 1)A  BC 2)A  a where A, B, C are in V and a is in T
    • CNF Theorem Let G be a grammar with no useless symbols and no ε-productions. There is a CNF grammar G’ such that L(G) = L(G’).
    • CNF Theorem: Proof The set of new productions P’ for G’ is obtained in 3 stages Stage 1: Iterate through the productions P of grammar G and remove all productions of the form A  a or A  BC and place them into P’, because these productions are already in CNF
    • CNF Theorem: Proof Stage 2: For each production A  α, where α is a string over the set of terminals and non-terminals of length at least 2, do the following: For each symbol s in α if s is a terminal, add Cs  s to P’; Rewrite A  α as A  α’ where α’ is obtained from α by replacing each s in α with Cs.
    • CNF Theorem: Proof Note that before stage 3 begins, all productions in G are of the form A  B1 … Bk, where k ≥ 2. Stage 3: Iterate through the remaining productions P. If there is a production A  B1B2, place it into P’, because it is already in CNF. For each production A  B1 … Bk, where k > 2, place in P’ the following productions: A  B1 D1 D1  B2 D2 … Dk-2  Bk-1Bk For example, if A  B1B2B3, then we would add to P’ A  B1D1 and D1  B2B3
    • CNF Conversion Example
    • Sample CFG 1. S  bA 2. S  aB 3. A  bAA 4. A  aS 5. A  a 6. B  aBB 7. B  bS 8. B  b
    • CNF Conversion: Stage 1 We iterate through the productions and look which ones are already in CNF. There are only two so we place them into P’: A  a B  b
    • CNF Conversion: Stage 2 For S  bA, add Cb  b to P’; rewrite S  bA as S  CbA For S  aB, add Ca  a to P’; rewrite S  aB as S  CaB For A  bAA, rewrite A  bAA as A  CbAA For A  aS, rewrite A  aS as A  CaS For B  aBB, rewrite B  aBB as B  CaBB For B  bS, rewrite B  bS as B  CbS
    • Grammar after Stages 1 & 2 1. S  CbA // rewritten at stage 2 2. S  CaB // rewritten at stage 2 3. A  CbAA // rewritten at stage 2 4. A  CaS // rewritten at stage 2 5. A  a // removed at stage 1 6. B  CaBB // rewritten at stage 2 7. B  CbS // rewritten at stage 2 8. B  b // removed at stage 1
    • CNF Conversion: Stage 3 Add productions 1, 2, 4, 7 to P’, because they are in CNF Replace A  CbAA (production 3) with A  CbD1 D1 AA Replace B  CaBB (production 6) with B  CaD2 D2  BB
    • CNF Conversion: Result CNF Grammar S  CbA | CaB A  CaS | CbD1 | a B  CbS | CaD2 | b D1 AA D2  BB Ca  a Cb  b
    • Cocke-Younger-Kasami (CYK) Algorithm
    • CYK Algorithm’s Problem  Problem: Given a CFG G = (V, T, P, S) and a string x in T*, determine if x is in L(G)?  The Cocke-Younger-Kasami (CYK) algorithm takes a CFG in CNF and a string and determines if S is one of the symbols that derive x
    • Substring Notation xsl  Let x be a string such that |x|= n ≥ 1  Let xsl be the substring of x of length l that starts at position s, 1≤ s ≤ n and 1≤ l ≤ n  For example, if x = aabbabb, then x13 = aab = x[1]x[2]x[3] and x24 = abba = x[2]x[3]x[4]x[5]  In general, if we do 1-based array indexing and the length of the substring is l, the last available position s at which the substring can start is n – l + 1  For example, if |x| = 4 and l = 2, the possible values for s in xs2 are 1, 2, and 3 = 4 – 2 + 1
    • CYK Algorithm: Basic Insight A B C xsk x(s+k)(l-k) s s+k s+ls+k-1 xsl A * xsl iff 1) A  BC; 2) B * xsk; 3) C * x(s+k)(l-k), for some k, 1 ≤ k < l In other words, to determine if A * xsl there must be a rule A  BC and some k, 1 ≤ k < l, for which B * xsk and C * x(s+k)(l-k).
    • Table D[s, l]  CYK is a dynamic programming algorithm that, given a CNF grammar G = (V, T, S, P) and a string x over a specific alphabet such that |x|= n > 0, incrementally builds a n x n table D (D stands for ‘derives’)  D[s, l] is a set, possibly empty, of symbols A in V such that A * xsl  In other words D[s, l] records all variables in G that derive xsl
    • Table D[s, l]  CYK is a dynamic programming algorithm that, given a CNF grammar G = (V, T, S, P) and a string x over a specific alphabet such that |x|= n > 0, incrementally builds a n x n table D (D stands for ‘derives’)  D[s, l] is a set, possibly empty, of symbols A in V such that A * xsl  In other words D[s, l] records all variables in G that derive xsl
    • D[s, l] Initialization  Let G = (V, T, S, P) be a CNF grammar and x be a string such that |x|= n > 0,  Let xsl be the substring of x of length l that starts at position s  If l = 1, then, for each 1≤ s ≤ n, we can check if xs1 can be derived directly from some variable A of G  How? By checking if G has a production A  xs1
    • D[s, l] Initialization  Assume that our CNF grammar is as follows: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a  Assume that the input is x = baaba  What does D[s, l] look like?
    • 5 x 5 D[s, l] s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[1,1]  The input is x = baaba  The 1st symbol of the input is b  Thus, D[1,1] = {A | A  b}, where A is in V  There is only one production that qualifies: B  b  So D[1,1] = {B} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[2,1]  The input is x = baaba  The 2nd symbol of the input is a  We compute {A | A  a} , where A is in V  There are two such productions: A  a, C  a  So D[2, 1] = {A,C} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[3,1]  The input is x = baaba  The 3rd symbol of the input is a  We compute {A | A  a} , where A is in V  There are two such productions: A  a, C  a  So D[3, 1] = {A,C} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[4,1]  The input is x = baaba  The 4th symbol of the input is b  Thus, D[4,1] = {A | A  b}, where A is in V  There is only one production that qualifies: B  b  So D[4,1] = {B} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} {B} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[5,1]  The input is x = baaba  The 5th symbol of the input is a  We compute {A | A  a} , where A is in V  There are two such productions: A  a and C  a  So D[5, 1] = {A,C} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} {B} {A, C} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[1,2]  We need to find k, such that 1 ≤ k < 2 and look for productions A  BC where B is in D[1,1] and C is in D[2,1]  Since D[1,1] = {B} and D[2,1] = {A, C}, the possibilities for the right-hand sides are {B} x {A, C} = {BA, BC}  The rules that match these possibilities are S  BC and A  BA  So D[1,2] = {S,A} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} {B} {A, C} {S, A} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[2,2]  We need to find k, such that 1 ≤ k < 2, and the rules A  BC, where B is in D[2,1] and C is in D[3,1]  Since D[2,1] = {A,C} = D[3,1] = {A,C}, the right-hand side possibilities are AA, AC, CA, CC  There is only one rule that qualifies: B  CC  So D[2,2] = {B} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} {B} {A, C} {S, A} {B} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[3,2]  We look for k, such that 1 ≤ k < 2 and rules of the form A  BC, where B is in D[3,1] and C is in D[4,1]  D[3,1] = {A,C} and D[4,1] = {B}  So the right-hand side (RHS) possibilities are AB, CB  The rules whose RHS’s that match these possibilities are: S  AB and C  AB  So D[3,2] = {S,C} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} {B} {A, C} {S, A} {B} {S, C} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[4,2]  We look for k, such that 1 ≤ k < 2 and rules of the form A  BC, where B is D[4,1] and C is in D[5,1]  V[4,1] = {B}; V[5,1] = {A,C}  So the RHS possibilities are BA and BC  The rules whose RHS’s that match these possibilities are: S  BC and A  BA  So D[4,2] = {S,A} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} {B} {A, C} {S, A} {B} {S, C} {S, A} s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[1,3]  We look for k, such that 1 ≤ k < 3 and rules of the form A  BC, where, for k = 1, B is in D[1,1] and C is in D[2,2] or where, for k = 2, B is in D[1,2] and C is in D[3,1]  For k = 1, D[1,1] = {B} and D[2,2] = {B}, so there is only one right-hand side possibility: BB  The grammar does not have any productions whose right-hand side is BB  For k = 2, D[1,2] = {S,A} and D[3,1] = {A,C}, so the RHS possibilities are: SA, SC, AA, AC  The grammar does not have any productions whose RHS’s are SA, SC, AA, AC  So D[1,3] = { } G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} {B} {A, C} {S, A} {B} {S, C} {S, A} { } s 1 2 3 4 5 1 2 3 4 5 l
    • Computing D[2,3]  We look for k, such that 1 ≤ k < 3 and rules of the form A  BC, where, if k = 1, B is in D[2,1] and C is in D[3,2] or where, if k = 2, B is in D[2,2] and C is in D[4,1]  For k = 1, D[2,1] = {A,C} and D[3,2] = {S,C}  The RHS possibilities are: AS, AC, CS, CC  The only rule that matches is B  CC  For k = 2, D[2,2] = {B} and D[4,1] = {B}  The possibilities are: BB  No rules match  So D[2,3] = {B} G’s Productions: 1. S  AB | BC 2. A  BA | a 3. B  CC | b 4. C  AB | a
    • D[s, l] So Far {B} {A, C} {A, C} {B} {A, C} {S, A} {B} {S, C} {S, A} { } {B} s 1 2 3 4 5 1 2 3 4 5 l
    • Rest of D[s, l] {B} {A, C} {A, C} {B} {A, C} {S, A} {B} {S, C} {S, A} { } {B} {B} { } {S, A, C} {S, A, C} s 1 2 3 4 5 1 2 3 4 5 l
    • Is x=baaba Accepted? Yes, because D[1,5] contains S. It means that S * xsl. In other words, the substring of x that starts at 1 and has a length of 5 is derivable from S.
    • CYK Algorithm: Pseudocode // Inputs are a string x such that |x| ≥ 1 and a CNF grammar G with no ε-productions CYK(String x, CNFGrammar G) { create a n x n table D, where n = |x|; for s from 1 upto n { D[s, 1] = {A | A → a is in G and a = x[i], i.e., a is the i-th symbol of x}; } for l from 2 upto n { // l are all possible substring lengths for s from 1 upto n – l + 1 { // s iterates over all possible substring starts D[s, l] = { }; for k from 1 upto l – 1 { // k iterates over all possible partition positions D[s, l] = D[s, l] U {A | A → BC is a production in G and B is in D[s, k] and C is in D[s+k, l-k]}; } } } if ( S is in D[1, n] ) return true; else return false; }
    • How & Why CYK Works  CYK runs in O(n3), where |x| = n > 0  Both k and l-k are strictly less than l  If we know that each of the two smaller derivations exists (i.e. B * xsk and C * x(s+k)(l-k)), we can determine if A  BC  When we reach l=n, we can determine if S* x1n
    • Early Algorithm
    • Bird’s Eye Overview  The Early Algorithm (EA) makes a single left-to-right pass through the input  As the algorithm passes through the input with N wordforms, it fills a 2D data structure, called a chart  A chart can be thought of as a list of N+1 variable length arrays  Each entry in the chart, e.g., chart[i] is a state (called dotted rule) that encodes:  a grammar production  the position in the input where the product was fired  how much of the input is processed by the production
    • Input & Input Positions MAKE A LEFT 0 1 2 3 INPUT: “MAKE A LEFT”
    • Relationship between Input & Dotted Rule 0 i j N (B  E F * G, i, j) (A  B C D *, 0, N) … … … Current state of Production, Input Start, Input End
    • Input & Dotted Rules: Examples MAKE A LEFT 0 1 2 3 (S  * VP, 0, 0) (S  DET * NOMINAL, 1, 2) (VP  Verb NP *, 0, 3)
    • References & Reading Suggestions  Hopcroft and Ullman. Introduction to Automata Theory, Languages, and Computation, Narosa Publishing House  Moll, Arbib, and Kfoury. An Introduction to Formal Language Theory  Jurafsky & Martin. Speech & Language Processing. Prentice Hall. www.youtube.com/vkedco