Your SlideShare is downloading. ×
Parsing using graphs
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Parsing using graphs

225

Published on

Published in: Spiritual, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
225
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Parsing Needs New Abstractions11/23/2011 1
  • 2. Problem• Parsing of context-free languages – active research topic from 60’s to 80’s – rich variety of parsing techniques are known • general CFL parsing: – Earley’s algorithm, Cocke-Younger-Kasami (CYK) • deterministic parsing: – SLL(k), LL(k), SLR(k), LR(k), LALR(k), LA(l)LR(k)..• Problem: most of these techniques were invented by automata theory people – terminology is fairly obscure: leftmost derivations, rightmost derivations, handles, viable prefixes, …. – string rewriting is very clean but not intuitive for most PL people – descriptions in compiler textbooks are obscure/erroneous – connections between different parsing techniques are lost• Question: is there an easier way of thinking about parsing than in terms of strings and string rewriting?11/23/2011 2
  • 3. New abstraction• For any context-free grammar, construct a Grammar Flow Graph (GFG) – syntax: representation of grammar as a control-flow graph – semantics: executable representation • special kind of non-deterministic pushdown automaton• Parsing problems – become path problems in GFG• Alphabet soup of grammar classes like LL(k), SLL(k), LR(k), LALR(k), SLR(k) etc. can be viewed as choices along three dimensions – non-determinism: how many paths can we explore at a time? • all (Earley), only one (LL), some (LR) – look-ahead: how much do we know about future? • solve fixpoint equations over sets – context: how much do we remember about the past? • procedure cloning11/23/2011 3
  • 4. GFG exampleSAa | bAc | Bc | bBaAd START-SBd ² ² ² ² S.bAc ² ² S.bBa ² ² b b S.Aa S.Bc START-A Sb.Ac START-B Sb.Ba ² ² SA.a ² A.d B.d SB.c d SbA.c d SbB.a a c c a SAa. Ad. Bd. SBc. ² SbAc. ² SbBa. END-A END-B ² ² ² ² ² ² END-S
  • 5. GFG construction For each non-terminal A, create nodes labeled START-A and END-A. For each production in grammar, create a “procedure” and connect to START and END nodes of LHS non-terminal as shown below.A ² START-A ² A. ² END-AA bXY ² b ² START-A A.bXY A b.XY AbX.Y AbXY. END-A ² ² ² ² START-X ……. END-X START-Y ……. END-Y Edges labeled ²: only at entry/exit of START-A and END-A nodes. Fan-out: only at exit of START-A nodes and END-A nodes 11/23/2011 transition node: node whose outgoing edge is labeled with a terminal Terminal 5
  • 6. TerminologyA ² START-A ² A. ² END-A Start node Entry node Call node Return node Exit node End nodeA bXY ² b ² START-A A.bXY A b.XY AbX.Y AbXY. END-A ² ² ² ² START-X ……. END-X START-Y ……. END-Y 11/23/2011 6
  • 7. Non-deterministic GFG automaton• Interpretation of GFG: NGA – similar to NFA START-S• Rules: – begin at START-S – at START nodes, make non- deterministic choice b b – at END nodes, must follow CFL path • “return to the same procedure d from which you made the call” d• CFL path from START to a c c a END  leftmost derivation• Label(path): – sequence of terminal symbols labeling edges in path END-S – Label of CFL path from START to END is a word in SAa | bAc | Bc | bBa language generated by CFG Ad Bd 11/23/2011 7
  • 8. Parsing problem• Paths(l): – set of paths with label l START-S – inverse relation of Label• Parsing problem: given a grammar G and a string S, – find all paths in GFG(G) that generate S, or b b – demonstrate that there is no such path• Parallel paths: d d – P1 = START-S + A a c c a – P2 = START-S + B – Label(P1) = Label(P2) – Equivalence relation on paths originating at START- S END-S• Ambiguous grammar – two or more parallel paths SAa | bAc | Bc | bBa START-S+ END-S Ad Bd 11/23/2011 8
  • 9. Compressed paths11/23/2011 9
  • 10. Addition to GFG• We need to be able START to talk about SbAc sentential forms, not SbBa just sentences SAa SBc• Small modification to b b GFG: A Ad A B Bd – add transitions labeled B with non-terminals at a d procedure calls c d c a• Some paths will have edges labeled with non-terminals – non-terminals that END have not been “expanded out” SAa | bAc | Bc | bBa Ad Bd 11/23/2011 10
  • 11. Compressed GFG paths START• More compact representation of GFG path• Idea: START-P – collapse portion of path between P start and end of a given procedure END-P and replace with non-terminal• Point: completed calls cannot P1 affect further evolution of path so we need not store full path• Edges going out of END nodes of procedures will never appear in compressed representation 11/23/2011 11
  • 12. NFA for compressed paths• Start from extended START GFG SbAc SbBa• Remove edges out SAa SBc b b of END nodes since A Ad B Bd these will never be A B in compressed path a d c d c a• Each path in NFA corresponds to a compressed GFG END path SAa | bAc | Bc | bBa Ad Bd 11/23/2011 12
  • 13. Following all paths: Earley’s algorithm11/23/2011 13
  • 14. Recall: NFA simulation• Input string is processed left to right, one symbol at a time• Deterministic simulator keeps track of all states NFA could be in as the input is processed• Simulation – simulated state = subset of NFA states – if current simulated state is C and next input symbol is t , compute next simulated N as {s0,s1,s4} !a {s2} !a {s2,s3,s7} …. follows: • scanning: if state si 2 C and NFA has transition si t sj, add sj to N • prediction: if state sj 2 N and NFA has transition sj sk, add sk to N – initial simulated state = set of initial states of NFA closed with prediction rule 11/23/2011 14
  • 15. Analog in GFG• First cut: use exactly the SAa | bAc | Bc | bBa same idea Ad Bd – current state C, next state N, next input symbol is t S0 – scanning: if state si 2 C and NFA has transition si t sj, add sj to N S4 S13 – prediction: if state sj 2 N and S1 b S8 b NFA has transition S5 S14 sj sk, add sk to N S2 S17 S9 S11• Problem: not clear how to a d S6 S15 make ²-transitions at return c c d S3 S18 S10 S12 a S7 states like s18 and s12 S16• Solution: keep “return addresses” as in Earley S19 {S0,S1,S4,S8,S13,S17,S11} !d {S12,S18, ?????} 11/23/2011 15
  • 16. START-E,0E.(E+E),0 E.(E-E) ,0 E.int,0 0 ( (E(.E+E),0 E(.E- E),0 START-E START-E,1 1 ( (E.(E+E) ,1 E.(E-E) ,1 E.int,1 9 E E + int --- E.int,1 END-E ,1 2 E EE(E. +E),0 E(E. -E),0 ) ) + E(E+.E),0 START-E,3 3 E.(E+E),3 E.(E-E),3 E.int,3 E int | (E+E) | (E-E) 6 Input string: (9+6) Eint.,3 END-E,3 4 E(E+E.),0 ) E (E+E)., 0 END-E,0 5 16
  • 17. Earley parser and GFG states• A given § set can contain multiple instances of the same GFG state.• Example: SaS|a• Earley set §i – <Sa.S, i-1> – <Sa., i-1> – <S.aS, i> – <S.a, i> – <SaS. , i-2> – <SaS., i-3> – …… – <SaS., 0>11/23/2011 17
  • 18. Earley’s parser and ambiguous grammars• If an Earley configuration §t can be added to a given § <X ® . , p1> set by two or more <Y ¯. , p2> configurations, grammar is ambiguous <Z ° A. ±, p>• Example: substring between positions p and t can be derived from A in two different ways11/23/2011 18
  • 19. Look-ahead computation11/23/2011 19
  • 20. Look-ahead computation• Look-ahead at point p in GFG: – first k symbols you might encounter on path starting at p – k is a small integer that is given for entire grammar• Subtle point: – look-ahead may depend on path from START that you took to get to p – (eg) 2-look-ahead at entry of N is different for red and blue calls• Two approaches: – context-independent look-ahead: first k symbols on paths starting at p – context-dependent look-ahead: given a path C from START to p, what are the first k symbols on any path starting at p that extends C S N {xa} {ya,yb} {aa,ab} {ab,bc} x y S xNab | yNbc N a N N a | a b11/23/2011 b c 20
  • 21. FIRSTk sets• FIRSTk(A): set of strings of length k or less – If A * s where s is a terminal string of length k or less, s ² FIRSTk(A) – If A * s where s is a string longer than k symbols, then k-prefix of s ² FIRSTk(A)• Intuition: – non-terminal A represents a set, which is the set of strings we can derive from it – FIRSTk(A) is the set of k-prefixes of these strings• Easy to extend FIRSTk to sequences of grammar symbols S N S xNab | yNbc N a | FIRST2(N)= {a, } x y FIRST2(Nab) a = {aa,ab} N N a b b c 21 11/23/2011
  • 22. Useful string functions• Concatenation: s + t – (eg) xy + abc = xyabc• k-prefix of string s: sk – (eg) (xyz)2 = xy, (x)2 = x, ( )2 =• Composition of concatenation and k-prefix: s +k t – defined as (s+t)k – (eg) x +2 yz = xy – operation is associative• Easy result: (s+t)k = (sk+tk)k = sk +k tk• Operations can be extended to sets in the obvious way – (eg) {a,bcd} +2 { ,x,yz} = {a,ax,ay,bc}11/23/2011 22
  • 23. FIRSTk FIRSTk(²) = {²} FIRSTk(t) = {t} FIRSTk(A) = FIRSTk(X1X2…Xn) U FIRSTk(Y1Y2…Ym) U … //rhs of productions FIRSTk(X1X2..Xn) = FIRSTk(X1) +k FIRSTk(X2) +k…+k FIRSTk(Xn)11/23/2011 23
  • 24. FIRSTk example S  aAab | bAb A  cAB | | a BFIRST2(S) = FIRST2(aAab) U FIRST2(bAb) = ({a}+2 FIRST2(A) +2 {ab}) U ({b}+2 FIRST2(A) +2 {b})FIRST2(A) = FIRST2(cAB) U {²} U{a} = ({c} +2 FIRST2(A) +2 FIRST2(B)) U {²} U {a}FIRST2(B) = { }FIRST2(A) ={²,a,c,ca,cc}FIRST2(B) = {²}FIRST2(S)={aa,ac,bb,ba,bc}11/23/2011 24
  • 25. Context-independent look-aheads S A B a b c A A a A a b ? b B {ab} {b$} Se={$$} Be AeCompute FOLLOWk(A) sets: strings of length k that can be encounteredafter you return from non-terminal ASe = {$$}Ae = (FIRST2({ab}) +2 Se) U (FIRST2({b})+2 Se) U (FIRST2(B) +2 Ae)Be = AeSolution: Se = {$$} Ae = {ab,b$} Be = {ab,b$}From these FOLLOW sets, we can now compute look-ahead at any GFG point.
  • 26. Computing context-independent look-ahead sets• Algorithm: – For each non-terminal A, compute FIRSTk(A) • First k terminals you encounter on path A-START + A-END – For each non-terminal A, compute FOLLOWk(A) • First k terminals you encounter on path that extends a GFG path START + A-END – Use the FIRSTk and FOLLOWk sets to compute the look-ahead at any point of interest in GFG• You can even compute FIRSTk and FOLLOWk sets in one big iteration if you want.• This computation is independent of the particular parsing method used11/23/2011 26
  • 27. Production cloning: a way of implementing context-dependence11/23/2011 27
  • 28. Context-dependent look-ahead• In running example, – look-aheads for N for red S call to N are disjoint N – look-aheads for N for blue {xa} {ya,yb} {aa,ab} {ab,bc} call to N are disjoint – context-independent look- x y ahead computation a combines the look-aheads N N from all the call sites of N b at the bottom of N and a {bc} propagates them to the top b c• Idea: – compute look-aheads {ab} separately for each context – keep track of context while parsing S xNab | yNbc  we can get a more capable N a | parser Input string: xab$$ 11/23/2011 28
  • 29. Tracking context by cloning • Grammar: S xNab | yNbc S  xN1ab | yN2bc N a | N1 a | N2 a | S [N,{ab}] N1 [N,{bc}] N2 {aa} {ab} {ab} {bc} x y a a[N,{ab}] N1 [N,{bc}] N2 a b b c 11/23/2011 29
  • 30. General idea of cloning• Cloning creates copies of productions• Intuitively we would like to create a clone of a production for each of its contexts and write-down look-ahead – but set of contexts for a production is usually infinite• Solution: – create finite number of equivalence classes of contexts for a given production – create a clone for each equivalence class – compute context-independent look-ahead• Two cloning rules are important in practice – k-look-ahead cloning: two contexts are in same equivalence class if their k-look-aheads are identical (used in LL(k)) – reachability cloning: two contexts C1 and C2 are in same equivalence class if the set of GFG nodes reachable by paths with label(C1) is equal to set of GFG nodes reachable by paths with label(C2) (used in LR(0)) – LR(k) uses a combination of them11/23/2011 30
  • 31. k-look-ahead cloning (intuitive idea) S A B S a b c a bA A A a d [A,{ab} [A,{b$}] a b a b b B {b$} b {ab} [A,{ab}] [A,{b$}] c c [A,{da}] a [A,{db}] a [B,{ab}] [B,{b$}] {b$} {ab} Other clones not shown kIf there are |T| terminal symbols, you may end up with 2|T| clones of a given production
  • 32. k-look-ahead cloning• G=(V,T,P,S):grammar, k:positive integer.• Tk(G) is following grammar – nonterminals: {[A,R]| A in V -T, and R µ Tk} – terminals: T – start symbol: [S,{$k}] – rules: all rules of the form [A,R]  X1X2X3...Xm where for some rule A  X1X2X3...Xm in P • Xi = Xi if Xi is a terminal • Xi = [Xi, FIRSTk(Xi+1,..Xm) +k R] when Xi is a non-terminal.• Intuition: – after this kind of cloning, k-look-aheads at the end of a procedure are identical for all return edges – so doing a context-independent look-ahead computation on the transformed grammar does not tell you anything you did not already know about k-look-aheads11/23/2011 32
  • 33. LL(k) and SLL(k)11/23/2011 33
  • 34. Intuition• This class of grammars has the following property: – if s is a string in the language, then for any prefix p of s, there is a unique path P from START such that label(P) = p (modulo look-ahead)• So we need to follow only one path through GFG for a given input string, using look- ahead to eliminate alternatives• Roughly analogous to DFAs in the CFL world11/23/2011 34
  • 35. LL(k) parsing• Only one path can be followed by the parser – so at procedure call for S non-terminal N, we must N know exactly which {xa} {ya,yb} {aa,ab} {ab,bc} procedure (rule) to call• Simple LL(k) parsing: x y a – make decision based on N N context-independent look- a b ahead of k symbols at entry point for N b c• LL(k) parsing: – use context-dependent look-ahead of k symbols – procedure cloning S xNab | yNbc technique converts LL(k) N a | grammar into SLL(k) grammar Grammar is LL(2) but not SLL(2) 11/23/2011 35
  • 36. Parser• Modify Earley parser to – track compressed paths instead of full paths • transitions labeled by non-terminals and terminals – eliminate return addresses • at the end of a production – A  X1X2..Xn: pop n states off and make an A transition from the exposed state – A  ² : make an A transition from current state – use look-ahead to eliminate alternatives11/23/2011 36
  • 37. START-E E.(+ E E) E.(- E E) E.int START-E ( ( ( E(. - E E) - E E E(- . E E) + int --- START-E E  .(+ E E) E .(- E E) E .int E E ( ) ) E  (.+ E E) E + E  (+ . E E) E int | (+ E E) | (- E E) START-E E.(+ E E) E .(- E E) E .int Input string: ( - ( + 8 9 ) 7 ) 8 EE  int. E  (+ E . E) E  (- E .E)END-E START-E START-E E.(+ E E) E  (- E E) E .int E. (+ E E) E .(- E E) E .int 9 E 7 E E  int. E (+ E E.) END-E E int. E  (- E E .) ) END-E ) E(+ E E). E (- E E ). END-E END-E 37
  • 38. Many grammars are not LL(k)• Grammar – Eint | (E+E) | (E-E) E• Not clear which rule to ( ( apply until you see “+” E E or “-” + int --- – this needs unbounded E E look-ahead, so grammar is not LL(k) for any k ) )• One solution: – follow multiple paths till only one survives11/23/2011 38
  • 39. LR(k),SLR(k),LALR(k)11/23/2011 39
  • 40. LR grammars (informal)• LR parsers permit limited non- START determinism – can follow more than one path but not all paths like Early• LR(0) condition: for any prefix of b b input, the corresponding fully A B extended compressed paths must A have the same label• Condition not true in general a d c d c a grammars: see example – Consider string “da” – For prefix “d”, there are two paths: • red path • blue path END – Labels of compressed paths: • red path: “A” SAa | bAc | Bc | bBa • blue path: “B” Ad• We can use modified Earley parser Bd for these grammars 11/23/2011 40
  • 41. START-E,0 START-E 0 START-EE.(E+E),0 E.(E-E) ,0 E.int,0 E.(E+E) E.(E-E) E.int ( ( ( ( ( (E(.E+E),0 E(.E- E),0 E(.E+E) E(.E- E) E E START-E,1 1 START-E + int ---E.(E+E) ,1 E.(E-E) ,1 E.int,1 E.(E+E) E.(E-E) E.int E E 3 E E ) ) E.int,1 E(E. +E) E(E. -E) END-E ,1 2E(E. +E),0 E(E. -E),0 + + E int | (E+E) | (E-E) E(E+.E),0 E(E+.E) START-E,3 3 START-E Input string: (3+4) E.(E+E),3 E.(E-E),3 E.int,3 E.(E+E) E.(E-E) E.int 4 4 Eint.,3 Eint. END-E,3 4 END-E E(E+E.),0 ………….. ) E (E+E)., 0 END-E,0 5 41
  • 42. Parser for LR languages• Use the modified Earley parser we used for LL grammars – each -state will have multiple items as in the original Earley parser since LR parsers follow multiple paths too• -states must follow a stack discipline for modified Earley parser to work• Since we are following multiple paths, this might break down – shift-reduce conflict: parallel compressed paths • P1 to a scan node and P2 to an EXIT node (push/pop conflict) – reduce-reduce conflict: parallel compressed paths • P1 and P2 to different EXIT nodes (pop/pop conflict)• If grammar does not have shift-reduce or reduce-reduce conflicts, we can use modified Earley parser and follow compressed paths while maintaining a stack discipline for -states• How do we determine whether grammar has shift-reduce or reduce- reduce conflicts?11/23/2011 42
  • 43. Finding LR(0) conflicts • Compute the DFA corresponding to the compressed path NFA • If conflicting states are in same DFA state, grammar has an LR(0) conflict d Reduce-reduce conflict S.Aa Ad. S.bAc Bd. S.Bc d c A SbA.c SbAc. S.bBa b Sb.Ac A.d Sb.Ba B.d A.d B SbB.a SbBa. A a B.d B aSAa | bAc | Bc | bBa SA.a SAa.Ad 11/23/2011Bd c 43 SB.c SBc.
  • 44. LR(0) automaton for expression grammar ) E(E+E). E(E+E.) E(E+.E) int Eint. E E.(E+E)E.(E+E) ( E.(E-E)E.(E – E) int + E.int intE .int ( E(.E+E) E.(E+E) E E(E.+E) Eint. E.(E-E) E(E.- E) E.int E(.E-E) - E(E-.E) int E.(E+E) ( E.(E-E) E.int ( E E(E-E.) ) E(E-E). 11/23/2011 44
  • 45. Parser for LR(0) languages• Use the modified Earley parser we used for LL grammars – each -state will have multiple items as in the original Earley parser since LR parsers follow multiple paths too• No need to keep track of GFG nodes within each -state – states in compressed path DFA correspond to possible -states – So modified Earley parser just pushes and pops DFA states11/23/2011 45
  • 46. GFG path interpretation• Let P1 and P2 be two GFG START paths with identical labels• Sufficient condition for labels START-P START-P of compressed paths to be END-P equal: END-P – sequence of completed calls in P1 P2 P1 and P2 are identical• Most of the action in LR parsers happens at EXIT nodes of productions 11/23/2011 46
  • 47. LR(0) conflicts: GFG START START t* t* t* t* B Bexit u Aexit Aexit reduce-reduce conflict shift-reduce conflict• LR(0) conflicts (GFG definition): – Shift-reduce conflict: there are parallel paths P1: START + Aexit and P2: START + scan-node – Reduce-reduce conflict: there are parallel paths P1: START + Aexit and P2: START + Bexit• Claim: Let G be an LR(0) grammar according to GFG definition. – P1 and P2 are two GFG paths that end at SCAN or END nodes, and C(P1) and C(P2) are their compressed equivalents – P111/23/2011 and P2 have the same label iff C(P1) and C(P2) have the same label 47
  • 48. LR(0) conflicts: GFG START START t* t* t* t* B Bexit u Aexit Aexit reduce-reduce conflict shift-reduce conflict• Claim: Let G be an LR(0) grammar according to GFG definition. – P1 and P2 are two GFG paths that end at SCAN or END nodes, and C(P1) and C(P2) are their compressed equivalents – P1 and P2 have the same label iff C(P1) and C(P2) have the same label• This claim is not true if the paths do not end at SCAN or END nodes – counterexample: in this LR(0) grammar, consider paths from START to nodes S  A.a and S .Uc S  Aa | Uc U  Ab11/23/2011 A. 48
  • 49. Example START SbAc SbBa SAa SBc b b Ad Bd• States with LR(0) conflicts – (Ad. , Bd.) a d c d• Conflicting context pairs c a (i) path label: d – C1: START, S.Aa, A.d, Ad. – C2: START, S.Bc, B.d, Bd. END (ii) path label: bd – C3: START, S.bAc, Sb.Ac, A.d, Ad. SAa | bAc | Bc | bBa – C4: START, S.bBa, Sb.Ba, B.d, Bd. Ad Bd• So grammar is not LR(0) 11/23/2011 49
  • 50. LR(0) H&U• A grammar G is LR(0) if – its start symbol does not appear on the right side of any production, and – for every viable prefix °, whenever A ! ®. is a complete valid item for °, then no other complete item nor any item with a terminal to the right of the dot is valid for °.• Comment: – by this definition, the only other valid items that can occur together with A ! ®. are incomplete items with a non-terminal to the right of the dot of the form B! ¯.C± – if First(C) contains a terminal t, it can be shown that an item of the form Y ! .t ¸ is valid for °, violating the LR(0) condition. Therefore, First(C) = {²}. It can be shown that this means ® = ² – Example: this grammar is LR(0) (A  . and B .Cd are valid items for viable prefix ² ) • SB • BCd • CA • A ²11/23/2011 50
  • 51. Look-ahead in LR grammars START START t* t* t* t* Bexit B Aexit Aexit reduce-reduce conflict shift-reduce conflict• LR(k) – for each pair of parallel paths to LR(0) conflicting states, k-look-ahead sets are disjoint• SLR(k): – if there is LR(0) conflict at nodes A and B, context-insensitive look- ahead sets of A and B are disjoint• LALR(k): grammar is SLR(k) after reachability cloning 11/23/2011 51
  • 52. Example START SbAc SbBa SAa SBc b b• States with LR(0) conflicts Ad Bd – (Ad. , Bd.)• Conflicting context pairs a d c d c a (i) path label: d – C1: START, S.Aa, A.d, Ad. – C2: START, S.Bc, B.d, Bd. (ii) path label: bd END – C3: START, S.bAc, Sb.Ac, A.d, Ad. SAa | bAc | Bc | bBa – C4: START, S.bBa, Sb.Ba, B.d, Bd. Ad• Grammar is LR(1) Bd – Look-ahead for C1: {a}, look-ahead for C2: {c} – Look-ahead for C3: {c}, look-ahead for C4: {a} 11/23/2011 52
  • 53. LR(1) automaton SAa | bAc | Bc | bBa Ad Bd S.Aa,$ d Ad., a S.bAc,$ Bd., c S.Bc,$ Ad.,c S.bBa,$ Bd.,a Sb.Ac,$ d A.d, a b Sb.Ba,$ c B.d, c A SbA.c,$ SbAc.,$ A.d, c A B.d, a a B SbB.a, $ SbBa.,$ B a SA.a,$ SAa.,$ c11/23/2011 SB.c,$ SBc.,$ 53
  • 54. LALR look-ahead computation• Key observation: – each path START s in deterministic LR(0) automaton represents a set of contexts in the non-deterministic LR(0) automaton • each context in this set ends at one of the items in s – in general, there will be multiple paths to state s in deterministic LR(0) automaton – so each state in LR(0) automaton represents a set of sets of contexts – in LALR, we merge the look-aheads for those contexts• LALR = reachability cloning + SLR (Bermudez and Logothetis) + unions at some nodes (see RL.) state in diagram on next page11/23/2011 54
  • 55. LALR(1) but not SLR(1) S’ S$ shift-reduce conflict S L=R|R L *R | id S R L $ S’  .S$ S’  S.$ S’  S$. S .L=R R S  L=R. S  .R S  L=.R L S  L.=R = L  .*R R  .L R  L. L  .id L  .*R L R R  L. R  .L L  .id S  R. id * L  id. id id L  *.R L  *R. * R FOLLOW(S) = { $ } R  .L FOLLOW(R) = { =, $ } L  .*R * L  .id L FOLLOW(L) = { =, $ }11/23/2011 55
  • 56. LALR  SLR grammarS’ S$ S’ S$S L=R|R S L1 = R2 | R1L *R | id L1,L2,L3 *R3 | idR L R1 L1 R2 L2 S’  .S$ S S’  S.$ $ S’  S$. R3 L3 S .L=R R2 S  L=R. S  .R S  L=.R L1 S  L.=R = L  .*R R  .L R  L. L  .id L  .*R L2 R1 R  L. R  .L L  .id S  R. id * L  id. id id L  *.R L  *R. * R3 R  .L * L  .*R 11/23/2011 56 L  .id L3
  • 57. LR(0): Reachability cloning• Motivation: NFADFA conversion for LR grammars START• Driven by compressed paths C1 • Need to verify that this cloning satisfies sanity condition even C2 1 if grammar is not LR(0) 1 C3 2 2• Compressed contexts C1 and 3 C2 of node A are in same B equivalence class if A set of GFG nodes reachable by paths with label(C1) C1 and C2 will be in the same = equivalence class. C3 is in a different class. set of GFG nodes reachable by paths with label(C2) 11/23/2011 57
  • 58. Algorithm (need to write)• G=(V,T,P,S):grammar• R(G) is following grammar – nonterminals: {[Ai]| A in V -T, 1 <= i <= n and there are n edges labeled A in compressed path DFA} – terminals: T – start symbol: [S] – rules: all rules of the form [Ai]  X1X2X3...Xm where for some rule A  X1X2X3...Xm in P • Xi = Xi if Xi is a terminal • [Xi] when Xi is a non-terminal.11/23/2011 58
  • 59. Cloning for LALR(1)• Same condition as LR(0): reachability cloning• Extension to LA(k)LR(l): – cloning is governed by LR(l) – compute SLR(k) look-aheads – LALR(k) is LA(k)LR(0) – LR(k) is LA(k)LR(k clone as in LR(l)11/23/2011 59
  • 60. Summary• New abstraction for CFL parsing – Grammar Flow Graph (GFG)• Parsing problems become path problems in GFG• Earley parser emerges as simple extension of NFA simulation• Mechanisms – control number of paths followed during parsing – look-ahead: • algorithm: solving set constraints – context-dependent look-ahead • algorithm: cloning• SLL(k), LL(k), SLR(k), LR(k), LALR(k) grammars arise from different choices of these mechanisms• LL and LR parsers emerge as specializations of Earley parser11/23/2011 60
  • 61. LR(0) ²DFA M1 M6 E(E+.E) E(E+E.) Eint. M3 E int ² ) M8 + M2 E(E+E). E.(E+E) ² E(.E+E) E E(E.+E) E.(E – E) E(.E- E) E(E.- E) M4 ( E .int M9 M0 - ² M5 E(E-E). E(E-.E) ) M7 E E(E-E.) ((2+3)-4) ( ( 2 + 3 ) - 4<M0,0> <M2,0> <M2,1> <M1,2> <M3,1> <M1,4> <M8,1> <M5,0> <M0,1> <M0,2> <M4,1> <M0,4> <M6,1> <M4,0> <M0,7> ) )<M1,7> 11/23/2011 <M9,0> 61<M7,0>
  • 62. LALR(1) example from G&J S’ -> S #S’S.# S -> A B c S A -> aS’ .S# c B -> bS.ABc SA.Bc SAB.c SAbc. B -> eA.a B. B A B.b a b Aa. Bb. 11/23/2011 62
  • 63. S L=R|R L *R | id S L = R Send C R L id * L T R Lend R L RendShift-reduce conflict occurs at states C and Rend (conflicting paths are S->L->Lend->C and S->R->L->Lend->Rend)1-look-ahead at C is =Context-independent 1-look-ahead at Rend is {=,$} so grammar is not SLR(1).LALR(1) figures out that for conflicting state, the calling context must SR.Look-ahead at Rend is = for context S LTRLLendRend but there is 11/23/2011 context S* C parallel to this one. no 63
  • 64. LR(1) S L R * L L R R id = R FIRST(L)=FIRST(R)={*,id} Shift-reduce conflict: id $ S [L,{=}] [L,{$}] [R,{=}] [R,{$}] * *[L,{=}] [R,{$}] [R,{=}] id [R,{$}] id [L,{=}] [L,{$}] = [R,{$}] 11/23/2011 After procedure cloning 64
  • 65. LALR(1) look-aheads T0 T1 T2 T4 S’ .S$ S(.S) [$,)] S(S.) [$,)] S(S). [$,)]S(S) S .(S) [$] S.(S) [)] ) SS S. S. [)] ( [$] S T5 ( • After reduction S(S), parsing can S’S.$ resume either in state T0 or T1. • LR parser stack tells you which one to resume from • LALR(1) look-aheads in state T1 are interesting. Item S(.S) gets look-ahead from item S  .(S) in state T0 as well as item S(.S) from state T1. 11/23/2011 65
  • 66. Parsing techniques• Our focus: techniques that perform breadth-first traversal of GFG – similar to online simulation of NFA – input is read left to right one symbol at a time – extend current GFG paths if possible, using symbol• Three dimensions: – non-determinism: how many paths can I follow at a given time? – look-ahead: how many symbols of look-ahead are known at each point? – context: how much context do we keep? • this is implemented by procedure cloning, independent of look- ahead11/23/2011 66
  • 67. What we would like to show• Obvious algorithm: – follow all CFL-paths in GFG – essentially a fancy transitive closure in GFG – leads to Earley’s algorithm – O(n3) complexity• O(n) algorithms: LL/LR/LALR,… – preprocessing to compute look-ahead sets – maintain compressed paths – ensure that Earley sets can be manipulated as a stack11/23/2011 67
  • 68. What we would like to show (contd.)• SLL(k) = no cloning + decision at procedure start• LL(k) = k-look-ahead-cloning+ decision at procedure start• LA(l)LL(k) = l-look-ahead-cloning + context- independent k-look-ahead + decision at procedure start• SLR(k) = no cloning + decision at procedure end• LR(k) = k-lookahead-cloning + decision at procedure end• LALR(k) = reachability-cloning + decision at procedure end11/23/2011 68
  • 69. Computing context-independent look-ahead• Intuition: S xNab | yNbc – simple inter-procedural N a | backward dataflow analysis in GFG – assume look-ahead at exit of S N GFG is {$k} – propagate look-ahead back {xa} {ya,yb} through GFG to determine look-aheads at other points x y• How do we propagate look- a aheads through non-terminal N N calls? a b – would like to avoid repeatedly analyzing procedure for each b c look-ahead set we want to propagate through it – need to handle recursive calls – ideally, we would have a 2-symbol look-aheads function that tells us how a look-ahead set at the exit of a procedure gets propagated to its input 11/23/2011 69
  • 70. Every LL(1) grammar is an SLL(1) grammar START Let string generated by paths P and Q be SP and SQ Cases:C1 C2 -SP = a and SQ = a : grammar is neither LL(1) nor SLL(1) -SP = a and SQ = b : grammar is LL1() and SLL(1) x y -SP = and SQ = : grammar is neither LL(1) nor SLL(1) N -SP = a and SQ = : - We show that there cannot be a context Ci for which the generated string for the complementary context Ci’ is a P Q - Otherwise, for context Ci, 1-lookahead for choice P is a 1-lookahead for choice Q is a so the grammar is not LL(1). - Therefore, there is no context Ci for which the 1-lookahead for choice Q is a.C1’ - But this means that the context-independent 1-lookahead C2’ for choice Q cannot contain a. - Therefore the grammar is SLL(1). END 11/23/2011 70
  • 71. LL(2) grammar that is not SLL(2) START -Consider the context-sensitive look-aheads at N. -For context C1, 2-lookahead for choice P is {aa}C1 C2 2-lookahead for choice Q is {ab} x y -For context C2, N 2-lookahead for choice P is {ab} 2-lookahead for choice Q is {bc}. a -Therefore, grammar is LL(2). P Q -Context-independent lookaheads: 2-lookahead for choice P is {aa,ab} 2-lookahead for choice Q is {ab,bc}. -Since these two sets are not disjoint, the grammar is not SLL(2). a b -Grammar: C2’ S  xNabC1’ S  yNbc b c Na N END 11/23/2011 71
  • 72. Cloning for LR(k)• From Sippu & Soissalon – replace each non-terminal A in the original grammar G with the set of all pairs of the form ([ ]k,A) where is a viable prefix of the $-augmented grammar G• [page 16] String 1 is LR(k) equivalent to string 2 if VALIDk( 1) = VALIDk( 2); i.e. exactly those items valid for 2 are valid 1 and vice versa.• An item [A . ,y] is LR(k)-valid for if S rm* Az rm z = z and k:z = y• Question: – is this a finer equivalence class than LL(k)?11/23/2011 72
  • 73. Sanity condition on equivalence classes• If C1 and C2 are two START contexts for some node N and – C1 = B1 + P – C2 = B2 + P B1 B2 – B1 and B2 are in the same equivalence class C1 and C2 must be in the same equivalence class• Can we come up with a P general construction procedure for cloning, N given a specification of the equivalence classes?11/23/2011 73

×