1
SYNTAX ANALYSIS - 1
UNIT - 2
2
 The Role of Parser
 Context Free Grammars
 Writing a Grammar
 Parsing – Top Down and Bottom Up
 Simple LR Parser
 Powerful LR Parser
 Using Ambiguous Grammars
 Parser Generators
OBJECTIVES
3
 Syntax
◦ The way in which words are stringed together to form phrases,
clauses and sentences
 Syntax Analysis
◦ The task concerned with fitting a sequence of tokens into a
specified sequence
 Parsing
◦ To break a sentence down into its component parts with an
explanation of the form, function, and syntactical relationship
of each part
Syntax Analysis and Parsing
4
 Every PL has rules that describe the syntactic structure of
well-formed programs
◦ In C, a program is made up of functions, declarations, statements etc
◦ Syntax can be specified using context-free grammars or BNF
 Grammars offer benefits to both designers & compiler writers
◦ Gives precise yet easy-to-understand syntactic specification of PL’s
◦ Helps construct efficient parsers automatically
◦ Provides a good structure to the language which in turn helps generate
correct object code
◦ Allows a language to be evolved iteratively by adding new constructs
to perform new tasks
Syntax Analysis
5
 Parser obtains a string of tokens from the lexical analyzer
 Verifies that the string of token names can be generated by the
grammar
 Constructs a parse tree and passes it to the rest of the compiler
 Reports errors when the program does not match the syntactic structure
of the source language
The Role of a Parser
6
 Universal Parsing Method
◦ Can parse any grammar, but too inefficient to use in any practical
compilers
◦ Example: Earley’s Algorithm
 Top-Down Parsing
◦ Build parse trees from the root to the leaves
◦ Can be generated automatically or written manually
◦ Example : LL Parser
 Bottom-Up Parsing
◦ Starts from the leaves and work their way up to the root
◦ Can only be generated automatically
◦ Example : LR Parser
Parsing Methods
7
 Some of the grammars used in the discussion
◦ LR grammar for bottom-up parsing
E  E + T | T
T  T * F | F
F  ( E ) | id
◦ Non-recursive variant of the grammar
E  T E`
E`  + T E` | ε
T  F T`
T  * F T` | ε
F  ( E ) | id
Representative Grammars
8
 A compiler must locate and track down errors but the error
handling is left to the compiler designer
 Common PL errors at many different levels
◦ Lexical Errors
 Misspellings of identifiers, keywords or operators
◦ Syntactic Errors
 Misplaced semicolons or extra or missing braces, that is “{“ or “}”
◦ Semantic Errors
 Type mismatches between operators and operands
◦ Logical Errors
 Incorrect reasoning by the programmer or use of assignment operator (=)
instead of comparison operator (==)
Syntax Error Handling
9
 Parsing methods detect errors as soon as possible, i.e., when
stream of tokens cannot be parsed further
 Have the viable-prefix property
◦ Detect an error has occurred as soon as a prefix of the input that cannot
be completed to form a string is seen
 Errors appear syntactic and are exposed when parsing cannot
continue
 Goals of error handler in a parser
◦ Report the presence of errors clearly and accurately
◦ Recover from each error quickly enough to detect subsequent errors
◦ Add minimal overhead to the processing of correct programs
Syntax Error Handling
10
 How should the parser recover once an error is detected?
◦ Quit with an informative error message when it detects the first error
◦ Additional errors are uncovered if parser can restore to a state where
processing of the input can continue
◦ If errors pile up, its better to stop after exceeding some error limit
 Four error-recovery strategies
◦ Panic-Mode Recovery
◦ Phrase-Level Recovery
◦ Error Productions
◦ Global Correction
Error-Recovery Strategies
11
 Panic-Mode Recovery
◦ Parser discards input symbols one at a time until one of a designated set of
synchronizing tokens is found
 Synchronizing tokens – delimiters like semicolon or “}”
◦ Compiler designer must select the synchronizing tokens appropriate for the
source language
◦ Advantage of simplicity and is guaranteed not to go into an infinite loop
 Phrase-Level Recovery
◦ Perform local correction on remaining input, that is, may replace a prefix
of the remaining input by some string that allows the parser to continue
◦ Choose replacements that do not lead to infinite loops
◦ Drawback is the difficulty to cope up with situations in which actual error
has occurred before the point of detection
Error-Recovery Strategies
12
 Error Productions
◦ Anticipate the common errors that might be encountered
◦ Augment grammar for language with productions that generate erroneous
constructs
◦ Such a parser detects anticipated errors when error production is used
during parsing
 Global Correction
◦ Make as few changes in processing an incorrect input string
◦ Use algorithms for choosing a minimal sequence of changes to obtain a
globally least-cost correction
◦ Given an incorrect input string x and grammar G, these algorithms find a
parse tree for a related string y, such that the number of changes required to
transform x to y is small
◦ Too costly to implement in terms of time and space
Error-Recovery Strategies
13
 A context-free grammar G = (T, N, S, P) consists of:
◦ T, a set of terminals (scanner tokens - symbols that may not appear on the
left side of rule,).
◦ N, a set of nonterminals (syntactic variables generated by productions –
symbols on left or right of rule).
◦ S, a designated start symbol nonterminal.
◦ P, a set of productions (Rules). Each production consists of
 A nonterminal called the left side of the production
 The symbol 
 A right side consisting of zero or more terminals and non-terminals
Context-Free Grammars- Formal Definition
14
 Example grammar for simple arithmetic expressions
expression  expression + term
expression  expression – term
expression  term
term  term * factor
term  term / factor
term  factor
factor  ( expression )
factor  id
Context-Free Grammars- Formal Definition
15
 Terminal symbols
◦ Lowercase letters early in the alphabet like a,b,c
◦ Operator symbols such as +, *
◦ Punctuation symbols such as parentheses, and comma
◦ Digits 0,1,….,9
 Nonterminal symbols
◦ Uppercase letters early in the alphabet like A,B,C
◦ Letter S, which, when used stands for the start symbol
 Lowercase letters late in the alphabet like u,v,..z represent
string of terminals
 Uppercase letters late in the alphabet such as X, Y, Z represent
grammar symbols that is either terminal or nonterminal
Notational Conventions
16
 Lower case Greek letters α, β, γ represent strings of grammar
symbols
 If A → α1 ,, A → α2 …….. A → αn are all Productions with A
on the LHS ( known as A-productions), then,
A → α1 | α2 …| αn (α1 , α2 … αn are alternatives of A)
 Unless otherwise stated , the LHS of the first production is the
start nonterminals
 Example: previous grammar written using these conventions
E  E + T | E – T | T
T  T * F | T / F | F
F  ( E ) | id`
Notational Conventions
17
 A way of showing how an input sentence is recognized with a
grammar
 Beginning with the start symbol, each rewriting step replaces a
nonterminal by the body of one of its productions
 Consider the following grammar
E  E + E | E * E | – E | ( E ) | id
◦ “E derives –E” can be denoted by E ==> – E
◦ A derivation of – ( id ) from E
E ==> – E ==> – ( E ) ==> – ( id )
Derivations
18
 Symbols to indicate derivation
◦ ==> denotes “derives in one step”
◦ ==> denotes “derives in zero or more steps”
◦ ==> denotes “derives in one or more steps”
 Given the grammar G with start symbol S, a string ω is part of
the language L (G), if and only if S ==> ω
 If S ==> α and α contains
◦ Only terminals then it is a sentence of G
◦ Both terminals and nonterminals then it is a Sentential form of G
 Two choices to be made at each step in the derivation
◦ Choose which nonterminal to replace
◦ Pick a production with that nonterminal as head
Derivations
*
+
*
*
19
 Leftmost Derivation
◦ The leftmost nonterminal in each sentential form is always chosen to be
replaced
◦ If α ==> β, is a step in which the leftmost nonterminal in α is replaced, we
write α ==> β
 Rightmost Derivation
◦ Replace the rightmost nonterminal in each sentential form
◦ Written as α ==> β
 If S ==> α, then we can say that α is left-sentential form of the
grammar
Derivations
lm
rm
lm
*
20
 A graphical representation for a derivation showing how to
derive the string of a language from grammar starting from Start
symbol
◦ The interior node is labeled with the nonterminal in the head of the
production
◦ Children are labeled by the symbols in the RHS of the production
 Yield of the tree
◦ The string derived or generated from the nonterminal at the root of the
tree
◦ Obtained by reading the leaves of the parse tree from left to right
Parse Trees and Derivations
21
Parse Trees and Derivations
Fig: Parse tree for – (id +id)
22
 A grammar that produces more than one parse tree for some
sentence is said to be ambiguous
◦ Can be a leftmost derivation or a rightmost derivation for the same
sentence
Ambiguity
Fig: Two Parse trees for id+id*id
23
 Every construct that can be described by a regular expression
can be described by a grammar, but not vice-versa.
 Grammar construction from the NFA
◦ For each state i of the NFA, create a nonterminal Ai
◦ If state i has a transition to state j on input a, add the production Ai  aAj
If state i goes to state j on input ε, add the production Ai  Aj
◦ If i is an accepting state, add Ai  ε
◦ If i is the start state, make Ai the start symbol of the grammar
CFG’s Versus Regular Expressions
24
Exercises
25
26
 Lexical versus Syntactic Analysis
◦ Why use RE’s to define lexical syntax of a language?
 Separating syntactic structure into lexical and non-lexical parts
modularizes a compiler into two components
 Lexical rules are simple and easy to describe
 RE’s provide a concise and easier-to-understand notation for tokens than
grammars
 Efficient lexical analyzers can be constructed automatically from RE’s
than from arbitrary grammars
◦ RE’s are most useful for describing the structure of constructs such
as identifiers, constants, keywords and whitespace
◦ Grammars are useful for describing nested structures such as
balanced parentheses, matching begin-end’s, corresponding if-then-
else’s
Writing a Grammar
27
 Ambiguous grammars can be rewritten to eliminate ambiguity
◦ Example : “Dangling-else” grammar
stmt  if expr then stmt
| if expr then stmt else stmt
| other
◦ Parse tree for the statement: if E1 then S1 else if E2 then S2 else S3
Eliminating Ambiguity
28
 Grammar is ambiguous since the following string has two parse
trees
if E1 then if E2 then S1 else S2
Eliminating Ambiguity
29
 Rewriting the “dangling-else” grammar
◦ General rule is, “Match each else with the closest unmatched then”
◦ Idea is that statement appearing between a then and else must be
“matched”
◦ That is, each interior statement must not end with an unmatched or open
then
◦ A matched statement is either an if-then-else statement containing no open
statements or any other kind of unconditional statement
Eliminating Ambiguity
30
 Left-recursive grammar
◦ Grammar that has a nonterminal A such that there is a derivation A ==>A for
some string 
◦ Top-down parsing methods cannot handle left-recursive grammars
◦ Immediate Left recursion : a production of the form A  A
◦ Left-recursive pair of productions, A  A | β can be replaced by:
A  β A′
A′   A′ | ε
◦ Eliminating immediate left recursion
 First group the A-productions as
A  A1 | A2 | . . . | Am | β1 | β2 | . . . | βn
Where no βi begins with an A
 Replace the A-productions by
A  β1 A′ | β2 A′ | . . . | βn A′
A  1 A′ | 2 A′ | . . . | m A′ | ε
Eliminating Left Recursion
+
31
 Algorithm to eliminate left recursion from a grammar
Eliminating Left Recursion
32
 When choice between two alternative A-productions is not clear,
◦ We rewrite the grammar to defer the decision until enough of the input has
been seen
◦ Two productions as below:
 In general, if A  αβ1 | αβ2 are two A-productions
◦ We do not know whether to expand A to αβ1 or αβ2
◦ Defer the decision by expanding A to αA′
◦ After seeing the input derived from α, we expand Á to β1 or to β2
◦ Left-factored the original productions become
A  α A′
A′  β1 | β2
Left Factoring
33
 Algorithm for left-factoring a grammar
◦ For each nonterminal A, find the longest prefix α common to two or more
alternatives
◦ If α ≠ ε – replace all of the A-productions A  αβ1 | αβ2 | . . . | αβn | γ,
where γ represents alternatives that do not begin with α, by
A  αA′ | γ
A′  β1 | β2 | . . . | βn
◦ Repeatedly apply this transformation until no two alternatives for a
nonterminal have a common prefix
Left Factoring
34
 Some syntactic constructs found in PL cannot be specified using
grammars alone
◦ Example 1: Problem of checking that identifiers are declared before
they are used in the program
The abstract language is L1 = {wcw | w is in (a|b)*}
◦ Example 2: Problem of checking that number of formal parameters in
the declaration of a function agrees with the number of actual
parameters in a use of the function
The abstract language is L2 = {an
bm
cn
dm
| n ≥ 1 and m ≥ 1}
Non-Context Free Language Constructs
35
 Constructing a parse tree for the input string starting from the
toot and creating nodes in preorder
 Can be viewed as finding a leftmost derivation for an input
string
 Consider the grammar below
E  T E′
E′  + T E′ | ε
T  F T′
T′  * F T′ | ε
F  ( E ) | id
Sequence of parse trees for the input id+id*id
Top-Down Parsing
36
Top-Down Parsing
37
 Determining the production to be applied for a nonterminal A is
the key problem at each step of top-down parse
 Once an A-production is chosen, the parsing process consists of
“matching” the terminal symbols in production with input string
 Two types
◦ Recursive-descent parsing
 May require backtracking to find the correct A-production to be applied
◦ Predictive Parsing
 No backtracking is requires
 Chooses the correct A-production by looking ahead at the input a fixed
number of symbols
 LL(k) Grammars : Construct predictive parsers that look k
symbols ahead in the input
Top-Down Parsing
38
 The parsing program consists of a set of procedures, one for
each nonterminal
 Execution begins with the procedure for the start symbol, which
halts and announces success if its procedure body scans the input
Recursive-Descent Parsing
39
 To allow backtracking, the code needs to be modified
◦ Cannot choose a unique A-production at line (1), so must try each of
the several productions in some order
◦ Failure at line (7) is not ultimate failure, but tells that we need to return
to line (1) and try another A-production
◦ Only if there are no more A-productions to try, we declare that an input
error has been found
◦ To try another A-production, we need to be able to reset the input
pointer to where it was when we first reached line (1)
◦ A local variable is needed to store this input pointer
Recursive-Descent Parsing
40
 Consider the grammar:
S  c A d
A  a b | a
And input string w = cad
Recursive-Descent Parsing
41
 During parsing, FIRST and FOLLOW allow us to choose which
production to apply, based on the next input symbol
 FIRST(α)
◦ Set of terminals that begin strings derived from α. If α derives ε, then ε
is also in FIRST(α)
◦ Example : A ==> c γ, so c is in FIRST(A)
◦ Consider two A-productions A α | β, where FIRST(α) and FIRST(β)
are disjoint sets
◦ Choose between these by looking at the next input symbol a, since a
can be in at most one of FIRST(α) and FIRST(β) , not both
FIRST and FOLLOW
*
42
 FOLLOW(A), for a nonterminal A
◦ Set of terminals a that can appear immediately to the right of A in some
sentential form
◦ That is, set of terminals a such that there exists a derivation of the form
S ==> αAaβ
◦ If A can be the rightmost symbol in some sentential form, then $ is in
FOLLOW(A)
FIRST and FOLLOW
*
43
 To compute FIRST(X), apply following rules until no more
terminals or ε can be added to any FIRST set
1. If X is a terminal, then FIRST(X) = { X }
2. If X is a nonterminal and X  Y1Y2 . . . Yk is a production for some k≥1
 Place a in FIRST(X) if for some i, a is in FIRST(Yi) , and ε is in all of
FIRST(Y1), . . . , FIRST(Yi-1); i.e., Y1. . . Yi-1 ==> ε
 If ε is in FIRST(Yj) for all j = 1, 2, . . ., k, then add ε to FIRST(X)
 If Y1 does not derive ε, then we add nothing more to FIRST(X), but if Y1 ==> ε,
then we add FIRST(Y2) and so on
3. If X  ε is a production, then add ε to FIRST(X)
FIRST and FOLLOW
*
*
44
 To compute FOLLOW(A) for all nonterminals A, apply
following rules until nothing can be added to any FOLLOW set
1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input
right endmarker
2. If there is production , A  αBβ, then everything in FIRST(β) except ε is
in FOLLOW(B)
3. If there is a production, A  αB, or a production A  αBβ, where
FIRST(β) contains ε, then everything in FOLLOW(A) is in
FOLLOW(B)
FIRST and FOLLOW
45
 Predictive parsers needing no backtracking are constructed for a
class of grammars called LL(1) grammars
◦ First “L” stands for scanning the input from left to right
◦ Second “L” stands for producing a leftmost derivation
◦ “1” for using one input symbol of look ahead at each step to make parsing
decisions
 A grammar G is LL(1) if and only if whenever A α | β , are
two distinct productions, the following hold:
◦ For no nonterminal a do both α and β derive strings beginning with a
◦ At most one of α and β can derive the empty string
◦ If β ==> ε, then α does not derive any string beginning with a terminal in
FOLLOW(A). Likewise, if α ==> ε, then β does not derive an string
beginning with a terminal in FOLLOW(A)
LL(1) Grammars
*
*
46
 Predictive parsers can be constructed for LL(1) grammars since
the proper production to apply can be selected by looking only
at the current input symbol
 Next algorithm collects information from FIRST and FOLLOW
sets
◦ Into a predictive parsing table M[A,a], where A is the nonterminal and a is
terminal
◦ IDEA
 Production A  α is chosen if the next input symbol a is in FIRST(α)
 If α = ε, we again choose A  α , if the current input symbol is in FOLLOW(A),
or if the $ on the input has been reached and $ is in FOLLOW(A)
LL(1) Grammars
47
Algorithm: Construction of Predictive Parsing Table
Input : Grammar G
Output: Parsing Table M
Method : For each production A  α of the grammar do the
1. For each terminal a in FIRST(A), add A  α to M[A,a].
2. If ε is in FIRST(α), then for each terminal b in FOLLOW(A), add
A  α to M[A,b]. If ε is in FIRST(α) and $ is in FOLLOW(A), add A
 α to M[A,$] as well
If after performing the above, there is no production at all in M[A,a],
then set M[A,a] to error
LL(1) Grammars
48
 Built by maintaining a stack explicitly rather than recursive calls
 Parser mimics a leftmost derivation
◦ If w is the input that has been matched so far, then the stack holds a
sequence of grammar symbols α such that S ==> w α
Nonrecursive Predictive Parsing
*
lm
49
Nonrecursive Predictive Parsing
50
Nonrecursive Predictive Parsing
51
 Error recovery refers to the stack of a table-driven predictive
parser
◦ It makes explicit the terminals and nonterminals that the parser hopes to
match with the remainder of the input
 An error is detected during predictive parsing when
◦ The terminal on top of the stack does not match the next input symbol
◦ Nonterminal A is on top of the stack, a is the next input symbol, and
M[A,a] is error, i.e., parsing table entry is empty
Error Recovery in Predictive Parsing
52
 Panic Mode
◦ Based on the idea of skipping symbols on the input until a token in a
selected set of synchronizing tokens appear
◦ Some heuristics are :
 Place all symbols in FOLLOW(A) into the synchronizing set for nonterminal
A. If we skip tokens until an element of FOLLOW (A) is seen and pop A
from the stack, parsing can continue
 If we add symbols in FIRST(A) to the synchronizing set, then it may be
possible to resume parsing according to A if a symbol in FIRST(A) appears
in the input
 If a nonterminal can generate the empty string, then the production deriving
ε can be used as default
 If a terminal on top of the stack cannot be matched, then pop the terminal,
issue a message saying that the terminal was inserted and continue parsing
Error Recovery in Predictive Parsing
53
 Synchronizing tokens added to the parsing table
Error Recovery in Predictive Parsing
54
 Parsing and Error Recovery moves
Error Recovery in Predictive Parsing
55
 Phrase-level Recovery
◦ Implemented by filling the blank entries in the table with pointers to error
routines
◦ These routines may change , insert, or delete symbols on the input and
issue appropriate error messages. May also pop from the stack
◦ Alteration of stack symbols or pushing of new symbols onto the stack is
questionable for several reasons
 Steps carried out by the parser might not correspond to the derivation of any
word in the language at all
 Must ensure that there is no possibility of an infinite loop
◦ Checking that any recovery action results in an input symbol being
consumed is a good way to protect against such loops
Error Recovery in Predictive Parsing

unit2_cdunit2_cdunit2_cdunit2_cdunit2_cd.pptx

  • 1.
  • 2.
    2  The Roleof Parser  Context Free Grammars  Writing a Grammar  Parsing – Top Down and Bottom Up  Simple LR Parser  Powerful LR Parser  Using Ambiguous Grammars  Parser Generators OBJECTIVES
  • 3.
    3  Syntax ◦ Theway in which words are stringed together to form phrases, clauses and sentences  Syntax Analysis ◦ The task concerned with fitting a sequence of tokens into a specified sequence  Parsing ◦ To break a sentence down into its component parts with an explanation of the form, function, and syntactical relationship of each part Syntax Analysis and Parsing
  • 4.
    4  Every PLhas rules that describe the syntactic structure of well-formed programs ◦ In C, a program is made up of functions, declarations, statements etc ◦ Syntax can be specified using context-free grammars or BNF  Grammars offer benefits to both designers & compiler writers ◦ Gives precise yet easy-to-understand syntactic specification of PL’s ◦ Helps construct efficient parsers automatically ◦ Provides a good structure to the language which in turn helps generate correct object code ◦ Allows a language to be evolved iteratively by adding new constructs to perform new tasks Syntax Analysis
  • 5.
    5  Parser obtainsa string of tokens from the lexical analyzer  Verifies that the string of token names can be generated by the grammar  Constructs a parse tree and passes it to the rest of the compiler  Reports errors when the program does not match the syntactic structure of the source language The Role of a Parser
  • 6.
    6  Universal ParsingMethod ◦ Can parse any grammar, but too inefficient to use in any practical compilers ◦ Example: Earley’s Algorithm  Top-Down Parsing ◦ Build parse trees from the root to the leaves ◦ Can be generated automatically or written manually ◦ Example : LL Parser  Bottom-Up Parsing ◦ Starts from the leaves and work their way up to the root ◦ Can only be generated automatically ◦ Example : LR Parser Parsing Methods
  • 7.
    7  Some ofthe grammars used in the discussion ◦ LR grammar for bottom-up parsing E  E + T | T T  T * F | F F  ( E ) | id ◦ Non-recursive variant of the grammar E  T E` E`  + T E` | ε T  F T` T  * F T` | ε F  ( E ) | id Representative Grammars
  • 8.
    8  A compilermust locate and track down errors but the error handling is left to the compiler designer  Common PL errors at many different levels ◦ Lexical Errors  Misspellings of identifiers, keywords or operators ◦ Syntactic Errors  Misplaced semicolons or extra or missing braces, that is “{“ or “}” ◦ Semantic Errors  Type mismatches between operators and operands ◦ Logical Errors  Incorrect reasoning by the programmer or use of assignment operator (=) instead of comparison operator (==) Syntax Error Handling
  • 9.
    9  Parsing methodsdetect errors as soon as possible, i.e., when stream of tokens cannot be parsed further  Have the viable-prefix property ◦ Detect an error has occurred as soon as a prefix of the input that cannot be completed to form a string is seen  Errors appear syntactic and are exposed when parsing cannot continue  Goals of error handler in a parser ◦ Report the presence of errors clearly and accurately ◦ Recover from each error quickly enough to detect subsequent errors ◦ Add minimal overhead to the processing of correct programs Syntax Error Handling
  • 10.
    10  How shouldthe parser recover once an error is detected? ◦ Quit with an informative error message when it detects the first error ◦ Additional errors are uncovered if parser can restore to a state where processing of the input can continue ◦ If errors pile up, its better to stop after exceeding some error limit  Four error-recovery strategies ◦ Panic-Mode Recovery ◦ Phrase-Level Recovery ◦ Error Productions ◦ Global Correction Error-Recovery Strategies
  • 11.
    11  Panic-Mode Recovery ◦Parser discards input symbols one at a time until one of a designated set of synchronizing tokens is found  Synchronizing tokens – delimiters like semicolon or “}” ◦ Compiler designer must select the synchronizing tokens appropriate for the source language ◦ Advantage of simplicity and is guaranteed not to go into an infinite loop  Phrase-Level Recovery ◦ Perform local correction on remaining input, that is, may replace a prefix of the remaining input by some string that allows the parser to continue ◦ Choose replacements that do not lead to infinite loops ◦ Drawback is the difficulty to cope up with situations in which actual error has occurred before the point of detection Error-Recovery Strategies
  • 12.
    12  Error Productions ◦Anticipate the common errors that might be encountered ◦ Augment grammar for language with productions that generate erroneous constructs ◦ Such a parser detects anticipated errors when error production is used during parsing  Global Correction ◦ Make as few changes in processing an incorrect input string ◦ Use algorithms for choosing a minimal sequence of changes to obtain a globally least-cost correction ◦ Given an incorrect input string x and grammar G, these algorithms find a parse tree for a related string y, such that the number of changes required to transform x to y is small ◦ Too costly to implement in terms of time and space Error-Recovery Strategies
  • 13.
    13  A context-freegrammar G = (T, N, S, P) consists of: ◦ T, a set of terminals (scanner tokens - symbols that may not appear on the left side of rule,). ◦ N, a set of nonterminals (syntactic variables generated by productions – symbols on left or right of rule). ◦ S, a designated start symbol nonterminal. ◦ P, a set of productions (Rules). Each production consists of  A nonterminal called the left side of the production  The symbol   A right side consisting of zero or more terminals and non-terminals Context-Free Grammars- Formal Definition
  • 14.
    14  Example grammarfor simple arithmetic expressions expression  expression + term expression  expression – term expression  term term  term * factor term  term / factor term  factor factor  ( expression ) factor  id Context-Free Grammars- Formal Definition
  • 15.
    15  Terminal symbols ◦Lowercase letters early in the alphabet like a,b,c ◦ Operator symbols such as +, * ◦ Punctuation symbols such as parentheses, and comma ◦ Digits 0,1,….,9  Nonterminal symbols ◦ Uppercase letters early in the alphabet like A,B,C ◦ Letter S, which, when used stands for the start symbol  Lowercase letters late in the alphabet like u,v,..z represent string of terminals  Uppercase letters late in the alphabet such as X, Y, Z represent grammar symbols that is either terminal or nonterminal Notational Conventions
  • 16.
    16  Lower caseGreek letters α, β, γ represent strings of grammar symbols  If A → α1 ,, A → α2 …….. A → αn are all Productions with A on the LHS ( known as A-productions), then, A → α1 | α2 …| αn (α1 , α2 … αn are alternatives of A)  Unless otherwise stated , the LHS of the first production is the start nonterminals  Example: previous grammar written using these conventions E  E + T | E – T | T T  T * F | T / F | F F  ( E ) | id` Notational Conventions
  • 17.
    17  A wayof showing how an input sentence is recognized with a grammar  Beginning with the start symbol, each rewriting step replaces a nonterminal by the body of one of its productions  Consider the following grammar E  E + E | E * E | – E | ( E ) | id ◦ “E derives –E” can be denoted by E ==> – E ◦ A derivation of – ( id ) from E E ==> – E ==> – ( E ) ==> – ( id ) Derivations
  • 18.
    18  Symbols toindicate derivation ◦ ==> denotes “derives in one step” ◦ ==> denotes “derives in zero or more steps” ◦ ==> denotes “derives in one or more steps”  Given the grammar G with start symbol S, a string ω is part of the language L (G), if and only if S ==> ω  If S ==> α and α contains ◦ Only terminals then it is a sentence of G ◦ Both terminals and nonterminals then it is a Sentential form of G  Two choices to be made at each step in the derivation ◦ Choose which nonterminal to replace ◦ Pick a production with that nonterminal as head Derivations * + * *
  • 19.
    19  Leftmost Derivation ◦The leftmost nonterminal in each sentential form is always chosen to be replaced ◦ If α ==> β, is a step in which the leftmost nonterminal in α is replaced, we write α ==> β  Rightmost Derivation ◦ Replace the rightmost nonterminal in each sentential form ◦ Written as α ==> β  If S ==> α, then we can say that α is left-sentential form of the grammar Derivations lm rm lm *
  • 20.
    20  A graphicalrepresentation for a derivation showing how to derive the string of a language from grammar starting from Start symbol ◦ The interior node is labeled with the nonterminal in the head of the production ◦ Children are labeled by the symbols in the RHS of the production  Yield of the tree ◦ The string derived or generated from the nonterminal at the root of the tree ◦ Obtained by reading the leaves of the parse tree from left to right Parse Trees and Derivations
  • 21.
    21 Parse Trees andDerivations Fig: Parse tree for – (id +id)
  • 22.
    22  A grammarthat produces more than one parse tree for some sentence is said to be ambiguous ◦ Can be a leftmost derivation or a rightmost derivation for the same sentence Ambiguity Fig: Two Parse trees for id+id*id
  • 23.
    23  Every constructthat can be described by a regular expression can be described by a grammar, but not vice-versa.  Grammar construction from the NFA ◦ For each state i of the NFA, create a nonterminal Ai ◦ If state i has a transition to state j on input a, add the production Ai  aAj If state i goes to state j on input ε, add the production Ai  Aj ◦ If i is an accepting state, add Ai  ε ◦ If i is the start state, make Ai the start symbol of the grammar CFG’s Versus Regular Expressions
  • 24.
  • 25.
  • 26.
    26  Lexical versusSyntactic Analysis ◦ Why use RE’s to define lexical syntax of a language?  Separating syntactic structure into lexical and non-lexical parts modularizes a compiler into two components  Lexical rules are simple and easy to describe  RE’s provide a concise and easier-to-understand notation for tokens than grammars  Efficient lexical analyzers can be constructed automatically from RE’s than from arbitrary grammars ◦ RE’s are most useful for describing the structure of constructs such as identifiers, constants, keywords and whitespace ◦ Grammars are useful for describing nested structures such as balanced parentheses, matching begin-end’s, corresponding if-then- else’s Writing a Grammar
  • 27.
    27  Ambiguous grammarscan be rewritten to eliminate ambiguity ◦ Example : “Dangling-else” grammar stmt  if expr then stmt | if expr then stmt else stmt | other ◦ Parse tree for the statement: if E1 then S1 else if E2 then S2 else S3 Eliminating Ambiguity
  • 28.
    28  Grammar isambiguous since the following string has two parse trees if E1 then if E2 then S1 else S2 Eliminating Ambiguity
  • 29.
    29  Rewriting the“dangling-else” grammar ◦ General rule is, “Match each else with the closest unmatched then” ◦ Idea is that statement appearing between a then and else must be “matched” ◦ That is, each interior statement must not end with an unmatched or open then ◦ A matched statement is either an if-then-else statement containing no open statements or any other kind of unconditional statement Eliminating Ambiguity
  • 30.
    30  Left-recursive grammar ◦Grammar that has a nonterminal A such that there is a derivation A ==>A for some string  ◦ Top-down parsing methods cannot handle left-recursive grammars ◦ Immediate Left recursion : a production of the form A  A ◦ Left-recursive pair of productions, A  A | β can be replaced by: A  β A′ A′   A′ | ε ◦ Eliminating immediate left recursion  First group the A-productions as A  A1 | A2 | . . . | Am | β1 | β2 | . . . | βn Where no βi begins with an A  Replace the A-productions by A  β1 A′ | β2 A′ | . . . | βn A′ A  1 A′ | 2 A′ | . . . | m A′ | ε Eliminating Left Recursion +
  • 31.
    31  Algorithm toeliminate left recursion from a grammar Eliminating Left Recursion
  • 32.
    32  When choicebetween two alternative A-productions is not clear, ◦ We rewrite the grammar to defer the decision until enough of the input has been seen ◦ Two productions as below:  In general, if A  αβ1 | αβ2 are two A-productions ◦ We do not know whether to expand A to αβ1 or αβ2 ◦ Defer the decision by expanding A to αA′ ◦ After seeing the input derived from α, we expand Á to β1 or to β2 ◦ Left-factored the original productions become A  α A′ A′  β1 | β2 Left Factoring
  • 33.
    33  Algorithm forleft-factoring a grammar ◦ For each nonterminal A, find the longest prefix α common to two or more alternatives ◦ If α ≠ ε – replace all of the A-productions A  αβ1 | αβ2 | . . . | αβn | γ, where γ represents alternatives that do not begin with α, by A  αA′ | γ A′  β1 | β2 | . . . | βn ◦ Repeatedly apply this transformation until no two alternatives for a nonterminal have a common prefix Left Factoring
  • 34.
    34  Some syntacticconstructs found in PL cannot be specified using grammars alone ◦ Example 1: Problem of checking that identifiers are declared before they are used in the program The abstract language is L1 = {wcw | w is in (a|b)*} ◦ Example 2: Problem of checking that number of formal parameters in the declaration of a function agrees with the number of actual parameters in a use of the function The abstract language is L2 = {an bm cn dm | n ≥ 1 and m ≥ 1} Non-Context Free Language Constructs
  • 35.
    35  Constructing aparse tree for the input string starting from the toot and creating nodes in preorder  Can be viewed as finding a leftmost derivation for an input string  Consider the grammar below E  T E′ E′  + T E′ | ε T  F T′ T′  * F T′ | ε F  ( E ) | id Sequence of parse trees for the input id+id*id Top-Down Parsing
  • 36.
  • 37.
    37  Determining theproduction to be applied for a nonterminal A is the key problem at each step of top-down parse  Once an A-production is chosen, the parsing process consists of “matching” the terminal symbols in production with input string  Two types ◦ Recursive-descent parsing  May require backtracking to find the correct A-production to be applied ◦ Predictive Parsing  No backtracking is requires  Chooses the correct A-production by looking ahead at the input a fixed number of symbols  LL(k) Grammars : Construct predictive parsers that look k symbols ahead in the input Top-Down Parsing
  • 38.
    38  The parsingprogram consists of a set of procedures, one for each nonterminal  Execution begins with the procedure for the start symbol, which halts and announces success if its procedure body scans the input Recursive-Descent Parsing
  • 39.
    39  To allowbacktracking, the code needs to be modified ◦ Cannot choose a unique A-production at line (1), so must try each of the several productions in some order ◦ Failure at line (7) is not ultimate failure, but tells that we need to return to line (1) and try another A-production ◦ Only if there are no more A-productions to try, we declare that an input error has been found ◦ To try another A-production, we need to be able to reset the input pointer to where it was when we first reached line (1) ◦ A local variable is needed to store this input pointer Recursive-Descent Parsing
  • 40.
    40  Consider thegrammar: S  c A d A  a b | a And input string w = cad Recursive-Descent Parsing
  • 41.
    41  During parsing,FIRST and FOLLOW allow us to choose which production to apply, based on the next input symbol  FIRST(α) ◦ Set of terminals that begin strings derived from α. If α derives ε, then ε is also in FIRST(α) ◦ Example : A ==> c γ, so c is in FIRST(A) ◦ Consider two A-productions A α | β, where FIRST(α) and FIRST(β) are disjoint sets ◦ Choose between these by looking at the next input symbol a, since a can be in at most one of FIRST(α) and FIRST(β) , not both FIRST and FOLLOW *
  • 42.
    42  FOLLOW(A), fora nonterminal A ◦ Set of terminals a that can appear immediately to the right of A in some sentential form ◦ That is, set of terminals a such that there exists a derivation of the form S ==> αAaβ ◦ If A can be the rightmost symbol in some sentential form, then $ is in FOLLOW(A) FIRST and FOLLOW *
  • 43.
    43  To computeFIRST(X), apply following rules until no more terminals or ε can be added to any FIRST set 1. If X is a terminal, then FIRST(X) = { X } 2. If X is a nonterminal and X  Y1Y2 . . . Yk is a production for some k≥1  Place a in FIRST(X) if for some i, a is in FIRST(Yi) , and ε is in all of FIRST(Y1), . . . , FIRST(Yi-1); i.e., Y1. . . Yi-1 ==> ε  If ε is in FIRST(Yj) for all j = 1, 2, . . ., k, then add ε to FIRST(X)  If Y1 does not derive ε, then we add nothing more to FIRST(X), but if Y1 ==> ε, then we add FIRST(Y2) and so on 3. If X  ε is a production, then add ε to FIRST(X) FIRST and FOLLOW * *
  • 44.
    44  To computeFOLLOW(A) for all nonterminals A, apply following rules until nothing can be added to any FOLLOW set 1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right endmarker 2. If there is production , A  αBβ, then everything in FIRST(β) except ε is in FOLLOW(B) 3. If there is a production, A  αB, or a production A  αBβ, where FIRST(β) contains ε, then everything in FOLLOW(A) is in FOLLOW(B) FIRST and FOLLOW
  • 45.
    45  Predictive parsersneeding no backtracking are constructed for a class of grammars called LL(1) grammars ◦ First “L” stands for scanning the input from left to right ◦ Second “L” stands for producing a leftmost derivation ◦ “1” for using one input symbol of look ahead at each step to make parsing decisions  A grammar G is LL(1) if and only if whenever A α | β , are two distinct productions, the following hold: ◦ For no nonterminal a do both α and β derive strings beginning with a ◦ At most one of α and β can derive the empty string ◦ If β ==> ε, then α does not derive any string beginning with a terminal in FOLLOW(A). Likewise, if α ==> ε, then β does not derive an string beginning with a terminal in FOLLOW(A) LL(1) Grammars * *
  • 46.
    46  Predictive parserscan be constructed for LL(1) grammars since the proper production to apply can be selected by looking only at the current input symbol  Next algorithm collects information from FIRST and FOLLOW sets ◦ Into a predictive parsing table M[A,a], where A is the nonterminal and a is terminal ◦ IDEA  Production A  α is chosen if the next input symbol a is in FIRST(α)  If α = ε, we again choose A  α , if the current input symbol is in FOLLOW(A), or if the $ on the input has been reached and $ is in FOLLOW(A) LL(1) Grammars
  • 47.
    47 Algorithm: Construction ofPredictive Parsing Table Input : Grammar G Output: Parsing Table M Method : For each production A  α of the grammar do the 1. For each terminal a in FIRST(A), add A  α to M[A,a]. 2. If ε is in FIRST(α), then for each terminal b in FOLLOW(A), add A  α to M[A,b]. If ε is in FIRST(α) and $ is in FOLLOW(A), add A  α to M[A,$] as well If after performing the above, there is no production at all in M[A,a], then set M[A,a] to error LL(1) Grammars
  • 48.
    48  Built bymaintaining a stack explicitly rather than recursive calls  Parser mimics a leftmost derivation ◦ If w is the input that has been matched so far, then the stack holds a sequence of grammar symbols α such that S ==> w α Nonrecursive Predictive Parsing * lm
  • 49.
  • 50.
  • 51.
    51  Error recoveryrefers to the stack of a table-driven predictive parser ◦ It makes explicit the terminals and nonterminals that the parser hopes to match with the remainder of the input  An error is detected during predictive parsing when ◦ The terminal on top of the stack does not match the next input symbol ◦ Nonterminal A is on top of the stack, a is the next input symbol, and M[A,a] is error, i.e., parsing table entry is empty Error Recovery in Predictive Parsing
  • 52.
    52  Panic Mode ◦Based on the idea of skipping symbols on the input until a token in a selected set of synchronizing tokens appear ◦ Some heuristics are :  Place all symbols in FOLLOW(A) into the synchronizing set for nonterminal A. If we skip tokens until an element of FOLLOW (A) is seen and pop A from the stack, parsing can continue  If we add symbols in FIRST(A) to the synchronizing set, then it may be possible to resume parsing according to A if a symbol in FIRST(A) appears in the input  If a nonterminal can generate the empty string, then the production deriving ε can be used as default  If a terminal on top of the stack cannot be matched, then pop the terminal, issue a message saying that the terminal was inserted and continue parsing Error Recovery in Predictive Parsing
  • 53.
    53  Synchronizing tokensadded to the parsing table Error Recovery in Predictive Parsing
  • 54.
    54  Parsing andError Recovery moves Error Recovery in Predictive Parsing
  • 55.
    55  Phrase-level Recovery ◦Implemented by filling the blank entries in the table with pointers to error routines ◦ These routines may change , insert, or delete symbols on the input and issue appropriate error messages. May also pop from the stack ◦ Alteration of stack symbols or pushing of new symbols onto the stack is questionable for several reasons  Steps carried out by the parser might not correspond to the derivation of any word in the language at all  Must ensure that there is no possibility of an infinite loop ◦ Checking that any recovery action results in an input symbol being consumed is a good way to protect against such loops Error Recovery in Predictive Parsing