2
The Roleof Parser
Context Free Grammars
Writing a Grammar
Parsing – Top Down and Bottom Up
Simple LR Parser
Powerful LR Parser
Using Ambiguous Grammars
Parser Generators
OBJECTIVES
3.
3
Syntax
◦ Theway in which words are stringed together to form phrases,
clauses and sentences
Syntax Analysis
◦ The task concerned with fitting a sequence of tokens into a
specified sequence
Parsing
◦ To break a sentence down into its component parts with an
explanation of the form, function, and syntactical relationship
of each part
Syntax Analysis and Parsing
4.
4
Every PLhas rules that describe the syntactic structure of
well-formed programs
◦ In C, a program is made up of functions, declarations, statements etc
◦ Syntax can be specified using context-free grammars or BNF
Grammars offer benefits to both designers & compiler writers
◦ Gives precise yet easy-to-understand syntactic specification of PL’s
◦ Helps construct efficient parsers automatically
◦ Provides a good structure to the language which in turn helps generate
correct object code
◦ Allows a language to be evolved iteratively by adding new constructs
to perform new tasks
Syntax Analysis
5.
5
Parser obtainsa string of tokens from the lexical analyzer
Verifies that the string of token names can be generated by the
grammar
Constructs a parse tree and passes it to the rest of the compiler
Reports errors when the program does not match the syntactic structure
of the source language
The Role of a Parser
6.
6
Universal ParsingMethod
◦ Can parse any grammar, but too inefficient to use in any practical
compilers
◦ Example: Earley’s Algorithm
Top-Down Parsing
◦ Build parse trees from the root to the leaves
◦ Can be generated automatically or written manually
◦ Example : LL Parser
Bottom-Up Parsing
◦ Starts from the leaves and work their way up to the root
◦ Can only be generated automatically
◦ Example : LR Parser
Parsing Methods
7.
7
Some ofthe grammars used in the discussion
◦ LR grammar for bottom-up parsing
E E + T | T
T T * F | F
F ( E ) | id
◦ Non-recursive variant of the grammar
E T E`
E` + T E` | ε
T F T`
T * F T` | ε
F ( E ) | id
Representative Grammars
8.
8
A compilermust locate and track down errors but the error
handling is left to the compiler designer
Common PL errors at many different levels
◦ Lexical Errors
Misspellings of identifiers, keywords or operators
◦ Syntactic Errors
Misplaced semicolons or extra or missing braces, that is “{“ or “}”
◦ Semantic Errors
Type mismatches between operators and operands
◦ Logical Errors
Incorrect reasoning by the programmer or use of assignment operator (=)
instead of comparison operator (==)
Syntax Error Handling
9.
9
Parsing methodsdetect errors as soon as possible, i.e., when
stream of tokens cannot be parsed further
Have the viable-prefix property
◦ Detect an error has occurred as soon as a prefix of the input that cannot
be completed to form a string is seen
Errors appear syntactic and are exposed when parsing cannot
continue
Goals of error handler in a parser
◦ Report the presence of errors clearly and accurately
◦ Recover from each error quickly enough to detect subsequent errors
◦ Add minimal overhead to the processing of correct programs
Syntax Error Handling
10.
10
How shouldthe parser recover once an error is detected?
◦ Quit with an informative error message when it detects the first error
◦ Additional errors are uncovered if parser can restore to a state where
processing of the input can continue
◦ If errors pile up, its better to stop after exceeding some error limit
Four error-recovery strategies
◦ Panic-Mode Recovery
◦ Phrase-Level Recovery
◦ Error Productions
◦ Global Correction
Error-Recovery Strategies
11.
11
Panic-Mode Recovery
◦Parser discards input symbols one at a time until one of a designated set of
synchronizing tokens is found
Synchronizing tokens – delimiters like semicolon or “}”
◦ Compiler designer must select the synchronizing tokens appropriate for the
source language
◦ Advantage of simplicity and is guaranteed not to go into an infinite loop
Phrase-Level Recovery
◦ Perform local correction on remaining input, that is, may replace a prefix
of the remaining input by some string that allows the parser to continue
◦ Choose replacements that do not lead to infinite loops
◦ Drawback is the difficulty to cope up with situations in which actual error
has occurred before the point of detection
Error-Recovery Strategies
12.
12
Error Productions
◦Anticipate the common errors that might be encountered
◦ Augment grammar for language with productions that generate erroneous
constructs
◦ Such a parser detects anticipated errors when error production is used
during parsing
Global Correction
◦ Make as few changes in processing an incorrect input string
◦ Use algorithms for choosing a minimal sequence of changes to obtain a
globally least-cost correction
◦ Given an incorrect input string x and grammar G, these algorithms find a
parse tree for a related string y, such that the number of changes required to
transform x to y is small
◦ Too costly to implement in terms of time and space
Error-Recovery Strategies
13.
13
A context-freegrammar G = (T, N, S, P) consists of:
◦ T, a set of terminals (scanner tokens - symbols that may not appear on the
left side of rule,).
◦ N, a set of nonterminals (syntactic variables generated by productions –
symbols on left or right of rule).
◦ S, a designated start symbol nonterminal.
◦ P, a set of productions (Rules). Each production consists of
A nonterminal called the left side of the production
The symbol
A right side consisting of zero or more terminals and non-terminals
Context-Free Grammars- Formal Definition
14.
14
Example grammarfor simple arithmetic expressions
expression expression + term
expression expression – term
expression term
term term * factor
term term / factor
term factor
factor ( expression )
factor id
Context-Free Grammars- Formal Definition
15.
15
Terminal symbols
◦Lowercase letters early in the alphabet like a,b,c
◦ Operator symbols such as +, *
◦ Punctuation symbols such as parentheses, and comma
◦ Digits 0,1,….,9
Nonterminal symbols
◦ Uppercase letters early in the alphabet like A,B,C
◦ Letter S, which, when used stands for the start symbol
Lowercase letters late in the alphabet like u,v,..z represent
string of terminals
Uppercase letters late in the alphabet such as X, Y, Z represent
grammar symbols that is either terminal or nonterminal
Notational Conventions
16.
16
Lower caseGreek letters α, β, γ represent strings of grammar
symbols
If A → α1 ,, A → α2 …….. A → αn are all Productions with A
on the LHS ( known as A-productions), then,
A → α1 | α2 …| αn (α1 , α2 … αn are alternatives of A)
Unless otherwise stated , the LHS of the first production is the
start nonterminals
Example: previous grammar written using these conventions
E E + T | E – T | T
T T * F | T / F | F
F ( E ) | id`
Notational Conventions
17.
17
A wayof showing how an input sentence is recognized with a
grammar
Beginning with the start symbol, each rewriting step replaces a
nonterminal by the body of one of its productions
Consider the following grammar
E E + E | E * E | – E | ( E ) | id
◦ “E derives –E” can be denoted by E ==> – E
◦ A derivation of – ( id ) from E
E ==> – E ==> – ( E ) ==> – ( id )
Derivations
18.
18
Symbols toindicate derivation
◦ ==> denotes “derives in one step”
◦ ==> denotes “derives in zero or more steps”
◦ ==> denotes “derives in one or more steps”
Given the grammar G with start symbol S, a string ω is part of
the language L (G), if and only if S ==> ω
If S ==> α and α contains
◦ Only terminals then it is a sentence of G
◦ Both terminals and nonterminals then it is a Sentential form of G
Two choices to be made at each step in the derivation
◦ Choose which nonterminal to replace
◦ Pick a production with that nonterminal as head
Derivations
*
+
*
*
19.
19
Leftmost Derivation
◦The leftmost nonterminal in each sentential form is always chosen to be
replaced
◦ If α ==> β, is a step in which the leftmost nonterminal in α is replaced, we
write α ==> β
Rightmost Derivation
◦ Replace the rightmost nonterminal in each sentential form
◦ Written as α ==> β
If S ==> α, then we can say that α is left-sentential form of the
grammar
Derivations
lm
rm
lm
*
20.
20
A graphicalrepresentation for a derivation showing how to
derive the string of a language from grammar starting from Start
symbol
◦ The interior node is labeled with the nonterminal in the head of the
production
◦ Children are labeled by the symbols in the RHS of the production
Yield of the tree
◦ The string derived or generated from the nonterminal at the root of the
tree
◦ Obtained by reading the leaves of the parse tree from left to right
Parse Trees and Derivations
22
A grammarthat produces more than one parse tree for some
sentence is said to be ambiguous
◦ Can be a leftmost derivation or a rightmost derivation for the same
sentence
Ambiguity
Fig: Two Parse trees for id+id*id
23.
23
Every constructthat can be described by a regular expression
can be described by a grammar, but not vice-versa.
Grammar construction from the NFA
◦ For each state i of the NFA, create a nonterminal Ai
◦ If state i has a transition to state j on input a, add the production Ai aAj
If state i goes to state j on input ε, add the production Ai Aj
◦ If i is an accepting state, add Ai ε
◦ If i is the start state, make Ai the start symbol of the grammar
CFG’s Versus Regular Expressions
26
Lexical versusSyntactic Analysis
◦ Why use RE’s to define lexical syntax of a language?
Separating syntactic structure into lexical and non-lexical parts
modularizes a compiler into two components
Lexical rules are simple and easy to describe
RE’s provide a concise and easier-to-understand notation for tokens than
grammars
Efficient lexical analyzers can be constructed automatically from RE’s
than from arbitrary grammars
◦ RE’s are most useful for describing the structure of constructs such
as identifiers, constants, keywords and whitespace
◦ Grammars are useful for describing nested structures such as
balanced parentheses, matching begin-end’s, corresponding if-then-
else’s
Writing a Grammar
27.
27
Ambiguous grammarscan be rewritten to eliminate ambiguity
◦ Example : “Dangling-else” grammar
stmt if expr then stmt
| if expr then stmt else stmt
| other
◦ Parse tree for the statement: if E1 then S1 else if E2 then S2 else S3
Eliminating Ambiguity
28.
28
Grammar isambiguous since the following string has two parse
trees
if E1 then if E2 then S1 else S2
Eliminating Ambiguity
29.
29
Rewriting the“dangling-else” grammar
◦ General rule is, “Match each else with the closest unmatched then”
◦ Idea is that statement appearing between a then and else must be
“matched”
◦ That is, each interior statement must not end with an unmatched or open
then
◦ A matched statement is either an if-then-else statement containing no open
statements or any other kind of unconditional statement
Eliminating Ambiguity
30.
30
Left-recursive grammar
◦Grammar that has a nonterminal A such that there is a derivation A ==>A for
some string
◦ Top-down parsing methods cannot handle left-recursive grammars
◦ Immediate Left recursion : a production of the form A A
◦ Left-recursive pair of productions, A A | β can be replaced by:
A β A′
A′ A′ | ε
◦ Eliminating immediate left recursion
First group the A-productions as
A A1 | A2 | . . . | Am | β1 | β2 | . . . | βn
Where no βi begins with an A
Replace the A-productions by
A β1 A′ | β2 A′ | . . . | βn A′
A 1 A′ | 2 A′ | . . . | m A′ | ε
Eliminating Left Recursion
+
31.
31
Algorithm toeliminate left recursion from a grammar
Eliminating Left Recursion
32.
32
When choicebetween two alternative A-productions is not clear,
◦ We rewrite the grammar to defer the decision until enough of the input has
been seen
◦ Two productions as below:
In general, if A αβ1 | αβ2 are two A-productions
◦ We do not know whether to expand A to αβ1 or αβ2
◦ Defer the decision by expanding A to αA′
◦ After seeing the input derived from α, we expand Á to β1 or to β2
◦ Left-factored the original productions become
A α A′
A′ β1 | β2
Left Factoring
33.
33
Algorithm forleft-factoring a grammar
◦ For each nonterminal A, find the longest prefix α common to two or more
alternatives
◦ If α ≠ ε – replace all of the A-productions A αβ1 | αβ2 | . . . | αβn | γ,
where γ represents alternatives that do not begin with α, by
A αA′ | γ
A′ β1 | β2 | . . . | βn
◦ Repeatedly apply this transformation until no two alternatives for a
nonterminal have a common prefix
Left Factoring
34.
34
Some syntacticconstructs found in PL cannot be specified using
grammars alone
◦ Example 1: Problem of checking that identifiers are declared before
they are used in the program
The abstract language is L1 = {wcw | w is in (a|b)*}
◦ Example 2: Problem of checking that number of formal parameters in
the declaration of a function agrees with the number of actual
parameters in a use of the function
The abstract language is L2 = {an
bm
cn
dm
| n ≥ 1 and m ≥ 1}
Non-Context Free Language Constructs
35.
35
Constructing aparse tree for the input string starting from the
toot and creating nodes in preorder
Can be viewed as finding a leftmost derivation for an input
string
Consider the grammar below
E T E′
E′ + T E′ | ε
T F T′
T′ * F T′ | ε
F ( E ) | id
Sequence of parse trees for the input id+id*id
Top-Down Parsing
37
Determining theproduction to be applied for a nonterminal A is
the key problem at each step of top-down parse
Once an A-production is chosen, the parsing process consists of
“matching” the terminal symbols in production with input string
Two types
◦ Recursive-descent parsing
May require backtracking to find the correct A-production to be applied
◦ Predictive Parsing
No backtracking is requires
Chooses the correct A-production by looking ahead at the input a fixed
number of symbols
LL(k) Grammars : Construct predictive parsers that look k
symbols ahead in the input
Top-Down Parsing
38.
38
The parsingprogram consists of a set of procedures, one for
each nonterminal
Execution begins with the procedure for the start symbol, which
halts and announces success if its procedure body scans the input
Recursive-Descent Parsing
39.
39
To allowbacktracking, the code needs to be modified
◦ Cannot choose a unique A-production at line (1), so must try each of
the several productions in some order
◦ Failure at line (7) is not ultimate failure, but tells that we need to return
to line (1) and try another A-production
◦ Only if there are no more A-productions to try, we declare that an input
error has been found
◦ To try another A-production, we need to be able to reset the input
pointer to where it was when we first reached line (1)
◦ A local variable is needed to store this input pointer
Recursive-Descent Parsing
40.
40
Consider thegrammar:
S c A d
A a b | a
And input string w = cad
Recursive-Descent Parsing
41.
41
During parsing,FIRST and FOLLOW allow us to choose which
production to apply, based on the next input symbol
FIRST(α)
◦ Set of terminals that begin strings derived from α. If α derives ε, then ε
is also in FIRST(α)
◦ Example : A ==> c γ, so c is in FIRST(A)
◦ Consider two A-productions A α | β, where FIRST(α) and FIRST(β)
are disjoint sets
◦ Choose between these by looking at the next input symbol a, since a
can be in at most one of FIRST(α) and FIRST(β) , not both
FIRST and FOLLOW
*
42.
42
FOLLOW(A), fora nonterminal A
◦ Set of terminals a that can appear immediately to the right of A in some
sentential form
◦ That is, set of terminals a such that there exists a derivation of the form
S ==> αAaβ
◦ If A can be the rightmost symbol in some sentential form, then $ is in
FOLLOW(A)
FIRST and FOLLOW
*
43.
43
To computeFIRST(X), apply following rules until no more
terminals or ε can be added to any FIRST set
1. If X is a terminal, then FIRST(X) = { X }
2. If X is a nonterminal and X Y1Y2 . . . Yk is a production for some k≥1
Place a in FIRST(X) if for some i, a is in FIRST(Yi) , and ε is in all of
FIRST(Y1), . . . , FIRST(Yi-1); i.e., Y1. . . Yi-1 ==> ε
If ε is in FIRST(Yj) for all j = 1, 2, . . ., k, then add ε to FIRST(X)
If Y1 does not derive ε, then we add nothing more to FIRST(X), but if Y1 ==> ε,
then we add FIRST(Y2) and so on
3. If X ε is a production, then add ε to FIRST(X)
FIRST and FOLLOW
*
*
44.
44
To computeFOLLOW(A) for all nonterminals A, apply
following rules until nothing can be added to any FOLLOW set
1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input
right endmarker
2. If there is production , A αBβ, then everything in FIRST(β) except ε is
in FOLLOW(B)
3. If there is a production, A αB, or a production A αBβ, where
FIRST(β) contains ε, then everything in FOLLOW(A) is in
FOLLOW(B)
FIRST and FOLLOW
45.
45
Predictive parsersneeding no backtracking are constructed for a
class of grammars called LL(1) grammars
◦ First “L” stands for scanning the input from left to right
◦ Second “L” stands for producing a leftmost derivation
◦ “1” for using one input symbol of look ahead at each step to make parsing
decisions
A grammar G is LL(1) if and only if whenever A α | β , are
two distinct productions, the following hold:
◦ For no nonterminal a do both α and β derive strings beginning with a
◦ At most one of α and β can derive the empty string
◦ If β ==> ε, then α does not derive any string beginning with a terminal in
FOLLOW(A). Likewise, if α ==> ε, then β does not derive an string
beginning with a terminal in FOLLOW(A)
LL(1) Grammars
*
*
46.
46
Predictive parserscan be constructed for LL(1) grammars since
the proper production to apply can be selected by looking only
at the current input symbol
Next algorithm collects information from FIRST and FOLLOW
sets
◦ Into a predictive parsing table M[A,a], where A is the nonterminal and a is
terminal
◦ IDEA
Production A α is chosen if the next input symbol a is in FIRST(α)
If α = ε, we again choose A α , if the current input symbol is in FOLLOW(A),
or if the $ on the input has been reached and $ is in FOLLOW(A)
LL(1) Grammars
47.
47
Algorithm: Construction ofPredictive Parsing Table
Input : Grammar G
Output: Parsing Table M
Method : For each production A α of the grammar do the
1. For each terminal a in FIRST(A), add A α to M[A,a].
2. If ε is in FIRST(α), then for each terminal b in FOLLOW(A), add
A α to M[A,b]. If ε is in FIRST(α) and $ is in FOLLOW(A), add A
α to M[A,$] as well
If after performing the above, there is no production at all in M[A,a],
then set M[A,a] to error
LL(1) Grammars
48.
48
Built bymaintaining a stack explicitly rather than recursive calls
Parser mimics a leftmost derivation
◦ If w is the input that has been matched so far, then the stack holds a
sequence of grammar symbols α such that S ==> w α
Nonrecursive Predictive Parsing
*
lm
51
Error recoveryrefers to the stack of a table-driven predictive
parser
◦ It makes explicit the terminals and nonterminals that the parser hopes to
match with the remainder of the input
An error is detected during predictive parsing when
◦ The terminal on top of the stack does not match the next input symbol
◦ Nonterminal A is on top of the stack, a is the next input symbol, and
M[A,a] is error, i.e., parsing table entry is empty
Error Recovery in Predictive Parsing
52.
52
Panic Mode
◦Based on the idea of skipping symbols on the input until a token in a
selected set of synchronizing tokens appear
◦ Some heuristics are :
Place all symbols in FOLLOW(A) into the synchronizing set for nonterminal
A. If we skip tokens until an element of FOLLOW (A) is seen and pop A
from the stack, parsing can continue
If we add symbols in FIRST(A) to the synchronizing set, then it may be
possible to resume parsing according to A if a symbol in FIRST(A) appears
in the input
If a nonterminal can generate the empty string, then the production deriving
ε can be used as default
If a terminal on top of the stack cannot be matched, then pop the terminal,
issue a message saying that the terminal was inserted and continue parsing
Error Recovery in Predictive Parsing
55
Phrase-level Recovery
◦Implemented by filling the blank entries in the table with pointers to error
routines
◦ These routines may change , insert, or delete symbols on the input and
issue appropriate error messages. May also pop from the stack
◦ Alteration of stack symbols or pushing of new symbols onto the stack is
questionable for several reasons
Steps carried out by the parser might not correspond to the derivation of any
word in the language at all
Must ensure that there is no possibility of an infinite loop
◦ Checking that any recovery action results in an input symbol being
consumed is a good way to protect against such loops
Error Recovery in Predictive Parsing