unit2_cdunit2_cdunit2_cdunit2_cdunit2_cd.pptx

1
SYNTAX ANALYSIS - 1
UNIT - 2

2
 The Role of Parser
 Context Free Grammars
 Writing a Grammar
 Parsing – Top Down and Bottom Up
 Simple LR Parser
 Powerful LR Parser
 Using Ambiguous Grammars
 Parser Generators
OBJECTIVES

3
 Syntax
◦ The way in which words are stringed together to form phrases,
clauses and sentences
 Syntax Analysis
◦ The task concerned with fitting a sequence of tokens into a
specified sequence
 Parsing
◦ To break a sentence down into its component parts with an
explanation of the form, function, and syntactical relationship
of each part
Syntax Analysis and Parsing

4
 Every PL has rules that describe the syntactic structure of
well-formed programs
◦ In C, a program is made up of functions, declarations, statements etc
◦ Syntax can be specified using context-free grammars or BNF
 Grammars offer benefits to both designers & compiler writers
◦ Gives precise yet easy-to-understand syntactic specification of PL’s
◦ Helps construct efficient parsers automatically
◦ Provides a good structure to the language which in turn helps generate
correct object code
◦ Allows a language to be evolved iteratively by adding new constructs
to perform new tasks
Syntax Analysis

5
 Parser obtains a string of tokens from the lexical analyzer
 Verifies that the string of token names can be generated by the
grammar
 Constructs a parse tree and passes it to the rest of the compiler
 Reports errors when the program does not match the syntactic structure
of the source language
The Role of a Parser

6
 Universal Parsing Method
◦ Can parse any grammar, but too inefficient to use in any practical
compilers
◦ Example: Earley’s Algorithm
 Top-Down Parsing
◦ Build parse trees from the root to the leaves
◦ Can be generated automatically or written manually
◦ Example : LL Parser
 Bottom-Up Parsing
◦ Starts from the leaves and work their way up to the root
◦ Can only be generated automatically
◦ Example : LR Parser
Parsing Methods

8
 A compiler must locate and track down errors but the error
handling is left to the compiler designer
 Common PL errors at many different levels
◦ Lexical Errors
 Misspellings of identifiers, keywords or operators
◦ Syntactic Errors
 Misplaced semicolons or extra or missing braces, that is “{“ or “}”
◦ Semantic Errors
 Type mismatches between operators and operands
◦ Logical Errors
 Incorrect reasoning by the programmer or use of assignment operator (=)
instead of comparison operator (==)
Syntax Error Handling

9
 Parsing methods detect errors as soon as possible, i.e., when
stream of tokens cannot be parsed further
 Have the viable-prefix property
◦ Detect an error has occurred as soon as a prefix of the input that cannot
be completed to form a string is seen
 Errors appear syntactic and are exposed when parsing cannot
continue
 Goals of error handler in a parser
◦ Report the presence of errors clearly and accurately
◦ Recover from each error quickly enough to detect subsequent errors
◦ Add minimal overhead to the processing of correct programs
Syntax Error Handling

10
 How should the parser recover once an error is detected?
◦ Quit with an informative error message when it detects the first error
◦ Additional errors are uncovered if parser can restore to a state where
processing of the input can continue
◦ If errors pile up, its better to stop after exceeding some error limit
 Four error-recovery strategies
◦ Panic-Mode Recovery
◦ Phrase-Level Recovery
◦ Error Productions
◦ Global Correction
Error-Recovery Strategies

11
 Panic-Mode Recovery
◦ Parser discards input symbols one at a time until one of a designated set of
synchronizing tokens is found
 Synchronizing tokens – delimiters like semicolon or “}”
◦ Compiler designer must select the synchronizing tokens appropriate for the
source language
◦ Advantage of simplicity and is guaranteed not to go into an infinite loop
 Phrase-Level Recovery
◦ Perform local correction on remaining input, that is, may replace a prefix
of the remaining input by some string that allows the parser to continue
◦ Choose replacements that do not lead to infinite loops
◦ Drawback is the difficulty to cope up with situations in which actual error
has occurred before the point of detection

12
 Error Productions
◦ Anticipate the common errors that might be encountered
◦ Augment grammar for language with productions that generate erroneous
constructs
◦ Such a parser detects anticipated errors when error production is used
during parsing
 Global Correction
◦ Make as few changes in processing an incorrect input string
◦ Use algorithms for choosing a minimal sequence of changes to obtain a
globally least-cost correction
◦ Given an incorrect input string x and grammar G, these algorithms find a
parse tree for a related string y, such that the number of changes required to
transform x to y is small
◦ Too costly to implement in terms of time and space

13
 A context-free grammar G = (T, N, S, P) consists of:
◦ T, a set of terminals (scanner tokens - symbols that may not appear on the
left side of rule,).
◦ N, a set of nonterminals (syntactic variables generated by productions –
symbols on left or right of rule).
◦ S, a designated start symbol nonterminal.
◦ P, a set of productions (Rules). Each production consists of
 A nonterminal called the left side of the production
 The symbol 
 A right side consisting of zero or more terminals and non-terminals
Context-Free Grammars- Formal Definition

14
 Example grammar for simple arithmetic expressions
expression  expression + term
expression  expression – term
expression  term
term  term * factor
term  term / factor
term  factor
factor  ( expression )
factor  id
Context-Free Grammars- Formal Definition

15
 Terminal symbols
◦ Lowercase letters early in the alphabet like a,b,c
◦ Operator symbols such as +, *
◦ Punctuation symbols such as parentheses, and comma
◦ Digits 0,1,….,9
 Nonterminal symbols
◦ Uppercase letters early in the alphabet like A,B,C
◦ Letter S, which, when used stands for the start symbol
 Lowercase letters late in the alphabet like u,v,..z represent
string of terminals
 Uppercase letters late in the alphabet such as X, Y, Z represent
grammar symbols that is either terminal or nonterminal
Notational Conventions

16
 Lower case Greek letters α, β, γ represent strings of grammar
symbols
 If A → α1 ,, A → α2 …….. A → αn are all Productions with A
on the LHS ( known as A-productions), then,
A → α1 | α2 …| αn (α1 , α2 … αn are alternatives of A)
 Unless otherwise stated , the LHS of the first production is the
start nonterminals
 Example: previous grammar written using these conventions
E  E + T | E – T | T
T  T * F | T / F | F
F  ( E ) | id`
Notational Conventions

17
 A way of showing how an input sentence is recognized with a
grammar
 Beginning with the start symbol, each rewriting step replaces a
nonterminal by the body of one of its productions
 Consider the following grammar
E  E + E | E * E | – E | ( E ) | id
◦ “E derives –E” can be denoted by E ==> – E
◦ A derivation of – ( id ) from E
E ==> – E ==> – ( E ) ==> – ( id )
Derivations

18
 Symbols to indicate derivation
◦ ==> denotes “derives in one step”
◦ ==> denotes “derives in zero or more steps”
◦ ==> denotes “derives in one or more steps”
 Given the grammar G with start symbol S, a string ω is part of
the language L (G), if and only if S ==> ω
 If S ==> α and α contains
◦ Only terminals then it is a sentence of G
◦ Both terminals and nonterminals then it is a Sentential form of G
 Two choices to be made at each step in the derivation
◦ Choose which nonterminal to replace
◦ Pick a production with that nonterminal as head
Derivations
*
+
*
*

19
 Leftmost Derivation
◦ The leftmost nonterminal in each sentential form is always chosen to be
replaced
◦ If α ==> β, is a step in which the leftmost nonterminal in α is replaced, we
write α ==> β
 Rightmost Derivation
◦ Replace the rightmost nonterminal in each sentential form
◦ Written as α ==> β
 If S ==> α, then we can say that α is left-sentential form of the
grammar
Derivations
lm
rm
lm
*

20
 A graphical representation for a derivation showing how to
derive the string of a language from grammar starting from Start
symbol
◦ The interior node is labeled with the nonterminal in the head of the
production
◦ Children are labeled by the symbols in the RHS of the production
 Yield of the tree
◦ The string derived or generated from the nonterminal at the root of the
tree
◦ Obtained by reading the leaves of the parse tree from left to right
Parse Trees and Derivations

21
Parse Trees and Derivations
Fig: Parse tree for – (id +id)

22
 A grammar that produces more than one parse tree for some
sentence is said to be ambiguous
◦ Can be a leftmost derivation or a rightmost derivation for the same
sentence
Ambiguity
Fig: Two Parse trees for id+id*id

23
 Every construct that can be described by a regular expression
can be described by a grammar, but not vice-versa.
 Grammar construction from the NFA
◦ For each state i of the NFA, create a nonterminal Ai
◦ If state i has a transition to state j on input a, add the production Ai  aAj
If state i goes to state j on input ε, add the production Ai  Aj
◦ If i is an accepting state, add Ai  ε
◦ If i is the start state, make Ai the start symbol of the grammar
CFG’s Versus Regular Expressions

26
 Lexical versus Syntactic Analysis
◦ Why use RE’s to define lexical syntax of a language?
 Separating syntactic structure into lexical and non-lexical parts
modularizes a compiler into two components
 Lexical rules are simple and easy to describe
 RE’s provide a concise and easier-to-understand notation for tokens than
grammars
 Efficient lexical analyzers can be constructed automatically from RE’s
than from arbitrary grammars
◦ RE’s are most useful for describing the structure of constructs such
as identifiers, constants, keywords and whitespace
◦ Grammars are useful for describing nested structures such as
balanced parentheses, matching begin-end’s, corresponding if-then-
else’s
Writing a Grammar

27
 Ambiguous grammars can be rewritten to eliminate ambiguity
◦ Example : “Dangling-else” grammar
stmt  if expr then stmt
| if expr then stmt else stmt
| other
◦ Parse tree for the statement: if E1 then S1 else if E2 then S2 else S3
Eliminating Ambiguity

28
 Grammar is ambiguous since the following string has two parse
trees
if E1 then if E2 then S1 else S2

29
 Rewriting the “dangling-else” grammar
◦ General rule is, “Match each else with the closest unmatched then”
◦ Idea is that statement appearing between a then and else must be
“matched”
◦ That is, each interior statement must not end with an unmatched or open
then
◦ A matched statement is either an if-then-else statement containing no open
statements or any other kind of unconditional statement

30
 Left-recursive grammar
◦ Grammar that has a nonterminal A such that there is a derivation A ==>A for
some string 
◦ Top-down parsing methods cannot handle left-recursive grammars
◦ Immediate Left recursion : a production of the form A  A
◦ Left-recursive pair of productions, A  A | β can be replaced by:
A  β A′
A′   A′ | ε
◦ Eliminating immediate left recursion
 First group the A-productions as
A  A1 | A2 | . . . | Am | β1 | β2 | . . . | βn
Where no βi begins with an A
 Replace the A-productions by
A  β1 A′ | β2 A′ | . . . | βn A′
A  1 A′ | 2 A′ | . . . | m A′ | ε
Eliminating Left Recursion
+

31
 Algorithm to eliminate left recursion from a grammar
Eliminating Left Recursion

32
 When choice between two alternative A-productions is not clear,
◦ We rewrite the grammar to defer the decision until enough of the input has
been seen
◦ Two productions as below:
 In general, if A  αβ1 | αβ2 are two A-productions
◦ We do not know whether to expand A to αβ1 or αβ2
◦ Defer the decision by expanding A to αA′
◦ After seeing the input derived from α, we expand Á to β1 or to β2
◦ Left-factored the original productions become
A  α A′
A′  β1 | β2
Left Factoring

33
 Algorithm for left-factoring a grammar
◦ For each nonterminal A, find the longest prefix α common to two or more
alternatives
◦ If α ≠ ε – replace all of the A-productions A  αβ1 | αβ2 | . . . | αβn | γ,
where γ represents alternatives that do not begin with α, by
A  αA′ | γ
A′  β1 | β2 | . . . | βn
◦ Repeatedly apply this transformation until no two alternatives for a
nonterminal have a common prefix
Left Factoring

34
 Some syntactic constructs found in PL cannot be specified using
grammars alone
◦ Example 1: Problem of checking that identifiers are declared before
they are used in the program
The abstract language is L1 = {wcw | w is in (a|b)*}
◦ Example 2: Problem of checking that number of formal parameters in
the declaration of a function agrees with the number of actual
parameters in a use of the function
The abstract language is L2 = {an
bm
cn
dm
| n ≥ 1 and m ≥ 1}
Non-Context Free Language Constructs

35
 Constructing a parse tree for the input string starting from the
toot and creating nodes in preorder
 Can be viewed as finding a leftmost derivation for an input
string
 Consider the grammar below
E  T E′
E′  + T E′ | ε
T  F T′
T′  * F T′ | ε
F  ( E ) | id
Sequence of parse trees for the input id+id*id
Top-Down Parsing

37
 Determining the production to be applied for a nonterminal A is
the key problem at each step of top-down parse
 Once an A-production is chosen, the parsing process consists of
“matching” the terminal symbols in production with input string
 Two types
◦ Recursive-descent parsing
 May require backtracking to find the correct A-production to be applied
◦ Predictive Parsing
 No backtracking is requires
 Chooses the correct A-production by looking ahead at the input a fixed
number of symbols
 LL(k) Grammars : Construct predictive parsers that look k
symbols ahead in the input
Top-Down Parsing

38
 The parsing program consists of a set of procedures, one for
each nonterminal
 Execution begins with the procedure for the start symbol, which
halts and announces success if its procedure body scans the input
Recursive-Descent Parsing

39
 To allow backtracking, the code needs to be modified
◦ Cannot choose a unique A-production at line (1), so must try each of
the several productions in some order
◦ Failure at line (7) is not ultimate failure, but tells that we need to return
to line (1) and try another A-production
◦ Only if there are no more A-productions to try, we declare that an input
error has been found
◦ To try another A-production, we need to be able to reset the input
pointer to where it was when we first reached line (1)
◦ A local variable is needed to store this input pointer

40
 Consider the grammar:
S  c A d
A  a b | a
And input string w = cad

41
 During parsing, FIRST and FOLLOW allow us to choose which
production to apply, based on the next input symbol
 FIRST(α)
◦ Set of terminals that begin strings derived from α. If α derives ε, then ε
is also in FIRST(α)
◦ Example : A ==> c γ, so c is in FIRST(A)
◦ Consider two A-productions A α | β, where FIRST(α) and FIRST(β)
are disjoint sets
◦ Choose between these by looking at the next input symbol a, since a
can be in at most one of FIRST(α) and FIRST(β) , not both
FIRST and FOLLOW
*

42
 FOLLOW(A), for a nonterminal A
◦ Set of terminals a that can appear immediately to the right of A in some
sentential form
◦ That is, set of terminals a such that there exists a derivation of the form
S ==> αAaβ
◦ If A can be the rightmost symbol in some sentential form, then $ is in
FOLLOW(A)
FIRST and FOLLOW
*

43
 To compute FIRST(X), apply following rules until no more
terminals or ε can be added to any FIRST set
1. If X is a terminal, then FIRST(X) = { X }
2. If X is a nonterminal and X  Y1Y2 . . . Yk is a production for some k≥1
 Place a in FIRST(X) if for some i, a is in FIRST(Yi) , and ε is in all of
FIRST(Y1), . . . , FIRST(Yi-1); i.e., Y1. . . Yi-1 ==> ε
 If ε is in FIRST(Yj) for all j = 1, 2, . . ., k, then add ε to FIRST(X)
 If Y1 does not derive ε, then we add nothing more to FIRST(X), but if Y1 ==> ε,
then we add FIRST(Y2) and so on
3. If X  ε is a production, then add ε to FIRST(X)
FIRST and FOLLOW
*
*

44
 To compute FOLLOW(A) for all nonterminals A, apply
following rules until nothing can be added to any FOLLOW set
1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input
right endmarker
2. If there is production , A  αBβ, then everything in FIRST(β) except ε is
in FOLLOW(B)
3. If there is a production, A  αB, or a production A  αBβ, where
FIRST(β) contains ε, then everything in FOLLOW(A) is in
FOLLOW(B)
FIRST and FOLLOW

45
 Predictive parsers needing no backtracking are constructed for a
class of grammars called LL(1) grammars
◦ First “L” stands for scanning the input from left to right
◦ Second “L” stands for producing a leftmost derivation
◦ “1” for using one input symbol of look ahead at each step to make parsing
decisions
 A grammar G is LL(1) if and only if whenever A α | β , are
two distinct productions, the following hold:
◦ For no nonterminal a do both α and β derive strings beginning with a
◦ At most one of α and β can derive the empty string
◦ If β ==> ε, then α does not derive any string beginning with a terminal in
FOLLOW(A). Likewise, if α ==> ε, then β does not derive an string
beginning with a terminal in FOLLOW(A)
LL(1) Grammars
*
*

46
 Predictive parsers can be constructed for LL(1) grammars since
the proper production to apply can be selected by looking only
at the current input symbol
 Next algorithm collects information from FIRST and FOLLOW
sets
◦ Into a predictive parsing table M[A,a], where A is the nonterminal and a is
terminal
◦ IDEA
 Production A  α is chosen if the next input symbol a is in FIRST(α)
 If α = ε, we again choose A  α , if the current input symbol is in FOLLOW(A),
or if the $ on the input has been reached and $ is in FOLLOW(A)
LL(1) Grammars

47
Algorithm: Construction of Predictive Parsing Table
Input : Grammar G
Output: Parsing Table M
Method : For each production A  α of the grammar do the
1. For each terminal a in FIRST(A), add A  α to M[A,a].
2. If ε is in FIRST(α), then for each terminal b in FOLLOW(A), add
A  α to M[A,b]. If ε is in FIRST(α) and $ is in FOLLOW(A), add A
 α to M[A,$] as well
If after performing the above, there is no production at all in M[A,a],
then set M[A,a] to error
LL(1) Grammars

48
 Built by maintaining a stack explicitly rather than recursive calls
 Parser mimics a leftmost derivation
◦ If w is the input that has been matched so far, then the stack holds a
sequence of grammar symbols α such that S ==> w α
Nonrecursive Predictive Parsing
*
lm

49

50

51
 Error recovery refers to the stack of a table-driven predictive
parser
◦ It makes explicit the terminals and nonterminals that the parser hopes to
match with the remainder of the input
 An error is detected during predictive parsing when
◦ The terminal on top of the stack does not match the next input symbol
◦ Nonterminal A is on top of the stack, a is the next input symbol, and
M[A,a] is error, i.e., parsing table entry is empty
Error Recovery in Predictive Parsing

52
 Panic Mode
◦ Based on the idea of skipping symbols on the input until a token in a
selected set of synchronizing tokens appear
◦ Some heuristics are :
 Place all symbols in FOLLOW(A) into the synchronizing set for nonterminal
A. If we skip tokens until an element of FOLLOW (A) is seen and pop A
from the stack, parsing can continue
 If we add symbols in FIRST(A) to the synchronizing set, then it may be
possible to resume parsing according to A if a symbol in FIRST(A) appears
in the input
 If a nonterminal can generate the empty string, then the production deriving
ε can be used as default
 If a terminal on top of the stack cannot be matched, then pop the terminal,
issue a message saying that the terminal was inserted and continue parsing

53
 Synchronizing tokens added to the parsing table

54
 Parsing and Error Recovery moves

55
 Phrase-level Recovery
◦ Implemented by filling the blank entries in the table with pointers to error
routines
◦ These routines may change , insert, or delete symbols on the input and
issue appropriate error messages. May also pop from the stack
◦ Alteration of stack symbols or pushing of new symbols onto the stack is
questionable for several reasons
 Steps carried out by the parser might not correspond to the derivation of any
word in the language at all
 Must ensure that there is no possibility of an infinite loop
◦ Checking that any recovery action results in an input symbol being
consumed is a good way to protect against such loops

unit2_cdunit2_cdunit2_cdunit2_cdunit2_cd.pptx

More Related Content

Similar to unit2_cdunit2_cdunit2_cdunit2_cdunit2_cd.pptx

Recently uploaded

unit2_cdunit2_cdunit2_cdunit2_cdunit2_cd.pptx