Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Compiler design.ppt
1. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 1
Introduction to Parsing
Lecture 8
Adapted from slides by G. Necula and R. Bodik
2. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 2
Outline
• Limitations of regular languages
• Parser overview
• Context-free grammars (CFG’s)
• Derivations
• Syntax-Directed Translation
3. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 3
Languages and Automata
• Formal languages are very important in CS
– Especially in programming languages
• Regular languages
– The weakest formal languages widely used
– Many applications
• We will also study context-free languages
4. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 4
Limitations of Regular Languages
• Intuition: A finite automaton that runs long
enough must repeat states
• Finite automaton can’t remember # of times it
has visited a particular state
• Finite automaton has finite memory
– Only enough to store in which state it is
– Cannot count, except up to a finite limit
• E.g., language of balanced parentheses is not
regular: { (i )i | i 0}
5. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 5
The Structure of a Compiler
Source Tokens
Interm.
Language
Lexical
analysis
Parsing
Code
Gen.
Machine
Code
Today we start
Optimization
6. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 6
The Functionality of the Parser
• Input: sequence of tokens from lexer
• Output: abstract syntax tree of the program
7. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 7
Example
• Pyth: if x == y: z =1
else: z = 2
• Parser input: IF ID == ID : ID = INT ELSE : ID = INT
• Parser output (abstract syntax tree):
IF-THEN-ELSE
== = =
ID ID ID ID INT
INT
8. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 8
Why A Tree?
• Each stage of the compiler has two purposes:
– Detect and filter out some class of errors
– Compute some new information or translate the
representation of the program to make things
easier for later stages
• Recursive structure of tree suits recursive
structure of language definition
• With tree, later stages can easily find “the
else clause”, e.g., rather than having to scan
through tokens to find it.
9. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 9
Comparison with Lexical Analysis
Phase Input Output
Lexer Sequence of
characters
Sequence of
tokens
Parser Sequence of
tokens
Syntax tree
10. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 10
The Role of the Parser
• Not all sequences of tokens are programs . . .
• . . . Parser must distinguish between valid and
invalid sequences of tokens
• We need
– A language for describing valid sequences of tokens
– A method for distinguishing valid from invalid
sequences of tokens
11. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 11
Programming Language Structure
• Programming languages have recursive structure
• Consider the language of arithmetic expressions
with integers, +, *, and ( )
• An expression is either:
– an integer
– an expression followed by “+” followed by expression
– an expression followed by “*” followed by expression
– a ‘(‘ followed by an expression followed by ‘)’
• int , int + int , ( int + int) * int are expressions
12. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 12
Notation for Programming Languages
• An alternative notation:
E int
E E + E
E E * E
E ( E )
• We can view these rules as rewrite rules
– We start with E and replace occurrences of E with
some right-hand side
• E E * E ( E ) * E ( E + E ) * E …
(int + int) * int
13. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 13
Observation
• All arithmetic expressions can be obtained by
a sequence of replacements
• Any sequence of replacements forms a valid
arithmetic expression
• This means that we cannot obtain
( int ) )
by any sequence of replacements. Why?
• This set of rules is a context-free grammar
14. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 14
Context-Free Grammars
• A CFG consists of
– A set of non-terminals N
• By convention, written with capital letter in these notes
– A set of terminals T
• By convention, either lower case names or punctuation
– A start symbol S (a non-terminal)
– A set of productions
• Assuming E N
E e , or
E Y1 Y2 ... Yn where Yi N T
15. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 15
Examples of CFGs
Simple arithmetic expressions:
E int
E E + E
E E * E
E ( E )
– One non-terminal: E
– Several terminals: int, +, *, (, )
• Called terminals because they are never replaced
– By convention the non-terminal for the first
production is the start one
16. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 16
The Language of a CFG
Read productions as replacement rules:
X Y1 ... Yn
Means X can be replaced by Y1 ... Yn
X e
Means X can be erased (replaced with empty string)
17. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 17
Key Idea
1. Begin with a string consisting of the start
symbol “S”
2. Replace any non-terminal X in the string by a
right-hand side of some production
X Y1 … Yn
3. Repeat (2) until there are only terminals in
the string
4. The successive strings created in this way
are called sentential forms.
18. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 18
The Language of a CFG (Cont.)
More formally, may write
X1 … Xi-1 Xi Xi+1… Xn X1 … Xi-1 Y1 … Ym Xi+1 … Xn
if there is a production
Xi Y1 … Ym
19. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 19
The Language of a CFG (Cont.)
Write
X1 … Xn * Y1 … Ym
if
X1 … Xn … … Y1 … Ym
in 0 or more steps
20. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 20
The Language of a CFG
Let G be a context-free grammar with start
symbol S. Then the language of G is:
L(G) = { a1 … an | S * a1 … an and every ai
is a terminal }
21. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 21
Examples:
• S 0 also written as S 0 | 1
S 1
Generates the language { “0”, “1” }
• What about S 1 A
A 0 | 1
• What about S 1 A
A 0 | 1 A
• What about S e | ( S )
22. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 22
Pyth Example
A fragment of Pyth:
Compound while Expr: Block
| if Expr: Block Elses
Elses e | else: Block | elif Expr: Block Elses
Block Stmt_List | Suite
(Formal language papers use one-character non-
terminals, but we don’t have to!)
23. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 23
Derivations and Parse Trees
• A derivation is a sequence of sentential forms
resulting from the application of a sequence of
productions
S … …
• A derivation can be represented as a parse tree
– Start symbol is the tree’s root
– For a production X Y1 … Yn add children
Y1, …, Yn to node X
24. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 24
Derivation Example
• Grammar
E E + E | E * E | (E) | int
• String
int * int + int
25. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 25
Derivation Example (Cont.)
E
E + E
E * E + E
int * E + E
int * int + E
int * int + int
28. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 28
Derivation in Detail (3)
E
E
E E
E
+
*
E
E + E
E * E + E
29. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 29
Derivation in Detail (4)
E
E
E E
E
+
*
int
E
E + E
E * E + E
int * E + E
30. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 30
Derivation in Detail (5)
E
E
E E
E
+
*
int
int
E
E + E
E * E + E
int * E + E
int * int + E
31. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 31
Derivation in Detail (6)
E
E
E E
E
+
int
*
int
int
E
E + E
E * E + E
int * E + E
int * int + E
int * int + int
32. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 32
Notes on Derivations
• A parse tree has
– Terminals at the leaves
– Non-terminals at the interior nodes
• A left-right traversal of the leaves is the
original input
• The parse tree shows the association of
operations, the input string does not !
– There may be multiple ways to match the input
– Derivations (and parse trees) choose one
33. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 33
The Payoff: parser as a translator
syntax-directed translation
parser
syntax + translation rules
(typically hardcoded in the parser)
stream of
tokens
ASTs, or
assembly code
34. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 34
Mechanism of syntax-directed translation
• syntax-directed translation is done by
extending the CFG
– a translation rule is defined for each production
given
X d A B c
the translation of X is defined recursively using
• translation of nonterminals A, B
• values of attributes of terminals d, c
• constants
35. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 35
To translate an input string:
1. Build the parse tree.
2. Working bottom-up
• Use the translation rules to compute the translation of each
nonterminal in the tree
Result: the translation of the string is the translation of the parse
tree's root nonterminal.
Why bottom up?
• a nonterminal's value may depend on the value of the symbols
on the right-hand side,
• so translate a non-terminal node only after children
translations are available.
36. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 36
Example 1: Arithmetic expression to value
Syntax-directed translation rules:
E E + T E1.trans = E2.trans + T.trans
E T E.trans = T.trans
T T * F T1.trans = T2.trans * F.trans
T F T.trans = F.trans
F int F.trans = int.value
F ( E ) F.trans = E.trans
37. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 37
Example 1: Bison/Yacc Notation
E : E + T { $$ = $1 + $3; }
T : T * F { $$ = $1 * $3; }
F : int { $$ = $1; }
F : ‘(‘ E ‘) ‘ { $$ = $2; }
• KEY: $$ : Semantic value of left-hand side
$n : Semantic value of nth symbol on
right-hand side
38. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 38
Example 1 (cont): Annotated Parse Tree
Input: 2 * (4 + 5)
E (18)
T (18)
F (9)
T (2)
F (2)
E (9)
T (5)
F (5)
E (4)
T (4)
F (4)
*
)
+
(
int (2)
int (4)
int (5)
39. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 39
Example 2: Compute the type of an expression
E -> E + E if $1 == INT and $3 == INT:
$$ = INT
else: $$ = ERROR
E -> E and E if $1 == BOOL and $3 == BOOL:
$$ = BOOL
else: $$ = ERROR
E -> E == E if $1 == $3 and $2 != ERROR:
$$ = BOOL
else: $$ = ERROR
E -> true $$ = BOOL
E -> false $$ = BOOL
E -> int $$ = INT
E -> ( E ) $$ = $2
40. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 40
Example 2 (cont)
• Input: (2 + 2) == 4
E (INT)
E (BOOL)
E (INT)
int (INT)
==
E (INT) E (INT)
+
int (INT)
int (INT)
E (INT)
( )
41. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 41
Building Abstract Syntax Trees
• Examples so far, streams of tokens translated
into
– integer values, or
– types
• Translating into ASTs is not very different
42. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 42
AST vs. Parse Tree
• AST is condensed form of a parse tree
– operators appear at internal nodes, not at leaves.
– "Chains" of single productions are collapsed.
– Lists are "flattened".
– Syntactic details are omitted
• e.g., parentheses, commas, semi-colons
• AST is a better structure for later compiler
stages
– omits details having to do with the source language,
– only contains information about the essential
structure of the program.
43. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 43
Example: 2 * (4 + 5) Parse tree vs. AST
E
int (2)
*
+
2
5
4
T
F
T
F
E
T
F
E
T
F
*
)
+
(
int (5)
int (4)
44. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 44
AST-building translation rules
E E + T $$ = new PlusNode($1, $3)
E T $$ = $1
T T * F $$ = new TimesNode($1, $3)
T F $$ = $1
F int $$ = new IntLitNode($1)
F ( E ) $$ = $2
45. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 45
Example: 2 * (4 + 5): Steps in Creating AST
E
int (2)
T
F
T
F
E
T
F
E
T
F
*
)
+
(
int (5)
int (4)
+
5
4
*
2
+
5
4
2 (Only some of the
semantic values are
shown)
46. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 46
Leftmost and Rightmost Derivations
E
E + E
E * E + E
int * E + E
int * int + E
int * int + int
Leftmost derivation:
always act on leftmost
non-terminal
E
E + E
E + int
E * E + int
E * int + int
int * int + int
Rightmost derivation:
always act on rightmost
non-terminal
48. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 48
rightmost Derivation in Detail (2)
E
E E
+
E
E + E
49. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 49
rightmost Derivation in Detail (3)
E
E E
+
int
E
E + E
E + int
50. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 50
rightmost Derivation in Detail (4)
E
E
E E
E
+
int
*
E
E + E
E + int
E * E + int
51. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 51
rightmost Derivation in Detail (5)
E
E
E E
E
+
int
*
int
E
E + E
E + int
E * E + int
E * int + int
52. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 52
rightmost Derivation in Detail (6)
E
E
E E
E
+
int
*
int
int
E
E + E
E + int
E * E + int
E * int + int
int * int + int
53. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 53
Aside: Canonical Derivations
• Take a look at that last derivation in reverse.
• The active part (red) tends to move left to
right.
• We call this a reverse rightmost or canonical
derivation.
• Comes up in bottom-up parsing. We’ll return
to it in a couple of lectures.
54. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 54
Derivations and Parse Trees
• For each parse tree there is exactly one
leftmost and one rightmost derivation
• The difference is the order in which branches
are added, not the structure of the tree.
55. 2/8/2008 Prof. Hilfinger CS164 Lecture 8 55
Summary of Derivations
• We are not just interested in whether
s L(G)
• Also need derivation (or parse tree) and AST.
• Parse trees slavishly reflect the grammar.
• Abstract syntax trees abstract from the grammar,
cutting out detail that interferes with later stages.
• A derivation defines a parse tree
– But one parse tree may have many derivations
• Derivations drive translation (to ASTs, etc.)
• Leftmost and rightmost derivations most important in
parser implementation