4 lexical and syntax analysis


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

4 lexical and syntax analysis

  1. 1. ICS 313 - Fundamentals of Programming Languages 14. Lexical and Syntax Analysis4.1 IntroductionLanguage implementation systems must analyze source code,regardless of the specific implementation approachNearly all syntax analysis is based on a formal description of thesyntax of the source language (BNF)The syntax analysis portion of a language processor nearly alwaysconsists of two parts:A low-level part called a lexical analyzer (mathematically, a finiteautomaton based on a regular grammar)A high-level part called a syntax analyzer, or parser (mathematically, apush-down automaton based on a context-free grammar, or BNF)Reasons to use BNF to describe syntax:Provides a clear and concise syntax descriptionThe parser can be based directly on the BNFParsers based on BNF are easy to maintain
  2. 2. ICS 313 - Fundamentals of Programming Languages 24.1 Introduction (continued)Reasons to separate lexical and syntax analysis:Simplicity - less complex approaches can be used forlexical analysis; separating them simplifies the parserEfficiency - separation allows optimization of the lexicalanalyzerPortability - parts of the lexical analyzer may not beportable, but the parser always is portable4.2 Lexical AnalysisA lexical analyzer is a pattern matcher for characterstringsA lexical analyzer is a “front-end” for the parserIdentifies substrings of the source program thatbelong together - lexemesLexemes match a character pattern, which isassociated with a lexical category called a tokensum is a lexeme; its token may be IDENT
  3. 3. ICS 313 - Fundamentals of Programming Languages 34.2 Lexical Analysis (continued)The lexical analyzer is usually a function that is called by theparser when it needs the next tokenThree approaches to building a lexical analyzer:1. Write a formal description of the tokens and use a software toolthat constructs table-driven lexical analyzers given such adescription2. Design a state diagram that describes the tokens and write aprogram that implements the state diagram3. Design a state diagram that describes the tokens and hand-construct a table-driven implementation of the state diagramWe only discuss approach 24.2 Lexical Analysis (continued)State diagram design:A naive state diagram would have a transition from every stateon every character in the source language - such a diagramwould be very large!In many cases, transitions can be combined to simplify thestate diagramWhen recognizing an identifier, all uppercase and lowercaseletters are equivalent - Use a character class that includes alllettersWhen recognizing an integer literal, all digits are equivalent -use a digit classReserved words and identifiers can be recognized together(rather than having a part of the diagram for each reservedword)Use a table lookup to determine whether a possible identifier is in fact a reservedword
  4. 4. ICS 313 - Fundamentals of Programming Languages 44.2 Lexical Analysis (continued)Convenient utility subprograms:getChar - gets the next character of input, puts it innextChar, determines its class and puts the class incharClassaddChar - puts the character from nextChar into the placethe lexeme is being accumulated, lexemelookup - determines whether the string in lexeme is areserved word (returns a code)4.2 Lexical Analysis (continued)Implementation (assume initialization):int lex() {switch (charClass) {case LETTER:addChar();getChar();while (charClass == LETTER ||charClass == DIGIT) {addChar();getChar();}return lookup(lexeme);break;case DIGIT:addChar();getChar();while (charClass == DIGIT) {addChar();getChar();}return INT_LIT;break;} /* End of switch */} /* End of function lex */
  5. 5. ICS 313 - Fundamentals of Programming Languages 54.3 The Parsing ProblemGoals of the parser, given an input program:Find all syntax errors; For each, produce an appropriate diagnosticmessage, and recover quicklyProduce the parse tree, or at least a trace of the parse tree, for theprogramTwo categories of parsersTop down - produce the parse tree, beginning at the rootOrder is that of a leftmost derivationBottom up - produce the parse tree, beginning at the leavesOrder is the that of the reverse of a rightmost derivationParsers look only one token ahead in the inputTop-down ParsersGiven a sentential form, xAα , the parser must choose the correct A-rule toget the next sentential form in the leftmost derivation, using only the firsttoken produced by AThe most common top-down parsing algorithms:Recursive descent - a coded implementationLL parsers - table driven implementation4.3 The Parsing Problem (continued)Bottom-up parsersGiven a right sentential form, α, what substring of α is the right-hand side of the rule in the grammar that must be reduced toproduce the previous sentential form in the right derivationThe most common bottom-up parsing algorithms are in the LRfamily (LALR, SLR, canonical LR)The Complexity of ParsingParsers that works for any unambiguous grammar are complexand inefficient (O(n3), where n is the length of the input)Compilers use parsers that only work for a subset of allunambiguous grammars, but do it in linear time (O(n), where nis the length of the input)
  6. 6. ICS 313 - Fundamentals of Programming Languages 64.4 Recursive-Descent ParsingRecursive Descent ProcessThere is a subprogram for each nonterminal in the grammar, which canparse sentences that can be generated by that nonterminalEBNF is ideally suited for being the basis for a recursive-descentparser, because EBNF minimizes the number of nonterminalsA grammar for simple expressions:<expr> → <term> {(+ | -) <term>}<term> → <factor> {(* | /) <factor>}<factor> → id | ( <expr> )Assume we have a lexical analyzer named lex, which puts the nexttoken code in nextTokenThe coding process when there is only one RHS:For each terminal symbol in the RHS, compare it with the next input token;if they match, continue, else there is an errorFor each nonterminal symbol in the RHS, call its associated parsingsubprogram4.4 Recursive-Descent Parsing (continued)/* Function exprParses strings in the language generated by the rule:<expr> → <term> {(+ | -) <term>} */void expr() {/* Parse the first term */term();/* As long as the next token is + or -, calllex to get the next token, and parse the next term */while (nextToken == PLUS_CODE ||nextToken == MINUS_CODE){lex();term();}}This particular routine does not detect errorsConvention: Every parsing routine leaves the next token in nextToken
  7. 7. ICS 313 - Fundamentals of Programming Languages 74.4 Recursive-Descent Parsing (continued)A nonterminal that has more than one RHS requiresan initial process to determine which RHS it is toparseThe correct RHS is chosen on the basis of the next tokenof input (the lookahead)The next token is compared with the first token that canbe generated by each RHS until a match is foundIf no match is found, it is a syntax error4.4 Recursive-Descent Parsing (continued)/* Function factorParses strings in the language generated bythe rule: <factor> -> id | (<expr>) */void factor() {/* Determine which RHS */if (nextToke == ID_CODE)/* For the RHS id, just call lex */lex();/* If the RHS is (<expr>) – call lex to passover the left parenthesis, call expr, andcheck for the right parenthesis */else if (nextToken == LEFT_PAREN_CODE) {lex();expr();if (nextToken == RIGHT_PAREN_CODE)lex();elseerror();} /* End of else if (nextToken == ... */else error(); /* Neither RHS matches */}
  8. 8. ICS 313 - Fundamentals of Programming Languages 84.4 Recursive-Descent Parsing (continued)The LL Grammar ClassThe Left Recursion ProblemIf a grammar has left recursion, either direct or indirect, it cannot be thebasis for a top-down parserA grammar can be modified to remove left recursionThe other characteristic of grammars that disallows top-down parsing isthe lack of pairwise disjointnessThe inability to determine the correct RHS on the basis of one token oflookaheadDef: FIRST(α) = {a | α =>* aβ } (If α =>* ε, ε is in FIRST(α))Pairwise Disjointness Test:For each nonterminal, A, in the grammar that has more than one RHS, foreach pair of rules, A → αi and A → αj, it must be true that FIRST(αi) ∩FIRST(αj) = φExamples:A → a | bB | cAbA → a | aB4.4 Recursive-Descent Parsing (continued)Left factoring can resolve the problemReplace<variable> → identifier | identifier [<expression>]with<variable> → identifier <new><new> → ε | [<expression>]or<variable> → identifier [[<expression>]](the outer brackets are metasymbols of EBNF)
  9. 9. ICS 313 - Fundamentals of Programming Languages 94.5 Bottom-up ParsingThe parsing problem is finding the correct RHS in a right-sentential form to reduce to get the previous right-sententialform in the derivationIntuition about handles:Def: β is the handle of the right sentential formγ = αβw if and only if S =>*rm αAw => αβwDef: β is a phrase of the right sentential formγ if and only if S =>* γ = α1Aα2 =>+ α1βα2Def: β is a simple phrase of the right sentential form γ if andonly if S =>* γ = α1Aα2 => α1βα2The handle of a right sentential form is its leftmost simplephraseGiven a parse tree, it is now easy to find the handleParsing can be thought of as handle pruning4.5 Bottom-up Parsing (continued)Shift-Reduce AlgorithmsReduce is the action of replacing the handle on the top of theparse stack with its corresponding LHSShift is the action of moving the next token to the top of theparse stackAdvantages of LR parsers:They will work for nearly all grammars that describeprogramming languagesThey work on a larger class of grammars than other bottom-upalgorithms, but are as efficient as any other bottom-up parserThey can detect syntax errors as soon as it is possibleThe LR class of grammars is a superset of the class parsableby LL parsers
  10. 10. ICS 313 - Fundamentals of Programming Languages 104.5 Bottom-up Parsing (continued)LR parsers must be constructed with a toolKnuth’s insight: A bottom-up parser could use theentire history of the parse, up to the current point, tomake parsing decisionsThere were only a finite and relatively small number ofdifferent parse situations that could have occurred, so thehistory could be stored in a parser state, on the parsestackAn LR configuration stores the state of an LR parser(S0X1S1X2S2…XmSm, aiai+1…an$)4.5 Bottom-up Parsing (continued)LR parsers are table driven, where the table has twocomponents, an ACTION table and a GOTO tableThe ACTION table specifies the action of the parser,given the parser state and the next tokenRows are state names; columns are terminalsThe GOTO table specifies which state to put on top of theparse stack after a reduction action is doneRows are state names; columns are nonterminals
  11. 11. ICS 313 - Fundamentals of Programming Languages 114.5 Bottom-up Parsing (continued)Initial configuration: (S0, a1…an$)Parser actions:If ACTION[Sm, ai] = Shift S, the next configuration is:(S0X1S1X2S2…XmSmaiS, ai+1…an$)If ACTION[Sm, ai] = Reduce A → β and S = GOTO[Sm-r, A],where r = the length of β, the next configuration is(S0X1S1X2S2…Xm-rSm-rAS, aiai+1…an$)If ACTION[Sm, ai] = Accept, the parse is complete and no errorswere foundIf ACTION[Sm, ai] = Error, the parser calls an error-handlingroutineA parser table can be generated from a given grammarwith a tool, e.g., yacc4.5 Bottom-up Parsing (continued)Reduce 4 (use GOTO[6, T])…* id $…0E1+6F3…Reduce 6 (use GOTO[6, F])* id $0E1+6id5Shift 5id * id $0E1+6Shift 6+ id * id $0E1Reduce 2 (use GOTO[0, E])+ id * id $0T2Reduce 4 (use GOTO[0, T])+ id * id $0F3Reduce 6 (use GOTO[0, F])+ id * id $0id5Shift 5id + id * id $0ActionInputStack1. E → E + T2. E → T3. T → T * F4. T → F5. F → (E)6. F → id