Lexical analyzer
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
15,986
On Slideshare
15,986
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1,053
Comments
6
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • lec00-outline August 9, 2011

Transcript

  • 1. CS40106 Compiler Design Compiler Design 40106
  • 2.
    • Prerequisites:
    • Automata Theory: DFAs, NFAs, Regular Expressions
    • Textbook:
    • Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
    • “ Compilers: Principles, Techniques, and Tools ”
    • Addison-Wesley, 1986.
    • K. Cooper, L. Torczon, ‘Engineering a Compiler’, Morgan Kaufmann, 1 st Edition, 2003
    Compiler Design 40106
  • 3. Course Outline
    • Introduction to Compiling
    • Lexical Analysis
    • Syntax Analysis
      • Context Free Grammars
      • Top-Down Parsing, LL Parsing
      • Bottom-Up Parsing, LR Parsing
    • Syntax-Directed Translation
      • Attribute Definitions
      • Evaluation of Attribute Definitions
    • Semantic Analysis, Type Checking
    • Run-Time Organization
    • Intermediate Code Generation
    • Code Optimization
    Compiler Design 40106
  • 4. COMPILERS
    • A compiler is a program takes a program written in a source language and translates it into an equivalent program in a target language.
    • source program COMPILER target program
    • error messages
    Compiler Design 40106 ( Normally a program written in a high-level programming language) ( Normally the equivalent program in machine code – relocatable object file)
  • 5. Other Applications
    • In addition to the development of a compiler, the techniques used in compiler design can be applicable to many problems in computer science.
      • Techniques used in a lexical analyzer can be used in text editors, information retrieval system, and pattern recognition programs.
      • Techniques used in a parser can be used in a query processing system such as SQL.
      • A symbolic equation solver which takes an equation as input. That program should parse the given input equation.
      • Most of the techniques used in compiler design can be used in Natural Language Processing (NLP) systems.
    Compiler Design 40106
  • 6. Major Parts of Compilers
    • There are two major parts of a compiler: Analysis and Synthesis
    • In analysis phase, an intermediate representation is created from the given source program.
      • Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this phase.
    • In synthesis phase, the equivalent target program is created from this intermediate representation.
      • Intermediate Code Generator, Code Generator, and Code Optimizer are the parts of this phase.
    Compiler Design 40106
  • 7. Phases of A Compiler Compiler Design 40106 Lexical Analyzer Semantic Analyzer Syntax Analyzer Intermediate Code Generator Code Optimizer Code Generator Target Program Source Program
    • Each phase transforms the source program from one representation into another representation.
    • They communicate with error handlers.
    • They communicate with the symbol table.
  • 8. 1. Lexical Analyzer
    • Lexical Analyzer reads the source program character by character and returns the tokens of the source program.
    • A token describes a pattern of characters having same meaning in the source program. (such as identifiers, operators, keywords, numbers, delimeters and so on)
    • Ex : newval := oldval + 12 => tokens: newval identifier
    • := assignment operator
    • oldval identifier
    • + add operator
    • 12 a number
    • Puts information about identifiers into the symbol table.
    • Regular expressions are used to describe tokens (lexical constructs).
    • A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer.
    Compiler Design 40106
  • 9. 2. Syntax Analyzer
    • A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given program.
    • A syntax analyzer is also called as a parser .
    • A parse tree describes a syntactic structure.
    • assgstmt
    • identifier := expression
    • newval expression + expression
    • identifier number
    • oldval 12
    Compiler Design 40106
    • In a parse tree, all terminals are at leaves.
    • All inner nodes are non-terminals in
    • a context free grammar.
  • 10. Syntax Analyzer (CFG)
    • The syntax of a language is specified by a context free grammar (CFG).
    • The rules in a CFG are mostly recursive.
    • A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not.
      • If it satisfies, the syntax analyzer creates a parse tree for the given program.
    • Ex: We use BNF (Backus Naur Form) to specify a CFG
    • assgstmt -> identifier := expression
    • expression -> identifier
    • expression -> number
    • expression -> expression + expression
    Compiler Design 40106
  • 11. Syntax Analyzer versus Lexical Analyzer
    • Which constructs of a program should be recognized by the lexical analyzer, and which ones by the syntax analyzer?
      • Both of them do similar things; But the lexical analyzer deals with simple non-recursive constructs of the language.
      • The syntax analyzer deals with recursive constructs of the language.
      • The lexical analyzer simplifies the job of the syntax analyzer.
      • The lexical analyzer recognizes the smallest meaningful units (tokens) in a source program.
      • The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize meaningful structures in our programming language.
    Compiler Design 40106
  • 12. Parsing Techniques
    • Depending on how the parse tree is created, there are different parsing techniques.
    • These parsing techniques are categorized into two groups:
      • Top-Down Parsing,
      • Bottom-Up Parsing
    • Top-Down Parsing:
      • Construction of the parse tree starts at the root, and proceeds towards the leaves.
      • Efficient top-down parsers can be easily constructed by hand.
      • Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
    • Bottom-Up Parsing:
      • Construction of the parse tree starts at the leaves, and proceeds towards the root.
      • Normally efficient bottom-up parsers are created with the help of some software tools.
      • Bottom-up parsing is also known as shift-reduce parsing.
      • Operator-Precedence Parsing – simple, restrictive, easy to implement
      • LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
    Compiler Design 40106
  • 13. 3. Semantic Analyzer
    • A semantic analyzer checks the source program for semantic errors and collects the type information for the code generation.
    • Type-checking is an important part of semantic analyzer.
    • Context-free grammars used in the syntax analysis are integrated with attributes (semantic rules)
      • the result is a syntax-directed translation,
      • Attribute grammars
    • Ex:
      • newval := oldval + 12
        • The type of the identifier newval must match with type of the expression (oldval+12)
    Compiler Design 40106
  • 14. 4. Intermediate Code Generation
    • A compiler may produce an explicit intermediate codes representing the source program.
    • These intermediate codes are generally machine (architecture independent). But the level of intermediate codes is close to the level of machine codes.
    • Ex:
      • newval := oldval * fact + 1
      • id1 := id2 * id3 + 1
      • temp1 := id2 * id3 Intermediates Codes (three address code)
      • temp2 := temp1 + 1
      • id1 := temp2
    Compiler Design 40106
  • 15. Intermediate Code Generation
    • Properties of IR-
    • Easy to produce
    • Easy to translate into the target program
    • IR can be in following forms-
    • Syntax trees
    • Postfix notation
    • Three address statements
    • Properties of three address statements-
    • At most one operator in addition to an assignment operator
    • Must generate temporary names to hold the value computed at each
    • instruction
    • Some instructions have fewer than three operands
    Compiler Design 40106
  • 16. 5. Code Optimizer (for Intermediate Code Generator)
    • The code optimizer optimizes the code produced by the intermediate code generator in the terms of time and space.
    • Ex:
      • temp1 := id2 * id3
      • id1 := temp1 + 1
    Compiler Design 40106
  • 17. 6. Code Generator
    • Produces the target language in a specific architecture.
    • The target program is normally a relocatable object file containing the machine code or assembly code.
    • Intermediate instructions are each translated into a sequence of machine instructions that perform the same task.
    • Ex:
    • ( assume that we have an architecture with instructions whose at least one of its operands is
    • a machine register)
      • MOVE id2,R1
      • MULT id3,R1
      • ADD #1,R1
      • MOVE R1,id1
    Compiler Design 40106
  • 18. Phases of compiler Compiler Design 40106
  • 19. Compiler Design 40106
  • 20. The Structure of a Compiler (8) Scanner [Lexical Analyzer] Parser [Syntax Analyzer] Semantic Process [Semantic analyzer] Code Generator [Intermediate Code Generator] Code Optimizer Parse tree Abstract Syntax Tree w/ Attributes Non-optimized Intermediate Code Optimized Intermediate Code Code Genrator Target machine code Compiler Design 40106 Tokens
  • 21. The input program as you see it. main () { int i,sum; sum = 0; for (i=1; i<=10; i++); sum = sum + i; printf(&quot;%dn&quot;,sum); } Compiler Design 40106
  • 22. Compiler Design 40106
  • 23. Compiler Design 40106
  • 24. Compiler Design 40106
  • 25. Compiler Design 40106
  • 26. Compiler Design 40106
  • 27. Compiler Design 40106
  • 28. Compiler Design 40106
  • 29. Compiler Design 40106
  • 30. Compiler Design 40106
  • 31. Compiler Design 40106
  • 32. Compiler Design 40106
  • 33. Compiler Design 40106
  • 34. Introducing Basic Terminology
    • What are Major Terms for Lexical Analysis?
      • TOKEN
        • A classification for a common set of strings
        • Examples Include <Identifier>, <number>, etc.
      • PATTERN
        • The rules which characterize the set of strings for a token
        • Recall File and OS Wildcards ([A-Z]*.*)
      • LEXEME
        • Actual sequence of characters that matches pattern and is classified by a token
        • Identifiers: x, count, name, etc…
    Compiler Design 40106
  • 35. Introducing Basic Terminology Compiler Design 40106 Token Sample Lexemes Informal Description of Pattern const if relation id num literal const if <, <=, =, < >, >, >= pi, count , D2 3.1416, 0, 6.02E23 “ core dumped” const if < or <= or = or < > or >= or > letter followed by letters and digits any numeric constant any characters between “ and “ except “ Classifies Pattern
    • Actual values are critical.
    • Info is :
      • 1.Stored in symbol table
      • 2.Returned to parser
  • 36. Attributes for Tokens Tokens influence parsing decision; the attributes influence the translation of tokens. Example: E = M * C ** 2 <id, pointer to symbol-table entry for E> <assign_op, > <id, pointer to symbol-table entry for M> <mult_op, > <id, pointer to symbol-table entry for C> <exp_op, > <num, integer value 2> Compiler Design 40106
  • 37. Role of the Lexical Analyzer Compiler Design 40106
  • 38.
    • Tasks of Lexical Analyzer
    • Reads source text and detects the token
    • Stripe outs comments, white spaces, tab, newline characters.
    • Correlates error messages from compilers to source program
    • Approaches to implementation
    • . Use assembly language- Most efficient but most difficult to implement
    • . Use high level languages like C- Efficient but difficult to implement
    • . Use tools like lex, flex- Easy to implement but not as efficient as the first two cases
    Compiler Design 40106
  • 39. Compiler Design 40106
  • 40.
      • Handling Lexical Errors
      • In what Situations do Errors Occur?
      • Lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix of remaining input.
      • For example:
      • fi ( a == f(x) )….
      • 1) Panic mode recovery - deleting successive characters until a well formed token is formed.
      • 2) Inserting a missing character
      • 3) Replacing a missing character by a correct character
      • 4) Transposing two adjacent characters.
      • 5) Deleting an extraneous character
    Compiler Design 40106
  • 41. Compiler Design 40106
  • 42. Compiler Design 40106
  • 43. Compiler Design 40106
  • 44. Compiler Design 40106
  • 45. Compiler Design 40106
  • 46. Compiler Design 40106
  • 47. Compiler Design 40106
  • 48. Compiler Design 40106
  • 49. Compiler Design 40106
  • 50. Compiler Design 40106
  • 51. Compiler Design 40106
  • 52. Compiler Design 40106
  • 53. Compiler Design 40106
  • 54. Compiler Design 40106
  • 55. Compiler Design 40106
  • 56. Compiler Design 40106
  • 57. Compiler Design 40106
  • 58. Compiler Design 40106
  • 59. Compiler Design 40106
  • 60. Compiler Design 40106
  • 61. Compiler Design 40106
  • 62. Compiler Design 40106
  • 63. Compiler Design 40106
  • 64. Compiler Design 40106
  • 65. Compiler Design 40106
  • 66. Compiler Design 40106
  • 67. Compiler Design 40106
  • 68. Compiler Design 40106
  • 69. Compiler Design 40106
  • 70. Compiler Design 40106
  • 71. Compiler Design 40106
  • 72. Compiler Design 40106
  • 73. Compiler Design 40106
  • 74. Compiler Design 40106
  • 75. Compiler Design 40106
  • 76.  
  • 77. Compiler Design 40106
  • 78. Compiler Design 40106
  • 79. Compiler Design 40106