Course Goals  •  To provide students with an understanding of the  major phases of a compiler.  •  To introduce students to the theory behind the various  phases, including regular expressions, context-free grammars, and finite state automata.  •  To provide students with an understanding of the design and implementation of a compiler.  •  To have the students build a compiler, through type checking and intermediate code generation, for a small language.  •  To provide students with an opportunity to work in a group  on a large project.
  Course Outcomes •  Students will have experience using current compiler generation tools.  •  Students will be familiar with the different phases of compilation.  •  Students will have experience defining and specifying the semantic rules of a programming language
  Prerequisites •  In-depth knowledge of at least one structured programming language.  •  Strong background in algorithms, data structures, and abstract data types, including stacks, binary trees, graphs.  •  Understanding of grammar theories.  •  Understanding of data types and control structures, their design and implementation.  •  Understanding of the design and implementation of subprograms, parameter passing mechanisms, scope.
Major Topics Covered in the Course Overview & Lexical Analysis (Scanning) Grammars & Syntax Analysis: Top-Down Parsing Syntax Analysis: Bottom-Up Parsing Semantic Analysis Symbol Tables and Run-time Systems  Code Generation Introduction to Optimization and Control Flow Analysis
  Textbook Compilers: Principles, Techniques, and Tools” by Aho, Lam, Sethi, and Ullman, 2 nd  edition.
GRADING Assignements & project:  40 Midterm Exam:  20 Final Exam:  40
Compilers and Interpreters “ Compilation ” Translation of a program written in a source language into a semantically equivalent program written in a target language Compiler Error messages Source Program Target Program Input Output
Compilers and Interpreters (cont’d) “ Interpretation ” Performing the operations implied by the source program Interpreter Source Program Input Output Error messages
The Analysis-Synthesis Model of Compilation There are two parts to compilation: Analysis  determines the operations implied by the source program which are recorded in a tree structure Synthesis  takes the tree structure and translates the operations therein into the target program
Preprocessors, Compilers, Assemblers, and Linkers Preprocessor Compiler Assembler Linker Skeletal Source Program Source Program Target Assembly Program Relocatable Object Code Absolute Machine Code Libraries and Relocatable Object Files Try for example: gcc -v myprog.c
The Phases of a Compiler Phase Output Sample Programmer (source code producer) Source string A=B+C; Scanner  (performs  lexical analysis ) Token string ‘ A’ ,  ‘=’ ,  ‘B’ ,  ‘+’ ,  ‘C’ ,  ‘;’ And  symbol table  with names Parser  (performs  syntax analysis  based on the grammar of the programming language) Parse tree or abstract syntax tree ;   |   =  / \ A  +   / \   B  C Semantic analyzer  (type checking,  etc) Annotated parse tree or abstract syntax tree Intermediate code generator Three-address code, quads, or RTL int2fp B  t1 +  t1  C  t2 :=  t2  A Optimizer Three-address code, quads, or RTL int2fp B  t1 +  t1  #2.3  A Code generator Assembly code MOVF  #2.3,r1 ADDF2 r1,r2 MOVF  r2,A Peephole optimizer Assembly code ADDF2 #2.3,r2 MOVF  r2,A
The Grouping of Phases Compiler  front  and  back ends : Front end:  analysis  ( machine independent ) Back end:  synthesis  ( machine dependent ) Compiler  passes: A collection of phases is done only once ( single pass ) or multiple times ( multi pass ) Single pass: usually requires everything to be defined before being used in source program Multi pass: compiler may have to keep entire program representation in memory
Compiler-Construction Tools Software development tools are available to implement one or more compiler phases Scanner generators Parser generators Syntax-directed translation engines Automatic code generators Data-flow engines
What qualities do you want in a  that compiler you buy  1. Correct Code 2. Output runs fast 3. Compiler runs fast 4. Compile time proportional to program size 5. Support for separate compilation 6. Good diagnostics for syntax errors 7. Works well with debugger 8. Good diagnostics for flow anomalies 9. Good diagnostics for storage leaks 10. Consistent, predictable optimization
High-level View of a Compiler  Implications Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code) Must agree with OS & linker on format for object code Source code Machine code Compiler Errors
Traditional Two-pass Compiler Use an intermediate representation (IR) Front end maps legal source code into IR Back end maps IR into target machine code Admits multiple front ends & multiple passes  ( better code ) Source code Front End Errors   Machine code Back End IR
The Front End Responsibilities Recognize legal (& illegal) programs Report errors in a useful way Produce IR & preliminary storage map Shape the code for the back end Much of front end construction can be automated Source code Scanner IR Parser Errors   tokens
The Front End Scanner Maps character stream into words—the basic unit of syntax Produces words & their parts of speech x = x + y ;  becomes  <id,x>  <op,= > <id,x> <op,+ <id,y> ; word    lexeme, part of speech    token In casual speech, we call the pair a  token Typical tokens include number, identifier, +, -, while, if Scanner eliminates white space Speed is important   use a specialized recognizer Source code Scanner IR Parser Errors   tokens
The Front End Parser Recognizes context-free syntax & reports errors Guides context-sensitive analysis ( type checking ) Builds IR for source program Hand-coded parsers are fairly easy to build Most books advocate using automatic parser generators Source code Scanner IR Parser Errors   tokens
The Front End Compilers often use an  abstract syntax tree This is much more concise AST s are one form of  intermediate representation ( IR ) The  AST  summarizes grammatical structure, without including detail about the derivation  + - < id, x > < number, 2 > < id, y >
The Back End Responsibilities Translate IR into target machine code Choose instructions to implement each IR operation Decide which value to keep in registers Ensure conformance with system interfaces Automation has been  much less  successful in the back end Errors   IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR IR
The Back End Instruction Selection Produce fast, compact code Take advantage of target features  such as addressing modes Usually viewed as a pattern matching problem ad hoc  methods, pattern matching, dynamic programming This was the problem of the future in 1978 Spurred by transition from PDP-11 to VAX-11 Orthogonality of RISC simplified this problem Errors   IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR IR
The Back End Instruction Scheduling Avoid hardware stalls and interlocks Use all functional units productively Can increase lifetime of variables  (changing the allocation) Optimal scheduling is NP-Complete in nearly all cases Good heuristic techniques are well understood Errors   IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR IR
The Back End Register allocation Have each value in a register when it is used Manage a limited set of resources Can change instruction choices & insert LOADs & STOREs Optimal allocation is NP-Complete  (1 or  k  registers) Compilers approximate solutions to NP-Complete problems Errors   IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR IR
Traditional Three-pass Compiler Code Improvement (or  Optimization ) Analyzes IR and rewrites (or  transforms ) IR Primary goal is to reduce running time of the compiled code May also improve space, power consumption, … Must preserve “meaning” of the code Measured by values of named variables Errors   Source Code Middle End Front End Machine code Back End IR IR
The Optimizer (or Middle End) Modern optimizers are structured as a series of passes Typical Transformations Discover & propagate some constant value Move a computation to a less frequently executed place Discover a redundant computation & remove it Remove useless or unreachable co de Errors   O p t 1 O p t 3 O p t 2 O p t n ... IR IR IR IR IR
The Big Picture Why study lexical analysis? We want to avoid writing scanners by hand Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies Scanner Scanner Generator specifications source code parts of speech tables or code
Lexical Analysis The lexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called  lexemes . For each lexeme, the lexical analyzer produces as output a  token  of the form (token-name, attribute-value)   the first component token-name is an abstract symbol that is used during syntax analysis, and the second component attribute-value points to an entry in the symbol table for this token.
Example suppose a source program contains the assignment statement: position = i n i t i a l + r a t e * 60 The characters in this assignment could grouped into the following lexemes and mapped into the following tokens passed on to the syntax analyzer: position is a lexeme that would be mapped into a token (id, I). The assignment symbol = is a lexeme that is mapped into the token (=). i n i t i a l is a lexeme that is mapped into the token (id, 2). + is a lexeme that is mapped into the token (+). r a t e is a lexeme that is mapped into the token (id, 3). * is a lexeme that is mapped into the token (*). 60 is a lexeme that is mapped into the token (60).
Specifying Lexical Patterns  (micro-syntax) A scanner recognizes the language’s parts of speech Some parts are easy White space WhiteSpace      blank  |  tab  |  WhiteSpace   blank  |  WhiteSpace   tab Keywords and operators Specified as literal patterns:  if ,  then ,  else ,  while ,  = ,  + , …  Comments Opening and ( perhaps ) closing delimiters /*   followed by  */   in C //   in C++ %   in LaTeX
A scanner recognizes the language’s parts of speech Some parts are more complex Identifiers Alphabetic followed by alphanumerics + _, &, $, … May have limited length Numbers Integers: 0  or  a digit from 1-9 followed by digits from 0-9 Decimals: integer  .  digits from 0-9,  or  .  digits from 0-9 Reals: (integer or decimal)  E  ( +  or  - ) digits from 0-9 Complex:  (  real  ,  real  ) We need a notation for specifying these patterns We would like the notation to lead to an implementation Specifying Lexical Patterns  (micro-syntax)
Regular Expressions Patterns form a regular language  ***  any finite language is regular  *** Regular expressions (REs) describe regular languages Regular Expression (over alphabet   )    is a RE denoting the set {  } If  a  is in   , then  a  is a RE denoting { a } If  x  and  y  are REs denoting  L(x)  and  L(y)  then x  is a RE denoting  L(x) x  | y  is a RE denoting  L(x)    L(y) xy  is a RE denoting  L(x)L(y) x *  is a RE denoting  L(x)* Precedence  is  closure , then  concatenation , then  alternation Ever type  “ rm *.o a.out” ?
Set Operations  (refresher) You need to know these definitions
Examples of Regular Expressions Identifiers: Letter      ( a | b | c | … | z | A | B | C | … | Z ) Digit     ( 0 | 1 | 2 | … | 9 ) Identifier     Letter  (  Letter  |  Digit  ) * Numbers: Integer    ( + | - |  ) ( 0 | ( 1 | 2 | 3 | … | 9 )( Digit  * ) ) Decimal     Integer  .  Digit  * Real     (  Integer  |  Decimal  )  E  ( + | - |  )  Digit  * Complex     (   Real   ,   Real  ) Numbers can get much more complicated!
Regular Expressions  (the point) To make scanning tractable, programming languages differentiate between parts of speech by  controlling their spelling (as opposed to dictionary lookup) Difference between  Identifier  and  Keyword  is entirely lexical While  is a  Keyword Whilst  is an  Identifier The lexical patterns used in programming languages are regular Using results from automata theory, we can automatically build recognizers from regular expressions    We study REs to automate scanner construction !
Consider the problem of recognizing  register names Register     r ( 0 | 1 | 2 | … |  9 )   ( 0 | 1 | 2 | … |  9 ) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) With implicit transitions on other inputs to an error state, s e Example  S 0  S 2  S 1  r ( 0 | 1 | 2 | …  9 ) accepting state ( 0 | 1 | 2 | …  9 ) Recognizer for  Register
DFA operation Start in state  S 0  & take transitions on each input character DFA accepts a word  x  iff  x  leaves it in a final state ( S 2  ) So, r17  takes it through  s 0 , s 1 , s 2   and accepts r  takes it through  s 0 , s 1  and fails a  takes it straight to  s e Example  (continued) S 0  S 2  S 1  r ( 0 | 1 | 2 | …  9 ) accepting state ( 0 | 1 | 2 | …  9 ) Recognizer for  Register
Example  (continued) char    next character; state    s 0 ; call action(state,char); while (char     eof ) state      (state,char); call action(state,char); char    next character; if   (state) =  final  then  report acceptance; else report failure; action(state,char) switch(  (state) ) case  start : word    char; break; case  normal : word    word + char; break; case  final : word    char; break; case  error : report error; break; end; The recognizer translates directly into code To change  DFA s, just change the tables
r   Digit Digit *   allows arbitrary numbers Accepts  r00000   Accepts  r99999 What if we want to limit it to  r0  through  r31  ? Write a tighter regular expression Register     r  ( ( 0 | 1 | 2 ) ( Digit  |   ) | ( 4 | 5 | 6 | 7 | 8 | 9 ) | ( 3 | 30 | 31 ) Register     r0 | r1 | r2 | … | r31 | r00 | r01 | r02 | … | r09 Produces a more complex DFA Has more states Same cost per transition Same basic implementation What if we need a tighter specification?
Tighter register specification  (continued) The DFA for Register     r  ( ( 0 | 1 | 2 ) ( Digit  |   ) | ( 4 | 5 | 6 | 7 | 8 | 9 ) | ( 3 | 30 | 31 ) Accepts a more constrained set of registers Same set of actions, more states  S 0  S 5  S 1  r S 4  S 3  S 6  S 2  0 , 1 , 2 3 0 , 1 4 , 5 , 6 , 7 , 8 , 9 ( 0 | 1 | 2 | …  9 )
Tighter register specification  (continued) To implement the recognizer Use the same code skeleton  Use transition and action tables for the new R E Bigger tables, more space, same asymptotic costs Better (micro-)syntax checking at the same cost

Cpcs302 1

  • 2.
    CourseGoals • To provide students with an understanding of the major phases of a compiler. • To introduce students to the theory behind the various phases, including regular expressions, context-free grammars, and finite state automata. • To provide students with an understanding of the design and implementation of a compiler. • To have the students build a compiler, through type checking and intermediate code generation, for a small language. • To provide students with an opportunity to work in a group on a large project.
  • 3.
    CourseOutcomes • Students will have experience using current compiler generation tools. • Students will be familiar with the different phases of compilation. • Students will have experience defining and specifying the semantic rules of a programming language
  • 4.
    Prerequisites• In-depth knowledge of at least one structured programming language. • Strong background in algorithms, data structures, and abstract data types, including stacks, binary trees, graphs. • Understanding of grammar theories. • Understanding of data types and control structures, their design and implementation. • Understanding of the design and implementation of subprograms, parameter passing mechanisms, scope.
  • 5.
    Major Topics Coveredin the Course Overview & Lexical Analysis (Scanning) Grammars & Syntax Analysis: Top-Down Parsing Syntax Analysis: Bottom-Up Parsing Semantic Analysis Symbol Tables and Run-time Systems Code Generation Introduction to Optimization and Control Flow Analysis
  • 6.
    TextbookCompilers: Principles, Techniques, and Tools” by Aho, Lam, Sethi, and Ullman, 2 nd edition.
  • 7.
    GRADING Assignements &project: 40 Midterm Exam: 20 Final Exam: 40
  • 8.
    Compilers and Interpreters“ Compilation ” Translation of a program written in a source language into a semantically equivalent program written in a target language Compiler Error messages Source Program Target Program Input Output
  • 9.
    Compilers and Interpreters(cont’d) “ Interpretation ” Performing the operations implied by the source program Interpreter Source Program Input Output Error messages
  • 10.
    The Analysis-Synthesis Modelof Compilation There are two parts to compilation: Analysis determines the operations implied by the source program which are recorded in a tree structure Synthesis takes the tree structure and translates the operations therein into the target program
  • 11.
    Preprocessors, Compilers, Assemblers,and Linkers Preprocessor Compiler Assembler Linker Skeletal Source Program Source Program Target Assembly Program Relocatable Object Code Absolute Machine Code Libraries and Relocatable Object Files Try for example: gcc -v myprog.c
  • 12.
    The Phases ofa Compiler Phase Output Sample Programmer (source code producer) Source string A=B+C; Scanner (performs lexical analysis ) Token string ‘ A’ , ‘=’ , ‘B’ , ‘+’ , ‘C’ , ‘;’ And symbol table with names Parser (performs syntax analysis based on the grammar of the programming language) Parse tree or abstract syntax tree ; | = / \ A + / \ B C Semantic analyzer (type checking, etc) Annotated parse tree or abstract syntax tree Intermediate code generator Three-address code, quads, or RTL int2fp B t1 + t1 C t2 := t2 A Optimizer Three-address code, quads, or RTL int2fp B t1 + t1 #2.3 A Code generator Assembly code MOVF #2.3,r1 ADDF2 r1,r2 MOVF r2,A Peephole optimizer Assembly code ADDF2 #2.3,r2 MOVF r2,A
  • 13.
    The Grouping ofPhases Compiler front and back ends : Front end: analysis ( machine independent ) Back end: synthesis ( machine dependent ) Compiler passes: A collection of phases is done only once ( single pass ) or multiple times ( multi pass ) Single pass: usually requires everything to be defined before being used in source program Multi pass: compiler may have to keep entire program representation in memory
  • 14.
    Compiler-Construction Tools Softwaredevelopment tools are available to implement one or more compiler phases Scanner generators Parser generators Syntax-directed translation engines Automatic code generators Data-flow engines
  • 15.
    What qualities doyou want in a that compiler you buy 1. Correct Code 2. Output runs fast 3. Compiler runs fast 4. Compile time proportional to program size 5. Support for separate compilation 6. Good diagnostics for syntax errors 7. Works well with debugger 8. Good diagnostics for flow anomalies 9. Good diagnostics for storage leaks 10. Consistent, predictable optimization
  • 16.
    High-level View ofa Compiler Implications Must recognize legal (and illegal) programs Must generate correct code Must manage storage of all variables (and code) Must agree with OS & linker on format for object code Source code Machine code Compiler Errors
  • 17.
    Traditional Two-pass CompilerUse an intermediate representation (IR) Front end maps legal source code into IR Back end maps IR into target machine code Admits multiple front ends & multiple passes ( better code ) Source code Front End Errors Machine code Back End IR
  • 18.
    The Front EndResponsibilities Recognize legal (& illegal) programs Report errors in a useful way Produce IR & preliminary storage map Shape the code for the back end Much of front end construction can be automated Source code Scanner IR Parser Errors tokens
  • 19.
    The Front EndScanner Maps character stream into words—the basic unit of syntax Produces words & their parts of speech x = x + y ; becomes <id,x> <op,= > <id,x> <op,+ <id,y> ; word  lexeme, part of speech  token In casual speech, we call the pair a token Typical tokens include number, identifier, +, -, while, if Scanner eliminates white space Speed is important  use a specialized recognizer Source code Scanner IR Parser Errors tokens
  • 20.
    The Front EndParser Recognizes context-free syntax & reports errors Guides context-sensitive analysis ( type checking ) Builds IR for source program Hand-coded parsers are fairly easy to build Most books advocate using automatic parser generators Source code Scanner IR Parser Errors tokens
  • 21.
    The Front EndCompilers often use an abstract syntax tree This is much more concise AST s are one form of intermediate representation ( IR ) The AST summarizes grammatical structure, without including detail about the derivation + - < id, x > < number, 2 > < id, y >
  • 22.
    The Back EndResponsibilities Translate IR into target machine code Choose instructions to implement each IR operation Decide which value to keep in registers Ensure conformance with system interfaces Automation has been much less successful in the back end Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR IR
  • 23.
    The Back EndInstruction Selection Produce fast, compact code Take advantage of target features such as addressing modes Usually viewed as a pattern matching problem ad hoc methods, pattern matching, dynamic programming This was the problem of the future in 1978 Spurred by transition from PDP-11 to VAX-11 Orthogonality of RISC simplified this problem Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR IR
  • 24.
    The Back EndInstruction Scheduling Avoid hardware stalls and interlocks Use all functional units productively Can increase lifetime of variables (changing the allocation) Optimal scheduling is NP-Complete in nearly all cases Good heuristic techniques are well understood Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR IR
  • 25.
    The Back EndRegister allocation Have each value in a register when it is used Manage a limited set of resources Can change instruction choices & insert LOADs & STOREs Optimal allocation is NP-Complete (1 or k registers) Compilers approximate solutions to NP-Complete problems Errors IR Instruction Scheduling Instruction Selection Machine code Register Allocation IR IR
  • 26.
    Traditional Three-pass CompilerCode Improvement (or Optimization ) Analyzes IR and rewrites (or transforms ) IR Primary goal is to reduce running time of the compiled code May also improve space, power consumption, … Must preserve “meaning” of the code Measured by values of named variables Errors Source Code Middle End Front End Machine code Back End IR IR
  • 27.
    The Optimizer (orMiddle End) Modern optimizers are structured as a series of passes Typical Transformations Discover & propagate some constant value Move a computation to a less frequently executed place Discover a redundant computation & remove it Remove useless or unreachable co de Errors O p t 1 O p t 3 O p t 2 O p t n ... IR IR IR IR IR
  • 28.
    The Big PictureWhy study lexical analysis? We want to avoid writing scanners by hand Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies Scanner Scanner Generator specifications source code parts of speech tables or code
  • 29.
    Lexical Analysis Thelexical analyzer reads the stream of characters making up the source program and groups the characters into meaningful sequences called lexemes . For each lexeme, the lexical analyzer produces as output a token of the form (token-name, attribute-value) the first component token-name is an abstract symbol that is used during syntax analysis, and the second component attribute-value points to an entry in the symbol table for this token.
  • 30.
    Example suppose asource program contains the assignment statement: position = i n i t i a l + r a t e * 60 The characters in this assignment could grouped into the following lexemes and mapped into the following tokens passed on to the syntax analyzer: position is a lexeme that would be mapped into a token (id, I). The assignment symbol = is a lexeme that is mapped into the token (=). i n i t i a l is a lexeme that is mapped into the token (id, 2). + is a lexeme that is mapped into the token (+). r a t e is a lexeme that is mapped into the token (id, 3). * is a lexeme that is mapped into the token (*). 60 is a lexeme that is mapped into the token (60).
  • 33.
    Specifying Lexical Patterns (micro-syntax) A scanner recognizes the language’s parts of speech Some parts are easy White space WhiteSpace  blank | tab | WhiteSpace blank | WhiteSpace tab Keywords and operators Specified as literal patterns: if , then , else , while , = , + , … Comments Opening and ( perhaps ) closing delimiters /* followed by */ in C // in C++ % in LaTeX
  • 34.
    A scanner recognizesthe language’s parts of speech Some parts are more complex Identifiers Alphabetic followed by alphanumerics + _, &, $, … May have limited length Numbers Integers: 0 or a digit from 1-9 followed by digits from 0-9 Decimals: integer . digits from 0-9, or . digits from 0-9 Reals: (integer or decimal) E ( + or - ) digits from 0-9 Complex: ( real , real ) We need a notation for specifying these patterns We would like the notation to lead to an implementation Specifying Lexical Patterns (micro-syntax)
  • 35.
    Regular Expressions Patternsform a regular language *** any finite language is regular *** Regular expressions (REs) describe regular languages Regular Expression (over alphabet  )  is a RE denoting the set {  } If a is in  , then a is a RE denoting { a } If x and y are REs denoting L(x) and L(y) then x is a RE denoting L(x) x | y is a RE denoting L(x)  L(y) xy is a RE denoting L(x)L(y) x * is a RE denoting L(x)* Precedence is closure , then concatenation , then alternation Ever type “ rm *.o a.out” ?
  • 36.
    Set Operations (refresher) You need to know these definitions
  • 37.
    Examples of RegularExpressions Identifiers: Letter  ( a | b | c | … | z | A | B | C | … | Z ) Digit  ( 0 | 1 | 2 | … | 9 ) Identifier  Letter ( Letter | Digit ) * Numbers: Integer  ( + | - |  ) ( 0 | ( 1 | 2 | 3 | … | 9 )( Digit * ) ) Decimal  Integer . Digit * Real  ( Integer | Decimal ) E ( + | - |  ) Digit * Complex  ( Real , Real ) Numbers can get much more complicated!
  • 38.
    Regular Expressions (the point) To make scanning tractable, programming languages differentiate between parts of speech by controlling their spelling (as opposed to dictionary lookup) Difference between Identifier and Keyword is entirely lexical While is a Keyword Whilst is an Identifier The lexical patterns used in programming languages are regular Using results from automata theory, we can automatically build recognizers from regular expressions  We study REs to automate scanner construction !
  • 39.
    Consider the problemof recognizing register names Register  r ( 0 | 1 | 2 | … | 9 ) ( 0 | 1 | 2 | … | 9 ) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) With implicit transitions on other inputs to an error state, s e Example S 0 S 2 S 1 r ( 0 | 1 | 2 | … 9 ) accepting state ( 0 | 1 | 2 | … 9 ) Recognizer for Register
  • 40.
    DFA operation Startin state S 0 & take transitions on each input character DFA accepts a word x iff x leaves it in a final state ( S 2 ) So, r17 takes it through s 0 , s 1 , s 2 and accepts r takes it through s 0 , s 1 and fails a takes it straight to s e Example (continued) S 0 S 2 S 1 r ( 0 | 1 | 2 | … 9 ) accepting state ( 0 | 1 | 2 | … 9 ) Recognizer for Register
  • 41.
    Example (continued)char  next character; state  s 0 ; call action(state,char); while (char  eof ) state   (state,char); call action(state,char); char  next character; if  (state) = final then report acceptance; else report failure; action(state,char) switch(  (state) ) case start : word  char; break; case normal : word  word + char; break; case final : word  char; break; case error : report error; break; end; The recognizer translates directly into code To change DFA s, just change the tables
  • 42.
    r Digit Digit * allows arbitrary numbers Accepts r00000 Accepts r99999 What if we want to limit it to r0 through r31 ? Write a tighter regular expression Register  r ( ( 0 | 1 | 2 ) ( Digit |  ) | ( 4 | 5 | 6 | 7 | 8 | 9 ) | ( 3 | 30 | 31 ) Register  r0 | r1 | r2 | … | r31 | r00 | r01 | r02 | … | r09 Produces a more complex DFA Has more states Same cost per transition Same basic implementation What if we need a tighter specification?
  • 43.
    Tighter register specification (continued) The DFA for Register  r ( ( 0 | 1 | 2 ) ( Digit |  ) | ( 4 | 5 | 6 | 7 | 8 | 9 ) | ( 3 | 30 | 31 ) Accepts a more constrained set of registers Same set of actions, more states S 0 S 5 S 1 r S 4 S 3 S 6 S 2 0 , 1 , 2 3 0 , 1 4 , 5 , 6 , 7 , 8 , 9 ( 0 | 1 | 2 | … 9 )
  • 44.
    Tighter register specification (continued) To implement the recognizer Use the same code skeleton Use transition and action tables for the new R E Bigger tables, more space, same asymptotic costs Better (micro-)syntax checking at the same cost