SlideShare a Scribd company logo
1 of 27
CHAPTER 2:
LEXICAL ANALYSIS
For 3rd Year Computer Science
1
WHAT IS LEXICAL ANALYSIS
 The first phase of a compiler.
 The input is a high level language program, such
as a ’C’ program in the form of a sequence of
characters.
 The output is a sequence of tokens that is sent to
the parser for syntax analysis.
 Strips off blanks, tabs, newlines, and comments
from the source program.
 Keeps track of line numbers
2
TOKENS, PATTERNS, AND LEXEMES
 Token (also called word)
 A string of characters which logically belong together.
 Classes of similar lexemes.
identifier, keywords, constants etc.
 Pattern
A rule which describes a token.
The pattern is said to match each string in the set
Eg. Pattern for identifier is letter is followed by letter or
digits.
 Lexeme
The sequence of characters matched by a pattern to form
the corresponding token.
3
 Examples of tokens, lexemes and patterns
 Classes of Tokens
 Identifiers: names chosen by the programmer
 Keywords: names already in the programming
language
 Separators: punctuation characters
 Operators: symbols that operate on arguments
and produce results
 Literals: numeric, logical, textual literals 4
Token Lexeme Pattern
ID x y n0 letter followed by letters and digits
NUM -123 4.5 any numeric constant
IF if if
LPAREN ( (
LITERAL ``Hello'' any string of characters between `` and ``
RELATION BETWEEN TOKEN AND LEXEME
5
TOKENS IN PROGRAMMING LANGUAGE
 Keywords, operators, identifiers, constants, literal
strings, punctuation symbols such as parentheses,
brackets, commas, semicolons, and colons, etc.
 A unique integer representing the token is passed by
the lexical analyzer to the parser.
 Attributes for tokens (apart from the integer
representing the token)
 identifier: the lexeme of the token, or a pointer into the
symbol table where the lexeme is stored by the lexical
analyzer.
intnum: the value of the integer (similarly for
floatnum, etc.)
string: the string itself.
The exact set of attributes are dependent on the
compiler designer. 6
SPECIFICATION AND RECOGNITION OF TOKENS
 Regular definitions, a mechanism based on regular
expressions are very popular for specification of tokens.
Has been implemented in the lexical analyzer generator
tool, LEX.
We study regular expressions first and then, token
specification using LEX.
 Transition diagrams, a variant of finite state automata,
are used to implement regular definitions and to
recognize tokens.
Transition diagrams are usually used to model a Lexical
Analyzer before translating them to programs by hand.
LEX automatically generates optimized FSA from regular
definitions
We study FSA and their generation from regular
expressions (covered) in order to understand transition
diagrams and LEX.
7
SPECIFYING AND RECOGNIZING TOKENS
 Alphabets: any finite set of symbols.
 {0,1} is a set of binary alphabets
 {a-z, A-Z} is a set of English language alphabets
 {0-9,A-F} is a set of Hexadecimal alphabets
 Strings: any finite sequence of alphabets.
 Length of the string is the total number of occurrence of alphabets
 String of length zero is known as empty string and is denoted by ε
(epsilon)
 Special Symbols:
 Arithmetic: +, - , % , * , / Preprocessor: #
 Punctuation: , ; . -> Location Specifier : &
 Assignment: = , += , /= , *= , -= Logical: &, &&, |, ||, !
8
LANGUAGE
 Considered as a finite set of strings over some finite
set of alphabets.
 Mathematically set operations can be performed on
computer languages.
 Finite languages can be described by regular
expressions.
9
REGULAR EXPRESSION
 Is an expression that describes a set of strings.
 An important notation for specifying tokens
 Lexical analyzer scan and identify only a finite set valid
string/token/lexeme that belong to the language in hand.
 Regular grammar: the grammar defined by regular
expressions.
 Regular language: the language defined by the regular
grammar.
 Regular expressions are used to specify the patterns of
tokens.
10
 Most formalisms provide the following operations to
construct regular expressions:
 Alternation:
 A vertical bar separates alternatives
 Eg. gray|grey can match “gray” or “grey”
 Grouping:
 Use parentheses to define the scope and precedence of
the operators.
 Eg. gray|grey and gr(a|e)y are equivalent
 Quantification:
 Specifies how often an element is allowed to occur.
11
SYNTAX OF REGULAR EXPRESSION
metasequence description
. Matches any single character except newline
[ ] Single character that is contained within the bracket
[a,b,c] = {a,b,c}
[^ ] Single character that is not contained within the bracket
[^abc] = { x is a character : x is not a or b or c}
* Zero or more times
ab*c = ac, abc, abbc
+ One or more times
[0-9] = 1, 10, 116
? Zero or one time
[0-9]? = “ “ , 8
| Choice(aka alternation or set union)
abc|def = “abc” or “def”
( ) Group to be a new expression
(01) Denotes string “01” 12
AUTOMATA
 A machine that accepts a language.
 Finite state automata accepts RLs which
corresponds to REs.
 Applications of Automata:
 Switching circuit design
 Lexical analyzer in a compiler
 String processing (grep, awk), etc.
 State charts used in object-oriented design
 Modeling control applications, e.g., elevator operation
 Parsers of all types
 Compilers
13
FINITE STATE AUTOMATION
 Is an acceptor or recognizer of regular languages
 Is a 5 tuple ( Q, Σ, δ, q0 , F )
 Q : finite set of states
 Σ : input alphabet
 δ : transition function , δ : Q × Σ → Q
 q0 : the start state
 F : the set of final accepting states
 In one move from some state q, an FSA reads an input
symbol, changes the state based on δ, and gets ready to
read the next input symbol.
 If the last state reached is not a final state, then the input
string is rejected. 14
15
TYPES OF FSA
 Non-deterministic Finite Automata (NFA)
 There may be multiple possible transitions or some
transitions that do not require an input ()
 Deterministic Finite Automata (DFA)
 The transition from each state is uniquely determined by
the current input character.
 For each state, at most 1 edge labeled “a” leaving state
 No  transitions
16
NON-DETERMINISTIC FINITE AUTOMATA
 Five tuple
 (Q , Σ, δ, q0 , F )
 δ = Q x Σ  2Q
 Given the current state there could be multiple next
states
 The next state may be chosen at random
 All the next states may be chosen in parallel
 Example:
 L = { Set of all strings that end with 0 }
 L = { Set of all strings that start with 0 }
 L = { Sets of all strings over {0,1} of length 2 } 17
DETERMINISTIC FINITE AUTOMATA
 Given the current state we know what the next state
will be
 It has only one unique next state
 It has no choices or randomness
 It is simple and easy to design
18
NFA TO DFA
 Every DFA is an NFA, but not vice versa
 There is an equivalent DFA for every NFA
 Dead configuration in NFA is equivalent to
Dead/trap state in DFA
 Find the equivalent DFA for the NFA given by M = [
{A,B,C}, (a,b), δ , A , {C} ] where δ is given by:
19
a b
A A,B C
B A B
C - A,B
 Example
 L = { set of all strings over (0,1) that ends with ‘01’ }
 Design an NFA for a language that accepts all
strings over {0,1} in which the second last symbol is
always ‘1’. Then convert it to its equivalent DFA.
20
ERROR RECOVERY
Certain languages do not have any reserved words,
e.g., while, do, if, else, etc., are reserved in ’C’ or
'C++', but not in PL/1.
In FORTRAN, some keywords are context-dependent.
In the statement, DO 10 I = 10.86, DO10I is an
identifier, and DO is not a keyword.
But in the statement, DO 10 I = 10, 86, DO is a
keyword.
Such features require substantial look ahead for
resolution.
Lexical Analyzer skips characters in the input until a
well-formed token is found. 21
When an error occurs, the lexical analyzer recovers by:
skipping (deleting) successive characters from the
remaining input until the lexical analyzer can find a
well-formed token(panic mode recovery).
deleting extraneous characters
inserting missing characters
replacing an incorrect character by a correct
character.
transposing to adjacent characters
22
LEXICAL ANALYZER GENERATOR (LEX)
 Lexer or Scanner
 The algorithm that divides the program into units
 Lex
 A program that takes a set of descriptions of possible
tokens and produce a C routine that implements a
scanner.
23
LEX STRUCTURE
%{
<c global variables, prototypes, comments>
%}
[ Definition Section ]
%%
[ Rules Section ] – define how to scan and what
action to take for each token
%%
C auxiliary subroutine – any user code
24
RULES SECTION
 Format
pattern { corresponding actions }
---
pattern { corresponding actions }
Regular Expression C Expression
Example
[0-9][0-9]* { printf(“number”); }
25
TWO NOTES ON LEX
1. Lex matches token with longest match
Input: abc
Rule: [a-z]+
-> Token: abc ( not “a” or “ab” )
2. Lex uses the first applicable rule
Input: post
Rule1: “post” { printf(“hello”); }
Rule2: [a-zA-Z]+ { printf(“world”); }
-> It will print hello, (not “world”)
26
VARIABLES OF A LEX PROGRAM
 yytext
 Whenever the scanner matches a token, the text of the
token is stored in the null terminated string yytext
 A variable that is a pointer to the first character of the
lexeme.
 yyleng
 The length of the string yytext
 yylex( )
 The scanner created by the lex has the entry point yylex( )
27

More Related Content

Similar to Ch 2.pptx

compiler Design course material chapter 2
compiler Design course material chapter 2compiler Design course material chapter 2
compiler Design course material chapter 2gadisaAdamu
 
Lex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.pptLex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.pptMohitJain296729
 
02-Lexical-Analysis.ppt
02-Lexical-Analysis.ppt02-Lexical-Analysis.ppt
02-Lexical-Analysis.pptBabanDeep5
 
Lexical Analyzer Implementation
Lexical Analyzer ImplementationLexical Analyzer Implementation
Lexical Analyzer ImplementationAkhil Kaushik
 
Structure of the compiler
Structure of the compilerStructure of the compiler
Structure of the compilerSudhaa Ravi
 
Lexical analysis, syntax analysis, semantic analysis. Ppt
Lexical analysis, syntax analysis, semantic analysis. PptLexical analysis, syntax analysis, semantic analysis. Ppt
Lexical analysis, syntax analysis, semantic analysis. Pptovidlivi91
 
Finals-review.pptx
Finals-review.pptxFinals-review.pptx
Finals-review.pptxamara jyothi
 
Compiler design important questions
Compiler design   important questionsCompiler design   important questions
Compiler design important questionsakila viji
 
COMPILER DESIGN.pdf
COMPILER DESIGN.pdfCOMPILER DESIGN.pdf
COMPILER DESIGN.pdfManishBej3
 
Lexical Analysis
Lexical AnalysisLexical Analysis
Lexical AnalysisMunni28
 
02. chapter 3 lexical analysis
02. chapter 3   lexical analysis02. chapter 3   lexical analysis
02. chapter 3 lexical analysisraosir123
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)bolovv
 
Lexical Analysis - Compiler design
Lexical Analysis - Compiler design Lexical Analysis - Compiler design
Lexical Analysis - Compiler design Aman Sharma
 
The Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxThe Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxssuser039bf6
 

Similar to Ch 2.pptx (20)

compiler Design course material chapter 2
compiler Design course material chapter 2compiler Design course material chapter 2
compiler Design course material chapter 2
 
Lex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.pptLex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.ppt
 
02-Lexical-Analysis.ppt
02-Lexical-Analysis.ppt02-Lexical-Analysis.ppt
02-Lexical-Analysis.ppt
 
Lexical Analyzer Implementation
Lexical Analyzer ImplementationLexical Analyzer Implementation
Lexical Analyzer Implementation
 
Structure of the compiler
Structure of the compilerStructure of the compiler
Structure of the compiler
 
Lexical analysis, syntax analysis, semantic analysis. Ppt
Lexical analysis, syntax analysis, semantic analysis. PptLexical analysis, syntax analysis, semantic analysis. Ppt
Lexical analysis, syntax analysis, semantic analysis. Ppt
 
Lexical
LexicalLexical
Lexical
 
Finals-review.pptx
Finals-review.pptxFinals-review.pptx
Finals-review.pptx
 
Compiler design important questions
Compiler design   important questionsCompiler design   important questions
Compiler design important questions
 
COMPILER DESIGN.pdf
COMPILER DESIGN.pdfCOMPILER DESIGN.pdf
COMPILER DESIGN.pdf
 
Lexical Analysis
Lexical AnalysisLexical Analysis
Lexical Analysis
 
02. chapter 3 lexical analysis
02. chapter 3   lexical analysis02. chapter 3   lexical analysis
02. chapter 3 lexical analysis
 
COMPILER DESIGN- Introduction & Lexical Analysis:
COMPILER DESIGN- Introduction & Lexical Analysis: COMPILER DESIGN- Introduction & Lexical Analysis:
COMPILER DESIGN- Introduction & Lexical Analysis:
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)
 
Lexical Analysis - Compiler design
Lexical Analysis - Compiler design Lexical Analysis - Compiler design
Lexical Analysis - Compiler design
 
Lexicalanalyzer
LexicalanalyzerLexicalanalyzer
Lexicalanalyzer
 
Lexicalanalyzer
LexicalanalyzerLexicalanalyzer
Lexicalanalyzer
 
Handout#02
Handout#02Handout#02
Handout#02
 
The Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxThe Theory of Finite Automata.pptx
The Theory of Finite Automata.pptx
 
Ch3.ppt
Ch3.pptCh3.ppt
Ch3.ppt
 

Recently uploaded

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 

Recently uploaded (20)

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 

Ch 2.pptx

  • 1. CHAPTER 2: LEXICAL ANALYSIS For 3rd Year Computer Science 1
  • 2. WHAT IS LEXICAL ANALYSIS  The first phase of a compiler.  The input is a high level language program, such as a ’C’ program in the form of a sequence of characters.  The output is a sequence of tokens that is sent to the parser for syntax analysis.  Strips off blanks, tabs, newlines, and comments from the source program.  Keeps track of line numbers 2
  • 3. TOKENS, PATTERNS, AND LEXEMES  Token (also called word)  A string of characters which logically belong together.  Classes of similar lexemes. identifier, keywords, constants etc.  Pattern A rule which describes a token. The pattern is said to match each string in the set Eg. Pattern for identifier is letter is followed by letter or digits.  Lexeme The sequence of characters matched by a pattern to form the corresponding token. 3
  • 4.  Examples of tokens, lexemes and patterns  Classes of Tokens  Identifiers: names chosen by the programmer  Keywords: names already in the programming language  Separators: punctuation characters  Operators: symbols that operate on arguments and produce results  Literals: numeric, logical, textual literals 4 Token Lexeme Pattern ID x y n0 letter followed by letters and digits NUM -123 4.5 any numeric constant IF if if LPAREN ( ( LITERAL ``Hello'' any string of characters between `` and ``
  • 5. RELATION BETWEEN TOKEN AND LEXEME 5
  • 6. TOKENS IN PROGRAMMING LANGUAGE  Keywords, operators, identifiers, constants, literal strings, punctuation symbols such as parentheses, brackets, commas, semicolons, and colons, etc.  A unique integer representing the token is passed by the lexical analyzer to the parser.  Attributes for tokens (apart from the integer representing the token)  identifier: the lexeme of the token, or a pointer into the symbol table where the lexeme is stored by the lexical analyzer. intnum: the value of the integer (similarly for floatnum, etc.) string: the string itself. The exact set of attributes are dependent on the compiler designer. 6
  • 7. SPECIFICATION AND RECOGNITION OF TOKENS  Regular definitions, a mechanism based on regular expressions are very popular for specification of tokens. Has been implemented in the lexical analyzer generator tool, LEX. We study regular expressions first and then, token specification using LEX.  Transition diagrams, a variant of finite state automata, are used to implement regular definitions and to recognize tokens. Transition diagrams are usually used to model a Lexical Analyzer before translating them to programs by hand. LEX automatically generates optimized FSA from regular definitions We study FSA and their generation from regular expressions (covered) in order to understand transition diagrams and LEX. 7
  • 8. SPECIFYING AND RECOGNIZING TOKENS  Alphabets: any finite set of symbols.  {0,1} is a set of binary alphabets  {a-z, A-Z} is a set of English language alphabets  {0-9,A-F} is a set of Hexadecimal alphabets  Strings: any finite sequence of alphabets.  Length of the string is the total number of occurrence of alphabets  String of length zero is known as empty string and is denoted by ε (epsilon)  Special Symbols:  Arithmetic: +, - , % , * , / Preprocessor: #  Punctuation: , ; . -> Location Specifier : &  Assignment: = , += , /= , *= , -= Logical: &, &&, |, ||, ! 8
  • 9. LANGUAGE  Considered as a finite set of strings over some finite set of alphabets.  Mathematically set operations can be performed on computer languages.  Finite languages can be described by regular expressions. 9
  • 10. REGULAR EXPRESSION  Is an expression that describes a set of strings.  An important notation for specifying tokens  Lexical analyzer scan and identify only a finite set valid string/token/lexeme that belong to the language in hand.  Regular grammar: the grammar defined by regular expressions.  Regular language: the language defined by the regular grammar.  Regular expressions are used to specify the patterns of tokens. 10
  • 11.  Most formalisms provide the following operations to construct regular expressions:  Alternation:  A vertical bar separates alternatives  Eg. gray|grey can match “gray” or “grey”  Grouping:  Use parentheses to define the scope and precedence of the operators.  Eg. gray|grey and gr(a|e)y are equivalent  Quantification:  Specifies how often an element is allowed to occur. 11
  • 12. SYNTAX OF REGULAR EXPRESSION metasequence description . Matches any single character except newline [ ] Single character that is contained within the bracket [a,b,c] = {a,b,c} [^ ] Single character that is not contained within the bracket [^abc] = { x is a character : x is not a or b or c} * Zero or more times ab*c = ac, abc, abbc + One or more times [0-9] = 1, 10, 116 ? Zero or one time [0-9]? = “ “ , 8 | Choice(aka alternation or set union) abc|def = “abc” or “def” ( ) Group to be a new expression (01) Denotes string “01” 12
  • 13. AUTOMATA  A machine that accepts a language.  Finite state automata accepts RLs which corresponds to REs.  Applications of Automata:  Switching circuit design  Lexical analyzer in a compiler  String processing (grep, awk), etc.  State charts used in object-oriented design  Modeling control applications, e.g., elevator operation  Parsers of all types  Compilers 13
  • 14. FINITE STATE AUTOMATION  Is an acceptor or recognizer of regular languages  Is a 5 tuple ( Q, Σ, δ, q0 , F )  Q : finite set of states  Σ : input alphabet  δ : transition function , δ : Q × Σ → Q  q0 : the start state  F : the set of final accepting states  In one move from some state q, an FSA reads an input symbol, changes the state based on δ, and gets ready to read the next input symbol.  If the last state reached is not a final state, then the input string is rejected. 14
  • 15. 15
  • 16. TYPES OF FSA  Non-deterministic Finite Automata (NFA)  There may be multiple possible transitions or some transitions that do not require an input ()  Deterministic Finite Automata (DFA)  The transition from each state is uniquely determined by the current input character.  For each state, at most 1 edge labeled “a” leaving state  No  transitions 16
  • 17. NON-DETERMINISTIC FINITE AUTOMATA  Five tuple  (Q , Σ, δ, q0 , F )  δ = Q x Σ  2Q  Given the current state there could be multiple next states  The next state may be chosen at random  All the next states may be chosen in parallel  Example:  L = { Set of all strings that end with 0 }  L = { Set of all strings that start with 0 }  L = { Sets of all strings over {0,1} of length 2 } 17
  • 18. DETERMINISTIC FINITE AUTOMATA  Given the current state we know what the next state will be  It has only one unique next state  It has no choices or randomness  It is simple and easy to design 18
  • 19. NFA TO DFA  Every DFA is an NFA, but not vice versa  There is an equivalent DFA for every NFA  Dead configuration in NFA is equivalent to Dead/trap state in DFA  Find the equivalent DFA for the NFA given by M = [ {A,B,C}, (a,b), δ , A , {C} ] where δ is given by: 19 a b A A,B C B A B C - A,B
  • 20.  Example  L = { set of all strings over (0,1) that ends with ‘01’ }  Design an NFA for a language that accepts all strings over {0,1} in which the second last symbol is always ‘1’. Then convert it to its equivalent DFA. 20
  • 21. ERROR RECOVERY Certain languages do not have any reserved words, e.g., while, do, if, else, etc., are reserved in ’C’ or 'C++', but not in PL/1. In FORTRAN, some keywords are context-dependent. In the statement, DO 10 I = 10.86, DO10I is an identifier, and DO is not a keyword. But in the statement, DO 10 I = 10, 86, DO is a keyword. Such features require substantial look ahead for resolution. Lexical Analyzer skips characters in the input until a well-formed token is found. 21
  • 22. When an error occurs, the lexical analyzer recovers by: skipping (deleting) successive characters from the remaining input until the lexical analyzer can find a well-formed token(panic mode recovery). deleting extraneous characters inserting missing characters replacing an incorrect character by a correct character. transposing to adjacent characters 22
  • 23. LEXICAL ANALYZER GENERATOR (LEX)  Lexer or Scanner  The algorithm that divides the program into units  Lex  A program that takes a set of descriptions of possible tokens and produce a C routine that implements a scanner. 23
  • 24. LEX STRUCTURE %{ <c global variables, prototypes, comments> %} [ Definition Section ] %% [ Rules Section ] – define how to scan and what action to take for each token %% C auxiliary subroutine – any user code 24
  • 25. RULES SECTION  Format pattern { corresponding actions } --- pattern { corresponding actions } Regular Expression C Expression Example [0-9][0-9]* { printf(“number”); } 25
  • 26. TWO NOTES ON LEX 1. Lex matches token with longest match Input: abc Rule: [a-z]+ -> Token: abc ( not “a” or “ab” ) 2. Lex uses the first applicable rule Input: post Rule1: “post” { printf(“hello”); } Rule2: [a-zA-Z]+ { printf(“world”); } -> It will print hello, (not “world”) 26
  • 27. VARIABLES OF A LEX PROGRAM  yytext  Whenever the scanner matches a token, the text of the token is stored in the null terminated string yytext  A variable that is a pointer to the first character of the lexeme.  yyleng  The length of the string yytext  yylex( )  The scanner created by the lex has the entry point yylex( ) 27