Ch 2.pptx

CHAPTER 2:
LEXICAL ANALYSIS
For 3rd Year Computer Science
1

WHAT IS LEXICAL ANALYSIS
 The first phase of a compiler.
 The input is a high level language program, such
as a ’C’ program in the form of a sequence of
characters.
 The output is a sequence of tokens that is sent to
the parser for syntax analysis.
 Strips off blanks, tabs, newlines, and comments
from the source program.
 Keeps track of line numbers
2

TOKENS, PATTERNS, AND LEXEMES
 Token (also called word)
 A string of characters which logically belong together.
 Classes of similar lexemes.
identifier, keywords, constants etc.
 Pattern
A rule which describes a token.
The pattern is said to match each string in the set
Eg. Pattern for identifier is letter is followed by letter or
digits.
 Lexeme
The sequence of characters matched by a pattern to form
the corresponding token.
3

 Examples of tokens, lexemes and patterns
 Classes of Tokens
 Identifiers: names chosen by the programmer
 Keywords: names already in the programming
language
 Separators: punctuation characters
 Operators: symbols that operate on arguments
and produce results
 Literals: numeric, logical, textual literals 4
Token Lexeme Pattern
ID x y n0 letter followed by letters and digits
NUM -123 4.5 any numeric constant
IF if if
LPAREN ( (
LITERAL ``Hello'' any string of characters between `` and ``

RELATION BETWEEN TOKEN AND LEXEME
5

TOKENS IN PROGRAMMING LANGUAGE
 Keywords, operators, identifiers, constants, literal
strings, punctuation symbols such as parentheses,
brackets, commas, semicolons, and colons, etc.
 A unique integer representing the token is passed by
the lexical analyzer to the parser.
 Attributes for tokens (apart from the integer
representing the token)
 identifier: the lexeme of the token, or a pointer into the
symbol table where the lexeme is stored by the lexical
analyzer.
intnum: the value of the integer (similarly for
floatnum, etc.)
string: the string itself.
The exact set of attributes are dependent on the
compiler designer. 6

SPECIFICATION AND RECOGNITION OF TOKENS
 Regular definitions, a mechanism based on regular
expressions are very popular for specification of tokens.
Has been implemented in the lexical analyzer generator
tool, LEX.
We study regular expressions first and then, token
specification using LEX.
 Transition diagrams, a variant of finite state automata,
are used to implement regular definitions and to
recognize tokens.
Transition diagrams are usually used to model a Lexical
Analyzer before translating them to programs by hand.
LEX automatically generates optimized FSA from regular
definitions
We study FSA and their generation from regular
expressions (covered) in order to understand transition
diagrams and LEX.
7

SPECIFYING AND RECOGNIZING TOKENS
 Alphabets: any finite set of symbols.
 {0,1} is a set of binary alphabets
 {a-z, A-Z} is a set of English language alphabets
 {0-9,A-F} is a set of Hexadecimal alphabets
 Strings: any finite sequence of alphabets.
 Length of the string is the total number of occurrence of alphabets
 String of length zero is known as empty string and is denoted by ε
(epsilon)
 Special Symbols:
 Arithmetic: +, - , % , * , / Preprocessor: #
 Punctuation: , ; . -> Location Specifier : &
 Assignment: = , += , /= , *= , -= Logical: &, &&, |, ||, !
8

LANGUAGE
 Considered as a finite set of strings over some finite
set of alphabets.
 Mathematically set operations can be performed on
computer languages.
 Finite languages can be described by regular
expressions.
9

REGULAR EXPRESSION
 Is an expression that describes a set of strings.
 An important notation for specifying tokens
 Lexical analyzer scan and identify only a finite set valid
string/token/lexeme that belong to the language in hand.
 Regular grammar: the grammar defined by regular
expressions.
 Regular language: the language defined by the regular
grammar.
 Regular expressions are used to specify the patterns of
tokens.
10

 Most formalisms provide the following operations to
construct regular expressions:
 Alternation:
 A vertical bar separates alternatives
 Eg. gray|grey can match “gray” or “grey”
 Grouping:
 Use parentheses to define the scope and precedence of
the operators.
 Eg. gray|grey and gr(a|e)y are equivalent
 Quantification:
 Specifies how often an element is allowed to occur.
11

SYNTAX OF REGULAR EXPRESSION
metasequence description
. Matches any single character except newline
[ ] Single character that is contained within the bracket
[a,b,c] = {a,b,c}
[^ ] Single character that is not contained within the bracket
[^abc] = { x is a character : x is not a or b or c}
* Zero or more times
ab*c = ac, abc, abbc
+ One or more times
[0-9] = 1, 10, 116
? Zero or one time
[0-9]? = “ “ , 8
| Choice(aka alternation or set union)
abc|def = “abc” or “def”
( ) Group to be a new expression
(01) Denotes string “01” 12

AUTOMATA
 A machine that accepts a language.
 Finite state automata accepts RLs which
corresponds to REs.
 Applications of Automata:
 Switching circuit design
 Lexical analyzer in a compiler
 String processing (grep, awk), etc.
 State charts used in object-oriented design
 Modeling control applications, e.g., elevator operation
 Parsers of all types
 Compilers
13

FINITE STATE AUTOMATION
 Is an acceptor or recognizer of regular languages
 Is a 5 tuple ( Q, Σ, δ, q0 , F )
 Q : finite set of states
 Σ : input alphabet
 δ : transition function , δ : Q × Σ → Q
 q0 : the start state
 F : the set of final accepting states
 In one move from some state q, an FSA reads an input
symbol, changes the state based on δ, and gets ready to
read the next input symbol.
 If the last state reached is not a final state, then the input
string is rejected. 14

TYPES OF FSA
 Non-deterministic Finite Automata (NFA)
 There may be multiple possible transitions or some
transitions that do not require an input ()
 Deterministic Finite Automata (DFA)
 The transition from each state is uniquely determined by
the current input character.
 For each state, at most 1 edge labeled “a” leaving state
 No  transitions
16

NON-DETERMINISTIC FINITE AUTOMATA
 Five tuple
 (Q , Σ, δ, q0 , F )
 δ = Q x Σ  2Q
 Given the current state there could be multiple next
states
 The next state may be chosen at random
 All the next states may be chosen in parallel
 Example:
 L = { Set of all strings that end with 0 }
 L = { Set of all strings that start with 0 }
 L = { Sets of all strings over {0,1} of length 2 } 17

DETERMINISTIC FINITE AUTOMATA
 Given the current state we know what the next state
will be
 It has only one unique next state
 It has no choices or randomness
 It is simple and easy to design
18

NFA TO DFA
 Every DFA is an NFA, but not vice versa
 There is an equivalent DFA for every NFA
 Dead configuration in NFA is equivalent to
Dead/trap state in DFA
 Find the equivalent DFA for the NFA given by M = [
{A,B,C}, (a,b), δ , A , {C} ] where δ is given by:
19
a b
A A,B C
B A B
C - A,B

 Example
 L = { set of all strings over (0,1) that ends with ‘01’ }
 Design an NFA for a language that accepts all
strings over {0,1} in which the second last symbol is
always ‘1’. Then convert it to its equivalent DFA.
20

ERROR RECOVERY
Certain languages do not have any reserved words,
e.g., while, do, if, else, etc., are reserved in ’C’ or
'C++', but not in PL/1.
In FORTRAN, some keywords are context-dependent.
In the statement, DO 10 I = 10.86, DO10I is an
identifier, and DO is not a keyword.
But in the statement, DO 10 I = 10, 86, DO is a
keyword.
Such features require substantial look ahead for
resolution.
Lexical Analyzer skips characters in the input until a
well-formed token is found. 21

When an error occurs, the lexical analyzer recovers by:
skipping (deleting) successive characters from the
remaining input until the lexical analyzer can find a
well-formed token(panic mode recovery).
deleting extraneous characters
inserting missing characters
replacing an incorrect character by a correct
character.
transposing to adjacent characters
22

LEXICAL ANALYZER GENERATOR (LEX)
 Lexer or Scanner
 The algorithm that divides the program into units
 Lex
 A program that takes a set of descriptions of possible
tokens and produce a C routine that implements a
scanner.
23

LEX STRUCTURE
%{
<c global variables, prototypes, comments>
%}
[ Definition Section ]
%%
[ Rules Section ] – define how to scan and what
action to take for each token
%%
C auxiliary subroutine – any user code
24

RULES SECTION
 Format
pattern { corresponding actions }
---
pattern { corresponding actions }
Regular Expression C Expression
Example
[0-9][0-9]* { printf(“number”); }
25

TWO NOTES ON LEX
1. Lex matches token with longest match
Input: abc
Rule: [a-z]+
-> Token: abc ( not “a” or “ab” )
2. Lex uses the first applicable rule
Input: post
Rule1: “post” { printf(“hello”); }
Rule2: [a-zA-Z]+ { printf(“world”); }
-> It will print hello, (not “world”)
26

VARIABLES OF A LEX PROGRAM
 yytext
 Whenever the scanner matches a token, the text of the
token is stored in the null terminated string yytext
 A variable that is a pointer to the first character of the
lexeme.
 yyleng
 The length of the string yytext
 yylex( )
 The scanner created by the lex has the entry point yylex( )
27

Ch 2.pptx

Recommended

Recommended

More Related Content

Similar to Ch 2.pptx

Similar to Ch 2.pptx (20)

Recently uploaded

Recently uploaded (20)

Ch 2.pptx