SlideShare a Scribd company logo
1 of 108
5/11/2021 Saeed Parsa 1
Compiler Design
Lexical Analysis
Saeed Parsa
Room 332,
School of Computer Engineering,
Iran University of Science & Technology
parsa@iust.ac.ir
Winter 2021
What is Lexical Analyzer?
 The lexical analyzer is usually a function that is called by the parser when it needs the next
token.
5/11/2021 Saeed Parsa 2
 The main task of the lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce as output a token for each lexeme in
the source program.
What is Lexical Analyzer?
Lexeme: sequence of characters in source program matching a pattern.
5/11/2021 Saeed Parsa 3
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc.
Read, Section 3.1.2 in page 111 of Aho book.
Read, Section 3.1.2 in page 111 of Aho book.
What is Lexical Analyzer?
5/11/2021 Saeed Parsa 4
Exampes
5/11/2021 Saeed Parsa 5
What is Lexical Analyzer?
5/11/2021 Saeed Parsa 6
Output structure:
typedef struct Token{
int row; // 1- Row number of the lexicon
int col; // 2- Column number of the lexicon
int BlkNo ; // 3- Nested block no.
enum Symbols type; // 4- the lexicon code
char Name [30]; // 5- Lexicon
} Token-type ;
Input:
Text source;
source = fopen(input_file , ”R+”);
Input & Output structure:
What is Lexical Analyzer?
5/11/2021 Saeed Parsa 7
Output structure: Nesting block no.
The names of the variables alone are not enough to determine and distinguish
them from each other, but to determine a word, its enclosing block nesting
number is needed.
Example:
{ int I ;
I=5 ;
{ int I ;
I=6 ;
printf(“2nd blk %d“, I);
}
printf(“ n 1st blk %d“, I);
}
Token example:
the word “int {
Token:
Row : 1
Col: 3
BlkNo: 1
Type: S_int
lexeme: “int“
What is Lexical Analyzer?
5/11/2021 Saeed Parsa 8
Output structure:
To each kind of lexeme a different code is assigned. Enumerated data types male
it possible to name constants.
For instance on the following enum definition, the first constant S_Program = 0
represents the program keyword, and the S_Eq = 2 represents the equal sign.
What is Lexical Analyzer?
5/11/2021 Saeed Parsa 9
Output structure:
typedefe enum Symbols
{ S_Program, S_Const, S_Eq, S_Semi, S_Id, S_No, S_Type,
S_Record, S_End, S_Int, S_Real, S_Char, S_Array, S_String,
S_Begin, S_Var, S_Colon, S_ParBaz, S_ParBast, S_Set,
S_of,S_BrackBaz, S_BrackBast, S_Case, S_ConstString,
S_Function, S_Procedure, S_Begin, S_Dot, S_Comma,
S_If, S_Then, S_Else, S_While, S_Do, S_Repeat, S_Until,
S_For, S_Add, S_Sub, S_Div, S_Mul, S_Mod, S_Lt, S_Le,
S_Gt, S_Ge, S_Gt, S_Ne, S_Not, S_And, S_Or
};
Implementation
5/11/2021 Saeed Parsa 10
 Lexical Analysis can be implemented with the Finite State Automata (FSA).
 A Finite State Automaton has
 A set of states
• One marked initial
• Some marked final
 A set of transitions from state to state
• Each labeled with an alphabet symbol or ε
 Operate by beginning at the start state, reading symbols and making
indicated transitions
 When input ends, state must be final or else reject
 Note: This FSA represents “Comments” in CPP.
Implementation
5/11/2021 Saeed Parsa 11
• Example:
Source:
1. Input: /* ab1-***+*/
State: 000122222333234 Accept
2. Input: /- aaaaa */
State: 0001 Error
2. Input: /* ab***-
State: 00012223332 Infinite loop
Blank | t | n
Finite State Automata (FSA)
5/11/2021 Saeed Parsa 12
 A Finite State Automaton is a recognizer or acceptor of regular Language.
 The word finite is used because the number of possible states and the number of symbols in
alphabet sets are finite.
 In Greek, automaton means self-acting.
 Formally, finite automaton is a 5 tuple machine, denoted by M, where:
 M=(Q,Σ,δ,q0,F).
 Q is a finite set of states.
 Σ is the finite input alphabets.
 Δ is the transition function.
 q0 indicates start state. q0⊆Q
 F is the set of final or accepting states. F⊆Q.
https://www.researchgate.net/publication/311251505_An_exploration_on_lexical_analysis
An example of a DFA
5/11/2021 Saeed Parsa 13
 This Figure depicts an example of Deterministic Finite Automata.
 Formal definition is
 M=(Q,Σ,δ,q0,F).
 Q={q0,q1,q2}set of all states.
 Σ={0,1},δ is given in transition table,
 q0 is start or initial state,
 F={q2}.
 Language is defined as L(M) ={w/w ends with 00}, w can be combination of 0’s and 1’s
which ends with 00. (2) (PDF)
https://www.researchgate.net/publication/311251505_An_exploration_on_lexical_analysis
An example of a DFA
5/11/2021 Saeed Parsa 14
 Language is defined as L(M) ={w/does not include 010}, w can be combination of 0’s
and 1’s which does not include the 010 substring.
 Formal definition is
 M=(Q,Σ,δ,q0, F).
 Q={q0,q1,q2}set of all states.
 Σ={0,1}, δ is given in transition table,
 q0 is start or initial state,
 F={q0, q1, q2}.
An example of a DFA
5/11/2021 Saeed Parsa 15
 DFA that accepts exactly one a. Σ = {a}.
 DFA that accepts at least one a. Σ = {a}.
 DFA that accepts even number of a's. Σ = {a}.
https://swaminathanj.github.io/fsm/dfa.html
Transition diagrams
5/11/2021 Saeed Parsa 16
 DFA starts consuming input string from q0 to reach the final state.
 In a single transition from some state q, DFA reads an input symbol, changes the state
based on δ and gets ready to read the next input symbol.
 If last state is final then string is accepted otherwise it is rejected.
 A finite automaton, (FA) consists of a finite set of states and set of transitions from state
to state that occur on input symbols chosen from an alphabet S.
 For each input symbol if there is exactly one transition out of each state then M is said
to deterministic finite automaton.
 A directed graph, called a transition diagram, is associated with a finite automaton as
follows.
 The vertices of the graph correspond to the states of the FA.
https://shodhganga.inflibnet.ac.in/bitstream/10603/77125/8/08_chapter%201.pdf
Transition diagrams
5/11/2021 Saeed Parsa 17
 The following Figure is a transition diagram that recognizes the lexemes matching the
token relop (relational operators).
 Note, however, that state 4 has a * to indicate that we must retract the input one position
Page 131 of AHO book
Recognizing Identifiers
5/11/2021 Saeed Parsa 18
 Recognizing keywords and identifiers presents a problem.
 Usually, keywords like if or then are reserved (as they are in our running example), s
 So they are not identifiers even though they look like identifiers.
 After an identifier, is detected, isKeyword() is invoked to check whether the detected
lexeme is a keyword.
Page 132 of AHO book
Recognizing Numbers
5/11/2021 Saeed Parsa 19
Page 133 of AHO book
Different types of numbers:
• 23.32
• 12
• 12.
• .13
• 12.5E-8
Recognizing all lexicons
5/11/2021 Saeed Parsa 20
 Lexical rules of a language can be defined in
terms of a DFA.
 In state zero whitespaces are ignored.
 State 1 recognizes identifiers.
 After an identifier, is detected, isKeyword() is
invoked to check whether the detected lexeme is
a keyword.
Architecture of a Transition-Diagram-Based Lexical Analyzer
5/11/2021 Saeed Parsa 21
 There are several ways that a collection of transition diagrams can be used to build a
lexical analyzer.
 Regardless of the overall strategy, each state is represented by a piece of code.
 We may imagine a variable state holding the number of the current state for a
transition diagram.
 A switch based on the value of state takes us to code for each of the possible states,
where we find the action of that state.
 Often, the code for a state is itself a switch statement or multiway branch that
determines the next state by reading and examining the next input character.
 You may write a program to convert a state transition diagram to a program.
Page 134 of AHO book
Converting the state diagram to a lexical analysis program
5/11/2021 Saeed Parsa 22
struct TokenType lexeicalAnalyser ( FILE *Source )
{
enum Symbols LexiconType ; //Type of Lexeme
char NextChar, NextWord[80 ] ; //Next char & next word
int State, Length; // State no. in automata
static char LastChar = ‘0’; // Extra char read In the last call
static int RowNo =0, ColNo = 0; // Row and column no.
State = 0 ; // Start state no. is zero
Length = 0 ; //Length of the detected lexeme
while( ! feof ( Source )) // While EOF source not encountered
{
if ( LastChar ) // if extra char was read in last call
Nextchar = LastChar; LastChar = ‘0’ ;} // Retreat the character
else
NextChar = fgetc ( Source ) ; // Read next character
NextWord[Length++] = NextChar; // Begin to make the next lexeme
state Length NextChar LastChar NextWord
0 1 v ‘0’ v
1 2 o ‘0’ vo
1 3 i ‘0’ voi
1 4 d ‘0’ void
1 5 ҍ ‘0’ voidҍ
Converting the state diagram to a lexical analysis program
5/11/2021 Saeed Parsa 23
switch State // Operate dependent on the State
{
case 0: // Recognizing numbers.
if (NextChar == ‘n‘)
{RowNo++ ; ColNo = 0; }
else ColNo++;
if (Nextchar == ‘ ‘ || Nextchar = = ‘  t ‘ || Nextchar == ‘  n ‘)
Length = 0;
else if (( Nextchar < = ‘z‘ && Nextchar > = ‘a‘ ) ||
( Nextchar < = ‘Z‘ && Nextchar > = ‘A‘ ))
State = 1;
else if ( Nextchar < = ‘9‘ && Nextchar > = ‘0‘ )
State = 2 ;
else if ( Nextchar = = ‘ ( ‘ ) State = 3;
else if ( Nextchar = = ‘ < ‘ ) State = 4
else if ( Nextchar = = ‘ > ‘ ) State = 5
else LexerError(NextWord, Length);
break ; // End of no. detection
state Length NextChar LastChar NextWord
0 1 v ‘0’ v
1 2 o ‘0’ vo
Converting the state diagram to a lexical analysis program
5/11/2021 Saeed Parsa 24
//switch State // Operate dependent on the State
//{
case 1 : // Recognizing Identifiers
if (Isalpha(Nextchar) || Isdigit(Nextchar) || NextChar == ‘_’)
state = 1;
else {
Lastchar = Nextchar ; NextWord[--Length] = ‘0’;
return MakeToken(IsKeyWord(NextWord));
}
break ;
case 3 : …
} // End of switch
} // End of while
state Length NextChar LastChar NextWord
0 1 v ‘0’ v
1 2 o ‘0’ vo
1 3 i ‘0’ voi
1 4 d ‘0’ void
1 5 ҍ ҍ
Void’0’
“void”
Example
5/11/2021 Saeed Parsa 25
Converting the state diagram to a lexical analysis program
5/11/2021 Saeed Parsa 26
 The isKeyword method determines
whether a given identifier, key, is a
keyword.
enum Symbols isKeyWord(char *key) {
int I ;
struct KeyType{
char *key;
enum Symbols Type
}
KeyTab[] = { “if’, S_If,
“while”, S_While,
“then”, S_Then,
“else”, S_Else,
“integer”, S_Integer,
“type”, S_Type,
“function”, S_Function,
0, 0};
for(I=0; KeyTab[I].key && strcmp(KeyTab[I].key, Key); I++);
if(KeyTab[I].key) return KeyTab[I].Type;
return S_Identifier;
} // EOF IsKeyWord
Example 1
5/11/2021 Saeed Parsa 27
 Design DFA which with  = {0, 1} accepts either even number of 0s or odd number
of 1s. Write the code for a lexical analyzer based on the designed DFA.
 We can make four different cases that are:
00 01
10 11
0
0
0
0
1
1
1
1
00010101010 not acceptable
1100 acceptable
110100 acceptable
0010101 acceptable
Example 1-2
5/11/2021 Saeed Parsa 28
switch State // Operate dependent on the State
{
case 0: // even no. 0s & even no. 1s.
if (NextChar == ‘n‘)
{RowNo++ ; ColNo = 0; }
else ColNo++;
if (Nextchar == ‘ ‘ || Nextchar = = ‘  t ‘ || Nextchar == ‘  n ‘)
Length = 0;
else if (( Nextchar == ‘1‘) State = 2;
else if (( Nextchar == ‘0‘) State = 1;
else return “accepted”;
break ; // End of no. detection
case 3: // even no. 0s & odd no. 1s
if (( Nextchar == ‘1‘) State = 1;
else if (( Nextchar == ‘0‘) State = 2;
else return “accepted”;
break ; // End of no. detection
0 1
2 3
0
0
0
0
1
1
1
1
‘ ‘ | t | n
Example 1-3
5/11/2021 Saeed Parsa 29
0 1
2 3
0
0
0
0
1
1
1
1
‘ ‘ | t | n
switch State // Operate dependent on the State
{
case 1: // odd no. 0s & even no. 1s
if (( Nextchar == ‘1‘) State = 3;
else if (( Nextchar == ‘0‘) State = 0;
else LexerError();
break ; //
case 2: // odd no. 0s & odd no. 1s
if (( Nextchar == ‘1‘) State = 0;
else if (( Nextchar == ‘0‘) State = 3;
else return “accepted”;
break ; // End of no. detection
} //End of switch
Example 2
5/11/2021 Saeed Parsa 30
 Design DFA which with  = {a, n} accepts strings with at least two a's and one b.
q0 q1
q3 q4
a
b
a
b
q2
q5
a
a
b
a,b
b b
a
Exercise
5/11/2021 Saeed Parsa 31
1. Design a DFA with ∑ = {0, 1} accepts those string which do not starts with “10” and
ends with 0. write a lexical analyzer program to implement the automata..
2. Design a deterministic finite automata(DFA) for accepting the language:
L = {(an bm) | m+n=even}.
3. Design a DFA for accepting numbers in base 3 whose sum of digits is 5.
Regular expressions
5/11/2021 Saeed Parsa 32
• By definition, a regular expression is a pattern that defines a set of character sequences.
• Lexical rules may be defined in terms of regular expressions and in terms of
deterministic finite automata.
• Examples:
Page 128 of AHO book
identifier : letter (letter | digit | ‘_’)* ;
Comment : ‘(‘* ( ( r | *+ s ) * r ) * *+ ‘)’ ;
r : all characters apart form *
S; All characters apart from * and ‘)’
5/11/2021 Saeed Parsa 33
ANTLR Lexical Rules
(Part 2)
Actions on token attributes
5/11/2021 Saeed Parsa 34
• All tokens have a collection of predefined, read-only attributes.
• The attributes include useful token properties such as the token type and text
matched for a token.
• Actions can access these attributes via $label.attribute where label labels a
particular instance of a token reference.
• To access the tokens matched for literals, you must use a label:
https://github.com/antlr/antlr4/blob/master/doc/actions.md
stat: r='return' expr {System.out.println("line="+$r.line);} ;
Token actions
5/11/2021 Saeed Parsa 35
• Most of the time you access the attributes of the token, but sometimes it is
useful to access the Token object itself because it aggregates all the attributes.
• Further, you can use it to test whether an optional subrule matched a token:
https://github.com/antlr/antlr4/blob/master/doc/actions.md
stat: 'if' expr 'then' stat (el='else' stat)?
{if ( $el!=null ) System.out.println("found an else");}
| ...
;
Token attributes
5/11/2021 Saeed Parsa 36
https://github.com/antlr/antlr4/blob/master/doc/actions.md
Attribute Type Description
text String
The text matched for the token; translates to a call to getText.
Example: $ID.text.
type int
The token type (nonzero positive integer) of the token such as
INT; translates to a call to getType. Example: $ID.type.
line int
The line number on which the token occurs, counting from 1;
translates to a call to getLine. Example: $ID.line.
pos int
The character position within the line at which the token’s first
character occurs counting from zero; translates to a call
togetCharPositionInLine. Example: $ID.pos.
Token attributes
5/11/2021 Saeed Parsa 37
https://github.com/antlr/antlr4/blob/master/doc/actions.md
Attribute Type Description
index int
The overall index of this token in the token stream, counting from
zero; translates to a call to getTokenIndex. Example: $ID.index.
channel int
The token’s channel number. The parser tunes to only one channel,
effectively ignoring off-channel tokens. The default channel is 0
(Token.DEFAULT_CHANNEL), and the default hidden channel is
Token.HIDDEN_CHANNEL. Translates to a call to getChannel.
Example: $ID.channel.
int int
The integer value of the text held by this token; it assumes that the
text is a valid numeric string. Handy for building calculators and so
on. Translates to Integer.valueOf(text-of-token). Example: $INT.int.
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 38
• Identifiers
A basic identifier is a nonempty sequence of uppercase and lowercase letters.
ID : ('a'..'z'|'A'..'Z')+ ; // match 1-or-more upper or lowercase letters
As a shorthand for character sets, ANTLR supports the more familiar regular
expression set notation:
ID : [a-zA-Z]+ ; // match 1-or-more upper or lowercase letters
• Keywords
Rule ID could also match keywords such as enum, if, while, then and for, which means
there’s more than one rule that could match the same string.
ANTLR lexers resolve ambiguities between lexical rules by favoring the rule specified first.
That means your ID rule should be defined after all of your keyword rules,
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 39
Identifiers:
/*Fragments*/
Identifier
: Identifiernondigit
(Identifiernondigit | DIGIT)*
;
fragment Identifiernondigit
: NONDIGIT
| Universalcharactername
;
fragment NONDIGIT
: [a-zA-Z]
;
fragment DIGIT
: [0-9]
;
fragment Universalcharactername
: 'u' Hexquad
| 'U' Hexquad Hexquad
;
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 40
Keywords:
Void
: 'void'
;
Volatile
: 'volatile'
;
While
: 'while'
;
Switch
: 'switch'
;
Struct
: 'struct'
;
Goto
: 'goto'
;
If
: 'if'
;
Inline
: 'inline'
;
I
int
: 'int'
;
Long
: 'long'
;
False
: 'false'
;
Final
: 'final'
;
Float
: 'float'
;
For
: 'for'
;
Else
: 'else'
;
Enum
: 'enum'
;
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 41
Numbers:
• Describing integer numbers such as 10 is easy because it’s just a
sequence of digits.
INT : '0'..'9'+ ; // match 1 or more digits
or
INT : [0-9]+ ; // match 1 or more digits
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 42
Floating point numbers:
 A floating-point number is a sequence of digits followed by a period and then
optionally a fractional part, or it starts with a period and continues with a sequence of
digits.
FLOAT: DIGIT+ '.' DIGIT* // match 1. , 39. , 3.14159 , etc...
| '.' DIGIT+ // match .1 .14159
;
fragment
DIGIT : [0-9] ; // match single digit
 By prefixing the rule with fragment, we let ANTLR know that the rule will be used
only by other lexical rules.
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 43
Strings:
 A string is a sequence of any characters between double quotes.
STRING : ‘“' .*? '"' ; // match anything in "..."
 The dot wildcard operator matches any single character.
 Therefore, .* would be a loop that matches any sequence of zero or more characters
 ANTLR provides support for nongreedy subrules using standard regular expression notation
(the ? suffix).
 Nongreedy subrules match the fewest number of characters while still allowing the entire
surrounding rule to match.
 To support the common escape characters, we need something like the following:
STRING: '"' (ESC|.)*? '“’ ;
Fragment ESC : '"' | '' ; // 2-char sequences " and
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 44
Comments:
 When a lexer matches the tokens we’ve defined so far, it emits them via the token
stream to the parser.
 But when the lexer matches comment and whitespace tokens, we’d like it to toss them
out.
 Here is how to match both single-line and multiline comments for C-derived
languages:
LINE_COMMENT : '//' .*? 'r’*? 'n' -> skip; // Match "//" stuff 'n'
COMMENT : '/*' .*? '*/' -> skip ; // Match "/*" stuff "*/"
 In LINE_COMMENT, .*? consumes everything after // until it sees a newline
(optionally preceded by a carriage return to match Windows-style newlines).
 In COMMENT, .*? consumes everything after /* and before the terminating */.
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 45
Whitespaces:
 Most programming languages treat whitespace characters as token separators but
otherwise ignore them.
 Python is an exception because it uses whitespace for particular syntax purposes:
newlines to terminate commands and indent level, with initial tabs or spaces to
indicate nesting level.
 Here is how to tell ANTLR to throw out whitespace:
WS : (' '|'t'|'r'|'n')+ -> skip; // match 1-or-more whitespace but discard
or
WS : [ trn]+ -> skip; // match 1-or-more whitespace but discard
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 46
Whitespaces:
Whitespace
: [ t]+ -> channel(HIDDEN)
;
Newline
: ('r' 'n'? | 'n') -> channel(HIDDEN)
;
BlockComment
: '/*' .*? '*/' -> channel(HIDDEN)
;
LineComment
: '//' ~ [rn]* -> channel(HIDDEN)
;
 By using The Channel(Hidden) attribute you tell ANTLR to keep the token.
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 47
Assignment operators:
assignmentoperator
: '=' #assignmentoperator1
| '*=' #assignmentoperator2
| '/=' #assignmentoperator3
| '%=' #assignmentoperator4
| '+=' #assignmentoperator5
| '-=' #assignmentoperator6
| RightShiftAssign #assignmentoperator7
| LeftShiftAssign #assignmentoperator8
| '&=' #assignmentoperator9
| '^=' #assignmentoperator10
| '|=' #assignmentoperator11
;
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 48
Operators:
theoperator
: New #theoperator1
| Delete #theoperator2
| New '[' ']' #theoperator3
| Delete '[' ']' #theoperator4
| '+' #theoperator5
| '-' #theoperator6
| '*' #theoperator7
| '/' #theoperator8
| '%' #theoperator9
| '^' #theoperator10
| '&' #theoperator11
| '|' #theoperator12
| '~' #theoperator13
| '!' #theoperator14
| 'not' #theoperator15
| '=' #theoperator16
| '<' #theoperator17
| '>' #theoperator18
| '+=' #theoperator19
| '-=' #theoperator20
| '*=' #theoperator21
| '/=' #theoperator22
| '%=' #theoperator23
| '^=' #theoperator24
| '&=' #theoperator25
| '|=' #theoperator26
| LeftShift #theoperator27
| RightShift #theoperator28
| RightShiftAssign #theoperator29
| LeftShiftAssign #theoperator30
| '==' #theoperator31
Lexical rules in ANTLR
5/11/2021 Saeed Parsa 49
Nested curly brackets:
 Consider that matching nested curly braces with a DFA must be done using a counter
whereas nested curlies are trivially matched with a context-free grammar:
ACTION
: '{' ( ACTION | ~'}' )* '}'
;
 The recursion, of course, is the dead giveaway that this is not an ordinary lexer rule.
• Lexer rules may use more than a single symbol of lookahead, can use semantic predicates,
and can specify syntactic predicates to look arbitrarily ahead.
ESCAPE_CHAR
: '' 't' // two char of lookahead needed,
| '' 'n' // due to common left-prefix
;
Lexer Starter Kit
5/11/2021 Saeed Parsa 50
 Punctuations
call : ID '(' exprList ')' ;
 Some programmers prefer to define token labels such as LP (left parenthesis) instead.
call : ID LP exprList RP ;
LP : '(' ;
RP : ')’ ;
 Keywords
 Keywords are reserved identifiers, and we can either reference them directly or define
token types for them:
returnStat : 'return' expr ';'
Lexer Starter Kit
5/11/2021 Saeed Parsa 51
 Identifiers
ID : ID_LETTER (ID_LETTER | DIGIT)* ; // From C language
fragment ID_LETTER : 'a'..'z'|'A'..'Z'|'_’ ;
fragment DIGIT : '0'..'9’ ;
 Numbers
INT : DIGIT+ ;
FLOAT
: DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
 Strings
STRING : '"' ( ESC | . )*? ‘" ’ ;
fragment ESC : '' [btnr"] ; // b, t, n etc...
Lexer Starter Kit
5/11/2021 Saeed Parsa 52
 Comments
LINE_COMMENT : '//' .*? 'n' -> skip ;
COMMENT : '/*' .*? '*/' -> skip ;
 Whitespace
WS : [ tnr]+ -> skip ;
Tokenizing Sentences
5/11/2021 Saeed Parsa 53
• Humans unconsciously combine letters into words before recognizing grammatical
structure while reading.
• Recognizers that feed off character streams are called tokenizers or lexers.
• Just as an overall sentence has structure, the individual tokens have structure.
• At the character level, we refer to syntax as the lexical structure.
• We want to recognize lists of names such as [a,b,c] and nested lists such as [a,[b,c],d]:
grammar NestedNameList;
list : '[' elements ']’ ; // match bracketed list
elements : element (',' element)* ; // match comma-separated list
element : NAME | list ; // element is name or nested list
NAME : ('a'..'z' |'A'..'Z' )+ ; // NAME is sequence of >=1 letter.
https://theswissbay.ch/pdf/Gentoomen%20Library/Programming/Pragmatic%20Programmers/Language%20I
mplementation%20Patterns.pdf
Parse Trees
5/11/2021 Saeed Parsa 54
grammar NestedNameList;
list : '[' elements ']’ ; // match bracketed list
elements : element (',' element)* ; // match comma-separated list
element : NAME | list ; // element is name or nested list
NAME : ('a'..'z' |'A'..'Z' )+ ; // NAME is sequence of >=1 letter.
1. Parse tree for [a,b,c]
2. Parse tree for [a,[b,c],d]
Implementation
5/11/2021 Saeed Parsa 55
• Here is a loop that pulls tokens out, until it returns a token with type EOF_TYPE:
ListLexer lexer = new ListLexer(args[0]);
Token t = lexer.nextToken();
while ( t.type != Lexer.EOF_TYPE ) {
System.out.println(t);
t = lexer.nextToken();
}
System.out.println(t); // EOF
https://theswissbay.ch/pdf/Gentoomen%20Library/Programming/Pragmatic%20Programmers/Language%20Imp
lementation%20Patterns.pdf
Page: 51
Example
5/11/2021 Saeed Parsa 56
Write a program to accept a C++ program as input and generate parse tree for the program
1. Run ANTLR to generate a lexical analyzer (lexer) and a parser for C++ .
2. Give a C++ program to your C++ compiler to generate parse and depict the parse tree for the
program.
You may generate lexer and parser for other languages such as C# , Java, and Python.
Your code could be either in Python, or C# language,
Example
5/11/2021 Saeed Parsa 57
1. Run ANTLR to generate a lexical analyzer (lexer) and a parser for C++.
First, we have to move the grammar of C++ language (CPP14.g4) to C:javalibs folder.
Now, we have to generate lexer & parser for the C++ by the Python language with this
command in cmd:
java –jar ./antlr-4.8-complete.jar –Dlanguage=Python3 CPP14.g4
Example
5/11/2021 Saeed Parsa 58
As you seen below:
Files generated
Example
5/11/2021 Saeed Parsa 59
2. Give a C++ program to your C++ compiler to generate parse and depict the
parse tree for the program.
We write a python code to generate lexer & parser for the file ‘test.cpp’ that is in
the main folder of our python code.
Example
5/11/2021 Saeed Parsa 60
from antlr4 import CommonTokenStream, FileStream, ParseTreeWalker
from CPP14Lexer import CPP14Lexer
from CPP14Listener import CPP14Listener
from CPP14Parser import CPP14Parser
if __name__ == '__main__':
input_stream = FileStream('./test.cpp')
lexer = CPP14Lexer(input_stream)
stream = CommonTokenStream(lexer)
parser = CPP14Parser(stream)
tree = parser.translationunit()
listener = CPP14Listener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
print(tree.getRuleIndex())
#Use FileStream to read the program file.
#Generate lexer from the fileStream object.
#Use CommonTokenStream to get tokens from the values
#read in Lexer.
#Generate parser to create the parse tree from tokens.
#Use listener & walker to navigate the parsing tree.
#Listener listen when the walker enters or exits from each
#non-terminal rule.
Regular expressions grammar
5/11/2021 Saeed Parsa 61
• grammar MyGrammar;
• /* * Parser Rules */
• expr: expr op=(MUL | DIV) expr #mulDiv
• | expr op=(ADD | SUB) expr #addSub
• | number | ‘(‘expr’)’ #num
• ;
• /* * Lexer Rules */
• fragment DIGIT: [0 – 9];
• fragment LETTER: [a – zA – Z];
• INT: DIGIT +;
• FLOAT: DIGIT + ‘.’
• DIGIT +;
• STRING_LITERAL: ‘”.* ? ‘”;
• NAME: LETTER(LETTER | DIGIT)
* ;
• IDENTIFIER: [a-zA-Z0-9]+;
• MUL: ‘*’;
• DIV: ‘/’;
• ADD: ‘+’;
• SUB: ‘-‘;
• WS: [trn]+ -> skip;
Example 1: Generating a Lexer
5/11/2021 Saeed Parsa 62
In our grammar file, say ScriptLexer.g4, we have:
1. // Name our lexer (the name must match the filename)
2. lexer grammar ScriptLexer;
3. // Define string values - either unquoted or quoted
4. STRING : ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'@')+ |
5. ('"' (~('"' | '' | 'r' | 'n') | '' ('"' | ''))* '"') ;
6. // Skip all spaces, tabs, newlines
7. WS : [ trn]+ -> skip ;
8. // Skip comments
9. LINE_COMMENT : '//' ~[rn]* 'r'? 'n' -> skip ;
10. // Define punctuations
11. LPAREN : '<' ;
12. RPAREN : '>' ;
13. EQUALS : '=' ;
14. SEMICO : ';' ;
15. ASSIGN : ':=' ;
https://imjching.com/writings/2017/02/16/lexical-analysis-with-antlr-v4/
Example 1: Generating a Lexer
5/11/2021 Saeed Parsa 63
Now that we have our grammar file, we can run the ANTLR tool on it to generate our lexer
program.
antlr4 ScriptLexer.g4
This will generate two files:
1. ScriptLexer.java (the code which contains the implementation of the FSM together
with our token constants) and
2. ScriptLexer.tokens.
Now we will create a Java program to test our lexer: TestLexer.java
https://imjching.com/writings/2017/02/16/lexical-analysis-with-antlr-v4/
Example 1: Generating a Lexer
5/11/2021 Saeed Parsa 64
import java.io.File;
import java.io.FileInputStream;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
public class TestLexer {
public static void main(String[] args) throws Exception {
System.out.println("Parsing: " + args[0]);
FileInputStream fis = new FileInputStream(new File(args[0]));
ANTLRInputStream input = new ANTLRInputStream(fis);
ScriptLexer lexer = new ScriptLexer(input);
Token token = lexer.nextToken();
while (token.getType() != ScriptLexer.EOF) {
System.out.println("t" + getTokenType(token.getType()) +
"tt" + token.getText());
token = lexer.nextToken();
}
}
private static String getTokenType(int tokenType) {
switch (tokenType) {
case ScriptLexer.STRING:
return "STRING";
case ScriptLexer.LPAREN:
return "LPAREN";
case ScriptLexer.RPAREN:
return "RPAREN";
case ScriptLexer.EQUALS:
return "EQUALS";
case ScriptLexer.SEMICO:
return "SEMICO";
case ScriptLexer.ASSIGN:
return "ASSIGN";
default:
return "OTHER";
}
}
}
https://imjching.com/writings/2017/02/16/lexical-analysis-with-antlr-v4/
Example 1: Generating a Lexer
5/11/2021 Saeed Parsa 65
We then compile our test program.
javac TestLexer.java
and if we try to run TestLexer and giving sample.script as an argument:
// Sample.Script : What to do in the morning
func morning <
name := "Jay";
greet morning=true input=@name;
eat cereals;
attend class="CS101";
>
// What to do at night
func night <
brush_teeth;
sleep hours=8;
>
Example 1: Generating a Lexer
5/11/2021 Saeed Parsa 66
We get the following:
java TestLexer sample.script
Parsing: sample.script
STRING func
STRING morning
LPAREN <
STRING name
ASSIGN :=
STRING "Jay"
SEMICO ;
STRING greet
STRING morning
EQUALS =
STRING true
STRING input
EQUALS =
STRING @name
SEMICO ;
STRING eat
STRING cereals
SEMICO ;
STRING attend
STRING class
EQUALS =
STRING "CS101"
SEMICO ;
RPAREN >
STRING func
STRING night
LPAREN <
STRING brush_teeth
SEMICO ;
STRING sleep
STRING hours
EQUALS =
STRING 8
SEMICO ;
RPAREN >
Regular Expression in C#
5/11/2021 Saeed Parsa 67
• In C#, Regular Expression is a pattern which is used to parse and check whether the
given input text is matching with the given pattern or not.
• In C#, Regular Expressions are generally termed as C# Regex.
• The .Net Framework provides a regular expression engine that allows the pattern
matching.
• Patterns may consist of any character literals, operators or constructors.
• C# provides a class termed as Regex which can be found in
System.Text.RegularExpression namespace.
• This class will perform two things:
- Parsing the inputting text for the regular expression pattern.
- Identify the regular expression pattern in the given text.
Regular Expression in C#
5/11/2021 Saeed Parsa 68
Regular Expression in C#
5/11/2021 Saeed Parsa 69
Regular Expression in C#
5/11/2021 Saeed Parsa 70
Regular Expression in C#
5/11/2021 Saeed Parsa 71
Regular Expression in C#
5/11/2021 Saeed Parsa 72
Regular Expression in C#
5/11/2021 Saeed Parsa 73
Regular Expression in C#
5/11/2021 Saeed Parsa 74
Regular Expression in C#
5/11/2021 Saeed Parsa 75
Regular Expression in C#
5/11/2021 Saeed Parsa 76
Regular Expression in C#
5/11/2021 Saeed Parsa 77
Regular Expression in C#
5/11/2021 Saeed Parsa 78
Regular Expression in C#
5/11/2021 Saeed Parsa 79
Regular Expression in C#
5/11/2021 Saeed Parsa 80
Regular Expression in C#
5/11/2021 Saeed Parsa 81
Regular Expression in C#
5/11/2021 Saeed Parsa 82
Regular Expression in C#
5/11/2021 Saeed Parsa 83
POSIX Standard
5/11/2021 Saeed Parsa 84
• POSIX standard is a widely used and accepted API for regular expression
• POSIX is a standard specified the IEEE.
• Traditional Unix regular expression syntax followed common conventions that often differed
from tool to tool.
• The POSIX Basic Regular Expressions syntax was developed by the IEEE, together with an
extended variant called Extended Regular Expression syntax.
• These standards were designed mostly to provide backward compatibility with the
traditional Simple Regular Expressions syntax, providing a common standard which has since
been adopted as the default syntax of many Unix regular expression tools.
POSIX Standard
5/11/2021 Saeed Parsa 85
POSIX Standard
5/11/2021 Saeed Parsa 86
Examples:
 .at matches any three-character string ending with "at", including "hat",
"cat", and "bat".
 [hc]at matches "hat" and "cat".
 [^b]at matches all strings matched by .at except "bat".
 ^[hc]at matches "hat" and "cat", but only at the beginning of the string or
line.
 [hc]at$ matches "hat" and "cat", but only at the end of the string or line.
 [.] matches any single character surrounded by "[" and "]" since the
brackets are escaped, for example: "[a]" and "[b]".
POSIX Standard
5/11/2021 Saeed Parsa 87
Example
5/11/2021 Saeed Parsa 88
Write a regular expression to describe inputs over the alphabet {a, b, c} that are in sorted order.
Because of sorting, we assume that inputs are like examples followed:
So, the regular expression is:
aaabbbcc abcc bcccc abb aaabcc aacc
Example
5/11/2021 Saeed Parsa 89
Write a regular expression to check whether a string starts and ends with the same character.
Implement the regular expression in Python.
We have two cases:
• If the string has just a single character, it has the condition we want, so the regExp is:
^[a-z]$
• If the string has multiple characters, we have:
^([a-z]).*1$
• 1: match the same character that comes in the first position of the string
So, the final regExp we want is the combination of two cases mentioned:
^[a-z]$|^([a-z]).*1$
Example
5/11/2021 Saeed Parsa 90
The python program to implement and test the regExp is shown below:
import re
regExp = r'^[a-z]$|^([a-z]).*1$'
result = re.match(regExp, 'abba')
print(result)
Example
5/11/2021 Saeed Parsa 91
Write a regular expression to determine if a string is an ip address Rule: An ip address consists of
3 numbers, separated with two dots. The value of each number is 0-255 For example:
255.189.10.37 Correct and 256.189.89.9 is error. Write a C# or Python program to validate IP
addresses.
Because we have 4 numbers that separated by .(dot), our regular expression that just accepts the
valid ip address is:
((d|[1-9]d|1d{2}|2[0-4]d|25[0-5]).){3}(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])
Example
5/11/2021 Saeed Parsa 92
The python program to implement and test the regExp is shown below:
import re
regExp = r'((d|[1-9]d|1d{2}|2[0-4]d|25[0-5]).){3}(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])'
result = re.fullmatch(regExp, '192.168.1.12')
print(result)
Example
5/11/2021 Saeed Parsa 93
Depict a DFA to accept all the binary strings, that do not include the substring “011”. write a
lexer function to determine these strings.
The DFA should be like this:
Example
5/11/2021 Saeed Parsa 94
Lexer function:
struct TokenType LexicalAnalyser(FILE * Source) {
enum Symbols LexiconType;
char NextChar, NextWord[80];
int State, Length;
static char LastChar = '0';
static int RowNo = 0, ColNo = 0;
State = 0;
Length = 0;
while(!feof(Source)) {
if(LastChar) {
NextChar = LastChar;
LastChar = '0';
} else {
NextChar = fgetc(Source);
}
NextWord[Length++] = NextChar;
Example
5/11/2021 Saeed Parsa 95
Lexer function:
case 0:
if (NextChar == ‘n‘){
RowNo++ ; ColNo = 0;
} else
ColNo++;
if (Nextchar == ‘ ‘ || Nextchar == ‘t‘ || Nextchar == ‘n‘)
Length = 0;
else if ( Nextchar == ‘1‘) State = 1;
else if ( Nextchar == ‘0‘) State = 2;
else LexerError(NextWord, Length);
break;
case 1:
if ( Nextchar == ‘1‘) State = 1;
else if ( Nextchar == ‘0‘) State = 2;
else return "accepted";
break;
Example
5/11/2021 Saeed Parsa 96
Lexer function:
case 2:
if ( Nextchar == ‘1‘) State = 3;
else if ( Nextchar == ‘0‘) State = 2;
else return "accepted";
break;
case 3:
if ( Nextchar == ‘0‘) State = 1;
else return "accepted";
break;
Example
5/11/2021 Saeed Parsa 97
Write a function that removes all comments from a piece of CPP code.
Steps to do:
1. Create FileStream object for the input file.
2. Create lexer for the FileStream object
3. Get the first token by lexer.nextToken().
4. Check in the loop all of the tokens one by one not to be line comment or block
comment & get next token.
Example
5/11/2021 Saeed Parsa 98
The python code to implement removing comments is shown below:
from antlr4 import *
from gen.CPP14Lexer import CPP14Lexer
def remove_comments(filename='test.cpp'):
input_stream = FileStream(filename)
lexer = CPP14Lexer(input_stream)
stream = CommonTokenStream(lexer)
token = lexer.nextToken()
new_file = open('test2.cpp', 'w')
while token.type != Token.EOF:
if token.type != lexer.BlockComment and token.type != lexer.LineComment:
new_file.write(token.text.replace('r', ''))
token = lexer.nextToken()
Example
5/11/2021 Saeed Parsa 99
That’s the result:
Before call remove_comment() After call remove_comment()
Example
5/11/2021 Saeed Parsa 100
Write a Python program, using ANTLR, to add your student number to all the
comments within CPP programs.
We find both line comments & block comments and we add student id to it and save in
new file named “test2.cpp”.
Example
5/11/2021 Saeed Parsa 101
The python code to implement removing comments is shown below:
input_stream = FileStream('test.cpp')
lexer = CPP14Lexer(input_stream)
stream = CommonTokenStream(lexer)
token = lexer.nextToken()
new_file = open('test2.cpp', 'w')
while token.type != Token.EOF:
if token.type == lexer.BlockComment:
text_to_write = token.text.replace('r', '').replace("*/", "97521135n*/")
new_file.write(text_to_write)
elif token.type == lexer.LineComment:
new_file.write(token.text.replace('r', '') + " 97521135")
else:
new_file.write(token.text.replace('r', ''))
token = lexer.nextToken()
Example
5/11/2021 Saeed Parsa 102
Write a Python program, using ANTLR, to detect email addresses.
First, we have to write a grammar for emails in Email.g4:
grammar Email;
email: LITERAL ATSIGN LITERAL (DOT LITERAL)+ ;
WS: [ trn] -> skip ;
LITERAL : [a-zA-Z]+ [0-9]*;
ATSIGN: '@' ;
DOT: '.' ;
Example
5/11/2021 Saeed Parsa 103
Write a Python program, using ANTLR, to detect email
addresses.
Now, we have to generate lexer & parser from the grammar
we’ve wrote, by right-click on the grammar file & select
“Generate ANTLR Recognizer”
Example
5/11/2021 Saeed Parsa 104
And we use python code to use lexer to evaluate the email is right or not.
from antlr4 import *
from gen.EmailLexer import EmailLexer
input_stream = InputStream('danibazi9@gmail.com')
try:
lexer = emailLexer(input_stream)
stream = CommonTokenStream(lexer)
print("The input email is correct")
print("The tokens are:")
token = lexer.nextToken()
while token.type != Token.EOF:
print(token.text)
token = lexer.nextToken()
except:
print("The input email is not in the proper format")
Assignment 2
5/11/2021 Saeed Parsa 105
Subject : Lexical Analysis
Deadline: 1399/7/28
Mark: 5 out of 100.
Assignment 2
5/11/2021 Saeed Parsa 106
1. Write a regular expression to describe inputs over the alphabet {a, b, c} that
are in sorted order.
2. Write a regular expression to check whether a string starts and ends with the
same character. Use ANTLR4 to implement the regular expression in Python.
3. Write a regular expression to determine if a string is an ip address Rule: An ip
address consists of 3 numbers, separated with two dots. The value of each
number is 0-255 For example: 255.189.10.37 Correct and 256.189.89.9 is
error. Write a C# or Python program to validate IP addresses.
4. Depict a DFA to accept all the binary strings, that do not include the substring
“011”. write a lexer function to determine these strings.
5. Write a function that removes all comments from a piece of CPP code.
The place of IUST in the world
5/11/2021 Saeed Parsa 107
https://www.researchgate.net/publication/328099969_Software_Fault_Localisation_A_Systematic_Mapping_Study
5/11/2021 Saeed Parsa 108

More Related Content

What's hot

Php server variables
Php server variablesPhp server variables
Php server variablesJIGAR MAKHIJA
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in PythonSujith Kumar
 
Lesson 02 python keywords and identifiers
Lesson 02   python keywords and identifiersLesson 02   python keywords and identifiers
Lesson 02 python keywords and identifiersNilimesh Halder
 
Python Loops Tutorial | Python For Loop | While Loop Python | Python Training...
Python Loops Tutorial | Python For Loop | While Loop Python | Python Training...Python Loops Tutorial | Python For Loop | While Loop Python | Python Training...
Python Loops Tutorial | Python For Loop | While Loop Python | Python Training...Edureka!
 
Exception Handling in JAVA
Exception Handling in JAVAException Handling in JAVA
Exception Handling in JAVASURIT DATTA
 
Finite automata(For college Seminars)
Finite automata(For college Seminars)Finite automata(For college Seminars)
Finite automata(For college Seminars)Naman Joshi
 
Data Types & Variables in JAVA
Data Types & Variables in JAVAData Types & Variables in JAVA
Data Types & Variables in JAVAAnkita Totala
 
Expression and Operartor In C Programming
Expression and Operartor In C Programming Expression and Operartor In C Programming
Expression and Operartor In C Programming Kamal Acharya
 
Looping Statements and Control Statements in Python
Looping Statements and Control Statements in PythonLooping Statements and Control Statements in Python
Looping Statements and Control Statements in PythonPriyankaC44
 
Command line-arguments-in-java-tutorial
Command line-arguments-in-java-tutorialCommand line-arguments-in-java-tutorial
Command line-arguments-in-java-tutorialKuntal Bhowmick
 
Data structure tries
Data structure triesData structure tries
Data structure triesMd. Naim khan
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial javaTpoint s
 
L14 exception handling
L14 exception handlingL14 exception handling
L14 exception handlingteach4uin
 

What's hot (20)

Php server variables
Php server variablesPhp server variables
Php server variables
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
 
Python for loop
Python for loopPython for loop
Python for loop
 
Lesson 02 python keywords and identifiers
Lesson 02   python keywords and identifiersLesson 02   python keywords and identifiers
Lesson 02 python keywords and identifiers
 
Python Loops Tutorial | Python For Loop | While Loop Python | Python Training...
Python Loops Tutorial | Python For Loop | While Loop Python | Python Training...Python Loops Tutorial | Python For Loop | While Loop Python | Python Training...
Python Loops Tutorial | Python For Loop | While Loop Python | Python Training...
 
C functions
C functionsC functions
C functions
 
PHP variables
PHP  variablesPHP  variables
PHP variables
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Exception Handling in JAVA
Exception Handling in JAVAException Handling in JAVA
Exception Handling in JAVA
 
C# String
C# StringC# String
C# String
 
Finite automata(For college Seminars)
Finite automata(For college Seminars)Finite automata(For college Seminars)
Finite automata(For college Seminars)
 
Data Types & Variables in JAVA
Data Types & Variables in JAVAData Types & Variables in JAVA
Data Types & Variables in JAVA
 
Expression and Operartor In C Programming
Expression and Operartor In C Programming Expression and Operartor In C Programming
Expression and Operartor In C Programming
 
Pda
PdaPda
Pda
 
Finite automata
Finite automataFinite automata
Finite automata
 
Looping Statements and Control Statements in Python
Looping Statements and Control Statements in PythonLooping Statements and Control Statements in Python
Looping Statements and Control Statements in Python
 
Command line-arguments-in-java-tutorial
Command line-arguments-in-java-tutorialCommand line-arguments-in-java-tutorial
Command line-arguments-in-java-tutorial
 
Data structure tries
Data structure triesData structure tries
Data structure tries
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial
 
L14 exception handling
L14 exception handlingL14 exception handling
L14 exception handling
 

Similar to 3. Lexical analysis

Ch 2.pptx
Ch 2.pptxCh 2.pptx
Ch 2.pptxwoldu2
 
String Matching with Finite Automata,Aho corasick,
String Matching with Finite Automata,Aho corasick,String Matching with Finite Automata,Aho corasick,
String Matching with Finite Automata,Aho corasick,8neutron8
 
Lexical analyzer generator lex
Lexical analyzer generator lexLexical analyzer generator lex
Lexical analyzer generator lexAnusuya123
 
02. chapter 3 lexical analysis
02. chapter 3   lexical analysis02. chapter 3   lexical analysis
02. chapter 3 lexical analysisraosir123
 
Compiler Design File
Compiler Design FileCompiler Design File
Compiler Design FileArchita Misra
 
Lecture 1 - Lexical Analysis.ppt
Lecture 1 - Lexical Analysis.pptLecture 1 - Lexical Analysis.ppt
Lecture 1 - Lexical Analysis.pptNderituGichuki1
 
Saumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona TucsonSaumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona Tucsonjeronimored
 
1.3.2 non deterministic finite automaton
1.3.2 non deterministic finite automaton1.3.2 non deterministic finite automaton
1.3.2 non deterministic finite automatonSampath Kumar S
 
@vtucode.in-module-1-21CS51-5th-semester (1).pdf
@vtucode.in-module-1-21CS51-5th-semester (1).pdf@vtucode.in-module-1-21CS51-5th-semester (1).pdf
@vtucode.in-module-1-21CS51-5th-semester (1).pdfFariyaTasneem1
 
Lex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.pptLex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.pptMohitJain296729
 
Chapter 5 -Syntax Directed Translation - Copy.ppt
Chapter 5 -Syntax Directed Translation - Copy.pptChapter 5 -Syntax Directed Translation - Copy.ppt
Chapter 5 -Syntax Directed Translation - Copy.pptFamiDan
 
Pattern Matching using Computational and Automata Theory
Pattern Matching using Computational and Automata TheoryPattern Matching using Computational and Automata Theory
Pattern Matching using Computational and Automata TheoryIRJET Journal
 
JCConf 2021 - Java17: The Next LTS
JCConf 2021 - Java17: The Next LTSJCConf 2021 - Java17: The Next LTS
JCConf 2021 - Java17: The Next LTSJoseph Kuo
 
Regular expression
Regular expressionRegular expression
Regular expressionRajon
 

Similar to 3. Lexical analysis (20)

Assignment5
Assignment5Assignment5
Assignment5
 
Ch 2.pptx
Ch 2.pptxCh 2.pptx
Ch 2.pptx
 
String Matching with Finite Automata,Aho corasick,
String Matching with Finite Automata,Aho corasick,String Matching with Finite Automata,Aho corasick,
String Matching with Finite Automata,Aho corasick,
 
Lexical analyzer generator lex
Lexical analyzer generator lexLexical analyzer generator lex
Lexical analyzer generator lex
 
02. chapter 3 lexical analysis
02. chapter 3   lexical analysis02. chapter 3   lexical analysis
02. chapter 3 lexical analysis
 
Lec1.pptx
Lec1.pptxLec1.pptx
Lec1.pptx
 
Ch3
Ch3Ch3
Ch3
 
RegexCat
RegexCatRegexCat
RegexCat
 
Compiler Design File
Compiler Design FileCompiler Design File
Compiler Design File
 
Lecture 1 - Lexical Analysis.ppt
Lecture 1 - Lexical Analysis.pptLecture 1 - Lexical Analysis.ppt
Lecture 1 - Lexical Analysis.ppt
 
Saumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona TucsonSaumya Debray The University of Arizona Tucson
Saumya Debray The University of Arizona Tucson
 
1.3.2 non deterministic finite automaton
1.3.2 non deterministic finite automaton1.3.2 non deterministic finite automaton
1.3.2 non deterministic finite automaton
 
@vtucode.in-module-1-21CS51-5th-semester (1).pdf
@vtucode.in-module-1-21CS51-5th-semester (1).pdf@vtucode.in-module-1-21CS51-5th-semester (1).pdf
@vtucode.in-module-1-21CS51-5th-semester (1).pdf
 
Lex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.pptLex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.ppt
 
Chapter 5 -Syntax Directed Translation - Copy.ppt
Chapter 5 -Syntax Directed Translation - Copy.pptChapter 5 -Syntax Directed Translation - Copy.ppt
Chapter 5 -Syntax Directed Translation - Copy.ppt
 
Pattern Matching using Computational and Automata Theory
Pattern Matching using Computational and Automata TheoryPattern Matching using Computational and Automata Theory
Pattern Matching using Computational and Automata Theory
 
JCConf 2021 - Java17: The Next LTS
JCConf 2021 - Java17: The Next LTSJCConf 2021 - Java17: The Next LTS
JCConf 2021 - Java17: The Next LTS
 
module 4.pptx
module 4.pptxmodule 4.pptx
module 4.pptx
 
Regular expression
Regular expressionRegular expression
Regular expression
 
Handout#02
Handout#02Handout#02
Handout#02
 

More from Saeed Parsa

6 attributed grammars
6  attributed grammars6  attributed grammars
6 attributed grammarsSaeed Parsa
 
5 top-down-parsers
5  top-down-parsers 5  top-down-parsers
5 top-down-parsers Saeed Parsa
 
2. introduction to compiler
2. introduction to compiler2. introduction to compiler
2. introduction to compilerSaeed Parsa
 
4. languages and grammars
4. languages and grammars4. languages and grammars
4. languages and grammarsSaeed Parsa
 
1. course introduction
1. course introduction1. course introduction
1. course introductionSaeed Parsa
 

More from Saeed Parsa (6)

6 attributed grammars
6  attributed grammars6  attributed grammars
6 attributed grammars
 
5 top-down-parsers
5  top-down-parsers 5  top-down-parsers
5 top-down-parsers
 
2. introduction to compiler
2. introduction to compiler2. introduction to compiler
2. introduction to compiler
 
4. languages and grammars
4. languages and grammars4. languages and grammars
4. languages and grammars
 
1. course introduction
1. course introduction1. course introduction
1. course introduction
 
2. introduction
2. introduction2. introduction
2. introduction
 

Recently uploaded

Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 

Recently uploaded (20)

Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 

3. Lexical analysis

  • 1. 5/11/2021 Saeed Parsa 1 Compiler Design Lexical Analysis Saeed Parsa Room 332, School of Computer Engineering, Iran University of Science & Technology parsa@iust.ac.ir Winter 2021
  • 2. What is Lexical Analyzer?  The lexical analyzer is usually a function that is called by the parser when it needs the next token. 5/11/2021 Saeed Parsa 2  The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a token for each lexeme in the source program.
  • 3. What is Lexical Analyzer? Lexeme: sequence of characters in source program matching a pattern. 5/11/2021 Saeed Parsa 3 Keywords; Examples-for, while, if etc. Identifier; Examples-Variable name, function name, etc. Operators; Examples '+', '++', '-' etc. Separators; Examples ',' ';' etc. Read, Section 3.1.2 in page 111 of Aho book. Read, Section 3.1.2 in page 111 of Aho book.
  • 4. What is Lexical Analyzer? 5/11/2021 Saeed Parsa 4
  • 6. What is Lexical Analyzer? 5/11/2021 Saeed Parsa 6 Output structure: typedef struct Token{ int row; // 1- Row number of the lexicon int col; // 2- Column number of the lexicon int BlkNo ; // 3- Nested block no. enum Symbols type; // 4- the lexicon code char Name [30]; // 5- Lexicon } Token-type ; Input: Text source; source = fopen(input_file , ”R+”); Input & Output structure:
  • 7. What is Lexical Analyzer? 5/11/2021 Saeed Parsa 7 Output structure: Nesting block no. The names of the variables alone are not enough to determine and distinguish them from each other, but to determine a word, its enclosing block nesting number is needed. Example: { int I ; I=5 ; { int I ; I=6 ; printf(“2nd blk %d“, I); } printf(“ n 1st blk %d“, I); } Token example: the word “int { Token: Row : 1 Col: 3 BlkNo: 1 Type: S_int lexeme: “int“
  • 8. What is Lexical Analyzer? 5/11/2021 Saeed Parsa 8 Output structure: To each kind of lexeme a different code is assigned. Enumerated data types male it possible to name constants. For instance on the following enum definition, the first constant S_Program = 0 represents the program keyword, and the S_Eq = 2 represents the equal sign.
  • 9. What is Lexical Analyzer? 5/11/2021 Saeed Parsa 9 Output structure: typedefe enum Symbols { S_Program, S_Const, S_Eq, S_Semi, S_Id, S_No, S_Type, S_Record, S_End, S_Int, S_Real, S_Char, S_Array, S_String, S_Begin, S_Var, S_Colon, S_ParBaz, S_ParBast, S_Set, S_of,S_BrackBaz, S_BrackBast, S_Case, S_ConstString, S_Function, S_Procedure, S_Begin, S_Dot, S_Comma, S_If, S_Then, S_Else, S_While, S_Do, S_Repeat, S_Until, S_For, S_Add, S_Sub, S_Div, S_Mul, S_Mod, S_Lt, S_Le, S_Gt, S_Ge, S_Gt, S_Ne, S_Not, S_And, S_Or };
  • 10. Implementation 5/11/2021 Saeed Parsa 10  Lexical Analysis can be implemented with the Finite State Automata (FSA).  A Finite State Automaton has  A set of states • One marked initial • Some marked final  A set of transitions from state to state • Each labeled with an alphabet symbol or ε  Operate by beginning at the start state, reading symbols and making indicated transitions  When input ends, state must be final or else reject  Note: This FSA represents “Comments” in CPP.
  • 11. Implementation 5/11/2021 Saeed Parsa 11 • Example: Source: 1. Input: /* ab1-***+*/ State: 000122222333234 Accept 2. Input: /- aaaaa */ State: 0001 Error 2. Input: /* ab***- State: 00012223332 Infinite loop Blank | t | n
  • 12. Finite State Automata (FSA) 5/11/2021 Saeed Parsa 12  A Finite State Automaton is a recognizer or acceptor of regular Language.  The word finite is used because the number of possible states and the number of symbols in alphabet sets are finite.  In Greek, automaton means self-acting.  Formally, finite automaton is a 5 tuple machine, denoted by M, where:  M=(Q,Σ,δ,q0,F).  Q is a finite set of states.  Σ is the finite input alphabets.  Δ is the transition function.  q0 indicates start state. q0⊆Q  F is the set of final or accepting states. F⊆Q. https://www.researchgate.net/publication/311251505_An_exploration_on_lexical_analysis
  • 13. An example of a DFA 5/11/2021 Saeed Parsa 13  This Figure depicts an example of Deterministic Finite Automata.  Formal definition is  M=(Q,Σ,δ,q0,F).  Q={q0,q1,q2}set of all states.  Σ={0,1},δ is given in transition table,  q0 is start or initial state,  F={q2}.  Language is defined as L(M) ={w/w ends with 00}, w can be combination of 0’s and 1’s which ends with 00. (2) (PDF) https://www.researchgate.net/publication/311251505_An_exploration_on_lexical_analysis
  • 14. An example of a DFA 5/11/2021 Saeed Parsa 14  Language is defined as L(M) ={w/does not include 010}, w can be combination of 0’s and 1’s which does not include the 010 substring.  Formal definition is  M=(Q,Σ,δ,q0, F).  Q={q0,q1,q2}set of all states.  Σ={0,1}, δ is given in transition table,  q0 is start or initial state,  F={q0, q1, q2}.
  • 15. An example of a DFA 5/11/2021 Saeed Parsa 15  DFA that accepts exactly one a. Σ = {a}.  DFA that accepts at least one a. Σ = {a}.  DFA that accepts even number of a's. Σ = {a}. https://swaminathanj.github.io/fsm/dfa.html
  • 16. Transition diagrams 5/11/2021 Saeed Parsa 16  DFA starts consuming input string from q0 to reach the final state.  In a single transition from some state q, DFA reads an input symbol, changes the state based on δ and gets ready to read the next input symbol.  If last state is final then string is accepted otherwise it is rejected.  A finite automaton, (FA) consists of a finite set of states and set of transitions from state to state that occur on input symbols chosen from an alphabet S.  For each input symbol if there is exactly one transition out of each state then M is said to deterministic finite automaton.  A directed graph, called a transition diagram, is associated with a finite automaton as follows.  The vertices of the graph correspond to the states of the FA. https://shodhganga.inflibnet.ac.in/bitstream/10603/77125/8/08_chapter%201.pdf
  • 17. Transition diagrams 5/11/2021 Saeed Parsa 17  The following Figure is a transition diagram that recognizes the lexemes matching the token relop (relational operators).  Note, however, that state 4 has a * to indicate that we must retract the input one position Page 131 of AHO book
  • 18. Recognizing Identifiers 5/11/2021 Saeed Parsa 18  Recognizing keywords and identifiers presents a problem.  Usually, keywords like if or then are reserved (as they are in our running example), s  So they are not identifiers even though they look like identifiers.  After an identifier, is detected, isKeyword() is invoked to check whether the detected lexeme is a keyword. Page 132 of AHO book
  • 19. Recognizing Numbers 5/11/2021 Saeed Parsa 19 Page 133 of AHO book Different types of numbers: • 23.32 • 12 • 12. • .13 • 12.5E-8
  • 20. Recognizing all lexicons 5/11/2021 Saeed Parsa 20  Lexical rules of a language can be defined in terms of a DFA.  In state zero whitespaces are ignored.  State 1 recognizes identifiers.  After an identifier, is detected, isKeyword() is invoked to check whether the detected lexeme is a keyword.
  • 21. Architecture of a Transition-Diagram-Based Lexical Analyzer 5/11/2021 Saeed Parsa 21  There are several ways that a collection of transition diagrams can be used to build a lexical analyzer.  Regardless of the overall strategy, each state is represented by a piece of code.  We may imagine a variable state holding the number of the current state for a transition diagram.  A switch based on the value of state takes us to code for each of the possible states, where we find the action of that state.  Often, the code for a state is itself a switch statement or multiway branch that determines the next state by reading and examining the next input character.  You may write a program to convert a state transition diagram to a program. Page 134 of AHO book
  • 22. Converting the state diagram to a lexical analysis program 5/11/2021 Saeed Parsa 22 struct TokenType lexeicalAnalyser ( FILE *Source ) { enum Symbols LexiconType ; //Type of Lexeme char NextChar, NextWord[80 ] ; //Next char & next word int State, Length; // State no. in automata static char LastChar = ‘0’; // Extra char read In the last call static int RowNo =0, ColNo = 0; // Row and column no. State = 0 ; // Start state no. is zero Length = 0 ; //Length of the detected lexeme while( ! feof ( Source )) // While EOF source not encountered { if ( LastChar ) // if extra char was read in last call Nextchar = LastChar; LastChar = ‘0’ ;} // Retreat the character else NextChar = fgetc ( Source ) ; // Read next character NextWord[Length++] = NextChar; // Begin to make the next lexeme state Length NextChar LastChar NextWord 0 1 v ‘0’ v 1 2 o ‘0’ vo 1 3 i ‘0’ voi 1 4 d ‘0’ void 1 5 ҍ ‘0’ voidҍ
  • 23. Converting the state diagram to a lexical analysis program 5/11/2021 Saeed Parsa 23 switch State // Operate dependent on the State { case 0: // Recognizing numbers. if (NextChar == ‘n‘) {RowNo++ ; ColNo = 0; } else ColNo++; if (Nextchar == ‘ ‘ || Nextchar = = ‘ t ‘ || Nextchar == ‘ n ‘) Length = 0; else if (( Nextchar < = ‘z‘ && Nextchar > = ‘a‘ ) || ( Nextchar < = ‘Z‘ && Nextchar > = ‘A‘ )) State = 1; else if ( Nextchar < = ‘9‘ && Nextchar > = ‘0‘ ) State = 2 ; else if ( Nextchar = = ‘ ( ‘ ) State = 3; else if ( Nextchar = = ‘ < ‘ ) State = 4 else if ( Nextchar = = ‘ > ‘ ) State = 5 else LexerError(NextWord, Length); break ; // End of no. detection state Length NextChar LastChar NextWord 0 1 v ‘0’ v 1 2 o ‘0’ vo
  • 24. Converting the state diagram to a lexical analysis program 5/11/2021 Saeed Parsa 24 //switch State // Operate dependent on the State //{ case 1 : // Recognizing Identifiers if (Isalpha(Nextchar) || Isdigit(Nextchar) || NextChar == ‘_’) state = 1; else { Lastchar = Nextchar ; NextWord[--Length] = ‘0’; return MakeToken(IsKeyWord(NextWord)); } break ; case 3 : … } // End of switch } // End of while state Length NextChar LastChar NextWord 0 1 v ‘0’ v 1 2 o ‘0’ vo 1 3 i ‘0’ voi 1 4 d ‘0’ void 1 5 ҍ ҍ Void’0’ “void”
  • 26. Converting the state diagram to a lexical analysis program 5/11/2021 Saeed Parsa 26  The isKeyword method determines whether a given identifier, key, is a keyword. enum Symbols isKeyWord(char *key) { int I ; struct KeyType{ char *key; enum Symbols Type } KeyTab[] = { “if’, S_If, “while”, S_While, “then”, S_Then, “else”, S_Else, “integer”, S_Integer, “type”, S_Type, “function”, S_Function, 0, 0}; for(I=0; KeyTab[I].key && strcmp(KeyTab[I].key, Key); I++); if(KeyTab[I].key) return KeyTab[I].Type; return S_Identifier; } // EOF IsKeyWord
  • 27. Example 1 5/11/2021 Saeed Parsa 27  Design DFA which with  = {0, 1} accepts either even number of 0s or odd number of 1s. Write the code for a lexical analyzer based on the designed DFA.  We can make four different cases that are: 00 01 10 11 0 0 0 0 1 1 1 1 00010101010 not acceptable 1100 acceptable 110100 acceptable 0010101 acceptable
  • 28. Example 1-2 5/11/2021 Saeed Parsa 28 switch State // Operate dependent on the State { case 0: // even no. 0s & even no. 1s. if (NextChar == ‘n‘) {RowNo++ ; ColNo = 0; } else ColNo++; if (Nextchar == ‘ ‘ || Nextchar = = ‘ t ‘ || Nextchar == ‘ n ‘) Length = 0; else if (( Nextchar == ‘1‘) State = 2; else if (( Nextchar == ‘0‘) State = 1; else return “accepted”; break ; // End of no. detection case 3: // even no. 0s & odd no. 1s if (( Nextchar == ‘1‘) State = 1; else if (( Nextchar == ‘0‘) State = 2; else return “accepted”; break ; // End of no. detection 0 1 2 3 0 0 0 0 1 1 1 1 ‘ ‘ | t | n
  • 29. Example 1-3 5/11/2021 Saeed Parsa 29 0 1 2 3 0 0 0 0 1 1 1 1 ‘ ‘ | t | n switch State // Operate dependent on the State { case 1: // odd no. 0s & even no. 1s if (( Nextchar == ‘1‘) State = 3; else if (( Nextchar == ‘0‘) State = 0; else LexerError(); break ; // case 2: // odd no. 0s & odd no. 1s if (( Nextchar == ‘1‘) State = 0; else if (( Nextchar == ‘0‘) State = 3; else return “accepted”; break ; // End of no. detection } //End of switch
  • 30. Example 2 5/11/2021 Saeed Parsa 30  Design DFA which with  = {a, n} accepts strings with at least two a's and one b. q0 q1 q3 q4 a b a b q2 q5 a a b a,b b b a
  • 31. Exercise 5/11/2021 Saeed Parsa 31 1. Design a DFA with ∑ = {0, 1} accepts those string which do not starts with “10” and ends with 0. write a lexical analyzer program to implement the automata.. 2. Design a deterministic finite automata(DFA) for accepting the language: L = {(an bm) | m+n=even}. 3. Design a DFA for accepting numbers in base 3 whose sum of digits is 5.
  • 32. Regular expressions 5/11/2021 Saeed Parsa 32 • By definition, a regular expression is a pattern that defines a set of character sequences. • Lexical rules may be defined in terms of regular expressions and in terms of deterministic finite automata. • Examples: Page 128 of AHO book identifier : letter (letter | digit | ‘_’)* ; Comment : ‘(‘* ( ( r | *+ s ) * r ) * *+ ‘)’ ; r : all characters apart form * S; All characters apart from * and ‘)’
  • 33. 5/11/2021 Saeed Parsa 33 ANTLR Lexical Rules (Part 2)
  • 34. Actions on token attributes 5/11/2021 Saeed Parsa 34 • All tokens have a collection of predefined, read-only attributes. • The attributes include useful token properties such as the token type and text matched for a token. • Actions can access these attributes via $label.attribute where label labels a particular instance of a token reference. • To access the tokens matched for literals, you must use a label: https://github.com/antlr/antlr4/blob/master/doc/actions.md stat: r='return' expr {System.out.println("line="+$r.line);} ;
  • 35. Token actions 5/11/2021 Saeed Parsa 35 • Most of the time you access the attributes of the token, but sometimes it is useful to access the Token object itself because it aggregates all the attributes. • Further, you can use it to test whether an optional subrule matched a token: https://github.com/antlr/antlr4/blob/master/doc/actions.md stat: 'if' expr 'then' stat (el='else' stat)? {if ( $el!=null ) System.out.println("found an else");} | ... ;
  • 36. Token attributes 5/11/2021 Saeed Parsa 36 https://github.com/antlr/antlr4/blob/master/doc/actions.md Attribute Type Description text String The text matched for the token; translates to a call to getText. Example: $ID.text. type int The token type (nonzero positive integer) of the token such as INT; translates to a call to getType. Example: $ID.type. line int The line number on which the token occurs, counting from 1; translates to a call to getLine. Example: $ID.line. pos int The character position within the line at which the token’s first character occurs counting from zero; translates to a call togetCharPositionInLine. Example: $ID.pos.
  • 37. Token attributes 5/11/2021 Saeed Parsa 37 https://github.com/antlr/antlr4/blob/master/doc/actions.md Attribute Type Description index int The overall index of this token in the token stream, counting from zero; translates to a call to getTokenIndex. Example: $ID.index. channel int The token’s channel number. The parser tunes to only one channel, effectively ignoring off-channel tokens. The default channel is 0 (Token.DEFAULT_CHANNEL), and the default hidden channel is Token.HIDDEN_CHANNEL. Translates to a call to getChannel. Example: $ID.channel. int int The integer value of the text held by this token; it assumes that the text is a valid numeric string. Handy for building calculators and so on. Translates to Integer.valueOf(text-of-token). Example: $INT.int.
  • 38. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 38 • Identifiers A basic identifier is a nonempty sequence of uppercase and lowercase letters. ID : ('a'..'z'|'A'..'Z')+ ; // match 1-or-more upper or lowercase letters As a shorthand for character sets, ANTLR supports the more familiar regular expression set notation: ID : [a-zA-Z]+ ; // match 1-or-more upper or lowercase letters • Keywords Rule ID could also match keywords such as enum, if, while, then and for, which means there’s more than one rule that could match the same string. ANTLR lexers resolve ambiguities between lexical rules by favoring the rule specified first. That means your ID rule should be defined after all of your keyword rules,
  • 39. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 39 Identifiers: /*Fragments*/ Identifier : Identifiernondigit (Identifiernondigit | DIGIT)* ; fragment Identifiernondigit : NONDIGIT | Universalcharactername ; fragment NONDIGIT : [a-zA-Z] ; fragment DIGIT : [0-9] ; fragment Universalcharactername : 'u' Hexquad | 'U' Hexquad Hexquad ;
  • 40. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 40 Keywords: Void : 'void' ; Volatile : 'volatile' ; While : 'while' ; Switch : 'switch' ; Struct : 'struct' ; Goto : 'goto' ; If : 'if' ; Inline : 'inline' ; I int : 'int' ; Long : 'long' ; False : 'false' ; Final : 'final' ; Float : 'float' ; For : 'for' ; Else : 'else' ; Enum : 'enum' ;
  • 41. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 41 Numbers: • Describing integer numbers such as 10 is easy because it’s just a sequence of digits. INT : '0'..'9'+ ; // match 1 or more digits or INT : [0-9]+ ; // match 1 or more digits
  • 42. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 42 Floating point numbers:  A floating-point number is a sequence of digits followed by a period and then optionally a fractional part, or it starts with a period and continues with a sequence of digits. FLOAT: DIGIT+ '.' DIGIT* // match 1. , 39. , 3.14159 , etc... | '.' DIGIT+ // match .1 .14159 ; fragment DIGIT : [0-9] ; // match single digit  By prefixing the rule with fragment, we let ANTLR know that the rule will be used only by other lexical rules.
  • 43. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 43 Strings:  A string is a sequence of any characters between double quotes. STRING : ‘“' .*? '"' ; // match anything in "..."  The dot wildcard operator matches any single character.  Therefore, .* would be a loop that matches any sequence of zero or more characters  ANTLR provides support for nongreedy subrules using standard regular expression notation (the ? suffix).  Nongreedy subrules match the fewest number of characters while still allowing the entire surrounding rule to match.  To support the common escape characters, we need something like the following: STRING: '"' (ESC|.)*? '“’ ; Fragment ESC : '"' | '' ; // 2-char sequences " and
  • 44. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 44 Comments:  When a lexer matches the tokens we’ve defined so far, it emits them via the token stream to the parser.  But when the lexer matches comment and whitespace tokens, we’d like it to toss them out.  Here is how to match both single-line and multiline comments for C-derived languages: LINE_COMMENT : '//' .*? 'r’*? 'n' -> skip; // Match "//" stuff 'n' COMMENT : '/*' .*? '*/' -> skip ; // Match "/*" stuff "*/"  In LINE_COMMENT, .*? consumes everything after // until it sees a newline (optionally preceded by a carriage return to match Windows-style newlines).  In COMMENT, .*? consumes everything after /* and before the terminating */.
  • 45. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 45 Whitespaces:  Most programming languages treat whitespace characters as token separators but otherwise ignore them.  Python is an exception because it uses whitespace for particular syntax purposes: newlines to terminate commands and indent level, with initial tabs or spaces to indicate nesting level.  Here is how to tell ANTLR to throw out whitespace: WS : (' '|'t'|'r'|'n')+ -> skip; // match 1-or-more whitespace but discard or WS : [ trn]+ -> skip; // match 1-or-more whitespace but discard
  • 46. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 46 Whitespaces: Whitespace : [ t]+ -> channel(HIDDEN) ; Newline : ('r' 'n'? | 'n') -> channel(HIDDEN) ; BlockComment : '/*' .*? '*/' -> channel(HIDDEN) ; LineComment : '//' ~ [rn]* -> channel(HIDDEN) ;  By using The Channel(Hidden) attribute you tell ANTLR to keep the token.
  • 47. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 47 Assignment operators: assignmentoperator : '=' #assignmentoperator1 | '*=' #assignmentoperator2 | '/=' #assignmentoperator3 | '%=' #assignmentoperator4 | '+=' #assignmentoperator5 | '-=' #assignmentoperator6 | RightShiftAssign #assignmentoperator7 | LeftShiftAssign #assignmentoperator8 | '&=' #assignmentoperator9 | '^=' #assignmentoperator10 | '|=' #assignmentoperator11 ;
  • 48. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 48 Operators: theoperator : New #theoperator1 | Delete #theoperator2 | New '[' ']' #theoperator3 | Delete '[' ']' #theoperator4 | '+' #theoperator5 | '-' #theoperator6 | '*' #theoperator7 | '/' #theoperator8 | '%' #theoperator9 | '^' #theoperator10 | '&' #theoperator11 | '|' #theoperator12 | '~' #theoperator13 | '!' #theoperator14 | 'not' #theoperator15 | '=' #theoperator16 | '<' #theoperator17 | '>' #theoperator18 | '+=' #theoperator19 | '-=' #theoperator20 | '*=' #theoperator21 | '/=' #theoperator22 | '%=' #theoperator23 | '^=' #theoperator24 | '&=' #theoperator25 | '|=' #theoperator26 | LeftShift #theoperator27 | RightShift #theoperator28 | RightShiftAssign #theoperator29 | LeftShiftAssign #theoperator30 | '==' #theoperator31
  • 49. Lexical rules in ANTLR 5/11/2021 Saeed Parsa 49 Nested curly brackets:  Consider that matching nested curly braces with a DFA must be done using a counter whereas nested curlies are trivially matched with a context-free grammar: ACTION : '{' ( ACTION | ~'}' )* '}' ;  The recursion, of course, is the dead giveaway that this is not an ordinary lexer rule. • Lexer rules may use more than a single symbol of lookahead, can use semantic predicates, and can specify syntactic predicates to look arbitrarily ahead. ESCAPE_CHAR : '' 't' // two char of lookahead needed, | '' 'n' // due to common left-prefix ;
  • 50. Lexer Starter Kit 5/11/2021 Saeed Parsa 50  Punctuations call : ID '(' exprList ')' ;  Some programmers prefer to define token labels such as LP (left parenthesis) instead. call : ID LP exprList RP ; LP : '(' ; RP : ')’ ;  Keywords  Keywords are reserved identifiers, and we can either reference them directly or define token types for them: returnStat : 'return' expr ';'
  • 51. Lexer Starter Kit 5/11/2021 Saeed Parsa 51  Identifiers ID : ID_LETTER (ID_LETTER | DIGIT)* ; // From C language fragment ID_LETTER : 'a'..'z'|'A'..'Z'|'_’ ; fragment DIGIT : '0'..'9’ ;  Numbers INT : DIGIT+ ; FLOAT : DIGIT+ '.' DIGIT* | '.' DIGIT+ ;  Strings STRING : '"' ( ESC | . )*? ‘" ’ ; fragment ESC : '' [btnr"] ; // b, t, n etc...
  • 52. Lexer Starter Kit 5/11/2021 Saeed Parsa 52  Comments LINE_COMMENT : '//' .*? 'n' -> skip ; COMMENT : '/*' .*? '*/' -> skip ;  Whitespace WS : [ tnr]+ -> skip ;
  • 53. Tokenizing Sentences 5/11/2021 Saeed Parsa 53 • Humans unconsciously combine letters into words before recognizing grammatical structure while reading. • Recognizers that feed off character streams are called tokenizers or lexers. • Just as an overall sentence has structure, the individual tokens have structure. • At the character level, we refer to syntax as the lexical structure. • We want to recognize lists of names such as [a,b,c] and nested lists such as [a,[b,c],d]: grammar NestedNameList; list : '[' elements ']’ ; // match bracketed list elements : element (',' element)* ; // match comma-separated list element : NAME | list ; // element is name or nested list NAME : ('a'..'z' |'A'..'Z' )+ ; // NAME is sequence of >=1 letter. https://theswissbay.ch/pdf/Gentoomen%20Library/Programming/Pragmatic%20Programmers/Language%20I mplementation%20Patterns.pdf
  • 54. Parse Trees 5/11/2021 Saeed Parsa 54 grammar NestedNameList; list : '[' elements ']’ ; // match bracketed list elements : element (',' element)* ; // match comma-separated list element : NAME | list ; // element is name or nested list NAME : ('a'..'z' |'A'..'Z' )+ ; // NAME is sequence of >=1 letter. 1. Parse tree for [a,b,c] 2. Parse tree for [a,[b,c],d]
  • 55. Implementation 5/11/2021 Saeed Parsa 55 • Here is a loop that pulls tokens out, until it returns a token with type EOF_TYPE: ListLexer lexer = new ListLexer(args[0]); Token t = lexer.nextToken(); while ( t.type != Lexer.EOF_TYPE ) { System.out.println(t); t = lexer.nextToken(); } System.out.println(t); // EOF https://theswissbay.ch/pdf/Gentoomen%20Library/Programming/Pragmatic%20Programmers/Language%20Imp lementation%20Patterns.pdf Page: 51
  • 56. Example 5/11/2021 Saeed Parsa 56 Write a program to accept a C++ program as input and generate parse tree for the program 1. Run ANTLR to generate a lexical analyzer (lexer) and a parser for C++ . 2. Give a C++ program to your C++ compiler to generate parse and depict the parse tree for the program. You may generate lexer and parser for other languages such as C# , Java, and Python. Your code could be either in Python, or C# language,
  • 57. Example 5/11/2021 Saeed Parsa 57 1. Run ANTLR to generate a lexical analyzer (lexer) and a parser for C++. First, we have to move the grammar of C++ language (CPP14.g4) to C:javalibs folder. Now, we have to generate lexer & parser for the C++ by the Python language with this command in cmd: java –jar ./antlr-4.8-complete.jar –Dlanguage=Python3 CPP14.g4
  • 58. Example 5/11/2021 Saeed Parsa 58 As you seen below: Files generated
  • 59. Example 5/11/2021 Saeed Parsa 59 2. Give a C++ program to your C++ compiler to generate parse and depict the parse tree for the program. We write a python code to generate lexer & parser for the file ‘test.cpp’ that is in the main folder of our python code.
  • 60. Example 5/11/2021 Saeed Parsa 60 from antlr4 import CommonTokenStream, FileStream, ParseTreeWalker from CPP14Lexer import CPP14Lexer from CPP14Listener import CPP14Listener from CPP14Parser import CPP14Parser if __name__ == '__main__': input_stream = FileStream('./test.cpp') lexer = CPP14Lexer(input_stream) stream = CommonTokenStream(lexer) parser = CPP14Parser(stream) tree = parser.translationunit() listener = CPP14Listener() walker = ParseTreeWalker() walker.walk(listener, tree) print(tree.getRuleIndex()) #Use FileStream to read the program file. #Generate lexer from the fileStream object. #Use CommonTokenStream to get tokens from the values #read in Lexer. #Generate parser to create the parse tree from tokens. #Use listener & walker to navigate the parsing tree. #Listener listen when the walker enters or exits from each #non-terminal rule.
  • 61. Regular expressions grammar 5/11/2021 Saeed Parsa 61 • grammar MyGrammar; • /* * Parser Rules */ • expr: expr op=(MUL | DIV) expr #mulDiv • | expr op=(ADD | SUB) expr #addSub • | number | ‘(‘expr’)’ #num • ; • /* * Lexer Rules */ • fragment DIGIT: [0 – 9]; • fragment LETTER: [a – zA – Z]; • INT: DIGIT +; • FLOAT: DIGIT + ‘.’ • DIGIT +; • STRING_LITERAL: ‘”.* ? ‘”; • NAME: LETTER(LETTER | DIGIT) * ; • IDENTIFIER: [a-zA-Z0-9]+; • MUL: ‘*’; • DIV: ‘/’; • ADD: ‘+’; • SUB: ‘-‘; • WS: [trn]+ -> skip;
  • 62. Example 1: Generating a Lexer 5/11/2021 Saeed Parsa 62 In our grammar file, say ScriptLexer.g4, we have: 1. // Name our lexer (the name must match the filename) 2. lexer grammar ScriptLexer; 3. // Define string values - either unquoted or quoted 4. STRING : ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'@')+ | 5. ('"' (~('"' | '' | 'r' | 'n') | '' ('"' | ''))* '"') ; 6. // Skip all spaces, tabs, newlines 7. WS : [ trn]+ -> skip ; 8. // Skip comments 9. LINE_COMMENT : '//' ~[rn]* 'r'? 'n' -> skip ; 10. // Define punctuations 11. LPAREN : '<' ; 12. RPAREN : '>' ; 13. EQUALS : '=' ; 14. SEMICO : ';' ; 15. ASSIGN : ':=' ; https://imjching.com/writings/2017/02/16/lexical-analysis-with-antlr-v4/
  • 63. Example 1: Generating a Lexer 5/11/2021 Saeed Parsa 63 Now that we have our grammar file, we can run the ANTLR tool on it to generate our lexer program. antlr4 ScriptLexer.g4 This will generate two files: 1. ScriptLexer.java (the code which contains the implementation of the FSM together with our token constants) and 2. ScriptLexer.tokens. Now we will create a Java program to test our lexer: TestLexer.java https://imjching.com/writings/2017/02/16/lexical-analysis-with-antlr-v4/
  • 64. Example 1: Generating a Lexer 5/11/2021 Saeed Parsa 64 import java.io.File; import java.io.FileInputStream; import org.antlr.v4.runtime.ANTLRInputStream; import org.antlr.v4.runtime.Token; public class TestLexer { public static void main(String[] args) throws Exception { System.out.println("Parsing: " + args[0]); FileInputStream fis = new FileInputStream(new File(args[0])); ANTLRInputStream input = new ANTLRInputStream(fis); ScriptLexer lexer = new ScriptLexer(input); Token token = lexer.nextToken(); while (token.getType() != ScriptLexer.EOF) { System.out.println("t" + getTokenType(token.getType()) + "tt" + token.getText()); token = lexer.nextToken(); } } private static String getTokenType(int tokenType) { switch (tokenType) { case ScriptLexer.STRING: return "STRING"; case ScriptLexer.LPAREN: return "LPAREN"; case ScriptLexer.RPAREN: return "RPAREN"; case ScriptLexer.EQUALS: return "EQUALS"; case ScriptLexer.SEMICO: return "SEMICO"; case ScriptLexer.ASSIGN: return "ASSIGN"; default: return "OTHER"; } } } https://imjching.com/writings/2017/02/16/lexical-analysis-with-antlr-v4/
  • 65. Example 1: Generating a Lexer 5/11/2021 Saeed Parsa 65 We then compile our test program. javac TestLexer.java and if we try to run TestLexer and giving sample.script as an argument: // Sample.Script : What to do in the morning func morning < name := "Jay"; greet morning=true input=@name; eat cereals; attend class="CS101"; > // What to do at night func night < brush_teeth; sleep hours=8; >
  • 66. Example 1: Generating a Lexer 5/11/2021 Saeed Parsa 66 We get the following: java TestLexer sample.script Parsing: sample.script STRING func STRING morning LPAREN < STRING name ASSIGN := STRING "Jay" SEMICO ; STRING greet STRING morning EQUALS = STRING true STRING input EQUALS = STRING @name SEMICO ; STRING eat STRING cereals SEMICO ; STRING attend STRING class EQUALS = STRING "CS101" SEMICO ; RPAREN > STRING func STRING night LPAREN < STRING brush_teeth SEMICO ; STRING sleep STRING hours EQUALS = STRING 8 SEMICO ; RPAREN >
  • 67. Regular Expression in C# 5/11/2021 Saeed Parsa 67 • In C#, Regular Expression is a pattern which is used to parse and check whether the given input text is matching with the given pattern or not. • In C#, Regular Expressions are generally termed as C# Regex. • The .Net Framework provides a regular expression engine that allows the pattern matching. • Patterns may consist of any character literals, operators or constructors. • C# provides a class termed as Regex which can be found in System.Text.RegularExpression namespace. • This class will perform two things: - Parsing the inputting text for the regular expression pattern. - Identify the regular expression pattern in the given text.
  • 68. Regular Expression in C# 5/11/2021 Saeed Parsa 68
  • 69. Regular Expression in C# 5/11/2021 Saeed Parsa 69
  • 70. Regular Expression in C# 5/11/2021 Saeed Parsa 70
  • 71. Regular Expression in C# 5/11/2021 Saeed Parsa 71
  • 72. Regular Expression in C# 5/11/2021 Saeed Parsa 72
  • 73. Regular Expression in C# 5/11/2021 Saeed Parsa 73
  • 74. Regular Expression in C# 5/11/2021 Saeed Parsa 74
  • 75. Regular Expression in C# 5/11/2021 Saeed Parsa 75
  • 76. Regular Expression in C# 5/11/2021 Saeed Parsa 76
  • 77. Regular Expression in C# 5/11/2021 Saeed Parsa 77
  • 78. Regular Expression in C# 5/11/2021 Saeed Parsa 78
  • 79. Regular Expression in C# 5/11/2021 Saeed Parsa 79
  • 80. Regular Expression in C# 5/11/2021 Saeed Parsa 80
  • 81. Regular Expression in C# 5/11/2021 Saeed Parsa 81
  • 82. Regular Expression in C# 5/11/2021 Saeed Parsa 82
  • 83. Regular Expression in C# 5/11/2021 Saeed Parsa 83
  • 84. POSIX Standard 5/11/2021 Saeed Parsa 84 • POSIX standard is a widely used and accepted API for regular expression • POSIX is a standard specified the IEEE. • Traditional Unix regular expression syntax followed common conventions that often differed from tool to tool. • The POSIX Basic Regular Expressions syntax was developed by the IEEE, together with an extended variant called Extended Regular Expression syntax. • These standards were designed mostly to provide backward compatibility with the traditional Simple Regular Expressions syntax, providing a common standard which has since been adopted as the default syntax of many Unix regular expression tools.
  • 86. POSIX Standard 5/11/2021 Saeed Parsa 86 Examples:  .at matches any three-character string ending with "at", including "hat", "cat", and "bat".  [hc]at matches "hat" and "cat".  [^b]at matches all strings matched by .at except "bat".  ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.  [hc]at$ matches "hat" and "cat", but only at the end of the string or line.  [.] matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".
  • 88. Example 5/11/2021 Saeed Parsa 88 Write a regular expression to describe inputs over the alphabet {a, b, c} that are in sorted order. Because of sorting, we assume that inputs are like examples followed: So, the regular expression is: aaabbbcc abcc bcccc abb aaabcc aacc
  • 89. Example 5/11/2021 Saeed Parsa 89 Write a regular expression to check whether a string starts and ends with the same character. Implement the regular expression in Python. We have two cases: • If the string has just a single character, it has the condition we want, so the regExp is: ^[a-z]$ • If the string has multiple characters, we have: ^([a-z]).*1$ • 1: match the same character that comes in the first position of the string So, the final regExp we want is the combination of two cases mentioned: ^[a-z]$|^([a-z]).*1$
  • 90. Example 5/11/2021 Saeed Parsa 90 The python program to implement and test the regExp is shown below: import re regExp = r'^[a-z]$|^([a-z]).*1$' result = re.match(regExp, 'abba') print(result)
  • 91. Example 5/11/2021 Saeed Parsa 91 Write a regular expression to determine if a string is an ip address Rule: An ip address consists of 3 numbers, separated with two dots. The value of each number is 0-255 For example: 255.189.10.37 Correct and 256.189.89.9 is error. Write a C# or Python program to validate IP addresses. Because we have 4 numbers that separated by .(dot), our regular expression that just accepts the valid ip address is: ((d|[1-9]d|1d{2}|2[0-4]d|25[0-5]).){3}(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])
  • 92. Example 5/11/2021 Saeed Parsa 92 The python program to implement and test the regExp is shown below: import re regExp = r'((d|[1-9]d|1d{2}|2[0-4]d|25[0-5]).){3}(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])' result = re.fullmatch(regExp, '192.168.1.12') print(result)
  • 93. Example 5/11/2021 Saeed Parsa 93 Depict a DFA to accept all the binary strings, that do not include the substring “011”. write a lexer function to determine these strings. The DFA should be like this:
  • 94. Example 5/11/2021 Saeed Parsa 94 Lexer function: struct TokenType LexicalAnalyser(FILE * Source) { enum Symbols LexiconType; char NextChar, NextWord[80]; int State, Length; static char LastChar = '0'; static int RowNo = 0, ColNo = 0; State = 0; Length = 0; while(!feof(Source)) { if(LastChar) { NextChar = LastChar; LastChar = '0'; } else { NextChar = fgetc(Source); } NextWord[Length++] = NextChar;
  • 95. Example 5/11/2021 Saeed Parsa 95 Lexer function: case 0: if (NextChar == ‘n‘){ RowNo++ ; ColNo = 0; } else ColNo++; if (Nextchar == ‘ ‘ || Nextchar == ‘t‘ || Nextchar == ‘n‘) Length = 0; else if ( Nextchar == ‘1‘) State = 1; else if ( Nextchar == ‘0‘) State = 2; else LexerError(NextWord, Length); break; case 1: if ( Nextchar == ‘1‘) State = 1; else if ( Nextchar == ‘0‘) State = 2; else return "accepted"; break;
  • 96. Example 5/11/2021 Saeed Parsa 96 Lexer function: case 2: if ( Nextchar == ‘1‘) State = 3; else if ( Nextchar == ‘0‘) State = 2; else return "accepted"; break; case 3: if ( Nextchar == ‘0‘) State = 1; else return "accepted"; break;
  • 97. Example 5/11/2021 Saeed Parsa 97 Write a function that removes all comments from a piece of CPP code. Steps to do: 1. Create FileStream object for the input file. 2. Create lexer for the FileStream object 3. Get the first token by lexer.nextToken(). 4. Check in the loop all of the tokens one by one not to be line comment or block comment & get next token.
  • 98. Example 5/11/2021 Saeed Parsa 98 The python code to implement removing comments is shown below: from antlr4 import * from gen.CPP14Lexer import CPP14Lexer def remove_comments(filename='test.cpp'): input_stream = FileStream(filename) lexer = CPP14Lexer(input_stream) stream = CommonTokenStream(lexer) token = lexer.nextToken() new_file = open('test2.cpp', 'w') while token.type != Token.EOF: if token.type != lexer.BlockComment and token.type != lexer.LineComment: new_file.write(token.text.replace('r', '')) token = lexer.nextToken()
  • 99. Example 5/11/2021 Saeed Parsa 99 That’s the result: Before call remove_comment() After call remove_comment()
  • 100. Example 5/11/2021 Saeed Parsa 100 Write a Python program, using ANTLR, to add your student number to all the comments within CPP programs. We find both line comments & block comments and we add student id to it and save in new file named “test2.cpp”.
  • 101. Example 5/11/2021 Saeed Parsa 101 The python code to implement removing comments is shown below: input_stream = FileStream('test.cpp') lexer = CPP14Lexer(input_stream) stream = CommonTokenStream(lexer) token = lexer.nextToken() new_file = open('test2.cpp', 'w') while token.type != Token.EOF: if token.type == lexer.BlockComment: text_to_write = token.text.replace('r', '').replace("*/", "97521135n*/") new_file.write(text_to_write) elif token.type == lexer.LineComment: new_file.write(token.text.replace('r', '') + " 97521135") else: new_file.write(token.text.replace('r', '')) token = lexer.nextToken()
  • 102. Example 5/11/2021 Saeed Parsa 102 Write a Python program, using ANTLR, to detect email addresses. First, we have to write a grammar for emails in Email.g4: grammar Email; email: LITERAL ATSIGN LITERAL (DOT LITERAL)+ ; WS: [ trn] -> skip ; LITERAL : [a-zA-Z]+ [0-9]*; ATSIGN: '@' ; DOT: '.' ;
  • 103. Example 5/11/2021 Saeed Parsa 103 Write a Python program, using ANTLR, to detect email addresses. Now, we have to generate lexer & parser from the grammar we’ve wrote, by right-click on the grammar file & select “Generate ANTLR Recognizer”
  • 104. Example 5/11/2021 Saeed Parsa 104 And we use python code to use lexer to evaluate the email is right or not. from antlr4 import * from gen.EmailLexer import EmailLexer input_stream = InputStream('danibazi9@gmail.com') try: lexer = emailLexer(input_stream) stream = CommonTokenStream(lexer) print("The input email is correct") print("The tokens are:") token = lexer.nextToken() while token.type != Token.EOF: print(token.text) token = lexer.nextToken() except: print("The input email is not in the proper format")
  • 105. Assignment 2 5/11/2021 Saeed Parsa 105 Subject : Lexical Analysis Deadline: 1399/7/28 Mark: 5 out of 100.
  • 106. Assignment 2 5/11/2021 Saeed Parsa 106 1. Write a regular expression to describe inputs over the alphabet {a, b, c} that are in sorted order. 2. Write a regular expression to check whether a string starts and ends with the same character. Use ANTLR4 to implement the regular expression in Python. 3. Write a regular expression to determine if a string is an ip address Rule: An ip address consists of 3 numbers, separated with two dots. The value of each number is 0-255 For example: 255.189.10.37 Correct and 256.189.89.9 is error. Write a C# or Python program to validate IP addresses. 4. Depict a DFA to accept all the binary strings, that do not include the substring “011”. write a lexer function to determine these strings. 5. Write a function that removes all comments from a piece of CPP code.
  • 107. The place of IUST in the world 5/11/2021 Saeed Parsa 107 https://www.researchgate.net/publication/328099969_Software_Fault_Localisation_A_Systematic_Mapping_Study