Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
COMPILER DESIGN
Lexical Analysis
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 1
Ms. RICHA SHARMA
Assistant Professor
rich...
LEXICAL ANALYSIS
• IT’S THE FIRST PHASE OF COMPILER IS ALSO KNOWN AS SCANNER.
• LEXICAL BREAKS THE INPUT INTO SMALLEST MEA...
LEXICAL ANALYSIS
THE FIRST STEP AT UNDERSTANDING A PROGRAM, BOTH FOR A COMPILER AND FOR A HUMAN,
IS TO UNDERSTAND THE WORD...
LEXICAL ANALYSIS
• THE GOAL OF LEXICAL ANALYSIS, THEN, IS TO DIVIDE THE
PROGRAM TEXT INTO ITS WORDS, OR WHAT WE CALL IN CO...
LEXICAL ANALYSIS
Lexical
Analyzer
Tokens
1
2 int main()
3 {
4 int count;
5
6 /* This is a comment */
7
8 for(count=0;count...
LEXICAL ANALYSIS
IT READS CHARACTER STREAMS FROM THE SOURCE CODE, CHECKS FOR LEGAL
TOKENS, AND PASSES THE DATA TO THE SYNT...
LEXICAL ANALYSIS
• LEXICAL ANALYSER CLASSIFY PROGRAM SUBSTRINGS ACCORDING TO ROLE(TOKEN
CLASS) AND COMMUNICATE TOKENS TO T...
LEXICAL ANALYSIS
• CHOOSE THE CORRECT NUMBER OF TOKENS IN EACH CLASS THAT APPEAR IN THE CODE
FRAGMENT
X = 0;NTWHILE (X < 1...
LEXICAL ANALYSIS
• LOOKAHEAD: “LOOKAHEAD” MAY BE REQUIRED TO DECIDE WHERE ONE TOKEN ENDS
AND THE NEXT TOKEN BEGINS .
• Eg:...
REGULAR LANGUAGE
• TO DEFINE THE REGULAR LANGUAGES, WE GENERALLY USE SOMETHING CALLED REGULAR
EXPRESSIONS AND EACH REGULAR...
REGULAR LANGUAGE
• THREE COMPOUND REGULAR EXPRESSIONS
1 ) UNION : A+B ALSO WRITTEN IN LEX AS A|B.
eg: (A+B) or (A|B): mean...
REGULAR LANGUAGE
• THERE ARE 9 SYMBOLS THAT WE USE TO FORM REGULAR EXPRESSION /PATTERN (RULES IN LEX)
 * : ITERATION WHIC...
REGULAR EXPRESSION EXAMPLES
[-az] –> {-,a,z}
[a-z] -> {a,b,c,d,……z}
[a -z] –> {a,-,z}
[^a-d]-> {e,f,g,h,I,j,k,l,m,…..z}
^[...
REGULAR LANGUAGE
• CHOOSE THE REGULAR LANGUAGES THAT ARE CORRECT SPECIFICATIONS OF THE
ENGLISH-LANGUAGE DESCRIPTION GIVEN ...
REPRESENTING OCCURRENCE OF SYMBOLS USING REGULAR
EXPRESSIONS
LETTER = [A – Z] OR [A – Z]
DIGIT = 0 | 1 | 2 | 3 | 4 | 5 | 6...
LEXICAL SPECIFICATION FILE
THE LEX FILE IS DIVIDED INTO THREE SECTIONS:
• DECLARATIONS
• TRANSLATION RULES
• AUXILIARY FUN...
DECLARATIONS
• REGULAR DEFINITIONS THAT CAN BE USED IN TRANSLATION RULES
• ENCLOSED WITHIN
%{
%}
• #DEFINES, C PROTOTYPE D...
TRANSLATION RULES
• PATTERN-ACTION PAIRS
• WHERE PATTERN IS A REGULAR EXPRESSION AND THE ACTION IS
A C LANGUAGE PROGRAM SE...
TRANSLATION RULES
• GENERATED GLOBAL VARIABLES THAT CAN BE USED IN THE ACTION STATEMENTS.
• YYTEXT CONTAINS THE LEXEME, WH...
AUXILIARY FUNCTIONS
• DEFINITION OF THE C FUNCTIONS USED IN THE ACTION STATEMENTS.
• THE WHOLE SECTION IS COPIED “AS IS” I...
LEXICAL ANALYZER GENERATOR - LEX
Lexical Compiler
Lex Source program
lex.l
lex.yy.c
C
compiler
lex.yy.c a.out
a.outInput s...
RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 22
Upcoming SlideShare
Loading in …5
×

Lexical analysis

2,542 views

Published on

Its the first phase of the compiler,useful in generating lexemes ,tokens and matching of the pattern.Its helpful in solving GATE/ UGCNET problems.For more insight refer http://tutorialfocus.net/

Published in: Engineering
  • Be the first to comment

Lexical analysis

  1. 1. COMPILER DESIGN Lexical Analysis RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 1 Ms. RICHA SHARMA Assistant Professor richa.18364@lpu.co.in Lovely Professional University
  2. 2. LEXICAL ANALYSIS • IT’S THE FIRST PHASE OF COMPILER IS ALSO KNOWN AS SCANNER. • LEXICAL BREAKS THE INPUT INTO SMALLEST MEANINGFUL SEQUENCE CALLED TOKENS WHICH PARSER USES FOR SYNTAX ANALYSER. • SOME TOKENS CAN BE DEFINED AS IDENTIFIERS ,KEYWORDS ,OPERATORS ,PUNCTUATION MARKS etc.. • IT REMOVES WHITE SPACES (LIKE TAB ,BLANK, NEW LINE) AND COMMENTS. • THE PART OF INPUT STREAM THAT QUALIFIES FOR TOKEN IS CALLED LEXEME. • eg: IF qualifies for keyword in C language hence ‘IF’ is lexeme in this case. • LEXICAL ANALYSER KEEP TRACK OF NEW LINE CHARACTER SO THAT IT CAN GIVE THE LINE NUMBER IN CASE OF ANY ERRORS IN THE SOURCE PROGRAM. RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 2
  3. 3. LEXICAL ANALYSIS THE FIRST STEP AT UNDERSTANDING A PROGRAM, BOTH FOR A COMPILER AND FOR A HUMAN, IS TO UNDERSTAND THE WORDS. HOW PLEASANT THE WEATHER IS? LOOK AT THIS EXAMPLE WE CAN IMMEDIATELY RECOGNIZE THAT THERE ARE FIVE WORDS HOW , PLEASANT, THE, WEATHER ,IS. THIS IS SO AUTOMATIC WE HAVE TO JUST RECOGNIZE THE SEPARATORS, NAMELY THE BLANKS AND THE PUNCTUATION etc.. TH EWEAT HER ISPLE ASANT. WE CAN ALSO READ THIS, BUT IT TAKES A LITTLE BIT OF TIME BECAUSE I'VE PUT THE SEPARATORS IN ODD PLACES. BUT AGAIN THIS ISN'T SOMETHING THAT COMES TO YOU IMMEDIATELY. YOU ACTUALLY HAVE TO DO SOME WORK TO SEE WHERE THE DIVISIONS LIE BECAUSE THEY'RE NOT GIVEN TO YOU IN THE WAY THAT WE'RE USED TO. RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 3
  4. 4. LEXICAL ANALYSIS • THE GOAL OF LEXICAL ANALYSIS, THEN, IS TO DIVIDE THE PROGRAM TEXT INTO ITS WORDS, OR WHAT WE CALL IN COMPILER SPEAK, THE TOKENS. Eg: IF(X==Y) THEN PRINTF(“HELLO %D”,A); ELSE PRINTF(“%D”, B); SO, HERE'S SOME OBVIOUS ONES THAT ARE KEYWORDS, LIKE IF, THEN ,ELSE. • TOKEN CLASS CORRESPONDS TO SET OF STRINGS. • IDENTIFIERS-STRINGS OF LETTER AND DIGITS STARTING WITH A LETTER. • INTEGER-A NON EMPTY STRING OF DIGITS • KEYWORD-“ELSE” OR “IF” OR “BEGIN” OR … • WHITESPACE: – A NON-EMPTY SEQUENCE OF BLANKS, NEWLINES, AND TABS RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 4
  5. 5. LEXICAL ANALYSIS Lexical Analyzer Tokens 1 2 int main() 3 { 4 int count; 5 6 /* This is a comment */ 7 8 for(count=0;count<10;count++){ 9 printf("Hello Worldn"); 10 } 11 } int main ( ) { count int for ; Keyword Identifier Source Program . . . Punctuation Keyword Keyword Identifier RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 5
  6. 6. LEXICAL ANALYSIS IT READS CHARACTER STREAMS FROM THE SOURCE CODE, CHECKS FOR LEGAL TOKENS, AND PASSES THE DATA TO THE SYNTAX ANALYZER WHEN IT DEMANDS. RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 6
  7. 7. LEXICAL ANALYSIS • LEXICAL ANALYSER CLASSIFY PROGRAM SUBSTRINGS ACCORDING TO ROLE(TOKEN CLASS) AND COMMUNICATE TOKENS TO THE PARSER . IF (I == J) Z = 0; ELSE Z = 1; LEXICAL ANALYSER READS ITS AS : TIF (I == J)NTTZ = 0;NTELSENTTZ = 1; T – TAB(WHITE SPACES) N –NEWLINE • LEXICAL ANALYSER REMOVES ALL THE WHITE SPACES AND BLANK CHARACTER FROM THE SOURCE CODE . RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 7
  8. 8. LEXICAL ANALYSIS • CHOOSE THE CORRECT NUMBER OF TOKENS IN EACH CLASS THAT APPEAR IN THE CODE FRAGMENT X = 0;NTWHILE (X < 10) {NTX++;N} A. W = 9; K = 1; I = 3; N = 2; O = 9 B. W = 11; K = 4; I = 0; N = 2; O = 9 C. W = 9; K = 4; I = 0; N = 3; O = 9 D. W = 11; K = 1; I = 3; N = 3; O = 9 W: WHITESPACE K: KEYWORD I: IDENTIFIER N: NUMBER O: OTHER TOKENS: { } ( ) < ++ ; = RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 8
  9. 9. LEXICAL ANALYSIS • LOOKAHEAD: “LOOKAHEAD” MAY BE REQUIRED TO DECIDE WHERE ONE TOKEN ENDS AND THE NEXT TOKEN BEGINS . • Eg: In order to recognize “==” When we read a single equal sign, how do we decide whether that's a single equals like other assignments or that it's really a double-equals. Well, in order to do that, if our focus point is right here, we have to look ahead and see. There's another = coming up and that's how we will know. That we wanted to combine it into a single symbol instead of considering this equals by itself. • LOOK AHEAD COMPLICATES THE IMPLEMENTATION OF LEXICAL ANALYSIS AND SO ONE OF THE GOALS IN THE DESIGN OF LEXICAL SYSTEMS IS TO MINIMIZE THE AMOUNT OF THE LOOK AHEAD OR BOUND THE AMOUNT OF LOOK AHEAD THAT IS REQUIRED. RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 9
  10. 10. REGULAR LANGUAGE • TO DEFINE THE REGULAR LANGUAGES, WE GENERALLY USE SOMETHING CALLED REGULAR EXPRESSIONS AND EACH REGULAR EXPRESSION IS A SET. • THERE ARE TWO BASIC REGULAR EXPRESSIONS. 1) SINGLE CHARACTER ‘C’ ={“C”} Single character c, that's an expression and what notes is a language containing one string. 2) EPSILON = { 𝜀} That contains again just a single string, this time the empty string. And, one thing that's important to keep in mind is that epsilon is not an empty language. RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 10
  11. 11. REGULAR LANGUAGE • THREE COMPOUND REGULAR EXPRESSIONS 1 ) UNION : A+B ALSO WRITTEN IN LEX AS A|B. eg: (A+B) or (A|B): means Either A or B. Note: In compiler use (A|B) Instead of A+B in order to abid or confusion between UNION OR ITERATION. 2) CONCATENATION : AB. eg: (ab|int|K1408) means either ‘ab’ or ‘int’ or ‘k1408’ 3) ITERATION : A* . THIS IS PRONOUNCED A STAR ORIS CALLED THE KLEENE ITERATION AND, OR THE KLEENE CLOSURE AND A STAR IS EQUAL TO THE UNION. A* =AA* WHICH IS EQUIVALENT TO A+ Eg: (a)* means “a” can occur 0 or more times (a)+ means “a” can occur 1 or more times(ie; minimum 1 time a should come.) RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 11
  12. 12. REGULAR LANGUAGE • THERE ARE 9 SYMBOLS THAT WE USE TO FORM REGULAR EXPRESSION /PATTERN (RULES IN LEX)  * : ITERATION WHICH MEAN ZERO OR MORE OCCURRENCE. IE; a*={ 𝜀 ,a ,aa ,aaa…}  + : ITERATION WHICH MEAN ONE OR MORE OCCURRENCE. IE; a+= { a ,aa ,aaa…}  ? : EITHER ZERO OCCURRENCE OR ONE . IE; a? = 𝜀 or a  | : EITHER ONE OUT OF TWO. IE; ab|cd ={ab , cd}.  () : TO FOLLOW EXACTLY THE SAME SEQUENCE. IE; (INT) = {INT} .  [] : ANY ONE CHARACTER OUT OF THE GIVEN STRING WITHIN IT. IE; (INT) = {I,N,T} .  ^ : NOT. IE; ^ab the string should not start with a or should not have a. ^(ab) here means that string should start with ab.  $ : END OF STRING.  . : IT ACCEPTS ANY SINGLE CHARACTER EXCEPT NEW LINE . RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 12
  13. 13. REGULAR EXPRESSION EXAMPLES [-az] –> {-,a,z} [a-z] -> {a,b,c,d,……z} [a -z] –> {a,-,z} [^a-d]-> {e,f,g,h,I,j,k,l,m,…..z} ^[a-z] -> Beginning should be from a to z a$ -> should end with a . -> anything except new line RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 13
  14. 14. REGULAR LANGUAGE • CHOOSE THE REGULAR LANGUAGES THAT ARE CORRECT SPECIFICATIONS OF THE ENGLISH-LANGUAGE DESCRIPTION GIVEN BELOW: TWELVE-HOUR TIMES OF THE FORM “04:13PM”. MINUTES SHOULD ALWAYS BE A TWO DIGIT NUMBER, BUT HOURS MAY BE A SINGLE DIGIT.  (0 + 1)?[0-9]:[0-5][0-9](AM + PM)  ((0 + Ε)[0-9] + 1[0-2]):[0-5][0-9](AM + PM)  (0*[0-9] + 1[0-2]):[0-5][0-9](AM + PM)  (0?[0-9] + 1(0 + 1 + 2):[0-5][0-9](A + P)M RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 14
  15. 15. REPRESENTING OCCURRENCE OF SYMBOLS USING REGULAR EXPRESSIONS LETTER = [A – Z] OR [A – Z] DIGIT = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 OR [0-9] SIGN = [ + | - ] REPRESENTING LANGUAGE TOKENS USING REGULAR EXPRESSIONS: DECIMAL = (SIGN)?(DIGIT)+ IDENTIFIER = LETTER(LETTER | DIGIT)* RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 15
  16. 16. LEXICAL SPECIFICATION FILE THE LEX FILE IS DIVIDED INTO THREE SECTIONS: • DECLARATIONS • TRANSLATION RULES • AUXILIARY FUNCTIONS  THE END OF THE SECTION IS MARKED BY %%. Declarations %% Translation Rules %% Auxiliary Functions RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 16
  17. 17. DECLARATIONS • REGULAR DEFINITIONS THAT CAN BE USED IN TRANSLATION RULES • ENCLOSED WITHIN %{ %} • #DEFINES, C PROTOTYPE DECLARATIONS OF THE FUNCTIONS USED IN TRANSLATION RULES • #INCLUDE STATEMENTS FOR THE LIBRARY FUNCTIONS USED IN TRANSLATION RULES RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 17
  18. 18. TRANSLATION RULES • PATTERN-ACTION PAIRS • WHERE PATTERN IS A REGULAR EXPRESSION AND THE ACTION IS A C LANGUAGE PROGRAM SEGMENT • THE ACTION IS TYPICALLY A RETURN STATEMENT INDICATING THE TYPE OF TOKEN THAT HAS BEEN MATCHED Pattern1 { Action 1 } Pattern2 { Action 2 } Pattern3 { Action 3 } RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 18
  19. 19. TRANSLATION RULES • GENERATED GLOBAL VARIABLES THAT CAN BE USED IN THE ACTION STATEMENTS. • YYTEXT CONTAINS THE LEXEME, WHICH TELLS YOU THE MATCHED LEXEME AND ITS OF STRING DATA TYPE. eg:printf(“%s “,yytext) • YYLENG GIVES THE LENGTH OF THE LEXEME MATCHED,ITS OF INTEGER DATA TYPE. eg:printf(“%d “,yyleng) • TOKENS THAT DO NOT HAVE ANY SIGNIFICANCE FOR THE PARSER (LIKE WHITE SPACE, NEW LINE ETC) THE ACTION STATEMENT WOULD NOT HAVE A RETURN STATEMENT RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 19
  20. 20. AUXILIARY FUNCTIONS • DEFINITION OF THE C FUNCTIONS USED IN THE ACTION STATEMENTS. • THE WHOLE SECTION IS COPIED “AS IS” INTO LEX.YY.C. • YYLEX ROUTINE IS CALLED REPEATEDLY TO CONTINUE GETTING THE NEXT TOKEN UNTIL THE END OF THE INPUT . 1. TO EXECUTE FIRST SAVE THE FILE WITH .l EXTENSION AS “abc.l” 2. THEN COMPILE WITH LEX COMPILER BY “lex abc.l” 3. IT WILL CREATE A FILE lex.yy.c 4. THEN COMPILE WITH GCC COMPILER AS “gcc lex.yy.c –ll” 5. THEN RUN THE FILE USING ./a.out RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 20
  21. 21. LEXICAL ANALYZER GENERATOR - LEX Lexical Compiler Lex Source program lex.l lex.yy.c C compiler lex.yy.c a.out a.outInput stream Sequence of tokens INPUT OUTPUT RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 21
  22. 22. RICHA SHARMA (LOVELY PROFESSIONAL UNIVERSITY) 22

×