Your SlideShare is downloading. ×
4 lexical
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

4 lexical

339
views

Published on

Published in: Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
339
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Makes parsing much easier (the parsers looks for relations between tokens) Other tasks: remove comments
  • Automatic generation of scanners
  • Transcript

    • 1. Chapter 4 Lexical analysisCMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1
    • 2. Scanner • Main task: identify tokens – Basic building blocks of programs – E.g. keywords, identifiers, numbers, punctuation marks • Desk calculator language example: read A sum := A + 3.45e-3 write sum write sum / 2CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 2
    • 3. Formal definition of tokens • A set of tokens is a set of strings over an alphabet – {read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …} • A set of tokens is a regular set that can be defined by comprehension using a regular expression • For every regular set, there is a deterministic finite automaton (DFA) that can recognize it – (Aka deterministic Finite State Machine (FSM)) – i.e. determine whether a string belongs to the set or not – Scanners extract tokens from source code in the same way DFAs determine membershipCMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 3
    • 4. Regular Expressions A regular expression (RE) is: • A single character • The empty string, ε • The concatenation of two regular expressions – Notation: RE1 RE2 (i.e. RE1 followed by RE2) • The union of two regular expressions – Notation: RE1 | RE2 • The closure of a regular expression – Notation: RE* – * is known as the Kleene star – * represents the concatenation of 0 or more strings • Caution: notations for regular expressions vary – Learn the basic concepts and the rest is just syntactic sugarCMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 4
    • 5. Token Definition Example DIG • Numeric literals in Pascal, e.g. 1, 123, 3.1415, 10e-3, 3.14e4 * DIG • Definition of token unsignedNum . DIG → 0|1|2|3|4|5|6|7|8|9 DIG * unsignedInt → DIG DIG* e e unsignedNum → + unsignedInt - DIG DIG DIG (( . unsignedInt) | ε) ((e ( + | – | ε) unsignedInt) | ε) * DIG • Notes: – Recursion is not allowed! • FAs with epsilons are nondeterministic. – Parentheses used to avoid • NFAs are much harder to ambiguity implement (use backtracking) – It’s always possible to rewrite • Every NFA can be rewriten as a DFA (gets larger, tho) removing epsilonsCMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 5
    • 6. Simple Problem • Write a C program which reads in a character string, consisting of a’s and b’s, one character at a time. If the string contains a double aa, then print string accepted else print string rejected. • An abstract solution to this can be expressed as a DFA b a 1- 2 3+ a, b a Start state b An accepting state input a b The state transitions of a DFA can be encoded as a table current 1 2 1 which specifies the new state for a given current state and state 2 3 1 input 3 3 3CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 6
    • 7. #include <stdio.h> main() { enum State {S1, S2, S3}; an approach in C enum State currentState = S1; int c = getchar(); while (c != EOF) { switch(currentState) { case S1: if (c == ‘a’) currentState = S2; if (c == ‘b’) currentState = S1; break; case S2: if (c == ‘a’) currentState = S3; if (c == ‘b’) currentState = S1; break; case S3: break; } c = getchar(); } if (currentState == S3) printf(“string acceptedn”); else printf(“string rejectedn”); }CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 7
    • 8. #include <stdio.h> Using a table main() { enum State {S1, S2, S3}; simplifies the enum Label {A, B}; enum State currentState = S1; program enum State table[3][2] = {{S2, S1}, {S3, S1}, {S3, S3}}; int label; int c = getchar(); while (c != EOF) { if (c == ‘a’) label = A; if (c == ‘b’) label = B; currentState = table[currentState][label]; c = getchar(); } if (currentState == S3) printf(“string acceptedn”); else printf(“string rejectedn”); }CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 8
    • 9. Lex • Lexical analyzer generator – It writes a lexical analyzer • Assumption – each token matches a regular expression • Needs – set of regular expressions – for each expression an action • Produces – A C program • Automatically handles many tricky problems • flex is the gnu version of the venerable unix tool lex. – Produces highly optimized codeCMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 9
    • 10. Scanner Generators • E.g. lex, flex • These programs take a table as their input and return a program (i.e. a scanner) that can extract tokens from a stream of characters • A very useful programming utility, especially when coupled with a parser generator (e.g., yacc) • standard in UnixCMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 10
    • 11. Lex example input lex cc foolex foo.l foolex.c foolex tokens > foolex < input > flex -ofoolex.c foo.l Keyword: begin > cc -ofoolex foolex.c -lfl Keyword: if Identifier: size Operator: > >more input Integer: 10 (10) begin Keyword: then Identifier: size if size>10 Operator: * then size * -3.1415 Operator: - end Float: 3.1415 (3.1415) Keyword: endCMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 11
    • 12. A Lex Program DIG [0-9] … definitions … ID [a-z][a-z0-9]* %% %% {DIG}+ printf("Integern”); {DIG}+"."{DIG}* printf("Floatn”); … rules … {ID} [ tn]+ printf("Identifiern”); /* skip whitespace */ . printf(“Huh?n"); %% %% … subroutines … main(){yylex();}CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 12
    • 13. Simplest Example %% .|n ECHO; %% main() { yylex(); }CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 13
    • 14. Strings containing aa %% (a|b)*aa(a|b)* {printf(“Accept %sn”, yytext);} [ab]+ {printf(“Reject %sn”, yytext);} .|n ECHO; %% main() {yylex();}CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 14
    • 15. Rules • Each has a rule has a pattern and an action. • Patterns are regular expression • Only one action is performed – The action corresponding to the pattern matched is performed. – If several patterns match the input, the one corresponding to the longest sequence is chosen. – Among the rules whose patterns match the same number of characters, the rule given first is preferred.CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 15
    • 16. x character x Flex’s RE syntax . any character except newline [xyz] character class, in this case, matches either an x, a y, or a z [abj-oZ] character class with a range in it; matches a, b, any letter from j through o, or Z [^A-Z] negated character class, i.e., any character but those in the class, e.g. any character except an uppercase letter. [^A-Zn] any character EXCEPT an uppercase letter or a newline r* zero or more rs, where r is any regular expression r+ one or more rs r? zero or one rs (i.e., an optional r) {name} expansion of the "name" definition (see above) "[xy]"foo" the literal string: [xy]"foo (note escaped “) x if x is an a, b, f, n, r, t, or v, then the ANSI-C interpretation of x. Otherwise, a literal x (e.g., escape) rs RE r followed by RE s (e.g., concatenation) r|s either an r or an s <<EOF>> end-of-fileCMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 16
    • 17. /* scanner for a toy Pascal-like language */ %{ #include <math.h> /* needed for call to atof() */ %} DIG [0-9] ID [a-z][a-z0-9]* %% {DIG}+ printf("Integer: %s (%d)n", yytext, atoi(yytext)); {DIG}+"."{DIG}* printf("Float: %s (%g)n", yytext, atof(yytext)); if|then|begin|end printf("Keyword: %sn",yytext); {ID} printf("Identifier: %sn",yytext); "+"|"-"|"*"|"/" printf("Operator: %sn",yytext); "{"[^}n]*"}" /* skip one-line comments */ [ tn]+ /* skip whitespace */ 17CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.