2. Lexical analyzer
• Lexical analysis, also called scanning, is the phase of the compilation
process which deals with the actual program being compiled, character by
character. The higher level parts of the compiler will call the lexical
analyzer with the command "get the next word from the input", and it is
the scanner's job to sort through the input characters and find this word.
• The types of "words" commonly found in a program are:
• programming language keywords, such as if, while, struct, int etc.
• operator symbols like =, +, -, &&, !, <= etc.
• other special symbols like: ( ), { }, [ ], ;, & etc.
• constants like 1, 2, 3, 'a', 'b', 'c', "any quoted string" etc.
• variable and function names (called identifers) such as x, i, t1 etc.
• Some languages (such as C) are case sensitive, in that they differentiate
between eg. if and IF; thus the former would be a keyword, the latter a
variable name.
3. Tokens
• Also, most languages would insist that identifers cannot be any of the keywords, or
contain operator symbols (versions of Fortran don't, making lexical analysis quite
difficult).
• In addition to the basic grouping process, lexical analysis usually performs the
following tasks:
• Since there are only a finite number of types of words, instead of passing the actual
word to the next phase we can save space by passing a suitable representation. This
representation is known as a token.
• If the language isn't case sensitive, we can eliminate differences between case at this
point by using just one token per keyword, irrespective of case; eg. #define IF-
TOKEN 1 #define WHILE-TOKEN 2 ..... ..... if we meet "IF", "If", "iF", "if" then return
IF_TOKEN if we meet "WHILE, "While", "WHile", ... then return WHILE-TOKEN
• We can pick out mistakes in the lexical syntax of the program such as using a
character which is not valid in the language. (Note that we do not worry about the
combination of patterns; eg. the pattern of characters"+*" would be returned
as PLUS-TOKEN, MULT-TOKEN, and it would be up to the next phase to see that
these should not follow in sequence.)
• We can eliminate pieces of the program that are no longer relevant, such as spaces,
tabs, carriage-returns (in most languages), and comments.
• In order to specify the lexical analysis process, what we need is some method of
describing which patterns of characters correspond to which words.
4. Regular Expressions
• Regular expressions are used to define patterns of characters; they are used in UNIX tools
such as awk, grep, vi and, of course, lex.
• A regular expression is just a form of notation, used for describing sets of words. For any
given set of characters , a regular expression over is defined by:
• The empty string, , which denotes a string of length zero, and means ``take nothing from
the input''. It is most commonly used in conjunction with other regular expressions eg. to
denote optionality.
• Any character in may be used in a regular expression. For instance, if we write a as a
regular expression, this means ``take the letter a from the input''; ie. it denotes the
(singleton) set of words {``a''}
• The union operator, ``|'', which denotes the union of two sets of words. Thus the regular
expression a|b denotes the set {``a'', ``b''}, and means ``take either the letter a or the
letter b from the input''
• Writing two regular expressions side-by-side is known as concatenation; thus the regular
expression ab denotes the set {``ab''} and means ``take the character a followed by the
character b from the input''.
• The Kleene closure of a regular expression, denoted by ``*'', indicates zero or more
occurrences of that expression. Thus a* is the (infinite) set {, ``a'', ``aa'', ``aaa'', ...} and
means ``take zero or more as from the input''.
• Brackets may be used in a regular expression to enforce precedence or increase clarity.