The document discusses lexical analysis, which is the first phase of a compiler. It scans the source code as input and produces a sequence of tokens as output. A lexical analyzer reads character input and divides it into tokens like keywords, identifiers, and symbols. Regular expressions are used to represent patterns for matching tokens. Operations like concatenation, choice, and repetition are used to define patterns in regular expressions for lexical analysis.
2. Ā» The scanning/lexical analysis phase of a
compiler performs the task of reading the
source program as a file of characters and
dividing up into tokens.
Jeena Thomas, Asst Professor, CSE,
Ā» Usually implemented as subroutine or co-
routine of parser.
Ā» Front end ofcompiler.
2
3. Ā» Each token is a sequence of characters that
represents a unit of information in the source
program.
Jeena Thomas, Asst Professor, CSE,
3
4. Ā» Keywords which are fixed string of letters .eg:
āifā, āwhileā.
Jeena Thomas, Asst Professor, CSE,
Ā» Identifiers which are user-defined
strings composed of letters and
numbers.
Ā» Special symbols like arithmetic
symbols.
4
5. Ā» Scanners perform pattern matching process.
Ā» The techniques used to implement lexical
analyzers can also be applied to other areas
such as query languages and information
retrieval systems.
Jeena Thomas, Asst Professor, CSE,
Ā» Since pattern directed programming is
widely useful, pattern action language
called Lex for specifying lexical
analyzers.
Ā» In lex , patterns are specified by regular
expressions, and a compiler for lex can
generate an efficient finite-automaton
recognizer for the regular expression.
5
6. Ā» A software tool that automates the
construction of lexical analyzers allows
people with different backgrounds to
use pattern matching in their own
areas.
Jeena Thomas, Asst Professor, CSE,
Ā» Jarvis[1976] Lexical analyzer generator to
create a program that recognizes
imperfections in
printed circuit boards.
Ā» The circuits are digitally scanned and
converted into āstringsā of line
segments at different angles.
Ā» The ālexical analyzerā looked for patterns
corresponding to imperfections in the
string of line segments.
6
7. Ā» It can utilize the best-known pattern-
matching algorithms and thereby create
efficient lexical analyzers for people who
are not experts in pattern-matching
techniques.
Jeena Thomas, Asst Professor, CSE,
7
10. Type Examples
ID foo n_14 last
NUM 73 00 517 082
REAL 66.1 .5 10. 1e67 5.5e-10
IF if
COMMA ,
NOTEQ !=
LPAREN (
RPAREN )
11. Type Examples
comment /* ignored */
preprocessor directive #include <foo.h>
#define NUMS 5, 6
macro NUMS
whitespace t n b
12. Ā» Separation of the input source code into
tokens.
Jeena Thomas, Asst Professor, CSE,
Ā» Stripping out the unnecessary white
spaces from the source code.
Ā» Removing the comments from the source
text.
Ā» Keeping track of line numbers while
scanning the new line characters. These line
numbers are used by the error handler to
print the error messages.
Ā» Preprocessing ofmacros.
12
13. Ā» There are several reasons for separating
the analysis phase of compiling into
lexical analysis and parsing:
Jeena Thomas, Asst Professor, CSE,
Ā» It leads to simpler design of the parser as
the unnecessary tokens can be eliminated by
scanne
r.
Ā» Efficiency of the process of compilation is
improved. The lexical analysis phase is most
time consuming phase in compilation. Using
specialized buffering to improve the speed
of compilation.
Ā» Portability of the compiler is enhanced as
the specialized symbols and
13
14. Ā» Connected with lexical analysis are three
important terms with similar meaning.
Jeena Thomas, Asst Professor, CSE,
Ā»
Lexeme
Ā» Token
Ā»
Patterns
14
15. Ā» A token is a pair consisting of a token name
and an optional attribute value. Token
name: Keywords, operators, identifiers, constants,
literal strings, punctuation symbols(such as
commas,semicolons)
Ā» A lexeme is a sequence of characters in
the source program that matches the
pattern for a token and is identified by the
lexical analyzer as an instance of that
token. E.g.Relation
Jeena Thomas, Asst Professor, CSE,
{<.<=,>,>=,==,
<>}
15
16. Ā» A pattern is a description of the form
that the lexemes of token may take.
Jeena Thomas, Asst Professor, CSE,
Ā» It gives an informal or formal
description of a token.
Ā» Eg:identifier
Ā» 2purposes
Ā» Gives a precise description/
specification of tokens.
Ā» Used to automatically generate a
lexical analyzer
16
17. 17
Ā» const pi = 3.1416;
Ā» The substring piis a lexeme for the
token
āidentifier.ā
Jeena Thomas, Asst Professor, CSE,
20. Ā» 1.) let us consider a statement āfi(a==f)ā. Here
āfiā is a misspelled keyword. This error is
not detected in lexical analysis as āfiā is
taken as an identifier. This error is then
detected in other phases of compilation.
Jeena Thomas, Asst Professor, CSE,
Ā» 2.) in case the lexical analyzer is not
able to continue with the process of
compilation, it resorts to panic mode
of error recovery.
ā¢ Deleting the successive characters
from the remaining input until a token is
detected.
ā¢ Deleting extraneous characters.
20
21. ā¢ Inserting missing characters
ā¢ Replacing an incorrect character by a
correct character.
ā¢ Transposing two adjacent characters
Jeena Thomas, Asst Professor, CSE,
21
22. Ā» Is the strategy generally followed by the
lexical analyzer to correct the errors
in the lexemes.
Jeena Thomas, Asst Professor, CSE,
Ā» It is nothing but the minimum number of
the corrections to be made to convert
an invalid lexeme to a valid lexeme.
Ā» But it is not generally used in practice
because it is too costly to implement.
22
24. Ā» Scanners are special pattern
matching processors.
Jeena Thomas, Asst Professor, CSE,
Ā» For representing patterns of strings of
characters, Regular Expressions(RE)
are used.
Ā» A regular expression (r) is defined by
set of strings that matches it.
Ā» This set is called as the language
generated by the regular expression and
is represented as L(r).
Ā» The set of symbols in the language is
called the
alphabet of the language is represented as
24
25. Ā» An alphabet is a finite set of
symbols.
Jeena Thomas, Asst Professor, CSE,
Ā» Example
Ā» A set of alphabetic characters is represented
as L={A,ā¦,Z,a,ā¦,z} and set of digits is
represented as D={0,1,ā¦,9}.
Ā» LUD is alanguage.
Ā» Strings over LUD- Begin,Max1, max1, 123,
ā¬ā¦
25
29. Ā» Indicated by metacharacter ā|ā(vertical
bar)
Jeena Thomas, Asst Professor, CSE,
Ā» r|s
Ā» R.E that matches any string that is matched
either by r or s.
Ā» L(r|s)=L(r) U L(s)
29
30. Ā» Consider
L(r)={a},
Jeena Thomas, Asst Professor, CSE,
Ā» L(s)={b},
Ā» L(t)= {ā¬},
Ā» L(u)= {}.
Ā» What do the following R.E
represent?
Ā» (i)r|s
Ā» (ii)r|t
Ā» (iii)r|u
30
31. Ā»
rs
Jeena Thomas, Asst Professor, CSE,
Ā» It matches any string that is a concatenation
of 2 strings, the first of which matches r
and second of which matches s.
Ā» L(rs) = L(r)
L(s)
31
32. Ā» Consider L(r) ={a}, L(s)={b}, L(t) ={ā¬},
L(u)={ }, L(v)={c}. What do following R.E
represent?
Jeena Thomas, Asst Professor, CSE,
Ā» (i)rs
Ā» (ii)rt
Ā» (iii)ru
Ā» (iv)
(r|s)v
32
33. Ā» Also called Kleene
closureĀ» Represents any finite concatenation of
strings each matches strings from L(r).
Ā» r*
Ā» Let S={a}, then L(a*)={ā¬, a, aa,aaa,ā¦}
Ā» S*={ā¬}USUSSUSSSUā¦.=
Ā»
33
Jeena Thomas, Asst Professor, CSE,
34. Ā» Consider L(r)=
{a},
Jeena Thomas, Asst Professor, CSE,
Ā» L(s)={b}.
Ā» What do the following R.E
represent?
Ā» (i)r*
Ā» (ii) (rs)*
Ā» (iii)(r|s)*
Ā» (iv)(r|ss)*
34
39. Ā» Regular Expression(RE): represents
pattern of string of characters.
Jeena Thomas, Asst Professor, CSE,
Ā» Language (L(r)): set ofstrings
Ā» Alphabet(ā): set of symbols
Ā» Meta character: is a special character (not a
part of ā) used in R.E eg: *, | etc
Ā» Basic R.E: R.E consisting of only one
character
Ā» Regular language: language defined by RE
39