Heartily
welcome
to
III CSE Students
Compiler Design
III-I Sem.
By
Padmaja Grandhe
Associate professor
CSE Department
PSCMRCET
Objective
 Understand the basic concept of compiler
design, and its different phases which will
be helpful to construct new tools like LEX,
YACC, etc
TEXT BOOKS:
 Compilers, Principles Techniques and Tools.
Alfred V Aho, Monical S. Lam, Ravi Sethi
Jeffery D. Ullman,2nd edition,pearson,2007
 Compiler Design K.Muneeswaran, OXFORD
 Principles of compiler design,2nd edition,
Nandhini Prasad, Elsebier.
REFERENCE BOOKS:
 Compiler Construction, Principles and
practice, Kenneth C Louden, CENGAGE
 Implementations of Compiler, A New
approach to Compilers including the
algebraic methods, Yunlinsu ,SPRINGER
UNIT – I
 Introduction Language Processing,
 Structure of a compiler
 the evaluation of Programming language
 The Science of building a Compiler
 application of Compiler Technology.
 Lexical Analysis-:
 The role of lexical analysis buffering,
 specification of tokens.
 Recognitions of tokens
 the lexical analyzer generator lex tool.
UNIT –II Syntax Analysis
 The Role of a parser
 Context free Grammars Writing A grammar
 Top down passing bottom up parsing
 Introduction Lr Parser
UNIT –III More Powerful LR parser
 (LR1, LALR) Using Ambiguous Grammars
Equal Recovery in Lr parser
 Syntax Directed Transactions Definition,
 Evolution order of SDTS Application of
SDTS.
 Syntax Directed Translation Schemes.
UNIT – IV
Intermediated Code Generation
 Variants of Syntax trees 3 Address code
 Types and Deceleration
 Translation of Expressions
 Type Checking.
 Controlled Flow Back patching
UNIT – V Code generation
 Stack allocation of space
 access to Non Local data on the stack
 Heap Management code generation –
Issues in design of code generation
 the target Language Address in the target
code
 Basic blocks and Flow graphs.
 A Simple Code generation.
UNIT –VI Code Optimization
 Machine Independent Optimization.
 The principle sources of Optimization
 peep hole Optimization
 Introduction to Data flow Analysis.
1.Language Translator
 It is a program that translates one language
to another Language.
 Ex:-compiler, assembler, interpreter, linker,
loader and preprocessor.
Language processing System
1.1Preprocessor
 A preprocessor produce input to compilers.
 Macro processing
 File inclusion: include header files into the
program text.
 Language Extensions:- add capabilities to the
language by certain amounts to build-in macro
 #define, #include
1.2Compiler
 is a software which converts a program written in
high level language (Source Language) to low level
language (Object/Target/Machine Language).
 shows errors
 Prolog, C,ML, LISP
1.3Assembler
 An assembler is a type of computer
program that interprets software programs
written in assembly language into machine
language
1.4. LINK-EDITOR (Linker) &LOADER :
 Usually a longer program is divided into smaller
subprograms called modules. And these modules
must be combined to execute the program. The
process of combining the modules is done by the
linker.
 In high level languages, some built in header files or
libraries are stored. These libraries are predefined
and these contain basic functions which are essential
for executing the program. These functions are linked
to the libraries by a program called Linker.
 Loader is responsible for loading programs and
libraries. It is one of the essential stages in the
process of executing a program, as it places
programs into memory and prepares them for
execution.
 The final output is Absolute machine code that is
loaded into a computer fixed storage location and
may not be relocated during execution of a computer
program.
1.4LINK-EDITOR (Linker) &LOADER Contd..
1.5Examples of Compilers:
1.Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .Common Lisp compilers
9. Fortran compilers
10Pascal compilers
11. PL/I compilers
12.Java compilers
1.6Interpreter & Compiler
Short answer questions
 What is the purpose of Loader/Linker in
language processing?
 List any 4 compilers and 2 interpreters you
know.
 What is a preprocessor? Mention its
objectives.
 List the Differences between interpreter
and compiler?
1.6 Interpreter and compiler differences
Compiler Interpreter
A compiler is a program which coverts the
entire source code of a programming
language into executable machine code for
a CPU.
interpreter takes a source program
and runs it line by line, translating
each line as it comes to it.
Compiler takes large amount of time to
analyze the entire source code but the
overall execution time of the program is
comparatively faster.
Interpreter takes less amount of
time to analyze the source code but
the overall execution time of the
program is slower.
Compiler generates the error message only
after scanning the whole program, so
debugging is comparatively hard as the
error can be present any where in the
program.
Its Debugging is easier as it
continues translating the program
until the error is met
Generates intermediate object code.
No intermediate object code is
generated.
Questions given in university End Exams
 Describe the functionality of compilers in a
typical language processing system
 What are program translators? Explain.
2.STRUCTURE OF THE COMPILER
DESIGN / PHASES OF A COMPILER
 A compiler operates in phases.
 A phase is a logically interrelated operation
that takes source program in one
representation and produces output in
another representation.
 2 phases of compilation.
 Analysis Phase or Front-end (Machine
Independent / Language Dependent)
 Synthesis Phase or Back-end (Machine
Dependent/ Language independent)
Analysis Phase
Synthesis
Phase
2.1Lexical Analysis Or scanning
 Lexical analysis performed by Lexical analyzer
or Scanner
 It is the first phase of the compiler.
 It gets input from the source program and
produces tokens as output.
 It reads the characters one by one, starting from
left to right and forms the tokens.
 Token : It represents a logically cohesive
sequence of characters ex:-keywords,
operators, identifiers, special symbols etc
 Example: a + b = 20
 a, b,+,=,20 are all separate tokens.
2.1Lexical Analysis contd..
 Group of characters forming a token is
called the Lexeme.
 The lexical analyser not only generates
a token but also enters the lexeme & its
information into the symbol table it is
not already there.
 Num1=900
2.2Syntax Analysis:-
 Is performed by syntax analyzer or parser.
 It is the second phase of the compiler. It is also known as
parser.
 It gets the token stream as input from the lexical analyser
of the compiler , analyses the syntactical structure of the
given input using grammar rules specified.
 Reports the syntax errors if any..otherwise Generates
Parse Tree or syntax tree as the output if no errors are
presented.
 Syntax tree : It is a tree in which interior nodes are
operators and exterior nodes are operands.
2.2Syntax Analysis contd..
 Example:
 For a=b+c*2, syntax tree is
<id,2>
=
<id,1>
<id,3>
+
2
*
Id1=id2+id3*2
2.3Semantic Analysis
 Semantic Analysis is the third phase of Compiler,
performed by semantic analyzer.
 Semantic Analysis makes sure that declarations and
statements of program are semantically correct.
 Semantic errors are reported
 C=0
A=b/c
 Semantic Analyzer:
It uses syntax tree and symbol table to check whether
the given program is semantically consistent with
language definition. It gathers type information and
stores it in either syntax tree or symbol table.
2.3Semantic Analysis CONTD..
 Semantic Errors: examples
Type mismatch, Undeclared variables,
Reserved identifier misuse
 Functions of Semantic Analysis:
 Type Checking – Ensures that data types are used
in a way consistent with their definition.
 Label Checking – A program should contain
labels references.
 Flow Control Check – Keeps a check that control
structures are used in a proper manner.(example:
no break statement outside a loop)

Implicit Type conversion or coercion
A=b+c*60 if a,b,c are float automatically 60 is converted
as 60.0
<id,2>
=
<id,1>
<id,3>
+
intoreal
*
2
2.4.Intermediate code generation
 Performed by intermediate code generator.
 In the analysis-synthesis model of a
compiler, the front end of a compiler
translates a source program into an
independent intermediate code, then the
back end of the compiler uses this
intermediate code to generate the target
code.
 Commonly used intermediate codes are
 Three address code, postfix notation
2.4.Intermediate code generation contd..
 A statement involving no more than three
references(two for operands and one for result) is
known as three address statement.
 A sequence of three address statements is known as
three address code.
 Ex:- three address code for A=b+c*60
T1=intoreal(60)
T2=<id3>*T1
T3=<id2> + T2
<id1>=T3
2.5. Code Optimization:-
 It is the fifth phase of the compiler.
 It gets the intermediate code as input and produces
optimized intermediate code as output.
 This phase reduces the redundant code and attempts to
improve the intermediate code so that faster-running
machine code will result.
 During the code optimization, the result of the program is not
affected.
 To improve the code generation, the optimization involves
- deduction and removal of dead code (unreachable code).
- calculation of constants in expressions and terms.
- collapsing of repeated expression into temporary string.
- - loop unrolling.
- - moving code outside the loop.
- removal of unwanted temporary variables.
2.6. Code Generation:-
 It is the final phase of the compiler.
 It gets input from code optimization phase and
produces the target code or object code as result.
 Intermediate instructions are translated into a
sequence of machine instructions that perform the
same task.
 The code generation involves
- allocation of register and memory
- generation of correct references
- generation of correct data types
-generation of missing code.
2.7. Symbol Table Management (or) Book-
keeping:-
 Symbol table is used to store all the information
about identifiers used in the program.
 It is a data structure containing a record for
each identifier, with fields for the attributes of
the identifier.
 It allows to find the record for each identifier
quickly and to store or retrieve data from that
record.
 Whenever an identifier is detected in any of the
phases, it is stored in the symbol table.
2.8.Error Handlers:-
 Each phase can encounter errors. After detecting an
error, a phase must handle the error so that
compilation can proceed.
 In lexical analysis, errors occur in separation of
tokens.
 In syntax analysis, errors occur during construction
of syntax tree.
 In semantic analysis, errors occur when the
compiler detects constructs with right syntactic
structure but no meaning and during type
conversion.
 In code optimization, errors occur when the result is
affected by the optimization.
 In code generation, it shows error when code is
 Position=initial+rate*60
 Output of lexical analyzer is
<id1>=<id2>+<id3>60
 Out put of syntax analyzer is
 Out put of semantic analyzer is
 Out put of intermediate code generation is
 T1=Intofloat(60)
 T2=id3*T1
 T3=Id2+t2
 Id1=t3
 Out put of code optimizer is
T1=intoreal(60)
T2=<id3>*T1
T3=<id2> + T2
<id1>=T3
T1=<id3>*60.0
T2=<id2>+T1
<id1>=T2
<id1>=<id2>+T1
 Output of Code optimizer is
 T1=id3*60.0
 Id1=id2+t1
 Output of Code Generator is
 LDF destination, source---loads given floating source value in
to destination
 STF destination, source --- stores given source in to
destination
 ADDF result,op1,op2---result=op1+op2
 Operands should be in registers or value directly for operations
except load and store
 LDF r1,id2
 LDF r2,id3
 MULF r2,r2,60.0
 ADDF r1,r1,r2
 STF id1,r1
Assignment
 Conversion of infix to postfix notation
process
 Differentiate between token, lexeme and
pattern with examples.
 What happens in Analysis and Synthesis
phases of compilation?
 What is the key difference between lexical
analysis and parsing?
 Pattern: is a rule associated with token that
describes the set of strings
ex:-Pi
 Lexeme is Pi
 Token <id1>
 Pattern is (letter)(letter/digit)*
3.Applications of Compiler Technology
1. Implementation of High-Level
Programming Languages
2. Optimizations for Computer Architectures
Design
3. Design of New Computer Architectures
4. Program Translations of New Computer
Architectures
5. Software Productivity Tools
3.1.Implementation of High-Level Programming
Languages
 A high-level programming language defines a programming
abstraction: the programmer expresses an algorithm using the
language, and the compiler must translate that program to the
target language.
 Higher-level programming languages are easier to program in,
but are less efficient, that is, the target programs run more
slowly. low-level language have more control over a
computation and can, in principle, produce more efficient code
 lower-level programs are harder to write and — worse still —
less portable, more prone to errors, and harder to maintain.
 Optimizing compilers include techniques to improve the
performance of generated code, thus offsetting the inefficiency
introduced by high-level abstractions.
 C—1970
 C++-1985
 Java-1995
 Data flow optimizations introduced
 Object orientation was first introduced in Simula
in 1967, and has been incorporated in languages
such as Smalltalk, C + + , C # , and Java. The key
ideas behind object orientation are
 Data abstraction and
 Inheritance of properties,
compiler optimization
Easy to program because of
reusability and modularity
concepts
3.2.Optimizations for Computer
Architectures
 The rapid evolution of computer architectures has
also led to demand for new compiler technology.
Almost all high-performance systems take advantage
of the same two basic techniques:
 parallelism and memory hierarchies.
 Parallelism can be found at several levels:
 At the instruction level, where multiple
operations are executed simultaneously
 at the processor level, where different threads
of the same application are run on different
processors.
 Memory hierarchies are a response to the basic
limitation that we can build very fast storage or very
large storage, both fast and large.
3.3Design of New Computer Architectures
 In the early days of computer architecture design,
compilers were developed after the machines were
built. That has changed.
 Since programming in high-level languages is the
norm, the performance of a computer system is
determined not by its raw speed but also by how well
compilers can exploit its features.
 Thus, in modern computer architecture development,
compilers are developed in the processor-design stage,
and compiled code, running on simulators, is used to
evaluate the proposed architectural features.
 RISC Reduced instruction set computer, CISC(X86)
 Most general-purpose processor architectures,
including PowerPC, SPARC, MIPS, Alpha, and PA-RISC,
are based on the RISC concept.
3.4. Program Translations
 While we normally think of compiling as a
translation from a high-level language to the
machine level, the same technology can be applied
to translate between different kinds of languages.
The following are some of the important
applications of program-translation techniques.
 Binary Translation
 Hardware Synthesis
 Database Query Interpreters
 Compiled Simulation
Binary translations
 Compiler technology can be used to translate the
binary code for one machine to that of another,
allowing a machine to run programs originally
compiled for another instruction set. Binary
translation technology has been used by various
computer companies to increase the availability of
software for their machines.
 Transmeta Crusoe processor is a VLIW
processor that relies on binary
translation to convert x86 code into
native VLIW code.
 Binary translation can also be used to provide
backward compatibility. When the processor in the
Hardware Synthesis
 Hardware designs are mostly described in
high-level hardware description languages like
Verilog and VHDL (Very high-speed integrated
circuit Hardware Description Language).
 Hardware designs are typically described at the
register transfer level (RTL), where variables
represent registers and expressions represent
combinational logic. Hardware-synthesis tools
translate RTL descriptions automatically into
gates, which are then mapped to transistors
and eventually to a physical layout.
Database Query Interpreters
 Data base Query languages
 For example, query languages, especially
SQL (Structured Query Language), are used
to search databases. Database queries
consist of predicates containing relational
and boolean operators.
 They can be interpreted or compiled into
commands to search a database for records
satisfying that predicate.
Compiled Simulation
 Simulation is a general technique used in many
scientific and engineering disciplines to understand a
phenomenon or to validate a design. Inputs to a
simulator usually include the description of the
design and specific input parameters for that
particular simulation run. Simulations can be very
expensive. each experiment may take days to
complete on a high-performance machine.
 Instead of writing a simulator that interprets the
design, it is faster to compile the design to produce
machine code that simulates that particular design
natively. Compiled simulation can run orders of
magnitude faster than an interpreter-based
approach.
 Compiled simulation is used in many state-of-the-art
3.5. Software Productivity Tools
 errors are rampant in programs; errors may crash a
system, produce wrong results, render a system
vulnerable to security attacks, or even lead to
catastrophic failures in critical systems. Testing is the
primary technique for locating errors in programs.
 An interesting and promising complementary
approach is to use data-flow analysis to locate errors
statically
 The problem of finding all program errors is
undecidable. A data flow analysis may be designed to
warn the programmers of all possible statements
violating a particular category of errors.
Optimized compiler includes
 Type Checking
 A=b+c
 A[50]
 Bounds Checking(array out of bounds,
buffer overflow)
 Memory – Management Tools(garbage)
4.The Evolution of Programming Languages
 The first electronic computers appeared in the 1940's
and were programmed in machine language by
sequences of O's and l's that explicitly told the
computer what operations to execute and in what
order.
 The operations themselves were very low level: move
data from one location to another, add the contents of
two registers, compare two values, and so on.
 This kind of programming was slow, tedious, and
error prone. And once written, the programs were
hard to understand and modify.
4.1.The Move to Higher-level Languages
 mnemonic assembly languages in the early
1950's. Initially, the instructions in an
assembly language were just mnemonic
representations of machine instructions.
 Later, macro instructions were added to
assembly languages so that a programmer
could define parameterized shorthands for
frequently used sequences of machine
instructions.
 A major step towards higher-level
languages was made in the latter half of
the 1950's with the development of
 Fortran for scientific computation,
 Cobol for business data processing,
 Lisp for symbolic computation.
 Programmers could more easily write
numerical computations, business
applications, and symbolic programs.
These languages were so successful that
they are still in use today.
 Today, there are thousands of programming
languages. They can be classified in a variety of
ways. One classification is by generation.
 First-generation languages are the machine
languages
 second-generation the assembly languages
 third-generation the higher-level languages like
Fortran, Cobol, Lisp, C, C + + , C # , and Java.
 Fourth-generation languages are languages designed
for specific applications like NOMAD for report
generation, SQL for database queries, and Postscript
for text formatting.
 The term fifth-generation language has been applied
to logic- and constraint-based languages like Prolog
and OPS5.
 Another classification of languages uses the term
 imperative for languages in which a program
specifies how a computation is to be done
 declarative for languages in which a program
specifies what computation is to be done.
 Languages such as C, C + + , C # , and Java are
imperative languages. In imperative languages
there is a notion of program state and statements
that change the state.
 Functional languages such as ML and Haskell
and constraint logic languages such as Prolog are
often considered to be declarative languages.
 An object-oriented language is one that supports object-
oriented programming, a programming style in which a
program consists of a collection of objects that interact
with one another.
 Simula 67 and Smalltalk are the earliest major object-
oriented Languages.
 Languages such as C + + , C # , Java, and Ruby are more
recent object-oriented languages.
 Scripting languages are interpreted languages with high-
level operators designed for "gluing together"
computations.
 These computations were originally called "scripts." Awk,
JavaScript, Perl, PHP, Python, Ruby, and Tel are popular
examples of scripting languages. Programs written in
scripting languages are often much shorter than
equivalent programs written in languages like C.
5. The Science of Building a Compiler
 A compiler must accept all source programs
that conform to the specification of the
language
 the set of source programs is infinite and any
program can be very large, consisting of
possibly millions of lines of code.
 Any transformation performed by the
compiler while translating a source program
must preserve the meaning of the program
being compiled.
 compiler development challenging.
5.1. Modelling in Compiler Design and Implementation
 The study of compilers is mainly a study of how we
design the right mathematical models and choose the
right algorithms, while balancing the need for generality
and power against simplicity and efficiency.
 The fundamental models are finite-state machines and
regular expressions. These models are useful for
describing the lexical units of programs (keywords,
identifiers, and such) and for describing the algorithms
used by the compiler to recognize those units.
 Other models such as context-free grammars, used to
describe the syntactic structure of programming
languages .
ex:-The nesting of parentheses or control
constructs.
 Similarly, trees are an important model for representing
the structure of programs and their translation into
5.2. The Science of Code Optimization
 The term "optimization" in compiler design refers to
the attempts that a compiler makes to produce code
that is more efficient than the obvious code.
Compiler optimizations must meet the following
design objectives:
 The optimization must be correct, that is, preserve the
meaning of the compiled program. The optimization
must improve the performance of many programs.
The compilation time must be kept reasonable and it
should be correct.
 The second goal is that the compiler must be
effective in improving the performance of many
input programs. Normally, performance means
the speed of the program execution. Other factors
like

 Third, we need to keep the compilation time
short to support a rapid development and
debugging cycle.
 Finally, a compiler is a complex system. we must
keep the system simple to assure that the
engineering and maintenance costs of the
compiler are manageable.
 There is an infinite number of program
optimizations that we could implement, and it
takes a nontrivial amount of effort to create a
correct and effective optimization. We must
prioritize the optimizations, implementing only
those that lead to the greatest benefits on source
programs encountered in practice.
Lexical Analysis-:
 The role of lexical analysis buffering
 specification of tokens.
 Recognitions of tokens
 lexical analyzer generator lex tool.
6.The role of lexical analysis buffering
 The main task of the lexical analyzer is to read the
input characters of the source program, group them
into lexemes, and produce as output a sequence of
tokens for each lexeme in the source program.
 The stream of tokens is sent to the parser for syntax
analysis. the lexical analyzer to interact with the
symbol table as well.
 When the lexical analyzer discovers a lexeme
constituting an identifier, it needs to enter that
lexeme into the symbol table.
 In some cases, information regarding the kind of
identifier may be read from the symbol table by the
lexical analyzer to assist it in determining the proper
token it must pass to the parser.
 parser call the lexical analyzer by getNextToken command,
causes the lexical analyzer to read characters from its input
until it can identify the next lexeme and produces token for it
&returns to the parser.
 Scanning & generation of tokens
 lexical analyzer performs certain other tasks besides
identification of lexemes.
stripping out comments
and whitespace (blank, newline, tab, and perhaps
other characters that are used to separate tokens in the
input).
 correlating error messages generated by the compiler with
the source program. For instance, the lexical analyzer may
keep track of the number of newline characters seen, so it
can associate a line number with each error message. In
1. #inclue<
2. int abc
3. A=b+c
The role of lexical analysis buffering contd..
1. Lexical Analysis Versus Parsing
2. Tokens, Patterns, and Lexemes
3. Attributes for Tokens
4. Lexical Errors
1. Lexical Analysis Versus Parsing
There are a number of reasons why lexical analysis
and parsing separated
 1. Simplicity of design is the most important
consideration. The separation of lexical and syntactic
analysis often allows us to simplify at least one of
these tasks. If we are designing a new language,
separating lexical and syntactic concerns can lead to
a cleaner overall language design.
 2. Compiler efficiency is improved. A separate lexical
analyzer allows us to apply specialized techniques
that serve only the lexical task, not the job of
parsing. In addition, specialized buffering
techniques for reading input characters can speed
up the compiler significantly.
6.2.Tokens, Patterns, and Lexemes
 Token : It represents a logically cohesive
sequence of characters ex:-keywords, operators,
identifiers, special symbols etc
 Group of characters forming a token is called the
Lexeme.
 Pattern: is a rule associated with token that
describes the set of strings
 num1=b+c
 1awe=a+b
 <id,1>=<id,2>+<id,3>
6.3. Attributes for Tokens
 When more than one lexeme can match a pattern, the lexical
analyzer must provide the subsequent compiler phases
additional information about the particular lexeme that
matched.
 For example, the pattern for token number matches both 0 and
1, but it is extremely important for the code generator to know
which lexeme was found in the source program. Thus, in many
cases the lexical analyzer returns to the parser not only a token
name, but an attribute value that describes the lexeme
represented by the token; the token name influences parsing
decisions, while the attribute value influences translation of
tokens after the parse.
 Normally, information about an identifier — e.g., its lexeme, its
type, and the location at which it is first found (in case an error
message about that identifier must be issued) — is kept in the
symbol table.
 Thus, the appropriate attribute value for an identifier is a
pointer to the symbol-table entry for that identifier.
6.4. Lexical Errors
 fi ( a == f ( x ) ) . . .
 a lexical analyzer cannot tell whether f i is a misspelling of
the keyword if or an undeclared function identifier. Since f i
is a valid lexeme for the token id, the lexical analyzer must
return the token id to the parser and let some other phase of
the compiler — probably the parser in this case — handle an
error due to transposition of the letters.
 However, suppose a situation arises in which the lexical
analyzer is unable to proceed because none of the patterns
for tokens matches any prefix of the remaining input.
 The simplest recovery strategy is "panic mode" recovery. We
delete successive characters from the remaining input, until
the lexical analyzer can find a well-formed token at the
beginning of what input is left. This recovery technique may
confuse the parser, but in an interactive computing
environment it may be quite adequate.
 Other possible error-recovery actions are:
 Delete one character from the remaining input.
 Insert a missing character into the remaining input.
 Replace a character by another character.
 Transpose two adjacent characters.
 Transformations like these may be tried in an attempt to
repair the input.
 The simplest such strategy is to see whether a prefix of the
remaining input can be transformed into a valid lexeme by
a single transformation. This strategy makes sense, since in
practice most lexical errors involve a single character. A
more general correction strategy is to find the smallest
number of transformations needed to convert the source
program into one that consists only of valid lexemes, but
this approach is considered too expensive in practice to be
worth the effort.
7.Input Buffering
7.1. Buffer Pairs
 An important scheme involves two buffers or buffer divided in to 2 parts,
that are alternately reloaded, as suggested in Fig. 3.3.
 Each buffer is of the same size N, and N is usually the size of a
disk block, e.g., 4096 bytes.
 Using one system read command we can read N characters
into a buffer, rather than using one system call per character.
 If fewer than N characters remain in the input file, then a
special character, represented by eof.
 Two pointers to the input are maintained:
 Pointer lexemeBegin, marks the beginning of the
current lexeme, whose extent we are attempting to
determine. Pointer forward scans ahead until a pattern
match is found.
 Once the next lexeme is determined, forward is set to
the character at its right end. Then, after the lexeme is
recorded as an attribute value of a token returned to
the parser, lexemeBegin is set to the character
immediately after the lexeme just found.
 In Fig. 3.3, we see forward has passed the end of the
next lexeme, ** (the Fortran exponentiation operator),
and must be retracted one position to its left.
 Advancing forward requires that we first test whether
we have reached the end of one of the buffers, and if so,
we must reload the other buffer from the input, and
https://www.slideshare.net/
dattatraygandhmal/input-bu
ffering
N Bytes N Bytes
First Half Second Half
Abc=pqr*xyz
A B C = P
Lexeme beginning
Forward
ABC token as ID
= operator
Q r * X Y
2. Sentinels
 We can combine the buffer-end test with the
test for the current character if we extend
each buffer to hold a sentinel character at the
end. The sentinel is a special character that
cannot be part of the source program, and a
natural choice is the character eof.
 Figure 3.4 shows the same arrangement as
Fig. 3.3, but with the sentinels added. Note
that eof retains its use as a marker for the
end of the entire input. Any eof that appears
other than at the end of a buffer means that
the input is at an end.
8.Specification of Tokens
 1. Strings and Languages
 2. Operations on Languages
 3. Regular Expressions
 4. Regular Definitions
1. Strings and Languages
 An alphabet is any finite set of symbols. Typical examples
of symbols are letters, digits, and punctuations. The set
{0,1} is the binary alphabet.
 The following string-related terms are commonly used:
 A prefix of string s is any string obtained by
removing zero or more symbols from the end
of s. For example, ban, banana, and e are prefixes
of banana.
 A suffix of string s is any string obtained by
removing zero or more symbols from the
beginning of s. For example, nana, banana, and e
are suffixes of banana.
 A substring of s is obtained by deleting any prefix
 The proper prefixes, suffixes, and substrings of a
string s are those, prefixes, suffixes, and substrings,
respectively, of s that are not e or not equal to s itself.
 A subsequence of s is any string formed by deleting zero or more
not necessarily consecutive positions of s. For example, baan is a
subsequence of banana.
 If x and y are strings, then the concatenation of x and y,
denoted xy, is the string formed by appending y to x.
 For example, if
 x = dog and
 y = house, then
 xy — doghouse.
 The empty string is the identity under concatenation; that is,
for any string s,
 es = se=s.
2. Operations on Languages
 operations on languages are union, concatenation, and
closure,
 Union is the familiar operation on sets.
 The concatenation of languages is all strings formed by
taking a string from the first language and a string from
the second language, in all possible ways, and on
concatenating them.
 The (Kleene)closure of a language L, denoted L*, is the
set of strings you get by concatenating L zero or more
times.
 Finally, the positive closure, denoted L+, is the same
as the Kleene closure, but without the term L°. That is, e
will not be in L+ unless it is in L itself.
Example 3.3 i Let L be the set of letters {A, B , . . . , Z, a, b , . . . , z}
and let D be the set of digits { 0 , 1 , . . . 9} . We may think of L
and D in two, essentially equivalent, ways.
 L U D---
the language with 62 strings of length one, each of which strings is
either one letter or one digit.
 LD ---
is the set c-f 520 strings of length two, each consisting of one
letter followed by one digit.
 L4
---
is the set of all 4-letter strings.
 L* ---
is the set of all strings of letters, including e, the empty string.
 L(L U D)* ---
is the set of all strings of letters and digits beginning with a letter.
 D+
---
is the set of all strings of one or more digits.
Regular Expression
 The language accepted by finite automata can
be easily described by simple expressions called
Regular Expressions.
 The languages accepted by some regular
expression are referred to as Regular languages.
 A regular expression can also be described as a
sequence of pattern that defines a string.
 Regular expressions are used to match
character combinations in strings. String
searching algorithm used this pattern to find the
operations on a string.
I N D U C T I O N :
 There are four parts to the induction whereby larger
regular expressions are built from smaller ones.
 Suppose r and s are regular expressions denoting
languages L(r) and L(s), respectively.
 1. (r)|(s) is a regular expression denoting the
language L(r) U L(s).
 2. (r)(s) is a regular expression denoting the
language L(r)L(s).
 (r)* is a regular expression denoting (L(r))*.
Regular Definitions
 For notational convenience, we may wish to give names
to certain regular expressions and use those names in
subsequent expressions, as if the names were themselves
symbols. If £ is an alphabet of basic symbols, then
a regular definition is a sequence of definitions of the
form:
 D1->a
 D2->jk
 D3->a*
 Each di is a new symbol, not in E and not the same as any
other of the cTs,
 C identifiers are strings of letters, digits,
and underscores. Here is a regular
definition for the language of C identifiers.
We shall conventionally use italics for the
symbols defined in regular definitions.
 letter- -> A | B | - - - | Z | a | b | - - - | z |
 Digit->0|1|2|...9
 Identi->letter(letter|digit|_)*
Write the regular definition for representing
number
 Number->[0-9]+
 Write the regular definition for representing
floating number
 95.89,.987
 Digit->[0-9]
 Number->digit+
 Fnumber->Number.Number
 0.987
 Unsigned numbers (integer or floating
point) are strings such as 5280, 0.01234,
6.336E4, or 1.89E-4. The regular definition
 Unsigned numbers (integer or floating
point) are strings such as 5280, 0.01234,
6.336E4, or 1.89E-4. The regular definition
 digit->0|1|2|3..9
 number->digit(digit)*
 Opfrac->e|.number
 Opexp->E(+|-|e)number|e
 Unum->number(opfrac)(opexp)
Recognition of Tokens
 Transition diagrams
 have a collection of nodes or circles, called states.
 Each state represents a condition that could occur
during the process of scanning the input looking
for a lexeme that matches one of several patterns
 Initial state, Final and non final states
 Final state-accept state
 Edges are directed from one state of the transition
diagram to another. Each edge is labelled by a
symbol or set of symbols.
Initial State
Final
State
<
>
<=
>=
=
<>
A<b
The Lexical-Analyzer Generator Lex Tool
 Use of Lex
 Structure of Lex Programs
 Conflict Resolution in Lex
 The Lookahead Operator
1. Use of Lex
Structure of Lex Programs
 Lex program is separated into three
sections by %% delimiters. The formal of
Lex source is as follows:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
 Definitions include declarations of constant, variable
and regular definitions.
 Rules define the statement of form
p1 {action1} p2 {action2}....pn {action}.
Where pi describes the regular expression
and action1 describes the actions what action the
lexical analyzer should take when pattern pi matches
a lexeme.
 User subroutines are auxiliary procedures needed by
the actions. The subroutine can be loaded with the
lexical analyzer and compiled separately.
Conflict Resolution in Lex
The two rules that Lex uses to decide on the proper
lexeme to select, when several prefixes of the input
match one or more patterns:
 Rule1: Always prefer a longer prefix to a shorter prefix.
Ex:- > =b+c
Ge
 Rule2: If the longest possible prefix matches two or
more patterns, prefer the pattern listed first in the Lex
program.
Ex:-if
4. The Lookahead Operator
 Lex automatically reads one character ahead
of the last character that forms the selected
lexeme, and then retracts the input so only the
lexeme itself is consumed from the input.
 However, sometimes, we want a certain
pattern to be matched to the input only when
it is followed by a certain other characters. If
so, we may use the slash in a pattern to
indicate the end of the part of the pattern that
matches the lexeme.
Rthjj 6878jjnkjk
 [A-Za-z0-9]* i++
/*lex program to count number of
words*/
%{
#include<stdio.h>
#include<string.h>
int i = 0;
%}
/* Rules Section*/
%%
([a-zA-Z0-9])* {i++;}
/* Rule for counting number of words*/
"n" {printf("%dn", i); i = 0;}
int yywrap(void){}
int main()
{
// The function that starts the analysis
yylex();
return 0;
}
 Function yywrap is called by lex when input
is exhausted. Return 1 if you are done or 0 if
more processing is required.
 Every C program requires a main function.
In this case we simply call yylex that is the
main entry-point for lex .
/* Lex program to Identify and Count Positive and
Negative Numbers */
%{
int positive_no = 0, negative_no = 0;
%}
%
^[-][0-9]+ {negative_no++;
printf("negative number = %sn",
yytext);}
[0-9]+ {positive_no++;
printf("positive number = %sn",
yytext);}
/*** use code section ***/
int yywrap(){}
int main()
{
yylex();
printf ("number of positive numbers = %d,"
"number of negative numbers = %dn",
positive_no, negative_no);
return 0;
}
yytext holds the text matched by the current
token.
 Regular expression to denote any string
starting with a. Where ={a,b}
∑
 a(a/b)*
 Regular expression to denote any string
ending with a. Where ={a,b}
∑
 (a/b)*a
 Regular expression to denote any string
begining with a and ending with b. Where
={a,b}
∑
 A(a/b)*b
 All strings of lowercase letters
[a-z]*
 All strings of lowercase letters in which
the letters in are in ascending
lexicographic order. String
 a*b*c*d*…………z*
 All strings of a’s and b’s with an even
number of a’s
(b*ab*ab*)*
 All strings of lowercase letters that contain
the five vowels in order.
Letter ->[b-d f-h j-n p-t v-z]
String-> (Letter|a)* (Letter|e)* (Letter|i)* (Letter|o)* (Letter|
u)*
Describe the languages denoted by the following regular
expressions:
(i) (a|b)*a(a|b)(a|b.
(ii) a*ba*ba*ba*
 With a suitable transition diagram, explain
recognition of keywords and identifiers.

ppt_cd.pptx ppt on phases of compiler of jntuk syllabus

  • 1.
  • 2.
    Compiler Design III-I Sem. By PadmajaGrandhe Associate professor CSE Department PSCMRCET
  • 3.
    Objective  Understand thebasic concept of compiler design, and its different phases which will be helpful to construct new tools like LEX, YACC, etc
  • 4.
    TEXT BOOKS:  Compilers,Principles Techniques and Tools. Alfred V Aho, Monical S. Lam, Ravi Sethi Jeffery D. Ullman,2nd edition,pearson,2007  Compiler Design K.Muneeswaran, OXFORD  Principles of compiler design,2nd edition, Nandhini Prasad, Elsebier.
  • 5.
    REFERENCE BOOKS:  CompilerConstruction, Principles and practice, Kenneth C Louden, CENGAGE  Implementations of Compiler, A New approach to Compilers including the algebraic methods, Yunlinsu ,SPRINGER
  • 6.
    UNIT – I Introduction Language Processing,  Structure of a compiler  the evaluation of Programming language  The Science of building a Compiler  application of Compiler Technology.  Lexical Analysis-:  The role of lexical analysis buffering,  specification of tokens.  Recognitions of tokens  the lexical analyzer generator lex tool.
  • 7.
    UNIT –II SyntaxAnalysis  The Role of a parser  Context free Grammars Writing A grammar  Top down passing bottom up parsing  Introduction Lr Parser
  • 8.
    UNIT –III MorePowerful LR parser  (LR1, LALR) Using Ambiguous Grammars Equal Recovery in Lr parser  Syntax Directed Transactions Definition,  Evolution order of SDTS Application of SDTS.  Syntax Directed Translation Schemes.
  • 9.
    UNIT – IV IntermediatedCode Generation  Variants of Syntax trees 3 Address code  Types and Deceleration  Translation of Expressions  Type Checking.  Controlled Flow Back patching
  • 10.
    UNIT – VCode generation  Stack allocation of space  access to Non Local data on the stack  Heap Management code generation – Issues in design of code generation  the target Language Address in the target code  Basic blocks and Flow graphs.  A Simple Code generation.
  • 11.
    UNIT –VI CodeOptimization  Machine Independent Optimization.  The principle sources of Optimization  peep hole Optimization  Introduction to Data flow Analysis.
  • 13.
    1.Language Translator  Itis a program that translates one language to another Language.  Ex:-compiler, assembler, interpreter, linker, loader and preprocessor.
  • 14.
  • 15.
    1.1Preprocessor  A preprocessorproduce input to compilers.  Macro processing  File inclusion: include header files into the program text.  Language Extensions:- add capabilities to the language by certain amounts to build-in macro  #define, #include
  • 16.
    1.2Compiler  is asoftware which converts a program written in high level language (Source Language) to low level language (Object/Target/Machine Language).  shows errors  Prolog, C,ML, LISP
  • 17.
    1.3Assembler  An assembleris a type of computer program that interprets software programs written in assembly language into machine language
  • 18.
    1.4. LINK-EDITOR (Linker)&LOADER :  Usually a longer program is divided into smaller subprograms called modules. And these modules must be combined to execute the program. The process of combining the modules is done by the linker.  In high level languages, some built in header files or libraries are stored. These libraries are predefined and these contain basic functions which are essential for executing the program. These functions are linked to the libraries by a program called Linker.
  • 19.
     Loader isresponsible for loading programs and libraries. It is one of the essential stages in the process of executing a program, as it places programs into memory and prepares them for execution.  The final output is Absolute machine code that is loaded into a computer fixed storage location and may not be relocated during execution of a computer program. 1.4LINK-EDITOR (Linker) &LOADER Contd..
  • 20.
    1.5Examples of Compilers: 1.Adacompilers 2 .ALGOL compilers 3 .BASIC compilers 4 .C# compilers 5 .C compilers 6 .C++ compilers 7 .COBOL compilers 8 .Common Lisp compilers 9. Fortran compilers 10Pascal compilers 11. PL/I compilers 12.Java compilers
  • 21.
  • 22.
    Short answer questions What is the purpose of Loader/Linker in language processing?  List any 4 compilers and 2 interpreters you know.  What is a preprocessor? Mention its objectives.  List the Differences between interpreter and compiler?
  • 23.
    1.6 Interpreter andcompiler differences Compiler Interpreter A compiler is a program which coverts the entire source code of a programming language into executable machine code for a CPU. interpreter takes a source program and runs it line by line, translating each line as it comes to it. Compiler takes large amount of time to analyze the entire source code but the overall execution time of the program is comparatively faster. Interpreter takes less amount of time to analyze the source code but the overall execution time of the program is slower. Compiler generates the error message only after scanning the whole program, so debugging is comparatively hard as the error can be present any where in the program. Its Debugging is easier as it continues translating the program until the error is met Generates intermediate object code. No intermediate object code is generated.
  • 24.
    Questions given inuniversity End Exams  Describe the functionality of compilers in a typical language processing system  What are program translators? Explain.
  • 25.
    2.STRUCTURE OF THECOMPILER DESIGN / PHASES OF A COMPILER  A compiler operates in phases.  A phase is a logically interrelated operation that takes source program in one representation and produces output in another representation.  2 phases of compilation.  Analysis Phase or Front-end (Machine Independent / Language Dependent)  Synthesis Phase or Back-end (Machine Dependent/ Language independent)
  • 26.
  • 27.
    2.1Lexical Analysis Orscanning  Lexical analysis performed by Lexical analyzer or Scanner  It is the first phase of the compiler.  It gets input from the source program and produces tokens as output.  It reads the characters one by one, starting from left to right and forms the tokens.  Token : It represents a logically cohesive sequence of characters ex:-keywords, operators, identifiers, special symbols etc  Example: a + b = 20  a, b,+,=,20 are all separate tokens.
  • 28.
    2.1Lexical Analysis contd.. Group of characters forming a token is called the Lexeme.  The lexical analyser not only generates a token but also enters the lexeme & its information into the symbol table it is not already there.  Num1=900
  • 29.
    2.2Syntax Analysis:-  Isperformed by syntax analyzer or parser.  It is the second phase of the compiler. It is also known as parser.  It gets the token stream as input from the lexical analyser of the compiler , analyses the syntactical structure of the given input using grammar rules specified.  Reports the syntax errors if any..otherwise Generates Parse Tree or syntax tree as the output if no errors are presented.  Syntax tree : It is a tree in which interior nodes are operators and exterior nodes are operands.
  • 30.
    2.2Syntax Analysis contd.. Example:  For a=b+c*2, syntax tree is <id,2> = <id,1> <id,3> + 2 * Id1=id2+id3*2
  • 31.
    2.3Semantic Analysis  SemanticAnalysis is the third phase of Compiler, performed by semantic analyzer.  Semantic Analysis makes sure that declarations and statements of program are semantically correct.  Semantic errors are reported  C=0 A=b/c  Semantic Analyzer: It uses syntax tree and symbol table to check whether the given program is semantically consistent with language definition. It gathers type information and stores it in either syntax tree or symbol table.
  • 32.
    2.3Semantic Analysis CONTD.. Semantic Errors: examples Type mismatch, Undeclared variables, Reserved identifier misuse  Functions of Semantic Analysis:  Type Checking – Ensures that data types are used in a way consistent with their definition.  Label Checking – A program should contain labels references.  Flow Control Check – Keeps a check that control structures are used in a proper manner.(example: no break statement outside a loop)  Implicit Type conversion or coercion A=b+c*60 if a,b,c are float automatically 60 is converted as 60.0
  • 33.
  • 34.
    2.4.Intermediate code generation Performed by intermediate code generator.  In the analysis-synthesis model of a compiler, the front end of a compiler translates a source program into an independent intermediate code, then the back end of the compiler uses this intermediate code to generate the target code.  Commonly used intermediate codes are  Three address code, postfix notation
  • 35.
    2.4.Intermediate code generationcontd..  A statement involving no more than three references(two for operands and one for result) is known as three address statement.  A sequence of three address statements is known as three address code.  Ex:- three address code for A=b+c*60 T1=intoreal(60) T2=<id3>*T1 T3=<id2> + T2 <id1>=T3
  • 36.
    2.5. Code Optimization:- It is the fifth phase of the compiler.  It gets the intermediate code as input and produces optimized intermediate code as output.  This phase reduces the redundant code and attempts to improve the intermediate code so that faster-running machine code will result.  During the code optimization, the result of the program is not affected.  To improve the code generation, the optimization involves - deduction and removal of dead code (unreachable code). - calculation of constants in expressions and terms. - collapsing of repeated expression into temporary string. - - loop unrolling. - - moving code outside the loop. - removal of unwanted temporary variables.
  • 37.
    2.6. Code Generation:- It is the final phase of the compiler.  It gets input from code optimization phase and produces the target code or object code as result.  Intermediate instructions are translated into a sequence of machine instructions that perform the same task.  The code generation involves - allocation of register and memory - generation of correct references - generation of correct data types -generation of missing code.
  • 38.
    2.7. Symbol TableManagement (or) Book- keeping:-  Symbol table is used to store all the information about identifiers used in the program.  It is a data structure containing a record for each identifier, with fields for the attributes of the identifier.  It allows to find the record for each identifier quickly and to store or retrieve data from that record.  Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
  • 39.
    2.8.Error Handlers:-  Eachphase can encounter errors. After detecting an error, a phase must handle the error so that compilation can proceed.  In lexical analysis, errors occur in separation of tokens.  In syntax analysis, errors occur during construction of syntax tree.  In semantic analysis, errors occur when the compiler detects constructs with right syntactic structure but no meaning and during type conversion.  In code optimization, errors occur when the result is affected by the optimization.  In code generation, it shows error when code is
  • 40.
     Position=initial+rate*60  Outputof lexical analyzer is <id1>=<id2>+<id3>60  Out put of syntax analyzer is
  • 41.
     Out putof semantic analyzer is  Out put of intermediate code generation is  T1=Intofloat(60)  T2=id3*T1  T3=Id2+t2  Id1=t3
  • 42.
     Out putof code optimizer is T1=intoreal(60) T2=<id3>*T1 T3=<id2> + T2 <id1>=T3 T1=<id3>*60.0 T2=<id2>+T1 <id1>=T2 <id1>=<id2>+T1
  • 43.
     Output ofCode optimizer is  T1=id3*60.0  Id1=id2+t1  Output of Code Generator is  LDF destination, source---loads given floating source value in to destination  STF destination, source --- stores given source in to destination  ADDF result,op1,op2---result=op1+op2  Operands should be in registers or value directly for operations except load and store  LDF r1,id2  LDF r2,id3  MULF r2,r2,60.0  ADDF r1,r1,r2  STF id1,r1
  • 45.
    Assignment  Conversion ofinfix to postfix notation process  Differentiate between token, lexeme and pattern with examples.  What happens in Analysis and Synthesis phases of compilation?  What is the key difference between lexical analysis and parsing?
  • 46.
     Pattern: isa rule associated with token that describes the set of strings ex:-Pi  Lexeme is Pi  Token <id1>  Pattern is (letter)(letter/digit)*
  • 47.
    3.Applications of CompilerTechnology 1. Implementation of High-Level Programming Languages 2. Optimizations for Computer Architectures Design 3. Design of New Computer Architectures 4. Program Translations of New Computer Architectures 5. Software Productivity Tools
  • 48.
    3.1.Implementation of High-LevelProgramming Languages  A high-level programming language defines a programming abstraction: the programmer expresses an algorithm using the language, and the compiler must translate that program to the target language.  Higher-level programming languages are easier to program in, but are less efficient, that is, the target programs run more slowly. low-level language have more control over a computation and can, in principle, produce more efficient code  lower-level programs are harder to write and — worse still — less portable, more prone to errors, and harder to maintain.  Optimizing compilers include techniques to improve the performance of generated code, thus offsetting the inefficiency introduced by high-level abstractions.
  • 49.
     C—1970  C++-1985 Java-1995  Data flow optimizations introduced  Object orientation was first introduced in Simula in 1967, and has been incorporated in languages such as Smalltalk, C + + , C # , and Java. The key ideas behind object orientation are  Data abstraction and  Inheritance of properties, compiler optimization Easy to program because of reusability and modularity concepts
  • 50.
    3.2.Optimizations for Computer Architectures The rapid evolution of computer architectures has also led to demand for new compiler technology. Almost all high-performance systems take advantage of the same two basic techniques:  parallelism and memory hierarchies.  Parallelism can be found at several levels:  At the instruction level, where multiple operations are executed simultaneously  at the processor level, where different threads of the same application are run on different processors.  Memory hierarchies are a response to the basic limitation that we can build very fast storage or very large storage, both fast and large.
  • 51.
    3.3Design of NewComputer Architectures  In the early days of computer architecture design, compilers were developed after the machines were built. That has changed.  Since programming in high-level languages is the norm, the performance of a computer system is determined not by its raw speed but also by how well compilers can exploit its features.  Thus, in modern computer architecture development, compilers are developed in the processor-design stage, and compiled code, running on simulators, is used to evaluate the proposed architectural features.  RISC Reduced instruction set computer, CISC(X86)  Most general-purpose processor architectures, including PowerPC, SPARC, MIPS, Alpha, and PA-RISC, are based on the RISC concept.
  • 52.
    3.4. Program Translations While we normally think of compiling as a translation from a high-level language to the machine level, the same technology can be applied to translate between different kinds of languages. The following are some of the important applications of program-translation techniques.  Binary Translation  Hardware Synthesis  Database Query Interpreters  Compiled Simulation
  • 53.
    Binary translations  Compilertechnology can be used to translate the binary code for one machine to that of another, allowing a machine to run programs originally compiled for another instruction set. Binary translation technology has been used by various computer companies to increase the availability of software for their machines.  Transmeta Crusoe processor is a VLIW processor that relies on binary translation to convert x86 code into native VLIW code.  Binary translation can also be used to provide backward compatibility. When the processor in the
  • 54.
    Hardware Synthesis  Hardwaredesigns are mostly described in high-level hardware description languages like Verilog and VHDL (Very high-speed integrated circuit Hardware Description Language).  Hardware designs are typically described at the register transfer level (RTL), where variables represent registers and expressions represent combinational logic. Hardware-synthesis tools translate RTL descriptions automatically into gates, which are then mapped to transistors and eventually to a physical layout.
  • 55.
    Database Query Interpreters Data base Query languages  For example, query languages, especially SQL (Structured Query Language), are used to search databases. Database queries consist of predicates containing relational and boolean operators.  They can be interpreted or compiled into commands to search a database for records satisfying that predicate.
  • 56.
    Compiled Simulation  Simulationis a general technique used in many scientific and engineering disciplines to understand a phenomenon or to validate a design. Inputs to a simulator usually include the description of the design and specific input parameters for that particular simulation run. Simulations can be very expensive. each experiment may take days to complete on a high-performance machine.  Instead of writing a simulator that interprets the design, it is faster to compile the design to produce machine code that simulates that particular design natively. Compiled simulation can run orders of magnitude faster than an interpreter-based approach.  Compiled simulation is used in many state-of-the-art
  • 57.
    3.5. Software ProductivityTools  errors are rampant in programs; errors may crash a system, produce wrong results, render a system vulnerable to security attacks, or even lead to catastrophic failures in critical systems. Testing is the primary technique for locating errors in programs.  An interesting and promising complementary approach is to use data-flow analysis to locate errors statically  The problem of finding all program errors is undecidable. A data flow analysis may be designed to warn the programmers of all possible statements violating a particular category of errors.
  • 58.
    Optimized compiler includes Type Checking  A=b+c  A[50]  Bounds Checking(array out of bounds, buffer overflow)  Memory – Management Tools(garbage)
  • 60.
    4.The Evolution ofProgramming Languages  The first electronic computers appeared in the 1940's and were programmed in machine language by sequences of O's and l's that explicitly told the computer what operations to execute and in what order.  The operations themselves were very low level: move data from one location to another, add the contents of two registers, compare two values, and so on.  This kind of programming was slow, tedious, and error prone. And once written, the programs were hard to understand and modify.
  • 61.
    4.1.The Move toHigher-level Languages  mnemonic assembly languages in the early 1950's. Initially, the instructions in an assembly language were just mnemonic representations of machine instructions.  Later, macro instructions were added to assembly languages so that a programmer could define parameterized shorthands for frequently used sequences of machine instructions.
  • 62.
     A majorstep towards higher-level languages was made in the latter half of the 1950's with the development of  Fortran for scientific computation,  Cobol for business data processing,  Lisp for symbolic computation.  Programmers could more easily write numerical computations, business applications, and symbolic programs. These languages were so successful that they are still in use today.
  • 63.
     Today, thereare thousands of programming languages. They can be classified in a variety of ways. One classification is by generation.  First-generation languages are the machine languages  second-generation the assembly languages  third-generation the higher-level languages like Fortran, Cobol, Lisp, C, C + + , C # , and Java.  Fourth-generation languages are languages designed for specific applications like NOMAD for report generation, SQL for database queries, and Postscript for text formatting.  The term fifth-generation language has been applied to logic- and constraint-based languages like Prolog and OPS5.
  • 64.
     Another classificationof languages uses the term  imperative for languages in which a program specifies how a computation is to be done  declarative for languages in which a program specifies what computation is to be done.  Languages such as C, C + + , C # , and Java are imperative languages. In imperative languages there is a notion of program state and statements that change the state.  Functional languages such as ML and Haskell and constraint logic languages such as Prolog are often considered to be declarative languages.
  • 65.
     An object-orientedlanguage is one that supports object- oriented programming, a programming style in which a program consists of a collection of objects that interact with one another.  Simula 67 and Smalltalk are the earliest major object- oriented Languages.  Languages such as C + + , C # , Java, and Ruby are more recent object-oriented languages.  Scripting languages are interpreted languages with high- level operators designed for "gluing together" computations.  These computations were originally called "scripts." Awk, JavaScript, Perl, PHP, Python, Ruby, and Tel are popular examples of scripting languages. Programs written in scripting languages are often much shorter than equivalent programs written in languages like C.
  • 66.
    5. The Scienceof Building a Compiler  A compiler must accept all source programs that conform to the specification of the language  the set of source programs is infinite and any program can be very large, consisting of possibly millions of lines of code.  Any transformation performed by the compiler while translating a source program must preserve the meaning of the program being compiled.  compiler development challenging.
  • 67.
    5.1. Modelling inCompiler Design and Implementation  The study of compilers is mainly a study of how we design the right mathematical models and choose the right algorithms, while balancing the need for generality and power against simplicity and efficiency.  The fundamental models are finite-state machines and regular expressions. These models are useful for describing the lexical units of programs (keywords, identifiers, and such) and for describing the algorithms used by the compiler to recognize those units.  Other models such as context-free grammars, used to describe the syntactic structure of programming languages . ex:-The nesting of parentheses or control constructs.  Similarly, trees are an important model for representing the structure of programs and their translation into
  • 68.
    5.2. The Scienceof Code Optimization  The term "optimization" in compiler design refers to the attempts that a compiler makes to produce code that is more efficient than the obvious code. Compiler optimizations must meet the following design objectives:  The optimization must be correct, that is, preserve the meaning of the compiled program. The optimization must improve the performance of many programs. The compilation time must be kept reasonable and it should be correct.  The second goal is that the compiler must be effective in improving the performance of many input programs. Normally, performance means the speed of the program execution. Other factors like 
  • 69.
     Third, weneed to keep the compilation time short to support a rapid development and debugging cycle.  Finally, a compiler is a complex system. we must keep the system simple to assure that the engineering and maintenance costs of the compiler are manageable.  There is an infinite number of program optimizations that we could implement, and it takes a nontrivial amount of effort to create a correct and effective optimization. We must prioritize the optimizations, implementing only those that lead to the greatest benefits on source programs encountered in practice.
  • 70.
    Lexical Analysis-:  Therole of lexical analysis buffering  specification of tokens.  Recognitions of tokens  lexical analyzer generator lex tool.
  • 71.
    6.The role oflexical analysis buffering  The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in the source program.  The stream of tokens is sent to the parser for syntax analysis. the lexical analyzer to interact with the symbol table as well.  When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table.  In some cases, information regarding the kind of identifier may be read from the symbol table by the lexical analyzer to assist it in determining the proper token it must pass to the parser.
  • 72.
     parser callthe lexical analyzer by getNextToken command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme and produces token for it &returns to the parser.  Scanning & generation of tokens  lexical analyzer performs certain other tasks besides identification of lexemes. stripping out comments and whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the input).  correlating error messages generated by the compiler with the source program. For instance, the lexical analyzer may keep track of the number of newline characters seen, so it can associate a line number with each error message. In
  • 73.
    1. #inclue< 2. intabc 3. A=b+c
  • 74.
    The role oflexical analysis buffering contd.. 1. Lexical Analysis Versus Parsing 2. Tokens, Patterns, and Lexemes 3. Attributes for Tokens 4. Lexical Errors
  • 75.
    1. Lexical AnalysisVersus Parsing There are a number of reasons why lexical analysis and parsing separated  1. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design.  2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly.
  • 76.
    6.2.Tokens, Patterns, andLexemes  Token : It represents a logically cohesive sequence of characters ex:-keywords, operators, identifiers, special symbols etc  Group of characters forming a token is called the Lexeme.  Pattern: is a rule associated with token that describes the set of strings  num1=b+c  1awe=a+b  <id,1>=<id,2>+<id,3>
  • 77.
    6.3. Attributes forTokens  When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler phases additional information about the particular lexeme that matched.  For example, the pattern for token number matches both 0 and 1, but it is extremely important for the code generator to know which lexeme was found in the source program. Thus, in many cases the lexical analyzer returns to the parser not only a token name, but an attribute value that describes the lexeme represented by the token; the token name influences parsing decisions, while the attribute value influences translation of tokens after the parse.  Normally, information about an identifier — e.g., its lexeme, its type, and the location at which it is first found (in case an error message about that identifier must be issued) — is kept in the symbol table.  Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for that identifier.
  • 78.
    6.4. Lexical Errors fi ( a == f ( x ) ) . . .  a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an undeclared function identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser and let some other phase of the compiler — probably the parser in this case — handle an error due to transposition of the letters.  However, suppose a situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens matches any prefix of the remaining input.  The simplest recovery strategy is "panic mode" recovery. We delete successive characters from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left. This recovery technique may confuse the parser, but in an interactive computing environment it may be quite adequate.
  • 79.
     Other possibleerror-recovery actions are:  Delete one character from the remaining input.  Insert a missing character into the remaining input.  Replace a character by another character.  Transpose two adjacent characters.  Transformations like these may be tried in an attempt to repair the input.  The simplest such strategy is to see whether a prefix of the remaining input can be transformed into a valid lexeme by a single transformation. This strategy makes sense, since in practice most lexical errors involve a single character. A more general correction strategy is to find the smallest number of transformations needed to convert the source program into one that consists only of valid lexemes, but this approach is considered too expensive in practice to be worth the effort.
  • 80.
    7.Input Buffering 7.1. BufferPairs  An important scheme involves two buffers or buffer divided in to 2 parts, that are alternately reloaded, as suggested in Fig. 3.3.  Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096 bytes.  Using one system read command we can read N characters into a buffer, rather than using one system call per character.  If fewer than N characters remain in the input file, then a special character, represented by eof.
  • 81.
     Two pointersto the input are maintained:  Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine. Pointer forward scans ahead until a pattern match is found.  Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found.  In Fig. 3.3, we see forward has passed the end of the next lexeme, ** (the Fortran exponentiation operator), and must be retracted one position to its left.  Advancing forward requires that we first test whether we have reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and
  • 83.
  • 84.
    Abc=pqr*xyz A B C= P Lexeme beginning Forward ABC token as ID = operator Q r * X Y
  • 85.
    2. Sentinels  Wecan combine the buffer-end test with the test for the current character if we extend each buffer to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.  Figure 3.4 shows the same arrangement as Fig. 3.3, but with the sentinels added. Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a buffer means that the input is at an end.
  • 87.
    8.Specification of Tokens 1. Strings and Languages  2. Operations on Languages  3. Regular Expressions  4. Regular Definitions
  • 88.
    1. Strings andLanguages  An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and punctuations. The set {0,1} is the binary alphabet.  The following string-related terms are commonly used:  A prefix of string s is any string obtained by removing zero or more symbols from the end of s. For example, ban, banana, and e are prefixes of banana.  A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana, banana, and e are suffixes of banana.  A substring of s is obtained by deleting any prefix
  • 89.
     The properprefixes, suffixes, and substrings of a string s are those, prefixes, suffixes, and substrings, respectively, of s that are not e or not equal to s itself.  A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. For example, baan is a subsequence of banana.  If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed by appending y to x.  For example, if  x = dog and  y = house, then  xy — doghouse.  The empty string is the identity under concatenation; that is, for any string s,  es = se=s.
  • 90.
    2. Operations onLanguages  operations on languages are union, concatenation, and closure,  Union is the familiar operation on sets.  The concatenation of languages is all strings formed by taking a string from the first language and a string from the second language, in all possible ways, and on concatenating them.  The (Kleene)closure of a language L, denoted L*, is the set of strings you get by concatenating L zero or more times.  Finally, the positive closure, denoted L+, is the same as the Kleene closure, but without the term L°. That is, e will not be in L+ unless it is in L itself.
  • 91.
    Example 3.3 iLet L be the set of letters {A, B , . . . , Z, a, b , . . . , z} and let D be the set of digits { 0 , 1 , . . . 9} . We may think of L and D in two, essentially equivalent, ways.  L U D--- the language with 62 strings of length one, each of which strings is either one letter or one digit.  LD --- is the set c-f 520 strings of length two, each consisting of one letter followed by one digit.  L4 --- is the set of all 4-letter strings.  L* --- is the set of all strings of letters, including e, the empty string.  L(L U D)* --- is the set of all strings of letters and digits beginning with a letter.  D+ --- is the set of all strings of one or more digits.
  • 92.
    Regular Expression  Thelanguage accepted by finite automata can be easily described by simple expressions called Regular Expressions.  The languages accepted by some regular expression are referred to as Regular languages.  A regular expression can also be described as a sequence of pattern that defines a string.  Regular expressions are used to match character combinations in strings. String searching algorithm used this pattern to find the operations on a string.
  • 93.
    I N DU C T I O N :  There are four parts to the induction whereby larger regular expressions are built from smaller ones.  Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.  1. (r)|(s) is a regular expression denoting the language L(r) U L(s).  2. (r)(s) is a regular expression denoting the language L(r)L(s).  (r)* is a regular expression denoting (L(r))*.
  • 94.
    Regular Definitions  Fornotational convenience, we may wish to give names to certain regular expressions and use those names in subsequent expressions, as if the names were themselves symbols. If £ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:  D1->a  D2->jk  D3->a*  Each di is a new symbol, not in E and not the same as any other of the cTs,
  • 95.
     C identifiersare strings of letters, digits, and underscores. Here is a regular definition for the language of C identifiers. We shall conventionally use italics for the symbols defined in regular definitions.  letter- -> A | B | - - - | Z | a | b | - - - | z |  Digit->0|1|2|...9  Identi->letter(letter|digit|_)*
  • 96.
    Write the regulardefinition for representing number  Number->[0-9]+  Write the regular definition for representing floating number  95.89,.987  Digit->[0-9]  Number->digit+  Fnumber->Number.Number  0.987
  • 97.
     Unsigned numbers(integer or floating point) are strings such as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular definition
  • 98.
     Unsigned numbers(integer or floating point) are strings such as 5280, 0.01234, 6.336E4, or 1.89E-4. The regular definition  digit->0|1|2|3..9  number->digit(digit)*  Opfrac->e|.number  Opexp->E(+|-|e)number|e  Unum->number(opfrac)(opexp)
  • 99.
    Recognition of Tokens Transition diagrams  have a collection of nodes or circles, called states.  Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns  Initial state, Final and non final states  Final state-accept state  Edges are directed from one state of the transition diagram to another. Each edge is labelled by a symbol or set of symbols.
  • 100.
  • 102.
    The Lexical-Analyzer GeneratorLex Tool  Use of Lex  Structure of Lex Programs  Conflict Resolution in Lex  The Lookahead Operator
  • 103.
  • 104.
    Structure of LexPrograms  Lex program is separated into three sections by %% delimiters. The formal of Lex source is as follows: { definitions } %% { rules } %% { user subroutines }
  • 105.
     Definitions includedeclarations of constant, variable and regular definitions.  Rules define the statement of form p1 {action1} p2 {action2}....pn {action}. Where pi describes the regular expression and action1 describes the actions what action the lexical analyzer should take when pattern pi matches a lexeme.  User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded with the lexical analyzer and compiled separately.
  • 106.
    Conflict Resolution inLex The two rules that Lex uses to decide on the proper lexeme to select, when several prefixes of the input match one or more patterns:  Rule1: Always prefer a longer prefix to a shorter prefix. Ex:- > =b+c Ge  Rule2: If the longest possible prefix matches two or more patterns, prefer the pattern listed first in the Lex program. Ex:-if
  • 107.
    4. The LookaheadOperator  Lex automatically reads one character ahead of the last character that forms the selected lexeme, and then retracts the input so only the lexeme itself is consumed from the input.  However, sometimes, we want a certain pattern to be matched to the input only when it is followed by a certain other characters. If so, we may use the slash in a pattern to indicate the end of the part of the pattern that matches the lexeme.
  • 108.
  • 109.
    /*lex program tocount number of words*/ %{ #include<stdio.h> #include<string.h> int i = 0; %} /* Rules Section*/ %% ([a-zA-Z0-9])* {i++;} /* Rule for counting number of words*/ "n" {printf("%dn", i); i = 0;}
  • 110.
    int yywrap(void){} int main() { //The function that starts the analysis yylex(); return 0; }  Function yywrap is called by lex when input is exhausted. Return 1 if you are done or 0 if more processing is required.  Every C program requires a main function. In this case we simply call yylex that is the main entry-point for lex .
  • 112.
    /* Lex programto Identify and Count Positive and Negative Numbers */ %{ int positive_no = 0, negative_no = 0; %} % ^[-][0-9]+ {negative_no++; printf("negative number = %sn", yytext);} [0-9]+ {positive_no++; printf("positive number = %sn", yytext);}
  • 113.
    /*** use codesection ***/ int yywrap(){} int main() { yylex(); printf ("number of positive numbers = %d," "number of negative numbers = %dn", positive_no, negative_no); return 0; } yytext holds the text matched by the current token.
  • 115.
     Regular expressionto denote any string starting with a. Where ={a,b} ∑  a(a/b)*  Regular expression to denote any string ending with a. Where ={a,b} ∑  (a/b)*a  Regular expression to denote any string begining with a and ending with b. Where ={a,b} ∑  A(a/b)*b
  • 116.
     All stringsof lowercase letters [a-z]*  All strings of lowercase letters in which the letters in are in ascending lexicographic order. String  a*b*c*d*…………z*  All strings of a’s and b’s with an even number of a’s (b*ab*ab*)*
  • 117.
     All stringsof lowercase letters that contain the five vowels in order. Letter ->[b-d f-h j-n p-t v-z] String-> (Letter|a)* (Letter|e)* (Letter|i)* (Letter|o)* (Letter| u)* Describe the languages denoted by the following regular expressions: (i) (a|b)*a(a|b)(a|b. (ii) a*ba*ba*ba*
  • 118.
     With asuitable transition diagram, explain recognition of keywords and identifiers.