Introduction to Compilers

Introduction to Compilers
M.Vijaya
Assistant Professor
Dept. of CSE, VJIT

What is Compiler?
Compiler is a translator program that translates a program
written in (HLL) the source program and translates it into an
equivalent program in (MLL) the target program. As an important
part of a compiler is error showing to the programmer.

“Interpretation”
Performing the operations implied by the source program

The Analysis-Synthesis Model of Compilation
There are two parts to compilation:
⚫Analysis determines the operations implied by the
source program which are recorded in a tree structure
⚫Synthesis takes the tree structure and translates the
operations therein into the target program

Preprocessors, Compilers,
Assemblers, and Linkers

There are two parts to compilation: analysis and
synthesis.
⚫The analysis part breaks up the source program into
pieces and creates an intermediate representation of
the source program.
⚫The synthesis part constructs the desired target
program from the intermediate representation.

The Grouping of Phases Compiler front and back ends:
Compiler passes:
Front end: analysis (machine independent)
Back end: synthesis (machine dependent)
Compiler passes: A collection of phases is done only once
(single pass) or multiple times (multi pass)
Single pass: usually requires everything to be defined
before being used in source program
Multi pass: compiler may have to keep entire program
representation in memory

Phases of Compiler
⚫ Compiler operates in phases.
⚫ Each phase transforms the source program from one representation
to another.
Six phases:
– Lexical Analyser
– Syntax Analyser
– Semantic Analyser
– Intermediate code generation
– Code optimization
– Code Generation
• Symbol table and error handling interact with the six phases.
• Some of the phases may be grouped together.
These phases are illustrated by considering the following statement:
position := initial + rate * 60

Lexical Analysis (Scanning)
Reads characters in the source program and groups them into
words -
lexeme (basic unit of syntax)
Produces words and recognises what sort they are.
The output is called token and is a pair of the form
<token_name, attribute value>
Eg: Position = initial + rate * 60
keep a symbol table.
Lexical analysis eliminates white spaces
Speed is important - use a specialised tool: e.g., flex - a tool for
generating
scanners: programs which recognise lexical patterns in text;

Syntax Analysis
⚫Syntax (or syntactic) Analysis (Parsing)
⚫Imposes a hierarchical structure on the token stream.
⚫Use output of lexical analyzer to create a tree like
structure of the token stream – syntax tree.
⚫The tree shows the order in which operations are
performed
⚫Ordering based on the precedence of mathematical
operations

Semantic Analysis :
Semantic Analysis :
•Collects context (semantic) information, checks for semantic
errors, and annotates nodes of the tree with the results.
•Ensures that component of a program fit together
meaningfully.
Also gathers type information and saves either in syntax tree or
symbol table.

Intermediate Code Generations:-
An intermediate representation of the final machine
language code is produced. This phase bridges the analysis
and synthesis phases of translation.
intermediate representation should have two
important properties.
⚫It should be easy to produce.
⚫It should be easy to translate into target machine code.

Code Optimization:-
•The goal is to improve the intermediate code and, thus, the effectiveness of
code generation and the performance of the target code get improved.
• Faster, shorter code and even that consumes less power.

Code Generation:-
⚫The last phase of translation is code generation.
⚫A number of optimizations to Reduce the length of
machine language program are carried out during this
phase. The output of the code generator is the machine
language program of the specified computer.
⚫Target, machine-specific properties may be used to
optimize the code.
⚫ Finally, machine code and associated information
required by the Operating System are generated.

Symbol Table Management
This record variable name used in the source p
This stores attributes of each name
Eg: name, its type, its scope
Method of passing each argument(by value or by reference)
Return type
Implementation of symbol table can be done is either linear list or hash
table.
Error Handling Routine: In the compiler design process error may
occur in all the below-given phases:
Lexical analyzer: Wrongly spelled tokens
Syntax analyzer: Missing parenthesis
Intermediate code generator: Mismatched operands for an operator
Code Optimizer: When the statement is not reachable
Code Generator: Unreachable statements
Symbol tables: Error of multiple declared identifiers

Front end
Front end of a compiler consists of the phases
• Lexical analysis.
• Syntax analysis.
• Semantic analysis.
• Intermediate code generation.
Back end
Back end of a compiler contains
• Code optimization.
• Code generation.
Front End
• Front end comprises of phases which are dependent on the
input (source language)
and independent on the target machine (target language).
• It includes lexical and syntactic analysis, symbol table
management, semantic analysis
and the generation of intermediate code.
• Code optimization can also be done by the front end.
• It also includes error handling at the phases concerned.

Back End
• Back end comprises of those phases of the compiler that are
dependent on the target machine and independent on the source
language.
• This includes code optimization, code generation.
• In addition to this, it also encompasses error handling and symbol
table management operations.

Passes
A Compiler pass refers to the traversal of a compiler through the entire
program. Compiler pass are two types:
•Single Pass Compiler
•Two Pass Compiler or Multi Pass Compiler.
1. Single Pass Compiler:
If we combine or group all the phases of compiler design in
a single module known as single pass compiler.

⚫In above diagram there are all 6 phases are grouped in a
single module, some points of single pass compiler is as:
⚫A one pass/single pass compiler is that type of compiler
that passes through the part of each compilation unit
exactly once.
⚫Single pass compiler is faster and smaller than the multi
pass compiler.
⚫As a disadvantage of single pass compiler is that it is less
efficient in comparison with multipass compiler.
⚫Single pass compiler is one that processes the
input exactly once, so going directly from lexical analysis
to code generator, and then going back for the next read.

⚫First Pass: is refers as
(a). Front end
(b). Analytic part
(c). Platform independent
⚫Second Pass: is refers as
(a). Back end
(b). Synthesis Part
(c). Platform Dependent

compiler vs interpreter
compiler interpreter
It scans the entire program first and
TRANSLATE it into machine code
It scans line by line programs and
convert into machine code
Debugging is slow Debugging is faster
Error occurs after scanning the
whole program
Error occurs after scanning each line
Compiler shows all errors and
warnings at same time
Interpreter show one error at a time
Execution time is less Execution time is more
Compilers is used by languages such
as c,c++ etc.
An interpreter is used languages
such as java,python etc.

The role of Lexical Analyzer
⚫ 1) Lexical analyzer functions (or) role of Lexical Analyzer
⚫ 2)Interaction between LA and SA
⚫ 3)token, lexemes and patterns
⚫ 1) Lexical analyzer functions (or) role of Lexical Analyzer
a) It provides a stream of tokens
b)It removes comments from SP
c)It removes white spaces characters such as blank space, tap space
and new line characters.
d) It counts the line numbers of source programs
e)It generates a symbol table
f)It provides error messages

The role of Lexical Analyzer
2)Interaction between LA and SA

⚫ 3)token, lexemes and patterns
⚫ -pair of components consist of
⚫ <token name, attribute value>
⚫ Identifier/variables/operators/keywords/constants
⚫ Pointer-which points to S.Table
⚫ lexemes : Sequence of characters
⚫ Patterns: used to define regular expression
⚫ Identifier start with (l/_)(l/d/_)*
[a-zA-Z_][a-zA-Z0-9_]*

⚫Lexemes a=b+c;
⚫Symboltable
⚫Lexemes Token
⚫A <id,1>
⚫= <assig><=>
⚫B <id,2>
⚫+ <+>
⚫C <id,3>
⚫;
Id1 Val of identifier
Id2
Id3

Regular Expressions
⚫Regular Expressions are used to denote regular
languages. An expression is regular if:
⚫ɸ is a regular expression for regular language ɸ.
⚫ɛ is a regular expression for regular language {ɛ}.
⚫If a ∈ Σ , a is regular expression with language {a}.
⚫If a and b are regular expression, a + b is also a regular
expression with language {a,b}.
⚫If a and b are regular expression, ab (concatenation of a
and b) is also regular.
⚫If a is regular expression, a* (0 or more times a) is also
regular.

Operations
The various operations on languages are:
⚫Union of two languages L and M is written as
⚫L U M = {s | s is in L or s is in M}
⚫Concatenation of two languages L and M is written as
⚫LM = {st | s is in L and t is in M}
⚫The Kleene Closure of a language L is written as
⚫L* = Zero or more occurrence of language L.

⚫Types of Grammars-
⚫Grammars are classified on different basis as-

Finite automata
⚫Finite automata is a state machine that takes a string of
symbols as input and changes its state accordingly.
Finite automata is a recognizer for regular expressions.
⚫The mathematical model of finite automata consists of:
⚫Finite set of states (Q)
⚫Finite set of input symbols (Σ)
⚫One Start state (q0)
⚫Set of final states (qf)
⚫Transition function (δ)

DFA refers to deterministic
finite automata. ... In DFA,
there is only one path for specific
input from the current state to
the next state.
The finite automata are
called NFA when there exist many
paths for specific input from the
current state to the next state.

Steps for converting NFA with ε to DFA:
⚫Step 1: We will take the ε-closure for the starting state
of NFA as a starting state of DFA.
⚫Step 2: Find the states for each input symbol that can
be traversed from the present. That means the union
of transition value and their closures for each state of
NFA present in the current state of DFA.
⚫Step 3: If we found a new state, take it as current state
and repeat step 2.
⚫Step 4: Repeat Step 2 and Step 3 until there is no new
state present in the transition table of DFA.
⚫Step 5: Mark the states of DFA as a final state which
contains the final state of NFA.

1. Convert the given NFA into its equivalent DFA.

r=(a/b)*abb(a/b) construct nfa from regular expression

Input Buffering in Compiler Design
⚫The lexical analyzer scans the input from left to right
one character at a time. It uses two pointers begin
ptr(bp) and forward to keep track of the pointer of the
input scanned.
⚫Initially both the pointers point to the first character
of the input string as shown below

⚫The forward ptr moves ahead to search for end of
lexeme. As soon as the blank space is encountered, it
indicates end of lexeme. In above example as soon as
ptr (fp) encounters a blank space the lexeme “int” is
identified.
⚫The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead.
then both the begin ptr(bp) and forward ptr(fp) are
set at next token.

⚫One Buffer Scheme
⚫Two buffer scheme

Regular expressions
⚫Regular expressions are a notation to
represent lexeme patterns for a token.
⚫They are used to represent the language
for lexical analyzer.
⚫They assist in finding the type of token
that accounts for a particular lexeme.

Regular Expression for identifiers
⚫ Ex: a
⚫ Aa
⚫ a123

1)Ex: 9
2)Ex :123
3)Ex: .25 /e
4)E+2/e-2/e2 /e
5)6.33E4

Lex-lexical analyzer generator

Lex-lexical analyzer generator
⚫ Lex is a program that generates lexical analyzer. It is used with
YACC parser generator.
⚫ The lexical analyzer is a program that transforms an input
stream into a sequence of tokens.
⚫ It reads the input stream and produces the source code as
output through implementing the lexical analyzer in the C
program.
The function of Lex is as follows:
⚫ Firstly lexical analyzer creates a program lex.1 in the Lex
language. Then Lex compiler runs the lex.1 program and
produces a C program lex.yy.c.
⚫ Finally C compiler runs the lex.yy.c program and produces an
object program a.out.
⚫ a.out is lexical analyzer that transforms an input stream into a
sequence of tokens.

Lex file format
A Lex program is separated into three sections by %%
delimiters. The formal of Lex source is as follows:
{ definitions } //declaration part
%%
{ rules } //rule part pattern {action}
%%
{ user subroutines } //auxiliary part
⚫Definitions include declarations of constant, variable and
regular definitions.
⚫Rules define the statement of form p1 {action1} p2
{action2}....pn {action}.
⚫Where pi describes the regular expression
and action1 describes the actions what action the lexical
analyzer should take when pattern pi matches a lexeme.
⚫User subroutines are auxiliary procedures needed by the
actions. The subroutine can be loaded with the lexical
analyzer and compiled separately.

% {
#include <stdio.h>
%}
% %
^[a - z A - Z _][a - z A - Z 0 - 9 _] * {printf("Valid Identifier");}
% %
main()
{
yylex();
}

//case conversion
%option noyywrap
lower [a-z]
CAPS [A-Z]
space [tn]
%%
{lower} {printf("%c", yytext[0]- 32);}
{CAPS} {printf("%c", yytext[0]+32);}
{space} ECHO;
ECHO;
%%
main()
{
yylex();
}

//1. program to identify number,letter,operators,special
symbols//
%option noyywrap
digit [0-9]
letter [a-zA-Z]
ops [+'-'*/=%]
sp_sym [^a-zA-Z0-9+- */=%]
%%
{letter}+ {printf("%s is a stringn", yytext);}
{digit}+ {printf("%s is a digitn", yytext);}
{ops}+ {printf("%s is an operatorn", yytext);}
{sp_sym}+ {printf("%s is a special symboln", yytext);}
%%
int main()
{
yylex();
return 0;
}

Lexical Error
⚫During the lexical analysis phase this type of error can be
detected.
⚫Lexical error is a sequence of characters that does not
match the pattern of any token. Lexical phase error is
found during the execution of the program.
Lexical phase error can be:
⚫Spelling error.
⚫Unmatched string.
⚫Appearance of illegal characters.
⚫Exceeding length of identifier.

Lexical analyzer: Wrongly spelled tokens
Syntax analyzer: Missing parenthesis
Intermediate code generator: Mismatched operands for
an operator
Code Optimizer: When the statement is not reachable
Code Generator: Unreachable statements
Symbol tables: Error of multiple declared identifiers

In this code, 1xab is neither a number nor
an identifier. So this code will show the
lexical error.

Introduction to Compilers

Recommended

Recommended

More Related Content

Similar to Introduction to Compilers

Similar to Introduction to Compilers (20)

Recently uploaded

Recently uploaded (20)

Introduction to Compilers