PREARED BY:
Zinal Gohil
ASST. PROF.
(CE/IT)
(PPSU,SOE)
2. OVERVIEW OF
LANGUAGE
PROCESSOR
 Programming Languages and Language
Processors,
 Language Processing Activities,
 Program Execution,
 Fundamental of Language Processing,
 Symbol Tables;
 Data Structures for Language Processing: Search
Data structures, Allocation Data Structures
OUTLINE
 Language processor comprises of compilers, assemblers
and interpreters
 Programmers write software’s in variety of languages
and compilers and interpreters translates it in to
instructions that are understood by the computer at
machine level.
 Language processing activities occur due to difference
between how software is made by the programmer and
how it is implemented by computer.
 The software programmers mentions two domains
 1. APPLICATION DOMAIN : To present the idea
 2. EXECUTION DOMAIN : for carrying of these ideas
OVERVIEW OF PROGRAMMING
LANGUAGE AND LANGUAGE
PROCESSOR
 Semantics pertains to the meaning of words. The
semantics of a language is a description of what the
sentences mean. It is much more difficult to express the
semantics of a language than it is to express the syntax.
 In order to implement a programming language we must
know what each sentence means (declaration,
expression, etc).
 E.g., does the sentence
 produce an output,
 take any inputs,
 change the value stored in a variable,
 produce an error.
TERMINOLOGIES
1. SEMANTICS (MEANING)
 Domain: It refers to the scope or sphere of any activity.
 Application Domain: The scope of an application is its
application domain.
 E.g., the application domain of an inventory program is
warehouse and its associated tangibles (goods,
machinery, etc), transactions (e.g., receiving goods,
purchase orders, locating goods, shipping of goods,
receiving payments, etc), people (e.g., workers,
managers, customers).
 All the above are objects in the application domain. The
application domain can best be described by a person in
that domain. E.g., the warehouse manager in the above
example.
TERMINOLOGIES
2. APPLICATION DOMAIN
 Execution Domain: (also called as the solution domain).
The execution domain is the work of programmers, e.g.,
program code, documentation, test results, files, computers,
etc.
 The solution domain is partitioned into two levels:
 Abstract, high-level documents, such as flow charts,
diagrams
 Low-level – data structures, function definitions, etc.
TERMINOLOGIES
3. EXECUTION DOMAIN
 The difference between the semantics of the application
domain and the execution domain is called the semantic
gap.
TERMINOLOGIES
4. SEMANTIC GAP
 Consequences of semantic gap:
 Large development times – interaction between
designers in application domain and programmers.
 Large development efforts.
 Poor quality of software.
CONSEQUENCES OF SEMANTIC GAP
 The semantic gap is reduced by programming languages
(PL). The use of a PL introduces a new domain called the
programming language domain (or PL domain).
 The PL domain bridges the gap between the application
domain and the execution domain.
HOW IS THE SEMANTIC GAP
REDUCED?
 Specification gap: It is the semantic gap between the
application domain and the PL domain.
 It can also be defined as the semantic gap between the
two specifications of the same task.
 The specification gap is bridged by the software
development team.
 Execution gap: It is the gap between the semantics of
programs written in different programming languages.
 The execution gap is bridged by the translator or
interpreter.
SPECIFICATION GAP AND EXECUTION
GAP
Advantages of introducing the PL domain:
(a) Large development times are reduced.
(b) Better quality of software.
(c) Language processor provides diagnostic
capabilities which detects errors
 Language Processor: It is a software which bridges the
specification or execution gap.
 Language Processing: It is any activity performed by a
language processor.
 Diagnostic capability is a feature of a language
processor. The input of a language processor is the
source program. The output of a language processor is
the target program. The target program is not produced
if the language processor finds any errors in the source
program.
TERMS
 (a) Language Translator: This bridges the execution gap to
the machine language of a computer system. Examples are
compiler and assembler.
 (b) De-translator: Similar to translator, but in the opposite
direction.
 (c) Preprocessor: This is a language processor whose source
and target languages are both high level, i.e., no translation
takes place.
 (d) Language migrator : It fills the specification gap between
two PL’s.(Used to convert program written in one programming
language in to another programming language) It may be used
to provide portability of program by migrating it to more
modern programming language
 The quality of target program is depends on semantics of two
programming languages
TYPES OF LANGUAGE PROCESSOR
 In case of problem-oriented languages. The PL
domain is very close to the application domain. The
specification gap is reduced in this case. Such PLs can
be used only for specific applications, hence they are
called problem-oriented languages.
 They have a large execution gap, but the execution
gap is bridged by the translator or interpreter. Using
these languages, we only have to do specify “what to
do”.
 Software development takes less time using problem-
oriented languages, but the resultant code may not be
optimized. Examples : Fourth generation languages
(4GL) like SQL.
PROBLEM-ORIENTED LANGUAGES:
 These provide general facilities and features which
are
required in most applications. These languages are
independent of application domains.
 Hence, there is a large specification gap. The gap
must be bridged by the application designer. Using
these languages, we have to specify “what to do”
and “how to do”.
 Examples. C, C++, FORTRAN, etc.
PROCEDURE-ORIENTED LANGUAGES:
 A compiler is a language translator. It translates a
source code (programs in a high-level language) into
the target code (machine code, or object code).
 To do this translation, a compiler steps through a
number of phases. The simplest is
 a 2-phase compiler. The first phase is called the
front end and the second phase is called the back
end.
COMPILERS
 Front End: The front end translates from the high-level
language to a common intermediate language. The front
end is source language dependent but it is machine-
independent. Thus, the front end consists of the following
phases:
 lexical analysis, syntactic analysis, creation of symbol table,
semantic analysis and generation of intermediate code. The
front end also includes error-handling routines for each of
these phases.
 Back End: The back end translates from this common
intermediate language to the machine code.
 The back end is machine dependent. This includes code
optimization, code generation, error-handling and symbol
table operations. Thus, a compiler bridges the execution gap.
COMPILERS
 It is a language processor. It also bridges the execution
gap but does not generate the machine code. An
interpreter executes a program written in a high level
language.
 The essential difference between a compiler and an
interpreter is that while a compiler generates the
machine code and is then no longer needed, an
interpreter is always required.
INTERPRETER
 Language processing activities are related to
specification gap and execution gap.
 It is divided in to two types.
 1. Program generation
 2. Program Execution
 Aim of program generation activity is to generate
automatic program. In this activity the specification
language of application domain is the source language
 A procedure oriented language is the target language.
Source language is specification language
 1. PROGRAM GENERATION ACTIVITY
 Program generator is a system software. Program
specification is input to this system software. It
generates output in target language.
LANGUAGE PROCESSING ACTIVITIES
 Here the specification gap is gap between application
domain and program generator domain.
LANGUAGE PROCESSING ACTIVITIES
User
Application
Domain
Program
Generator
Domain
Target
Programming
Language
Domain
Program
Execution
Domain
Specification Gap
 It reduces specification gap . and reliability of generated
program is increases. It also helps programmer for easily
writing specification of program.
 Compiler is used to bridge the gap between target PL
and the execution domain.
 2. PROGRAM EXECUTION
 Methods of program execution
 (a) Program translation
 (b) Program interpretation
LANGUAGE PROCESSING ACTIVITIES
 (a) Program
Translation
 It bridges the
execution gap by
translating source
program in to target
program.
 Source program is
written in to
programming
language and a
target program is an
assembly language.
LANGUAGE PROCESSING ACTIVITIES
 Characteristics:
 1. Before execution of program, it must be translated
 2. Translated program may be saved in to files
 3. Program must be retranslated with modifications.
 (b) PROGRAM INTERPRETATION
 It reads source program and store it in main memory.
During program interpretation it takes a source
statement and determines its meaning then it perform
the actions which to be implement on that statements.
 The action may be computational and input –output.
LANGUAGE PROCESSING ACTIVITIES
Program counter
increments the
memory address
for next
instruction. CPU
uses program
counter for next
instruction.
 Instruction
execution cycle
consists of
three steps
 1. Fetch 2.
Decode 3.
Execution
LANGUAGE PROCESSING ACTIVITIES
(b) Execution
 This cycle is repeated for all instructions. The instruction
address in the program counter is updated at the end of
the cycle CPU select the next instruction for execution.
 The above process is called interpretation cycle.
Interpretation cycle consists of :
 1. Fetch the statement
 2. Analyze the statement
 3. Execute the statement
 CHARACTERISTICS OF INTERPRETATION
1. Source program is retained in source form itself
 2. Statement is analyzed during interpretation
LANGUAGE PROCESSING ACTIVITIES
FUNDAMENTAL OF LANGUAGE
PROCESSING
 Language processing is the combination of analysis of SP
and synthesis of TP. specification of source program
consists of three components.
 1. Lexical rule
 2. Syntax Rule
 3. Semantic Rule
• The source program can be analyzed in three phases-
• 1. Linear-lexical Analysis : In this type of analysis the
source string is read from left to right and grouped in to
tokens.
• EX : Tokens for a language can be identifiers, constants,
relational operations, keywords.
• 2. Hierarchical(Syntax) Analysis : In this analysis,
characters or tokens are grouped hierarchically in to
nested collections for checking them syntactically.
• 3. Semantic Analysis : This kind of analysis ensures the
correctness of meaning of program.
ANALYSIS OF SOURCE PROGRAM
FUNDAMENTAL OF LANGUAGE
PROCESSING
 Synthesis phase is concerned with the construction of
target language statements which have the same
meaning as a source statement . It consists of two main
activities
 Code optimization : generation of various data structures
of target program.
 Code generation : It generates the target code.
PHASES OF LANGUAGE PROCESSOR
• ANALYSIS PART
• 1. LEXICAL ANALYSIS :
• The lexical analysis is also called scanning. It is the phase
of compilation in which the complete source code is
scanned and your source program is broken up in to
group of strings called token.
• A token is a sequence of characters having a collective
meaning.
• For example if some assignment statement in your
source program is as follow:
• total =count + rate * 10
PHASES OF COMPILER
• total =count + rate * 10
• In lexical Analysis phase this statement is broken up in to
series of tokens as follow:
• 1. the identifier total
• 2. The assignment symbol
• 3. the identifier count
• 4. The plus symbol
• 5. The identifier rate
• 6. The multiplication symbol
• 7. the constant number 10
The blank characters which are used in the programming
statements are eliminated during lexical analysis.
LEXICAL ANALYSIS
Parse tree for total =count + rate * 10
• The syntax analysis is also called
parsing. In this phase the tokens
generated by the lexical analysis are
grouped together to form
hierarchical structure.
• The syntax analysis determines the
structure of source string by
grouping the tokens together.
• The hierarchical structure
generated in this phase is called
parse tree or syntax tree.
• For expression total= count + rate
*10 the parse tree will like below
2. SYNTAX ANALYSIS
• In that statement first rate *10 will be considered
because in arithmetic expression the multiplication
operator should be performed before the addition. And
then addition operation will considered.
2. SYNTAX ANALYSIS
• Once the syntax is checked in the syntax analyzer phase
the next phase (i.e. semantic analysis) determines the
meaning of source string. For example meaning of
matching parenthesis in the expression or matching of
if…else statements or performing arithmetic operations
that are type compatible or checking the scope of
variable.
 Thus the three phases are performing the task of
analysis.
 After these phases an intermediate code gets generated
3. SEMANTIC ANALYSIS
• The intermediate code is the kind of code which is very
easy to generate and this code can be easily converted to
target code.
• This code is in variety of form such as three address
code, quadruple, triple, posix.
• Intermediate code in three address form is given below
which is like an assembly language.
• The three address code consists of instructions each of
which has at the most three operands
• EX : t1 = int to float(10)
t2 = rate * t1
t3 = count + t2
total = t3
• There are certain properties which should be processed
by the three address code
4. INTERMEDIATE CODE GENERATION
• There are certain properties which should be processed
by the three address code
• 1. Each three address instruction as at most one operator
in addition to the assignment. Thus the compiler has to
decide the order of the operations devised by the three
address code.
• 2. Compiler must generate a temporary name to hold the
value computed by each instruction.
• 3. Some three address instructions may have fewer then
three operands for example first and last instruction of
above three address code.
• EX : t1 = int to real (10)
• Total = t3
4. INTERMEDIATE CODE GENERATION
• The code optimization phase attempts to improve the
intermediate code.
• This is necessary to have a faster executing code or less
consumption of memory.
• Thus by optimizing the code overall running time of the
target program can be improved.
5. CODE OPTIMIZATION
• In this generation phase the target code gets generated.
• The intermediate code instructions are translated in to
sequence of machine instructions.
• MOV rate, R1
• MUL #10.0, R1
• MOV count, R2
• ADD R1, R1
• MOV R1, total
6. CODE GENERATION
• To support phases of compiler symbol table is
maintained. The task of symbol table is to store
identifiers (variables) used in the programs.
• The symbol table also stores the information about
attributes of each identifier. The attributes of identifier
are usually it’s type, it’s scope, information about the
storage allocated for it.
• The symbol table also stores information about
subroutines used in program (In case subroutine, the
symbol table stores the name of subroutine, number of
arguments passed to it, type of these arguments, the
method of passing these arguments –either call by value
or call by reference and return type if any )
• The symbol table allows to find records for each identifier
quickly and to store or retrieve data from the record
efficiently.
SYMBOL TABLE MANAGEMENT
• During compilation lexical analyzer detects the identifier
and makes its entry in the symbol table.
• How ever lexical analyzer can not determine all the
attributes of an identifier and therefore the attributes
are entered by remaining phases of compiler.
• Various phases can use the symbol table in various ways.
EX – while doing semantic analysis the intermediate code
generation, we need to know what type of identifier are.
Then during code generation typically information about
how much storage is allocated to identifier is seen.
SYMBOL TABLE MANAGEMENT
• As programs are written by human beings therefore they
can not be free from errors.
• In compilation, each phase detects errors. These errors
must be reported to error handler whose task is to
handle the errors so that the compilation can proceed.
• Normally the errors are reported in form of messages.
When input character from the input do not form token,
the lexical analyzer detects it as error.
• Large number of errors can be detected in syntax
analysis phase. Such errors are popularly known as
syntax errors.
• During semantic analysis type mismatch kind of errors is
usually detected.
ERROR DETECTION AND HANDLING
• Input a = b + c * 60
EXAMPLE ON PROCESS OF
COMPILATION
SYMBOL TABLE ENTRIES
• Compiler/interpreter uses symbol table to achieve compile time
efficiency.
• It associates lexical names with their attributes.
• the items to be stored in symbol table are:
1) variable names
2) constants
3) procedure names
4) literal constants and strings
5) compiler generated temporaries
6) labels in source language
• Compiler uses following types of information from symbol table.
1) data type
2) Name
3) declaring procedures
4) offset in storage
5) if structure or record then pointer to structure
6) for parameter, whether parameter passing is by value or by
reference?
7) Number and type of arguments passed to function
8) base address
SYMBOL TABLE ENTRIES
1) variable names : when variable is identified, it is stored in symbol table by it’s
name. The name must be unique.
2) Constants : The constants are stored in symbol table. These constants can be
accessed by compiler with the help of pointers.
3) Data types: The data type of associated variable is stored in symbol table.
4) compiler generated temporaries : The intermediate code is generated by
compiler. During this process many temporaries may generated which are
stored in symbol table.
5) Function names : The names of functions can be stored in symbol table.
6) parameter names : The parameter that are passed to the function are stored
in symbol table. The information such as call by value or call by reference is
also stored in symbol table.
7) scope information : The scope of variable, where it can be used,
(-1) is used to store permanent symbols such as keywords
(0) is used to store global symbols
(1) is used to store symbols defined in main program
ATTRIBUTES OF SYMBOL TABLE
 Symbol table have following attributes to store the
information of data
 1. Symbol name :Symbol names are the name given to
the variable. They are of two types:
 (i) Fixed length
 (ii) Variable length
ATTRIBUTES OF SYMBOL TABLE
HOW TO STORE NAMES IN SYMBOL TABLE
• There are two types of name representation.
• 1. Fixed length name
• A fixed space for each name is allocated in symbol table. In this
type of storage if name is too small then there is a wastage of
space.
• The name can be referred by pointer to symbol table entry
CONT…
• 2. Variable length record
• Amount of space required by string is used to store names.
• The names can be stored with the help of starting index and
length of each name.
• EXAMPLE
 1. Initialize the symbol table and make all it’s entries
empty
 2. Store the symbol and it’s attribute
 3. Find a symbol
 4. Insert the new symbol
 5. delete a symbol
 6. enter scope level
OPERATIONS ON SYMBOL TABLE
FUNDAMENTAL OF LANGUAGE
PROCESSING
TERMS COMMONLY USED IN
STRINGS
TERM MEANING
Prefix of string A string obtained by removing zero or more tail
symbols.
For example for string Hindustan the prefix could be
‘Hindu’
Suffix of string A string obtained by removing zero or more leading
symbols ,For example , for string Hindustan the suffix
could be ‘dustan’
Substring A string obtained by removing prefix and suffix of a
given string is called substring. For example For string
Hindustan the srting ‘indu’ can be substring.
Sequence of
string
Any string formed by removing zero or more not
necessarily the contiguous symbols is called sequence
of string. For example Hisan can be sequence of string
OPERATIONS ON LANGUAGE
OPERATION DESCRIPTION
Union of two
languages L1 and
L2
L1 U L2 = { set of strings in L1 and strings i L2 }
Concatenation of
two languages L1
and L2
L1 . L2 = { set of strings in L1 followed by set of strings
in L2 }
Kleene closure of
L
Positive closure of
L




0
i
L
*
L
i
L
of
ions
concatenat
more
or
one
denotes
L
,
L
1
i





L
L* denotes zero or more
Concatenations of L
 The finite set which denotes a regular language
and the set which can be described by regular
expression is called regular set.
 EXAMPLE : A set of identifier is regular set because
it can be represented using regular expression.
REGULAR SET
Definition of Regular language
and regular expression over ∑
 The set R of regular language over and
∑
corresponding regular expressions are defined
as follow :
 1. ϕ is an element of R and corresponding regular
expression is ϕ
 2. { ^ } is an element of R and corresponding regular
expression is ^
 3. for each a є A, {a} is an element of R and
corresponding R.E. is a
Definition of Regular language and
regular expression over ∑
 4. if L1 and L2 are any elements of R and r1 and r2 are
it’s corresponding regular expressions then
 (a) L1 U L2 is an element of R and corresponding R.E. is (r1
+ r2)
 (b) L1L2 is an element of R and corresponding R.E. is (r1 r2)
 (c) L1 * is an element of R and corresponding R.E. is ( r1 ) *
only those language that can be obtained by statement 1-4 are
regular over ∑
 R.E. = (0+1) (0+1)
 EXAMPLE 2: regular expression for language
containing string which ends with “abb” over Σ=
{ a,b}
 R.E. = (a+b) * abb
 Example 3:Write regular expression to identify
identifier
 To denote identifier we consider a set of latters and
digits because identifier is a combination of letter
and digit but having first character as letter
always.
 R. E. = letter (letter + digit )*
EXAMPLE 1 : write a R.E. for language containing
the strings of length two over Σ= { 0,1}
 Various tool has been built for constructing lexical
analyzers using the special purpose notations called
regular expressions.
 The regular expressions are used in recognition of
tokens.
 A tool called LEX gives a special language that specifies
the tokens using regular expressions.
 The LEX file has .l extension. suppose we create one file
x.l .
 This x.l is then given to LEX compiler to produce lex.yy.c .
 This lex.yy.c is a C program which is actually a lexical
analyzer program.
 As we know that specification file stores the regular
expression for tokens, the lex.yy.c file consists of tabular
representation of transition diagrams constructed for
A language for specifying lexical
analysis
 The lexemes can be recognized by with help of
tabular transition diagrams and standard routines.
 In specification file of LEX actions are associated
with each regular expression.
 This actions are simply C code.
 This C code is directly carried out over lex.yy.c file.
 Finally C compiler compiles generated lex.yy.c and
produces an object program a.out. when some
input stream is given to a.out then sequence of
token is generated.
A language for specifying lexical
analysis
A language for specifying lexical
analysis
 The LEX program consists of three parts
 1. Declaration section
 2. Rule section
 3. Procedure section
A language for specifying lexical
analysis
% {
DECLARATION SECTION
%}
%%
RULE SECTION
%%
AUXILARY PROCEDURE SECTION
In declaration
section
declaration of
variables,
constants, can be
done.
Some regular
definitions can
also be written in
this section.
the regular
definitions are
basically
components of
regular
 The Rule section consists of regular expressions
associated with actions. These transition rules can be
given in form as-
 And third section auxiliary procedure section in which
all the required procedures are defined. Some times
these procedures are required by actions in the rule
section.
 the lexical analyzer and scanner works in co-ordination
of parser.
 When activated by parser , lexical analyzer begin
reading its remaining input, character by character at a
A language for specifying lexical
analysis
R1 { action1 }
R2 { action 2 }
.
.
.
Rn { action n}
Where each Ri is a regular expression and
each actioni is a program fragment
describing what action is to be taken for
corresponding regular expression
 When string is matched with one of the regular
expression Ri then corresponding actioni will get
executed and this actioni returns the controller to
parser.
 The repeated search for the lexeme can be made in
order to return all the tokens in source string.
 The lexical analyzer ignores the white spaces and
comments in this process.
A language for specifying lexical
analysis
% {
#include<stdio.h>
%}
%%
Rama | Seeta|Geeta| Neeta {
printf (“n Noun “);
}
Sings| Dances | eats {
printf(“n verb”);
}
int main()
{
yylex();
return 0; }
int yywrap()
{
return 1;
}
 The program mentioned in previous slide
recognizes noun and verb from the string clearly.
 There are three section in that program
 The section starting and ending with % { and %}
respectively is a definition section.
 The section starting with %% is called rule section.
this section is closed by %%
 within %% consists of regular expression and
actions. Rule 1 gives definition of noun and second
rule gives definition of verb.
 The third section consists of two functions the
main function and yywrap function.
 here main function calls yylex() function. yylex()
function is defined in lex.yy.c file.
A language for specifying lexical
analysis
 first we will compile our above program x.l using
lex compiler and then LEX compiler will
automatically generates C program named lex.yy.c.
This lex.yy.c makes use of regular expression and
corresponding actions defined in x.l.
 Hence our program x.l is called lex specification
file.
 When we compile lex.yy.c using gcc compiler as cc
lex.yy.c , we get an output file a.out – default
output file of LINUX platform and on execution of
a.out we can give input string
A language for specifying lexical
analysis
$ lex x.l
$ gcc lex.yy.c
$ ./a.out
This command generates lex.yy.c
This command compiles lex.yy.c (we can also use gcc in
place of cc
This command runs executable file
A language for specifying lexical
analysis
Rama eats
Noun
Verb
Seeta sings
Noun
Verb
After entering these commands a blank space for
entering input
gets available. Then we can give some valid input.
Then press CTRL +C or CTRL + D to come out of output.
$ lex x.l
$gcc lex.yy.c
$ ./a.out
LEX specification and features
REGULAR
EXPRESSION
MEANING
* Matching with zero or more occurrences of
preceding expression. For example, 1*
occurrence of 1 for any number of times
. Matches any single character other than new
line
[ ] A character class which matches with any
character within the bracket.
For example: [a-z] matches with any alphabet
in lower case.
( ) Group of regular expressions together put in
to a new regular expression
r[m,n] m to n occurrence of r example : a[3,5]
LEX specification and features
REGULAR
EXPRESSION
MEANING
$ Matches with the end of line as last character.
+ Matches with one or more occurrence of
preceding expression.
Example: [0-9]+ any number but not empty
string
? Matches zero or one occurrence of preceding
regular expression. For example [+-]? [0-9]+ a
number with unary operator
^ Matching the beginning of a line as first
character.
[ ^S ] Used as for negation. Any character except S.
For example [^verb] means except verb match
with anything else
 Used as escape meta character
 1. BEGIN : - It indicates the start state. The lexical
analyzer starts at state 0
 2. ECHO :- It emits the input as it is
 3. yytext :- when lexer matches or recognizes the
token from input then the lexeme is stored in null
terminated string called yytext.
 as soon as new token is found the content of yytext
is replaced by new token.
 4. yylex() :- as soon as call to yylex() is encountered
scanner starts scanning the source program.
 5. yywrap() :- The function yywrap() is called when
scanner encounters end of file.The yywrap returns
0 then scanner continuous scanning. When
yywrap() returns 1 that means end of file is
LEX Actions
 6. yyin :- It is standard input file that stores input
source program.
 7. yyleng :- when lexer recognizes token then the
lexeme is stored in NULL terminated string called
yytext and yyleng stores the length of string ,so we
can say that this function is same as strlen()
 8. HOW TO WRITE main() in LEX
int main()
{
yylex();
}
LEX Actions
 9. Where to write C code ?
 we can write a valid ‘C’ code between %{ %}
 we can write any C function in a subroutine
section.
 C code for Action part for corresponding regular
expression
 10. RECOGNIZER WORKS IN FOLLOWING WAYS :
i. If more then one pattern matches then
recognizer has to choose the longest lexeme
matched
 ii. If there are two or more patterns that match the
longest lexeme, the first listed matching pattern is
choosen.

LEX Actions
 Data structures are classified on basis of following criteria
 1. Nature of Data structure : Linear or Non linear data
structure
 2. Purpose of Data Structure : Search or allocation data
structure
 3. Lifetime of data structure : Used During language
processing or during target program generation
 Linear data structure consists of linear arrangement of
elements in the memory. It requires contiguous area of
memory for its elements. This leads to wastage of memory.
 The elements of non linear data structure are accessed using
pointers. The elements need not occupy contiguous area of
memory. There is no wastage of memory. But it leads to lower
search efficiency.
SYMBOL TABLE DATA STRUCTURE FOR
LANGUAGE PROCESSING
 Search data structure are used during language
processing to maintain attribute information concerning
different entities in the source program.
 This type of data structures are characterized by using
entry for an entity is created only once but may be
searched for a large number of times. here important
point is search efficiency.
 Allocation data structure is characterized by the address
of memory area allocated to an entity known to the user
of that entity.
 In this method, search operations are not conducted. The
important points are allocation or de-allocation speed
and efficiency of memory utilization for this type of data
structures.
SYMBOL TABLE DATA STRUCTURE FOR
LANGUAGE PROCESSING
 It is a set of entries, each entry accommodating the
information concerning one entity. Each entry contains
the key field and this field is used for searching.
 ENTRY FORMATS
 set of fields are used in a search structure for each entry.
Search structure consists of two parts:
 1. Fixed parts
 2. Variant parts
 Compiler’s symbol table have the following entries
 a. Fixed part : field symbol and class
 b. Variant parts
SEARCH DATA STRUCTURES
SEARCH DATA STRUCTURES
Sr.
No
Tag Value Variant Part Fields
1 variable Type, Length, Dimension
information
2 Label Statement Number
3 Procedure Name Address of parameter list, Number
of parameters, Type of return
value, Length of returned value
 Entry format
 a. fixed length
 b. Variable length
HOW TO STORE NAMES IN SYMBOL TABLE
• There are two types of name representation.
• 1. Fixed length name
• A fixed space for each name is allocated in symbol table. In this type
of storage if name is too small then there is a wastage of space.
• The name can be referred by pointer to symbol table entry
• Benefit of using linear organizations enables the use of efficient
search procedures
CONT…
• 2. Variable length record
• Amount of space required by string is used to store names.
• The names can be stored with the help of starting index and
length of each name.
• No memory wastage in this organization
• EXAMPLE
 Hybrid entry format is used to combine the access
efficiency of fixed entry format and memory efficiency of
the variable entry format.
 In this format each entry is split in to two halves : fixed
part and variant part
 A pointer field is used in fixed part and it points to the
variable part of the entry.
HYBRID ENTRY FORMAT
Fixed part Pointer Length Variable
part
Hybrid entry format
 1. add : Add the entry to
symbol table
 2. Search : search and
locate the entry of a symbol
 3.Delete : Delete the entry
of the symbol.
 TABLE ORGANIZATON
 Table is a linear data
structure. The entries of a
table occupy adjoining area
of memory.
 Fixed length entries are
used in linear data
structures.
OPEARTION ON SEARCH DATA
STRUCTURES
#1
#2
#3
#f
#n
Occupied
Entries
Free
Entries
 Symbols used:
 n=Number of entries in table
 f = Number of occupied entries
 Operations
 1. Add a symbol : Symbol is added to the first free entry
in the table. The value of ‘f’ is updated accordingly.
 2. Delete a Symbol : Deletion can be done in two ways:
 a. Physical deletion : In physical deletion an entry is
deleted by erasing or by overwriting
 b. Logical deletion : Logical deletion of entry is
performed by adding some information to the entry to
indicate its deletion
TABLE ORGANIZATON
 1. Stack
 PROPERTIES
 1. stack is unbounded array that is treated in last in first
out(LIFO) manner. The last element stored is first one
removed.
 2. Only the last entry is accessible at any time.
 a. A stack pointer(SP) indicates the position or frame at
the top of the stack.
 b. Stack base(SB) – It points to the first word of the stack
area.
 c. Top of Stack(TOS) – It points to last entry allocated in
the stack.
 When entry is pushed on the stack, TOS is incremented
by 1. when an entry is popped , it is decremented by 1.
ALLOCATION DATA STRUCTURE
 Apart from SB and TOS, record base pointer(RB) and
reserved pointers are used in extended stack model.
 Record base pointer pointing to first word of the last
record in stack.
 Reserved pointer is the first word of each record.
EXTENDED STACK MODEL
(b) Allocation
(c) Deallocation
 Heap is non linear data structure. Heap allows the
allocation and de-allocation of entries in a random order.
 There is no specific method to access an allocated
memory location. so pointers are used to allocation and
deallocation.
HEAP
END OF CHAPTER!!!

Overview of language processor course d&a

  • 1.
    PREARED BY: Zinal Gohil ASST.PROF. (CE/IT) (PPSU,SOE) 2. OVERVIEW OF LANGUAGE PROCESSOR
  • 2.
     Programming Languagesand Language Processors,  Language Processing Activities,  Program Execution,  Fundamental of Language Processing,  Symbol Tables;  Data Structures for Language Processing: Search Data structures, Allocation Data Structures OUTLINE
  • 3.
     Language processorcomprises of compilers, assemblers and interpreters  Programmers write software’s in variety of languages and compilers and interpreters translates it in to instructions that are understood by the computer at machine level.  Language processing activities occur due to difference between how software is made by the programmer and how it is implemented by computer.  The software programmers mentions two domains  1. APPLICATION DOMAIN : To present the idea  2. EXECUTION DOMAIN : for carrying of these ideas OVERVIEW OF PROGRAMMING LANGUAGE AND LANGUAGE PROCESSOR
  • 4.
     Semantics pertainsto the meaning of words. The semantics of a language is a description of what the sentences mean. It is much more difficult to express the semantics of a language than it is to express the syntax.  In order to implement a programming language we must know what each sentence means (declaration, expression, etc).  E.g., does the sentence  produce an output,  take any inputs,  change the value stored in a variable,  produce an error. TERMINOLOGIES 1. SEMANTICS (MEANING)
  • 5.
     Domain: Itrefers to the scope or sphere of any activity.  Application Domain: The scope of an application is its application domain.  E.g., the application domain of an inventory program is warehouse and its associated tangibles (goods, machinery, etc), transactions (e.g., receiving goods, purchase orders, locating goods, shipping of goods, receiving payments, etc), people (e.g., workers, managers, customers).  All the above are objects in the application domain. The application domain can best be described by a person in that domain. E.g., the warehouse manager in the above example. TERMINOLOGIES 2. APPLICATION DOMAIN
  • 6.
     Execution Domain:(also called as the solution domain). The execution domain is the work of programmers, e.g., program code, documentation, test results, files, computers, etc.  The solution domain is partitioned into two levels:  Abstract, high-level documents, such as flow charts, diagrams  Low-level – data structures, function definitions, etc. TERMINOLOGIES 3. EXECUTION DOMAIN
  • 7.
     The differencebetween the semantics of the application domain and the execution domain is called the semantic gap. TERMINOLOGIES 4. SEMANTIC GAP
  • 8.
     Consequences ofsemantic gap:  Large development times – interaction between designers in application domain and programmers.  Large development efforts.  Poor quality of software. CONSEQUENCES OF SEMANTIC GAP
  • 9.
     The semanticgap is reduced by programming languages (PL). The use of a PL introduces a new domain called the programming language domain (or PL domain).  The PL domain bridges the gap between the application domain and the execution domain. HOW IS THE SEMANTIC GAP REDUCED?
  • 10.
     Specification gap:It is the semantic gap between the application domain and the PL domain.  It can also be defined as the semantic gap between the two specifications of the same task.  The specification gap is bridged by the software development team.  Execution gap: It is the gap between the semantics of programs written in different programming languages.  The execution gap is bridged by the translator or interpreter. SPECIFICATION GAP AND EXECUTION GAP Advantages of introducing the PL domain: (a) Large development times are reduced. (b) Better quality of software. (c) Language processor provides diagnostic capabilities which detects errors
  • 11.
     Language Processor:It is a software which bridges the specification or execution gap.  Language Processing: It is any activity performed by a language processor.  Diagnostic capability is a feature of a language processor. The input of a language processor is the source program. The output of a language processor is the target program. The target program is not produced if the language processor finds any errors in the source program. TERMS
  • 12.
     (a) LanguageTranslator: This bridges the execution gap to the machine language of a computer system. Examples are compiler and assembler.  (b) De-translator: Similar to translator, but in the opposite direction.  (c) Preprocessor: This is a language processor whose source and target languages are both high level, i.e., no translation takes place.  (d) Language migrator : It fills the specification gap between two PL’s.(Used to convert program written in one programming language in to another programming language) It may be used to provide portability of program by migrating it to more modern programming language  The quality of target program is depends on semantics of two programming languages TYPES OF LANGUAGE PROCESSOR
  • 13.
     In caseof problem-oriented languages. The PL domain is very close to the application domain. The specification gap is reduced in this case. Such PLs can be used only for specific applications, hence they are called problem-oriented languages.  They have a large execution gap, but the execution gap is bridged by the translator or interpreter. Using these languages, we only have to do specify “what to do”.  Software development takes less time using problem- oriented languages, but the resultant code may not be optimized. Examples : Fourth generation languages (4GL) like SQL. PROBLEM-ORIENTED LANGUAGES:
  • 14.
     These providegeneral facilities and features which are required in most applications. These languages are independent of application domains.  Hence, there is a large specification gap. The gap must be bridged by the application designer. Using these languages, we have to specify “what to do” and “how to do”.  Examples. C, C++, FORTRAN, etc. PROCEDURE-ORIENTED LANGUAGES:
  • 15.
     A compileris a language translator. It translates a source code (programs in a high-level language) into the target code (machine code, or object code).  To do this translation, a compiler steps through a number of phases. The simplest is  a 2-phase compiler. The first phase is called the front end and the second phase is called the back end. COMPILERS
  • 16.
     Front End:The front end translates from the high-level language to a common intermediate language. The front end is source language dependent but it is machine- independent. Thus, the front end consists of the following phases:  lexical analysis, syntactic analysis, creation of symbol table, semantic analysis and generation of intermediate code. The front end also includes error-handling routines for each of these phases.  Back End: The back end translates from this common intermediate language to the machine code.  The back end is machine dependent. This includes code optimization, code generation, error-handling and symbol table operations. Thus, a compiler bridges the execution gap. COMPILERS
  • 17.
     It isa language processor. It also bridges the execution gap but does not generate the machine code. An interpreter executes a program written in a high level language.  The essential difference between a compiler and an interpreter is that while a compiler generates the machine code and is then no longer needed, an interpreter is always required. INTERPRETER
  • 18.
     Language processingactivities are related to specification gap and execution gap.  It is divided in to two types.  1. Program generation  2. Program Execution  Aim of program generation activity is to generate automatic program. In this activity the specification language of application domain is the source language  A procedure oriented language is the target language. Source language is specification language  1. PROGRAM GENERATION ACTIVITY  Program generator is a system software. Program specification is input to this system software. It generates output in target language. LANGUAGE PROCESSING ACTIVITIES
  • 19.
     Here thespecification gap is gap between application domain and program generator domain. LANGUAGE PROCESSING ACTIVITIES User Application Domain Program Generator Domain Target Programming Language Domain Program Execution Domain Specification Gap
  • 20.
     It reducesspecification gap . and reliability of generated program is increases. It also helps programmer for easily writing specification of program.  Compiler is used to bridge the gap between target PL and the execution domain.  2. PROGRAM EXECUTION  Methods of program execution  (a) Program translation  (b) Program interpretation LANGUAGE PROCESSING ACTIVITIES
  • 21.
     (a) Program Translation It bridges the execution gap by translating source program in to target program.  Source program is written in to programming language and a target program is an assembly language. LANGUAGE PROCESSING ACTIVITIES
  • 22.
     Characteristics:  1.Before execution of program, it must be translated  2. Translated program may be saved in to files  3. Program must be retranslated with modifications.  (b) PROGRAM INTERPRETATION  It reads source program and store it in main memory. During program interpretation it takes a source statement and determines its meaning then it perform the actions which to be implement on that statements.  The action may be computational and input –output. LANGUAGE PROCESSING ACTIVITIES
  • 23.
    Program counter increments the memoryaddress for next instruction. CPU uses program counter for next instruction.  Instruction execution cycle consists of three steps  1. Fetch 2. Decode 3. Execution LANGUAGE PROCESSING ACTIVITIES (b) Execution
  • 24.
     This cycleis repeated for all instructions. The instruction address in the program counter is updated at the end of the cycle CPU select the next instruction for execution.  The above process is called interpretation cycle. Interpretation cycle consists of :  1. Fetch the statement  2. Analyze the statement  3. Execute the statement  CHARACTERISTICS OF INTERPRETATION 1. Source program is retained in source form itself  2. Statement is analyzed during interpretation LANGUAGE PROCESSING ACTIVITIES
  • 25.
    FUNDAMENTAL OF LANGUAGE PROCESSING Language processing is the combination of analysis of SP and synthesis of TP. specification of source program consists of three components.  1. Lexical rule  2. Syntax Rule  3. Semantic Rule
  • 26.
    • The sourceprogram can be analyzed in three phases- • 1. Linear-lexical Analysis : In this type of analysis the source string is read from left to right and grouped in to tokens. • EX : Tokens for a language can be identifiers, constants, relational operations, keywords. • 2. Hierarchical(Syntax) Analysis : In this analysis, characters or tokens are grouped hierarchically in to nested collections for checking them syntactically. • 3. Semantic Analysis : This kind of analysis ensures the correctness of meaning of program. ANALYSIS OF SOURCE PROGRAM
  • 27.
    FUNDAMENTAL OF LANGUAGE PROCESSING Synthesis phase is concerned with the construction of target language statements which have the same meaning as a source statement . It consists of two main activities  Code optimization : generation of various data structures of target program.  Code generation : It generates the target code.
  • 28.
  • 29.
    • ANALYSIS PART •1. LEXICAL ANALYSIS : • The lexical analysis is also called scanning. It is the phase of compilation in which the complete source code is scanned and your source program is broken up in to group of strings called token. • A token is a sequence of characters having a collective meaning. • For example if some assignment statement in your source program is as follow: • total =count + rate * 10 PHASES OF COMPILER
  • 30.
    • total =count+ rate * 10 • In lexical Analysis phase this statement is broken up in to series of tokens as follow: • 1. the identifier total • 2. The assignment symbol • 3. the identifier count • 4. The plus symbol • 5. The identifier rate • 6. The multiplication symbol • 7. the constant number 10 The blank characters which are used in the programming statements are eliminated during lexical analysis. LEXICAL ANALYSIS Parse tree for total =count + rate * 10
  • 31.
    • The syntaxanalysis is also called parsing. In this phase the tokens generated by the lexical analysis are grouped together to form hierarchical structure. • The syntax analysis determines the structure of source string by grouping the tokens together. • The hierarchical structure generated in this phase is called parse tree or syntax tree. • For expression total= count + rate *10 the parse tree will like below 2. SYNTAX ANALYSIS
  • 32.
    • In thatstatement first rate *10 will be considered because in arithmetic expression the multiplication operator should be performed before the addition. And then addition operation will considered. 2. SYNTAX ANALYSIS
  • 33.
    • Once thesyntax is checked in the syntax analyzer phase the next phase (i.e. semantic analysis) determines the meaning of source string. For example meaning of matching parenthesis in the expression or matching of if…else statements or performing arithmetic operations that are type compatible or checking the scope of variable.  Thus the three phases are performing the task of analysis.  After these phases an intermediate code gets generated 3. SEMANTIC ANALYSIS
  • 34.
    • The intermediatecode is the kind of code which is very easy to generate and this code can be easily converted to target code. • This code is in variety of form such as three address code, quadruple, triple, posix. • Intermediate code in three address form is given below which is like an assembly language. • The three address code consists of instructions each of which has at the most three operands • EX : t1 = int to float(10) t2 = rate * t1 t3 = count + t2 total = t3 • There are certain properties which should be processed by the three address code 4. INTERMEDIATE CODE GENERATION
  • 35.
    • There arecertain properties which should be processed by the three address code • 1. Each three address instruction as at most one operator in addition to the assignment. Thus the compiler has to decide the order of the operations devised by the three address code. • 2. Compiler must generate a temporary name to hold the value computed by each instruction. • 3. Some three address instructions may have fewer then three operands for example first and last instruction of above three address code. • EX : t1 = int to real (10) • Total = t3 4. INTERMEDIATE CODE GENERATION
  • 36.
    • The codeoptimization phase attempts to improve the intermediate code. • This is necessary to have a faster executing code or less consumption of memory. • Thus by optimizing the code overall running time of the target program can be improved. 5. CODE OPTIMIZATION
  • 37.
    • In thisgeneration phase the target code gets generated. • The intermediate code instructions are translated in to sequence of machine instructions. • MOV rate, R1 • MUL #10.0, R1 • MOV count, R2 • ADD R1, R1 • MOV R1, total 6. CODE GENERATION
  • 38.
    • To supportphases of compiler symbol table is maintained. The task of symbol table is to store identifiers (variables) used in the programs. • The symbol table also stores the information about attributes of each identifier. The attributes of identifier are usually it’s type, it’s scope, information about the storage allocated for it. • The symbol table also stores information about subroutines used in program (In case subroutine, the symbol table stores the name of subroutine, number of arguments passed to it, type of these arguments, the method of passing these arguments –either call by value or call by reference and return type if any ) • The symbol table allows to find records for each identifier quickly and to store or retrieve data from the record efficiently. SYMBOL TABLE MANAGEMENT
  • 39.
    • During compilationlexical analyzer detects the identifier and makes its entry in the symbol table. • How ever lexical analyzer can not determine all the attributes of an identifier and therefore the attributes are entered by remaining phases of compiler. • Various phases can use the symbol table in various ways. EX – while doing semantic analysis the intermediate code generation, we need to know what type of identifier are. Then during code generation typically information about how much storage is allocated to identifier is seen. SYMBOL TABLE MANAGEMENT
  • 40.
    • As programsare written by human beings therefore they can not be free from errors. • In compilation, each phase detects errors. These errors must be reported to error handler whose task is to handle the errors so that the compilation can proceed. • Normally the errors are reported in form of messages. When input character from the input do not form token, the lexical analyzer detects it as error. • Large number of errors can be detected in syntax analysis phase. Such errors are popularly known as syntax errors. • During semantic analysis type mismatch kind of errors is usually detected. ERROR DETECTION AND HANDLING
  • 41.
    • Input a= b + c * 60 EXAMPLE ON PROCESS OF COMPILATION
  • 43.
    SYMBOL TABLE ENTRIES •Compiler/interpreter uses symbol table to achieve compile time efficiency. • It associates lexical names with their attributes. • the items to be stored in symbol table are: 1) variable names 2) constants 3) procedure names 4) literal constants and strings 5) compiler generated temporaries 6) labels in source language
  • 44.
    • Compiler usesfollowing types of information from symbol table. 1) data type 2) Name 3) declaring procedures 4) offset in storage 5) if structure or record then pointer to structure 6) for parameter, whether parameter passing is by value or by reference? 7) Number and type of arguments passed to function 8) base address SYMBOL TABLE ENTRIES
  • 45.
    1) variable names: when variable is identified, it is stored in symbol table by it’s name. The name must be unique. 2) Constants : The constants are stored in symbol table. These constants can be accessed by compiler with the help of pointers. 3) Data types: The data type of associated variable is stored in symbol table. 4) compiler generated temporaries : The intermediate code is generated by compiler. During this process many temporaries may generated which are stored in symbol table. 5) Function names : The names of functions can be stored in symbol table. 6) parameter names : The parameter that are passed to the function are stored in symbol table. The information such as call by value or call by reference is also stored in symbol table. 7) scope information : The scope of variable, where it can be used, (-1) is used to store permanent symbols such as keywords (0) is used to store global symbols (1) is used to store symbols defined in main program ATTRIBUTES OF SYMBOL TABLE
  • 46.
     Symbol tablehave following attributes to store the information of data  1. Symbol name :Symbol names are the name given to the variable. They are of two types:  (i) Fixed length  (ii) Variable length ATTRIBUTES OF SYMBOL TABLE
  • 47.
    HOW TO STORENAMES IN SYMBOL TABLE • There are two types of name representation. • 1. Fixed length name • A fixed space for each name is allocated in symbol table. In this type of storage if name is too small then there is a wastage of space. • The name can be referred by pointer to symbol table entry
  • 48.
    CONT… • 2. Variablelength record • Amount of space required by string is used to store names. • The names can be stored with the help of starting index and length of each name. • EXAMPLE
  • 49.
     1. Initializethe symbol table and make all it’s entries empty  2. Store the symbol and it’s attribute  3. Find a symbol  4. Insert the new symbol  5. delete a symbol  6. enter scope level OPERATIONS ON SYMBOL TABLE
  • 50.
  • 51.
    TERMS COMMONLY USEDIN STRINGS TERM MEANING Prefix of string A string obtained by removing zero or more tail symbols. For example for string Hindustan the prefix could be ‘Hindu’ Suffix of string A string obtained by removing zero or more leading symbols ,For example , for string Hindustan the suffix could be ‘dustan’ Substring A string obtained by removing prefix and suffix of a given string is called substring. For example For string Hindustan the srting ‘indu’ can be substring. Sequence of string Any string formed by removing zero or more not necessarily the contiguous symbols is called sequence of string. For example Hisan can be sequence of string
  • 52.
    OPERATIONS ON LANGUAGE OPERATIONDESCRIPTION Union of two languages L1 and L2 L1 U L2 = { set of strings in L1 and strings i L2 } Concatenation of two languages L1 and L2 L1 . L2 = { set of strings in L1 followed by set of strings in L2 } Kleene closure of L Positive closure of L     0 i L * L i L of ions concatenat more or one denotes L , L 1 i      L L* denotes zero or more Concatenations of L
  • 53.
     The finiteset which denotes a regular language and the set which can be described by regular expression is called regular set.  EXAMPLE : A set of identifier is regular set because it can be represented using regular expression. REGULAR SET
  • 54.
    Definition of Regularlanguage and regular expression over ∑  The set R of regular language over and ∑ corresponding regular expressions are defined as follow :  1. ϕ is an element of R and corresponding regular expression is ϕ  2. { ^ } is an element of R and corresponding regular expression is ^  3. for each a є A, {a} is an element of R and corresponding R.E. is a
  • 55.
    Definition of Regularlanguage and regular expression over ∑  4. if L1 and L2 are any elements of R and r1 and r2 are it’s corresponding regular expressions then  (a) L1 U L2 is an element of R and corresponding R.E. is (r1 + r2)  (b) L1L2 is an element of R and corresponding R.E. is (r1 r2)  (c) L1 * is an element of R and corresponding R.E. is ( r1 ) * only those language that can be obtained by statement 1-4 are regular over ∑
  • 56.
     R.E. =(0+1) (0+1)  EXAMPLE 2: regular expression for language containing string which ends with “abb” over Σ= { a,b}  R.E. = (a+b) * abb  Example 3:Write regular expression to identify identifier  To denote identifier we consider a set of latters and digits because identifier is a combination of letter and digit but having first character as letter always.  R. E. = letter (letter + digit )* EXAMPLE 1 : write a R.E. for language containing the strings of length two over Σ= { 0,1}
  • 57.
     Various toolhas been built for constructing lexical analyzers using the special purpose notations called regular expressions.  The regular expressions are used in recognition of tokens.  A tool called LEX gives a special language that specifies the tokens using regular expressions.  The LEX file has .l extension. suppose we create one file x.l .  This x.l is then given to LEX compiler to produce lex.yy.c .  This lex.yy.c is a C program which is actually a lexical analyzer program.  As we know that specification file stores the regular expression for tokens, the lex.yy.c file consists of tabular representation of transition diagrams constructed for A language for specifying lexical analysis
  • 58.
     The lexemescan be recognized by with help of tabular transition diagrams and standard routines.  In specification file of LEX actions are associated with each regular expression.  This actions are simply C code.  This C code is directly carried out over lex.yy.c file.  Finally C compiler compiles generated lex.yy.c and produces an object program a.out. when some input stream is given to a.out then sequence of token is generated. A language for specifying lexical analysis
  • 59.
    A language forspecifying lexical analysis
  • 60.
     The LEXprogram consists of three parts  1. Declaration section  2. Rule section  3. Procedure section A language for specifying lexical analysis % { DECLARATION SECTION %} %% RULE SECTION %% AUXILARY PROCEDURE SECTION In declaration section declaration of variables, constants, can be done. Some regular definitions can also be written in this section. the regular definitions are basically components of regular
  • 61.
     The Rulesection consists of regular expressions associated with actions. These transition rules can be given in form as-  And third section auxiliary procedure section in which all the required procedures are defined. Some times these procedures are required by actions in the rule section.  the lexical analyzer and scanner works in co-ordination of parser.  When activated by parser , lexical analyzer begin reading its remaining input, character by character at a A language for specifying lexical analysis R1 { action1 } R2 { action 2 } . . . Rn { action n} Where each Ri is a regular expression and each actioni is a program fragment describing what action is to be taken for corresponding regular expression
  • 62.
     When stringis matched with one of the regular expression Ri then corresponding actioni will get executed and this actioni returns the controller to parser.  The repeated search for the lexeme can be made in order to return all the tokens in source string.  The lexical analyzer ignores the white spaces and comments in this process. A language for specifying lexical analysis % { #include<stdio.h> %} %% Rama | Seeta|Geeta| Neeta { printf (“n Noun “); } Sings| Dances | eats { printf(“n verb”); } int main() { yylex(); return 0; } int yywrap() { return 1; }
  • 63.
     The programmentioned in previous slide recognizes noun and verb from the string clearly.  There are three section in that program  The section starting and ending with % { and %} respectively is a definition section.  The section starting with %% is called rule section. this section is closed by %%  within %% consists of regular expression and actions. Rule 1 gives definition of noun and second rule gives definition of verb.  The third section consists of two functions the main function and yywrap function.  here main function calls yylex() function. yylex() function is defined in lex.yy.c file. A language for specifying lexical analysis
  • 64.
     first wewill compile our above program x.l using lex compiler and then LEX compiler will automatically generates C program named lex.yy.c. This lex.yy.c makes use of regular expression and corresponding actions defined in x.l.  Hence our program x.l is called lex specification file.  When we compile lex.yy.c using gcc compiler as cc lex.yy.c , we get an output file a.out – default output file of LINUX platform and on execution of a.out we can give input string A language for specifying lexical analysis $ lex x.l $ gcc lex.yy.c $ ./a.out This command generates lex.yy.c This command compiles lex.yy.c (we can also use gcc in place of cc This command runs executable file
  • 65.
    A language forspecifying lexical analysis Rama eats Noun Verb Seeta sings Noun Verb After entering these commands a blank space for entering input gets available. Then we can give some valid input. Then press CTRL +C or CTRL + D to come out of output. $ lex x.l $gcc lex.yy.c $ ./a.out
  • 66.
    LEX specification andfeatures REGULAR EXPRESSION MEANING * Matching with zero or more occurrences of preceding expression. For example, 1* occurrence of 1 for any number of times . Matches any single character other than new line [ ] A character class which matches with any character within the bracket. For example: [a-z] matches with any alphabet in lower case. ( ) Group of regular expressions together put in to a new regular expression r[m,n] m to n occurrence of r example : a[3,5]
  • 67.
    LEX specification andfeatures REGULAR EXPRESSION MEANING $ Matches with the end of line as last character. + Matches with one or more occurrence of preceding expression. Example: [0-9]+ any number but not empty string ? Matches zero or one occurrence of preceding regular expression. For example [+-]? [0-9]+ a number with unary operator ^ Matching the beginning of a line as first character. [ ^S ] Used as for negation. Any character except S. For example [^verb] means except verb match with anything else Used as escape meta character
  • 68.
     1. BEGIN: - It indicates the start state. The lexical analyzer starts at state 0  2. ECHO :- It emits the input as it is  3. yytext :- when lexer matches or recognizes the token from input then the lexeme is stored in null terminated string called yytext.  as soon as new token is found the content of yytext is replaced by new token.  4. yylex() :- as soon as call to yylex() is encountered scanner starts scanning the source program.  5. yywrap() :- The function yywrap() is called when scanner encounters end of file.The yywrap returns 0 then scanner continuous scanning. When yywrap() returns 1 that means end of file is LEX Actions
  • 69.
     6. yyin:- It is standard input file that stores input source program.  7. yyleng :- when lexer recognizes token then the lexeme is stored in NULL terminated string called yytext and yyleng stores the length of string ,so we can say that this function is same as strlen()  8. HOW TO WRITE main() in LEX int main() { yylex(); } LEX Actions
  • 70.
     9. Whereto write C code ?  we can write a valid ‘C’ code between %{ %}  we can write any C function in a subroutine section.  C code for Action part for corresponding regular expression  10. RECOGNIZER WORKS IN FOLLOWING WAYS : i. If more then one pattern matches then recognizer has to choose the longest lexeme matched  ii. If there are two or more patterns that match the longest lexeme, the first listed matching pattern is choosen.  LEX Actions
  • 71.
     Data structuresare classified on basis of following criteria  1. Nature of Data structure : Linear or Non linear data structure  2. Purpose of Data Structure : Search or allocation data structure  3. Lifetime of data structure : Used During language processing or during target program generation  Linear data structure consists of linear arrangement of elements in the memory. It requires contiguous area of memory for its elements. This leads to wastage of memory.  The elements of non linear data structure are accessed using pointers. The elements need not occupy contiguous area of memory. There is no wastage of memory. But it leads to lower search efficiency. SYMBOL TABLE DATA STRUCTURE FOR LANGUAGE PROCESSING
  • 73.
     Search datastructure are used during language processing to maintain attribute information concerning different entities in the source program.  This type of data structures are characterized by using entry for an entity is created only once but may be searched for a large number of times. here important point is search efficiency.  Allocation data structure is characterized by the address of memory area allocated to an entity known to the user of that entity.  In this method, search operations are not conducted. The important points are allocation or de-allocation speed and efficiency of memory utilization for this type of data structures. SYMBOL TABLE DATA STRUCTURE FOR LANGUAGE PROCESSING
  • 74.
     It isa set of entries, each entry accommodating the information concerning one entity. Each entry contains the key field and this field is used for searching.  ENTRY FORMATS  set of fields are used in a search structure for each entry. Search structure consists of two parts:  1. Fixed parts  2. Variant parts  Compiler’s symbol table have the following entries  a. Fixed part : field symbol and class  b. Variant parts SEARCH DATA STRUCTURES
  • 75.
    SEARCH DATA STRUCTURES Sr. No TagValue Variant Part Fields 1 variable Type, Length, Dimension information 2 Label Statement Number 3 Procedure Name Address of parameter list, Number of parameters, Type of return value, Length of returned value  Entry format  a. fixed length  b. Variable length
  • 76.
    HOW TO STORENAMES IN SYMBOL TABLE • There are two types of name representation. • 1. Fixed length name • A fixed space for each name is allocated in symbol table. In this type of storage if name is too small then there is a wastage of space. • The name can be referred by pointer to symbol table entry • Benefit of using linear organizations enables the use of efficient search procedures
  • 77.
    CONT… • 2. Variablelength record • Amount of space required by string is used to store names. • The names can be stored with the help of starting index and length of each name. • No memory wastage in this organization • EXAMPLE
  • 78.
     Hybrid entryformat is used to combine the access efficiency of fixed entry format and memory efficiency of the variable entry format.  In this format each entry is split in to two halves : fixed part and variant part  A pointer field is used in fixed part and it points to the variable part of the entry. HYBRID ENTRY FORMAT Fixed part Pointer Length Variable part Hybrid entry format
  • 79.
     1. add: Add the entry to symbol table  2. Search : search and locate the entry of a symbol  3.Delete : Delete the entry of the symbol.  TABLE ORGANIZATON  Table is a linear data structure. The entries of a table occupy adjoining area of memory.  Fixed length entries are used in linear data structures. OPEARTION ON SEARCH DATA STRUCTURES #1 #2 #3 #f #n Occupied Entries Free Entries
  • 80.
     Symbols used: n=Number of entries in table  f = Number of occupied entries  Operations  1. Add a symbol : Symbol is added to the first free entry in the table. The value of ‘f’ is updated accordingly.  2. Delete a Symbol : Deletion can be done in two ways:  a. Physical deletion : In physical deletion an entry is deleted by erasing or by overwriting  b. Logical deletion : Logical deletion of entry is performed by adding some information to the entry to indicate its deletion TABLE ORGANIZATON
  • 81.
     1. Stack PROPERTIES  1. stack is unbounded array that is treated in last in first out(LIFO) manner. The last element stored is first one removed.  2. Only the last entry is accessible at any time.  a. A stack pointer(SP) indicates the position or frame at the top of the stack.  b. Stack base(SB) – It points to the first word of the stack area.  c. Top of Stack(TOS) – It points to last entry allocated in the stack.  When entry is pushed on the stack, TOS is incremented by 1. when an entry is popped , it is decremented by 1. ALLOCATION DATA STRUCTURE
  • 83.
     Apart fromSB and TOS, record base pointer(RB) and reserved pointers are used in extended stack model.  Record base pointer pointing to first word of the last record in stack.  Reserved pointer is the first word of each record. EXTENDED STACK MODEL
  • 84.
  • 85.
     Heap isnon linear data structure. Heap allows the allocation and de-allocation of entries in a random order.  There is no specific method to access an allocated memory location. so pointers are used to allocation and deallocation. HEAP
  • 86.