SlideShare a Scribd company logo
1 of 234
Introduction to Lex and Yacc
Prepared by,
Prof. Aruna M.G
Computer Science & Engineering Department
M.S.E.C
Bangalore
Lex and Yacc
• Two Compiler Writing Tools that are Utilized to easily
Specify:
– Lexical Tokens and their Order of Processing (Lex)
– Context Free Grammar for LALR(1) (Yacc)
• Both Lex and Yacc have Long History in Computing
– Lex and Yacc – Earliest Days of Unix Minicomputers
– Flex and Bison – From GNU
– JFlex - Fast Scanner Generator for Java
– BYacc/J – Berkeley
– CUP, ANTRL, PCYACC, …
– PCLEX and PCYACC from Abacus
Overview
take a glance at Lex!
Compilation Sequence
General Compiler Infra-structure
Scanner
(tokenizer)
Parser Semantic
Routines
Analysis/
Transformations/
optimizations
Code
Generator
Program source
(stream of
characters)
Tokens
Syntactic
Structure
IR: Intermediate
Representation (1)
Assembly code
IR: Intermediate
Representation (2)
Symbol and
Attribute Tables
Flex/Lex
flex - fast lexical analyzer generator
• Flex is a tool for generating scanners.
• Flex source is a table of regular expressions
and corresponding program fragments.
• Generates lex.yy.c which defines a routine
yylex()
Lex
 Written by Eric Schmidt and Mike Lesk.
 lex is a program (generator) that generates lexical analyzers, (widely
used on Unix).
 It is mostly used with Yacc parser generator.
 It reads the input stream (specifying the lexical analyzer ) and outputs
source code implementing the lexical analyzer in the C programming
language.
 Lex will read patterns (regular expressions); then produces C code
for a lexical analyzer that scans for identifiers.
What is Lex?
• The main job of a lexical analyzer (scanner) is
to break up an input stream into more usable
elements (tokens)
a = b + c * d;
ID ASSIGN ID PLUS ID MULT ID SEMI
Lex – Lexical Analyzer
• A set of descriptions of possible tokens and
producing a C routine is called a lexical analyzer or
scanner or lexer.
• Lexical analyzers tokenize input streams
• Tokens are the terminals of a language
– English
• words, punctuation marks, …
– Programming language
• Identifiers, operators, keywords, …
• Regular expressions define terminals/tokens
LEXER
• Lexical analysis is the process of converting a
sequence of characters into a sequence of tokens.
• A program or function which performs lexical
analysis is called a lexical analyzer, lexer or scanner.
• A lexer often exists as a single function which is
called by a parser or another function
Token
• A token is a string of characters, categorized
according to the rules as a symbol (e.g.
IDENTIFIER, NUMBER, COMMA, etc.).
Tokens
• Tokens in Lex are declared like variable names
in C. Every token has an associated expression.
Token Associated expression Meaning
number ([0-9])+ 1 or more occurrences of a digit
chars [A-Za-z] Any character
blank " " A blank space
word (chars)+ 1 or more occurrences of chars
variable (chars)+(number)*(chars)*( number)*
Consider this expression in the C programming language:
sum=3+2;
Tokenized in the following table:
lexeme token type
sum Identifier
= Assignment operator
3 Number
+ Addition operator
2 Number
; End of statement
Lex – A Lexical Analyzer Generator
• A Unix Utility from early 1970s
• A Compiler that Takes as Source a Specification for:
– Tokens/Patterns of a Language
– Generates a “C” Lexical Analyzer Program
• Pictorially:Creating a Lexical Analyzer with Lex
Lex
Compiler
C
Compiler
a.out
Lex Source
Program:
lex.l
lex.yy.c
lex.yy.c a.out
Input stream Sequence
of tokens
Step for executing lex program
• First a specification of a lexical analyzer is prepared
by creating a program Lex-l(filename.l) in the Lex
language.
• Lex-l is run through the Lex compiler to produce a C
program Lex.YY.C.
• The program Lex.YY.C consists of a tabular
representation of a transition diagram constructed
from the regular expressions of lex.l, together with
a standard routine that uses the table to recognize
LEXEMER.
Continued..
• The lexical analyses phase reads the characters in the
source program and groups them into a stream of
tokens in which each token represents a logically
cohesive sequence of characters, such as an identifier, a
keyword (if, while, etc.) a punctuation character or a
multi-character operator like : = .
• The character sequence forming a token is called the
lexeme for the token.
• The actions associated with regular expressions in lex
- a are pieces of C code and are carried over directly to
lex. YY.C.
• Finally, lex .YY.C is run through the C compiler to
produce an object program a.out.
LEX SPECIFICATION
• The set of descriptions you(we) give to lex is
called a lex specification.
(optional)
(required)
Lex Source
• Lex source is separated into three sections by %%
delimiters
• The general format of Lex source is
• The absolute minimum Lex program is thus
{definitions}
%%
{transition rules}
%%
{user subroutines}
%%
Format of the Input File
• The flex /lex input file consists of three
sections, separated by a line with just %% in it:
definitions
%%
rules
%%
user code
Note
• where the definitions and the user subroutines are
often omitted.
• The second %% is optional, but the first is required
to mark the beginning of the rules.
• The absolute minimum Lex program is thus
%%
(no definitions, no rules) which translates into a
program which copies the input to the output
unchanged.
Format of a Lexical Specification – 3 Parts
• Declarations:
– Literal block contains Defs, Constants, Types, #includes,
etc. that can Occur in a C Program.
– Regular Definitions (expressions),internal table
declaration, start condition and translation.
• Translation Rules:
– Pairs of (Regular Expression, Action)
– Informs Lexical Analyzer of Action when Pattern is
Recognized
• Auxiliary Procedures:
– Designer Defined C Code
– Can Replace System Calls
Lex.y File Format:
DECLARATIONS
%%
TRANSLATION RULES
%%
AUXILIARY PROCEDURES
Skeleton of a lex specification (.l file)
x.l
%{
< C global variables, prototypes,
comments >
%}
[DEFINITION SECTION]
%%
[RULES SECTION]
%%
< C auxiliary subroutines>
lex.yy.c is generated after
running
> lex x.l
LITERAL BLOCK
This part will be embedded into lex.yy.c
substitutions, internal table, character
translation code and start states; will
be copied into lex.yy.c
define how to scan and what action to
take for each token
any user code. For example, a main
function to call the scanning function
yylex().
Literal Block
• Any initial C program code want to copied
into the final program should be written in
definition section.
• Lex copies the contents between “%{“ and
“%}” directly to the generated C file.
• In the definitions and rules sections, any
indented text or text enclosed in %{ and %} is
copied verbatim to the generated C source file
(i.eoutput) near beginning, before the
beginning of yylex() (with the %{%}'s
removed).
Example
%{
#include <stdio.h>
#include "y.tab.h"
int c;
extern int yylval;
/*
* This simple demo: comments in side def section
* Example
*/
%}
Definitions Section(substitutions)
• Definitions intended for Lex are given before the first
%% delimiter. Any line in this section not contained
between %{ and %}, and beginning in column 1, is
assumed to define Lex substitution strings.
• The definitions section contains declarations of simple
name definitions to simplify the scanner specification.
• Name definitions have the form:
name definition or NAME expression
• Example:
DIGIT [0-9]
ID [a-z][a-z0-9]*
Continued..
• The format of such lines is name translation and
it causes the string given as a translation to be
associated with the name.
• The name and translation must be separated by
at least one blank or tab, and the name must
begin with a letter.
• The name can contain letters, digits &
underscores, & must not start with a digit.
• The translation can then be called out by the
{name} syntax in a rule.
Example
Using {D} for the digits and {E} for an
exponent field, for example, might abbreviate
rules to recognize numbers:
D [0-9]
E [DEde][-+]?{D}+
%%
{D}+ printf("integer");
{D}+"."{D}*({E})? |
{D}*"."{D}+({E})? |
{D}+{E}
Internal Tables (%N Declarations)
• Lex use internal tables of a fixed size which may not be
big enough for large scanners, they allow the
programmer to increase the size of the tables explicitly.
• To increase the size of the tables with “%a”, “%e”,
“%k”,“%n”, “%o”, and “%p” lines in the definition
section.
• The old lex accept “%r” to make lex generate a lexer in
Ratfor and “%c” for a lexer in C.
• Ex: %p 6000
%e 3000
To run lex with –v flag to know current statistics.
Character Translations
• A lexer uses native character code that the C
complier uses.
• Ex: The code for the character “A” is the C value “A”.
• It is convenient to use some other character code,
either because the i/p stream uses different code,
EBCDIC or baudot or lex looks for patterns in an i/p
stream not consisting of text at all.
• Lex character translations allow to define an explicit
mapping b/w bytes that are read by input() and
characters used in lex patterns.
Syntax: %T
• Ex:
%T
1 aA
2 bB
3 cC
%T
An i/p byte with value 1 will match anywhere there is
an “A” or “a” in a pattern, so on.
Note: if translation is used, every literal character
used in lex program must appear on RHS of
translation line.
BEGIN
• The BEGIN macro switches among start states.
• It invokes, usually in action code for a pattern as:
Syntax : BEGIN statename;
The scanner starts in state 0(zero), also known as
INITIAL.
All other states must be named in %s or %x in the
definition section.
Even BEGIN is a marco, it doesn’t take any arguments
itself, and statements need not be enclosed in
parentheses.
Start States
• Start states also called start conditions or start rules
in the definition section.
• It used to apply a set rules only at certain times, and
which makes to limit the scope of certain rules, or
to change the way the lexer treats part of the file.
Syntax: %s CMNT Or %x CMNT
• In rule section, we added start state in angle
brackets < > ( ex: <CMNT> )
• The rules that do not have start states can apply in
any state.
• The standard/default state in which lex starts is
state ZERO, also known as INITIAL.
Example:
%x CMNT /* create new start state in the lexer. This is
current start state */
%%
“/*” BEGIN CMNT; /* switch to comment mode*/
<CMNT>. | /* these rules are recognized when
lexer is in state CMNT */
<CMNT>n ; /* throw away comment state */
<CMNT>”*/” BEGIN INITIAL; /* once it matches the
pattern it return back to regular state */
%%
Difference B/W Regular And Exclusive Start
States
• A rule with no start state is not matched when an exclusive state
is active.
Example:
%s NORMAL CMNT /* create new start state in the lexer. This is
current start state */
%%
%{
BEGIN NORMAL; /* start in Normal state */
%}
<NORMAL>“/*” BEGIN CMNT; /* switch to comment mode*/
<CMNT>. |
<CMNT>n ; /* throw away comment state */
<CMNT>”*/” BEGIN NORMAL; /* return to regular state */
%%
Rules Section
• Each rule is made of two parts: pattern and
action, separated by whitespace.
• The rules section of the lex input contains a
series of rules of the form:
PATTERN ACTION
• Example:
{ID} printf( "An identifier: %sn", yytext );
• The yytext and yylength variable.
Example
[t ]+ /* ignore whitespace */ ;
If action is empty, the matched token is discarded.
tab space Pattern matches 1 or more
copies of subpattern
Semicolon, do nothing C statement.
its effect is to ignore the input.
ACTION
• If the action contains a ‘{‘, the action spans till the
balancing ‘}‘ is found, as in C.
• An action consisting only of a vertical bar ('|') means
"same as the action for the next rule.“
• The return statement, as in C.
• In case no rule matches: simply copy the input to the
standard output (A default rule).
Example1 : Single statements in Action Part
%%
{letter}({letter}|{digit})* printf(“id: %sn”, yytext);
n printf(“new linen”);
%%
• If single statements is present in action part then no need
of { } flower brackets/braces.
Otherwise
if more than one statement or more than one line long in
action part then write within the { } flower brackets.
Note : lex take everything after pattern as action, while others
only read the first statement on the and ignore anything
else.
Example2
%%
.|n ECHO; /* prints the matched pattern on the o/p,
copying any punctuation or other character. */
%%
Example3
%%
colour printf("color");
mechanise printf("mechanize");
petrol printf("gas");
%%
Multiple statements in Action Part
%%
" " ;
[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER);
}
[0-9]* { yylval = atoi(yytext); return(NUMBER); }
[^a-z0-9b] { c = yytext[0]; return(c); }
%%
User Code Section
• It can consist of any legal C code.
• The lex copies it to the C file after the end of
the lex generated code.
• The user code section is simply copied to
lex.yy.c verbatim.
• The presence of this section is optional; if it is
missing, the second %% in the input file may
be skipped.
Example
main()
{
yylex(); /* it produced the c code
after it processed the entire
i/p.*/
}
Comments Satements
• Outside “%{“ and “%}”, comments must be indented
with whitespace for lex to recognize them correctly.
Example:
%%
[t ]+ /* ignore whitespace */ ;
%%
Main()
{
/* user code */
yylex(); /* to run the lexer. To translates the lex specification
into a C file */
}
Disambiguating rules
• Lex has a set of disambiguating rules. The two
that make lexer work are:
1. Lex patterns only match a given i/p character or
string once.
2. Lex executes the action for the longest possible
string match for the current i/p.
Ex: island (program for verb/not verb).
Ex : well-being as single word (program for no.of
words).
Precedence Problem
• For example: a “<“ can be matched by
“<“ and “<=“.
• The one matching most text has higher
precedence.
• If two or more have the same length, the rule
listed first in the lex input has higher
precedence.
Lex program structure
… definitions …
%%
… rules …
%%
… subroutines …
%{
#include <stdio.h>
#include "y.tab.h"
int c;
extern int yylval;
%}
%%
" " ;
[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); }
[0-9]* { yylval = atoi(yytext); return(NUMBER); }
[^a-z0-9b] { c = yytext[0]; return(c); }
Lex Source Program example1
• Lex source is a table of
– regular expressions and
– corresponding program fragments
digit [0-9]
letter [a-zA-Z]
%%
{letter}({letter}|{digit})* printf(“id: %sn”, yytext);
n printf(“new linen”);
%%
main()
{
yylex();
}
A Simple Example2
%{
int num_lines = 0, num_chars = 0;
%}
%%
n ++num_lines; ++num_chars;
. ++num_chars;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars = %dn",
num_lines, num_chars );
}
Lex Source to C Program
• The table is translated to a C program (lex.yy.c)
which
– reads an input stream
– partitioning the input into strings which match the
given expressions and
– copying it to an output stream if necessary
Step to create & run/execute the lex
program
1. Create a file Vi filename.l
2. Type the source code in vi/geditor and save it by
pressing esc shift colon w q then enter.
(esc+shift+:+w+q).
3. Compile lex file (filename.l) to generate C routine
(lex.yy.c) .i.e lex filename.l
4. Compile cc lex.yy.c to generate output file./a.out
(object file)
5. % cc lex.yy.c –o <execfilename> -ll.
Ex: cc lex.yy.c –o first –ll(lex library).
1. The lex translates the lex specification into a C
source file called lex.yy.c. then this file is compiled
and linked with lex library –ll.
2. ./a.out
3. Type the i/p data press enter key then press ctrl+d.
Continued..
The execution is as following:
1. If –o <execfilename> is not given run the
program by using ./a.out
2. If –o<execfilename> is given run the program
by using execfilename.
3. Enter the data after execution.
4. Use ^d to terminate the program and give
result.
Regular Expression
Lex Regular Expressions (Extended Regular
Expressions)
• A regular expression matches a set of strings. It contains
text characters and operator characters.
• A regular expression is a pattern description using a “meta”
language.
• Regular expression
– Operators
– Character classes
– Arbitrary character
– Optional expressions
– Alternation and grouping
– Context sensitivity
– Repetitions and definitions
Operators
“ “ : quotation mark
 : backward slash(escape character)
[ ] : square brackets (character class)
^ : caret (negation)
- : minus
? : question mark
. : period(dot)
* : asterisk(star)
+ : plus
| : vertical bar(pie)
( ) : parentheses
$ : dollar
/ : forward slash
{ } : flower braces(curly braces)
% : percentage
< > : angular brackets
Metacharacter Matches
. Matches any character except newline
 Used to escape metacharacter
* Matches zero or more copies of the preceding expression
+ Matches one or more copies of the preceding expression
? Matches zero or one copy of the preceding expression
^ Matches beginning of line as first character / complement (negation)
$ Matches end of line as last character
| Matches either preceding
( ) grouping a series of RE(one or more copies)
[] Matches any character
{} Indicates how many times the pervious pattern allowed to match.
“ ” Interprets everything within it.
/ Matches preceding RE. only one slash is permitted.
- Used to denote range. Example: A-Z implies all characters from A to Z
Pattern Matching Primitives
Quotation Mark Operator “ “
• The quotation mark operator “…”indicates that
whatever is contained between a pair of quotes is to be
taken as text characters.
Ex: xyz"++“ or "xyz++"
• If they are to be used as text characters, an escape 
should be used
Ex: xyz++ = "xyz++"
$ = “$”
 = “”
• Every character but blank(b) , tab (t), newline (n),
is backspace and the list above is always a text character.
• Any blank character not contained within [] must be
quoted.
Character Classes []
• Classes of characters can be specified using the operator
pair [].
• It matches any single character, which is in the [ ].
• Every operator meaning is ignored except  - and ^.
• The - character indicates ranges.
• If it is desired to include the character - in a character
class, it should be first or last.
• If the first character is a circumflex(“ ^”) it changes the
meaning to match any character except the ones within
the brackets.
• C escape sequences starting with “” are recognized.
examples
[ab] => a or b
[a-z] => a or b or c or … or z
[-+0-9] => all the digits and the two signs
[^a-zA-Z] => any character which is not a
letter
[a-z0-9<>_] all the lower case letters, the digits,
the angle brackets, and underline.
joke[rs] Matches either jokes or joker.
Arbitrary Character .
• To match almost any character, the operator
character . is the class of all characters
except newline
• [40-176] matches all printable
characters in the ASCII character set, from
octal 40 (blank) to octal 176 (tilde~)
Optional & Repeated Expressions
• The operator ? indicates an optional element of an
expression. Thus ab?c matches either ac or abc.
• Repetitions of classes are indicated by the operators * and +.
• a? => zero or one instance of a
• a* => zero or more instances of a
• a+ => one or more instances of a
• E.g.
ab?c => ac or abc
[a-z]+ => all strings of lower case letters
[a-zA-Z][a-zA-Z0-9]* => all alphanumeric strings
with a leading alphabetic character
Examples
• an integer: 12345
[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[“-”+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+
Examples
• a delimiter for an English sentence
“.” | “?” | ! or [“.””?”!]
• C++ comment: // call foo() here!!
“//”.*
•white space
[ t]+
• English sentence: Look at this!
([ t]+|[a-zA-Z]+)+(“.”|”?”|!)
Alternation and Grouping
The operator | indicates alternation:
(ab|cd)
matches either ab or cd. Note that parentheses are used for
grouping, although they are not necessary on the outside
level;
ab|cd
would have sufficed. Parentheses can be used for more
complex expressions:
(ab|cd+)?(ef)*
matches such strings as abefef, efefef, cdef, or cddd; but not
abc, abcd, or abcdef.
Context Sensitivity
• Lex will recognize a small amount of surrounding
context. The two simplest operators for this are ^
and $.
• If the first character of an expression is ^, the
expression will only be matched at the beginning of
a line (after a newline character, or at the beginning
of the input stream). This can never conflict with
the other meaning of ^, complementation of
character classes, since that only applies within the
[] operators.
Continued..
• If the very last character is $, the expression will
only be matched at the end of a line (when
immediately followed by newline).
• The latter operator is a special case of the /
operator character, which indicates trailing context.
• The expression ab/cd matches the string ab, but
only if followed by cd.
• Thus ab$ is the same as ab/n
Continued..
Left context is handled in Lex by start conditions. If a
rule is only to be executed when the Lex
automaton interpreter is in start condition x, the
rule should be prefixed by
<x>
using the angle bracket operator characters. If we
considered “being at the beginning of a line'' to be
start condition ONE, then the ^ operator would be
equivalent to
<ONE>
Start conditions are explained more fully in earlier.
Repetitions and Definitions
The operators {} specify either repetitions (if they enclose
numbers) or definition expansion (if they enclose a
name).
For example
{digit}
looks for a predefined string named digit and inserts it at
that point in the expression. The definitions are given in
the first part of the Lex input, before the rules. In
contrast,
a{1,5}
looks for 1 to 5 occurrences of a.
PLLab, NTHU,Cs2403 Programming Languages 71
Pattern Matching Primitives
Meta
character
Example
. a.b, we.78
n [t n]
* [t n]*, a*. A.*z
+ [t n]+, a+, a.+r
? -?[0-9]+ , ab?c
^ [^t n], ^AD , ^(.*)n
$ end of line as last character
a|b a or b
(ab)+ one or more copies of ab (grouping)
[ab] a or b
a{3} 3 instances of a
“a+b” literal “a+b” (C escapes still work)
A{1,2}shis+ Matches AAshis, Ashis, AAshi, Ashi.
(A[b-e])+ Matches zero or one occurrences of A followed by any character from b to e.
Finally, initial % is special, being the separator
for Lex source segments
• [a-z]+ printf("%s", yytext);
will print the string in yytext. The C function printf
accepts a format argument and data to be printed; in
this case, the format is “print string'' (% indicating
data conversion, and %s indicating string type), and
the data are the characters in yytext. So this just
places the matched string on the output. This action is
so common that it may be written as ECHO:
• [a-z]+ ECHO;
Lex – Pattern Matching Examples
Regular Expression (1/3)
x match the character 'x'
. any character (byte) except newline
[xyz] a "character class"; in this case, the pattern matches either
an 'x', a 'y', or a 'z‘
[abj-oZ] a "character class" with a range in it; matches an 'a', a 'b',
any letter from 'j' through 'o', or a 'Z‘
[^A-Z] a "negated character class", i.e., any character but those in
the class. In this case, any character EXCEPT an uppercase
letter.
[^A-Zn] any character EXCEPT an uppercase letter or a newline
Regular Expression (2/3)
r* zero or more r's, where r is any regular expression
r+ one or more r's
r? zero or one r's (that is, "an optional r")
r{2,5} anywhere from two to five r's
r{2,} two or more r's
r{4} exactly 4 r's
{name} the expansion of the "name" definition (see above)
"[xyz]"foo“ the literal string: [xyz]"foo
X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v‘, then the ANSI-C interpretation of x.
Otherwise, a literal 'X' (used to escape operators such as '*')
Regular Expression (3/3)
0 a NUL character (ASCII code 0)
123 the character with octal value 123
x2a the character with hexadecimal value 2a
(r) match an r; parentheses are used to override precedence (see below)
rs the regular expression r followed by the regular expression s; called
"concatenation“
r|s either an r or an s
^r an r, but only at the beginning of a line (i.e.,which just starting to scan,
or right after a newline has been scanned).
r$ an r, but only at the end of a line (i.e., just before a newline).
Equivalent to "r/n".
Precedence of Operators
• Level of precedence
– Kleene closure (*), ?, +
– concatenation
– alternation (|)
• All operators are left associative.
• Ex: a*b|cd* = ((a*)b)|(c(d*))
Lex Predefined Variables
yytext
• Wherever a scanner matches TOKEN ,the text of the
token is stored in the null terminated string yytext.
• The contents of yytext are replaced each time a new
token is matched.
External char yytext[]; //array
External char *yytext; //pointer
To increases the size of buffer. Format in AT&T and MKS
%{
#undef YYLMAX /*remove default defintion*/
#define YYLMAX 500 /* new size */
%}
Continued..
• If yytext is an array, any token which is longer
than yytext will overflow the end of the array
and cause the lexer to fail.
• Yytext[] is 200 ,100character in different lex
tools.
• Flex have default I/O buffer is 16k,which can
handled token upto 8k.
yyleng
The length of the token is stored in it. It is similar
to strlen(yytext).
• Example :
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}
A Lex action may decide that a rule has not recognized the
correct span of characters. Two routines are provided to aid
with this situation.
First, yymore () can be called to indicate that the next
input expression recognized is to be tacked on to the end
of this input. Normally, the next input string would
overwrite the current entry in yytext.
Second, yyless (n) may be called to indicate that not all the
characters matched by the currently successful expression
are wanted right now. The argument n indicates the
number of characters in yytext to be retained.
Further characters previously matched are
returned to the input.
Example
"[^"]* {
if (yytext[yyleng-1] == '')
yymore();
else
... normal user processing
}
which will, when faced with a string such as "abc" “def"
first match the five characters "abc”; then the call to
yymore() will cause the next part of the string, "def, to be
tacked on the end.
Note that the final quote terminating the string should be
picked up in the code labeled ``normal processing''.
I/O routines
Lex also permits access to the I/O routines it uses.
They are:
1) input() which returns the next input character;
2) output(c) which writes the character c on the output; and
3) unput(c) pushes the character c back onto the input stream to
be read later by input().
By default these routines are provided as macro definitions, but
the user can override them and supply private versions. These
routines define the relationship between external files and
internal characters, and must all be retained or modified
consistently.
Ambiguous Source Rules
Lex can handle ambiguous specifications. When more than one
expression can match the current input, Lex chooses as follows:
1) The longest match is preferred.
2) Among rules which matched the same number of characters,
the rule given first is preferred. Thus, suppose the rules
integer keyword action ...;
[a-z]+ identifier action ...;
to be given in that order. If the input is integers, it is taken as an
identifier, because [az]+ matches 8 characters while integer
matches only 7. If the input is integer, both rules match 7
characters, and the keyword rule is selected because it was given
first. Anything shorter (e.g. int) will not match the expression
integer and so the identifier interpretation is used
yywarp()
• Yywarp is a built in macro. When a lexer encounters
the end of file, it calls the routine yywarp() to find
out what to do next.
• If yywarp() returns 0, the lexer(scanner) continues
scanning process. It indicates that there is more i/p
& hence lexer has to continue working.
• It first needs to adjust yyin to point to a new file by
using fopen().
• If yywarp() returns 1, the lexer(scanner) halts
scanning process(i.e scanner return 0 token to
report the EOF.
The user can create own marco. To do so, put this at the beginning of
the rules section.
Format:
%{
#undef yywarp
%}
Note : disable the macro by undefine it, before
define your macro.
yylex()
• The scannerr created by lex has the entry point
yylex().
• You call yylex() to start or resume scanning.
• If a lex action does a return to pass a value to the
calling program, the next call to yylex() will
continue from the point where it left off.
• All code in the rules section is copied into yylex().
• Lines of code immediately after the “%%” line are
placed near the beginning of the scanner, before
the first executable statement.
Example
%{
int counter = 0;
%}
letter [a-zA-Z]
%%
{letter}+ {printf(“a wordn”); counter++;}
%%
main()
{
yylex();
printf(“There are total %d wordsn”, counter);
}
REJECT
• The action REJECT means ``go do the next alternative.'' It
causes whatever rule was second choice after the current
rule to be executed. The position of the input pointer is
adjusted accordingly.
• Lex separaters the i/p into non overlapping tokens.
• If overlaps occurs with other tokens then also we need all
occurrences of a token, a special action REJECT is used to
perform it.
When to use REJECT?
In general, REJECT is useful whenever the purpose of
Lex is not to partition the input stream but to detect all
examples of some items in the input, and the instances
of these items may overlap or include each other.
Some Lex rules to do this might be
she s++;
he h++;
n |
. ;
where the last two rules ignore everything besides he and
she. Remember that . does not include newline. Since
she includes he, Lex will normally not recognize the
instances of he included in she, since once it has passed
a she those characters are gone.
Examples
she {s++; REJECT;}
he {h++; REJECT;}
n |
. ;
a[bc]+ { ... ; REJECT;}
a[cd]+ { ... ; REJECT;}
If the input is ab, only the first
rule matches, and on ad only the
second matches.
The input string accb matches
the first rule for four characters
and then the second rule for
three characters.
In contrast, the input accd agrees
with the second rule for four
characters and then the first rule
for three.
Example
…
%%
pink {npink++; REJECT;}
ink {nink++; REJECT;}
pin {npin++; REJECT;}
. |
n ;
%%
…
i/p data : pink
all three pattern will match. Without REJECT statement only pink is match.
If REJECT action executes it puts back the text matched by the pattern
& finds the next best match for it.
Note : where the REJECT is necessary to pick up a letter pair beginning at every
character, rather than at every other character.
Lex Predefined Variables
• yytext -- a string containing the lexeme
• yyleng -- the length of the lexeme
• yyin -- the input stream pointer
– the default input of default main() is stdin
• yyout -- the output stream pointer
– the default output of default main() is stdout.
• cs20: %./a.out < inputfile > outfile
• E.g.
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}
Lex Library Routines
• yylex()
– The default main() contains a call of yylex()
• yymore()
– return the next token
• yyless(n)
– retain the first n characters in yytext
• yywarp()
– is called whenever Lex reaches an end-of-file
– The default yywarp() always returns 1
Review of Lex Predefined Variables
Name Function
char *yytext pointer to matched string
int yyleng length of matched string
FILE *yyin input stream pointer
FILE *yyout output stream pointer
int yylex(void) call to invoke lexer, returns token
char* yymore(void) return the next token
int yyless(int n) retain the first n characters in yytext
int yywrap(void) wrapup, return 1 if done, 0 if not done
ECHO write matched string
REJECT go to the next alternative rule
INITAL initial start condition
BEGIN condition switch start condition
Revisiting Internal Variables in Lex
• char *yytext;
– Pointer to current lexeme terminated by ‘0’
• int yylen;
– Number of chacters in yytex but not ‘0’
• yylval:
– Global variable through which the token value can be
returned to Yacc
– Parser (Yacc) can access yylval, yylen, and yytext
• How are these used?
– Consider Integer Tokens:
– yylval = ascii_to_integer (yytext);
– Conversion from String to actual Integer Value
Symbol Tables
• The table of words is a simple symbol table, a
common structure in lex and yacc applications.
• The use of symbol table to build a table of words as
the lexer is running, so we can add new words
without modifying and recompiling the lex program.
• Ex: A C compiler, for example, stores the variable
and structure names, labels, enumeration tags, and
all other names used in the program in its symbol
table. Each name is stored along with information
describing the name. In a C compiler the
information is the type of symbol, déclaration scope,
variable type, etc.
Continued..
• add-word(), which puts a new word into the symbol
table, and
• lookup-word(), which looks up a word which should
already be entered.
• a variable state that keeps track of whether we're
looking up words, state LOOKUP, or declaring them.
RECOMMENDED QUESTIONS:
1. write the specification of lex with an example? (10)
2. what is regular expressions? With examples
explain? (8)
3. write a lex program to count the no of words , lines
, space, characters? (8)
4. write a lex program to count the no of vowels and
consonants? (8)
5. what is lexer- parser communication? Explain? (5)
6. write a program to count no of words by the
method of substitution? (7)
Yacc - Yet Another Compiler-
Compiler
The parser is that phase of the compiler which takes a token
string as input and with the help of existing grammar, converts
it into the corresponding Intermediate Representation. The
parser is also known as Syntax Analyzer.
Yacc
Theory:
◦ Yacc reads the grammar and generate C code for a parser .
◦ Grammars written in Backus Naur Form (BNF) .
◦ BNF grammar used to express context-free languages .
◦ e.g. to parse an expression , do reverse operation( reducing the
expression)
◦ This known as bottom-up or shift-reduce parsing .
◦ Using stack for storing (LIFO).
What is YACC ?
– Tool which will produce a parser for a given
grammar.
– YACC (Yet Another Compiler Compiler) is a program
designed to compile a LALR(1) grammar and to
produce the source code of the syntactic analyzer of the
language produced by this grammar
Parsing
• The i/p is divided into tokens, a program needs
to establish the relationship among tokens.
• A C complier needs to finds the expressions,
statement, declarations, blocks, and procedures
in the program.
• This task is known as parsing.
Parser
• Yacc takes a concise description of a grammar and
produces a C routine that can parse that grammar, a
parser.
• The yacc parser automatically detects whenever a
sequences of i/p tokens matches one of the rules in
the grammar and also detects a syntax error
whenever its i/p tokens doesn’t match any of the
rules.
Grammar
• The list/set of rules that define the relationships that program
understands is a grammar.
Or
It is a series of rules that the parser uses to recognize
syntactically valid i/p.
For example, one grammar rule might be
date : month_name day ',' year
• Here, date, month_name, day, and year represent structures
of interest in the input process; presumably, month_name,
day, and year are defined elsewhere. The comma ``,'' is
enclosed in single quotes; this implies that the comma is to
appear literally in the input. The colon and semicolon merely
serve as punctuation in the rule, and have no significance in
controlling the input. Thus, with proper definitions, the input
• July 4, 1776
• might be matched by the above rule.
Example
Ex: CFG
expr : expr '+' term | term;
term : term '*' factor | factor;
factor : '(' expr ')' | ID | NUM;
(a + b) *c
Example
1. E -> E + E
2. E -> E * E
3. E -> id
Three productions have been specified.
Terms that appear on the left-hand side (lhs) of a
production, such as E (expression) are nonterminals.
Terms such as id (identifier) are terminals (tokens
returned by lex) and only appear on the right-hand side
(rhs) of a production.
This grammar specifies that an expression may be the
sum of two expressions, the product of two expressions,
or an identifier.
Example :x + y * z
E -> E * E (r2)
-> E * z (r3)
-> E + E * z (r1)
-> E + y * z (r3)
-> x + y * z (r3)
Symbols
• A yacc grammar is constructed from symbols.
• Symbols are strings of letter, digits, periods and
underscores that do not start with a digit.
• The symbol error is reserved for error recovery.
• There are two types of symbols:
1. Terminal symbols or tokens
2. Non-terminal symbols or non-terminals
Terminal symbols v/s Non-Terminal
symbols
• Terminal Symbols
Symbols that actually appear
in the i/p & are returned by
the lexer are called terminal
symbols or Tokens.
•It represented in lower case
letter.
•It present only on RHS of
arrow or colon.
•It can not be further derived.
•Ex : a , b & c
• Non-Terminal Symbols
Symbols that appear in the
LHS of some rule are called
Non-terminal symbols or
non-terminals.
•It represented in upper case
letter.
•It present on bothside of
arrow or colon LHS &RHS.
•It can be further derived.
•Ex: E , T & F
Continued.
• The symbol to the left of the rule is known
as the left-hand side rule(LHS).
• The symbol to the right of the rule is known
as the right-hand side rule(RHS).
• Terminal & non-terminal symbols must be
different; it is an error to write a rule with a
token on the Left side.
• Every grammar includes a Start symbol, the one
that has to be at the root of the parse tree.
• Ex : E is the start symbol.
Example : ( a + b ) * c.
E E + T | T
T T * F | F
F ( E ) |a| b | c
Start State Vertical bar means two possibilities for
the same symbol.
Production Rules
Recursive Rules
• Rules can refer directly or indirectly to itself
(themselves); this important ability makes it
possible to parse arbitrarily long i/p sequences.
• Applying the expression rules repeatedly.
• Ex: fred = 14 + 23 – 11 + 7
Rule : E : NUM
| E + NUM
| E – NUM ;
Note : E is called again & again.
Left and Right Recursion
• A recursive rule can put the recursive reference at
the left end or right end of the RHS of the rule.
• Ex: Exp: Exp ‘ + ‘ E; /* left recursion */
• Ex: Exp: E ‘ + ‘ Exp ; /* Right recursion */
Examples
Continued..
• Note: Any recursive rule must have at least
one non-recursive(one that does not refer to
itself). Otherwise there is no way to terminate
the string that it matches, which is an error.
• Yacc handles left recursion more efficiently
than right recursion.
• This is because its internal stack keeps track of
all symbols for all partially parsed rules.
Yacc
• Input to yacc is divided into three sections.
• Every specification file consists of three sections: the declarations,
(grammar) rules, and programs. The sections are separated by
double percent ``%%'' marks. (The percent ``%'' is generally used
in Yacc specifications as an escape character.)
Format :
... definitions ...
%%
... rules ...
%%
... subroutines ...
YACC File Format
%{
C declarations
%}
yacc declarations
%%
Grammar rules
%%
Additional C code
– Comments enclosed in /* ... */ may appear in any of the
sections.
Format of a Yacc Specification – 3 Parts
• Definition Section:
–Literal block contains Defs, Constants, Types, #includes, etc. that
can Occur in a C Program.
–Regular Definitions (expressions), declarations, start condition .
–They may be %union, %start, %token, %type, %left, %right, and
%nonassoc declarations.
–All of these are optional, even it may be completely empty.
•Translation Rules:
–Pairs of (grammar rules, Action).
–Informs paser/yacc of Action when grammar
is Recognized.
•Auxiliary Procedures:
–Designer Defined C Code
–Can Replace System Calls
yacc.y File Format:
DECLARATIONS
%%
TRANSLATION RULES
%%
AUXILIARY
PROCEDURES
Definitions Section
 The definitions section consists of:
◦ token declarations .
◦ Types of values used on the parser stack, and other odds
and ends.
◦ Literal blockC code bracketed by “%{“ and “%}”.
◦ The declaration section may be empty.
• There may be %union, %start, %token, %type, %left,
%right, and %nonassoc declarations. (See "%union
Declaration," "Start Declaration,“ "Tokens," "%type
Declarations," and "Precedence and Operator
Declarations.")
Continued..
• It can also contain comments in the usual C
format, surrounded by ''/*n and "*/".
• All of these are optional, so in a very simple
parser the definition section may be
completely empty.
• You can use single quoted characters as tokens
without declaring them, so we don't need to
declare "=", "+", or "-".
Literal Block
• Any initial C program code want to copied
into the final program should be written in
definition section.
• Yacc copies the contents between “%{“ and
“%}” directly to the generated C file.
• In the definitions and rules sections, any
indented text or text enclosed in %{ and %} is
copied verbatim to the generated C source file
(i.eoutput) near beginning, before the
beginning of yyparse() (with the %{%}'s
removed).
Example
%{
#include <stdio.h>
#include "y.tab.h"
int c;
extern int yylval;
/*
* This simple demo: comments in side def section
* Example
*/
%}
Tokens Declarations
• Tokens may either be symbols defined by %token
or individual characters in single quotes.
• All symbols used as tokens must be defined
explicitly in the definitions section, e.g.:
Format : %token NAME1,name2..
• Tokens can also be declared by %left, %right, or
%nonassoc declarations, each of which has
exactly the same syntax options as has %token
Token Numbers
• Within the lexer and parser, tokens are
identified by small integers.
• The token number of a literal token is the
numeric value in the local character set, usually
ASCII, and is the same as the C value of the
quoted character.
• %token NAME integervalue
Continued.. With example
– To specify token AAA BBB
• %token AAA BBB
• %token UP DOWN LEFT RIGHT
– To assign a token number to a token (needed when using
lex), a nonnegative integer followed immediately to the
first appearance of the token
• %token EOFnumber 0
• %token SEMInumber 101
• %token UP 50 DOWN 60 LEFT 17 RIGHT 25
– Non-terminals do not need to be declared unless you
want to associated it with a type to store attributes
Token Values
• Each symbol in a yacc parser can have an associated
value. (See "Symbol Values.")
• Since tokens can have values, you need to set the
values as the lexer returns tokens to the parser.
• The token value is always stored in the variable
yylval.
• Example :
[0-9]+ { yylval = atoi (yytext ) ; return NUMBER; }
Symbol Values
• Every symbol in a yacc parser, both tokens and non-
terminals, can have a value associated with it.
• If the token were NUMBER, the value might be the
particular number, if it were STRING, the value
might be a pointer to a copy of the string, and if it
were SYMBOL, the value might be a pointer to an
entry in the symbol table that describes the symbol.
Continued..
• Ex: C type, int or double for the number,
char * for the string, and a pointer to a
structure for the symbol.
• Yacc makes it easy to assign types to symbols
so that it automatically uses the correct type
for each symbol.
Declaring Symbol Types
• Internally, yacc declares each value as a C union
that includes all of the types.
• You list all of the types in a %union declaration,
q.v.
• Yacc turns this into a typedef for a union type
called YYSTYPE.
• Then for each symbol whose value is set or used
in action code, you have to declare its type.
%type Declaration
• You declare the types of non-terminals using
%type. Each declaration has the form:
%type <type> name,name,..
• The type name must have been defined by a
%union.
• Each name is the name of a non-terminal symbol.
• Use %type to declare non-terminals
%union Declaration
• The %union declaration identifies all of the possible C
types that a symbol value can have. The declaration
takes this form:
%union {
. . Field declarations ...
}
• The field declarations are copied verbatim into a C
union declaration of the type YYSTYPE in the output
file. Yacc does not check to see if the contents of the
%union are valid C.
• In the absence of a %union declaration, yacc defines
YYSTYPE to be int so all of the symbol values are
integers.
Start Declaration/ Start Symbol
• Normally, the start rule, the one that the parser starts
trying to parse, is the one named in the first rule.
• If you want to start with some other rule, in the
declaration section you can write:
Format : %start somename
//to start with rule somename
EX:
• The first non-terminal specified in the grammar
specification section.
• To overwrite it with %start declaraction.
%start non-terminal
Example Definitions Section
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token ID NUM
%start expr
It is a terminal
expr parse
Operator Declarations
• Operator declarations appear in the definitions
section.
• The possible declarations are %left, %right, and
%nonassoc. (In very old grammars you may
find the obsolete equivalents %<, %>, and %2 or
%binary.)
• The %left and %right declarations make an operator
left or right associative, respectively.
• You declare non-associative operators with
%nonassoc.
Continued..
• Operators are declared in increasing order of
precedence.
• All operators declared on the same line are at the
same precedence level.
• Example :
%left PLUSnumber, MINUSnumber
%left TIMESnumber, DIVIDEnumber
Example
%union {
int iValue; /* integer value */
char sIndex; /* symbol table index */
nodeType *nPtr; /* node pointer */
};
%token <iValue> INTEGER
%token <sIndex> VARIABLE
%token WHILE IF PRINT
%nonassoc IFX
%nonassoc ELSE
%left GE LE EQ NE '>' '<'
%left '+' '-'
%left '*' '/'
%nonassoc UMINUS
%type <nPtr> stmt expr stmt_list
YACC Declaration Summary
`%start'
Specify the grammar's start symbol
`%union'
Declare the collection of data types that semantic values may have
`%token'
Declare a terminal symbol (token type name) with no precedence
or associativity specified
`%type'
Declare the type of semantic values for a nonterminal symbol
YACC Declaration Summary
`%right'
Declare a terminal symbol (token type name) that is
right-associative
`%left'
Declare a terminal symbol (token type name) that is left-associative
`%nonassoc'
Declare a terminal symbol (token type name) that is nonassociative
(using it in a way that would be associative is a syntax error,
ex: x op. y op. z is syntax error)
Yacc RULE SECTION
• The rules section contains grammar rules and actions
containing C code.
• A yacc consists of two task :Grammar and Action.
◦ the rules section consists of:
 BNF grammar .
 ACTION
RULES
• A yacc grammar consists of a set of rules.
• Each rule starts with a nonterminal symbol and a
colon, and is followed by a possibly empty Iist of
symbols, literal tokens, and actions.
• Rules by convention end with a semicolon,
although in most versions of yacc the semicolon is
optional.
• The rules section is made up of one or more
grammar rules.
A grammar rule has the form:
A : BODY ;
A represents a nonterminal name, and BODY
represents a sequence of zero or more names and
literals. The colon and the semicolon are Yacc
punctuation.
If a nonterminal symbol matches the empty string.
empty : ;
Continued..
• If there are several grammar rules with the same left
hand side, the vertical bar ``|'' can be used to avoid
rewriting the left hand side. In addition, the semicolon
at the end of a rule can be dropped before a vertical bar.
Thus the grammar rules
A : B C D ;
A : E F ;
A : G ;
• can be given to Yacc as
A : B C D
| E F
| G
;
Example Rules Section
• This section defines grammar
• Example
expr : expr '+' term | term;
term : term '*' factor | factor;
factor : '(' expr ')' | ID | NUM;
%token NAME NUMBER
%%
statement: NAME '=' expression
| expression
;
expression: NUMBER ' + ‘ NUMBER
I NUMBER '-' NUMBER
;
Note :Unlike lex, yacc pays no attention to line boundaries in
the rules section, and you will find that a lot of whitespace
makes grammars easier to read.
The symbol on the left-hand side of the first rule in the
grammar is normally the start symbol.
EXAMPLE
Actions
• An action is C code executed when yacc matches a
rule in the grammar.
• The action must be a C compound statement,
Example:
date: month '/' day ' / ' year { printf("date found"); }
;
The action can refer to the values associated with the
symbols in the rule by using a dollar sign followed by
a number, with the first symbol after the colon being
number 1.
CONTINUED..
EXAMPLE :
date: month ‘ / ' day ' / ‘year
{ print£ ("date %d-%d-%d found", $1, $3, $5) ; }
:
The name "$$" refers to the value for the symbol to
the left of the colon.
• For rules with no action, yacc uses a default of:
{ $$ = $1; }
THE RULE'S ACTION
• Whenever the parser reduces a rule, it executes
user C code associated with the rule, known as the
rule's action.
• The action appears in braces after the end of the
rule, before the semicolon or vertical bar.
• The action code can refer to the values of the right-
hand side symbols as $1, $2, . . . , and can set
• the value of the left-hand side by setting $$.
• In our parser, the value of an expression symbol is
the value of the expression it represents.
Rule Reduction and Action
stat: expr {printf("%dn",$1);}
| LETTER '=' expr {regs[$1] = $3;} ;
expr:
expr '+' expr {$$ = $1 + $3;} |
LETTER {$$ = regs[$1];} ;
Grammar rule Action
“or” operator:
For multiple RHS
Rules Section
• Normally written like this
• Example:
expr : expr '+' term
| term
;
term : term '*' factor
| factor
;
factor : '(' expr ')'
| ID
| NUM
;
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
$1
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
$2
The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
$3 Default: $$ = $1;
SUBROUTINES SECTION
the subroutines section consists of:
◦ user subroutines .
User Code Section
• It can consist of any legal C code.
• The yacc copies it to the C file after the end
of the yacc generated code using a lex
generated lexer to compile: main() and
yyerror().
• The user code section is simply copied to
y.tab.c verbatim.
• The presence of this section is optional; if it is
missing, the second %% in the input file may
be skipped.
Format
void yyerror()
{
}
int main(void)
{
yyparse();
return 0;
}
Example
void yyerror(char *s)
{
fprintf(stderr, "%sn", s);
}
int main(void)
{
yyparse();
return 0;
}
example
main()
{
return(yyparse());
}
yyerror(CHAR *s)
{
fprintf(stderr, "%sn",s);
}
yywrap()
{
return(1);
}
Example
yyerror(const char *str)
{ printf("yyerror: %s at line %dn", str, yyline);
}
main()
{
if (!yyparse()) {printf("acceptn");}
else
printf("rejectn");
}
Yacc Program Structure
% {
# i n c l u d e < s t d i o . h >
i n t r e g s [ 2 6 ] ;
i n t b a s e ;
% }
% t o k e n n u m b e r l e t t e r
% l e f t ' + ' ' - ‘
% l e f t ' * ' ' / ‘
% %
l i s t : | l i s t s t a t '  n ' | l i s t e r r o r '  n ' { y y e r r o k ; } ;
s t a t : e x p r { p r i n t f ( " % d  n " , $ 1 ) ; }
| L E T T E R ' = ' e x p r { r e g s [ $ 1 ] = $ 3 ; } ;
e x p r :
' ( ' e x p r ' ) ' { $ $ = $ 2 ; } |
e x p r ' + ' e x p r { $ $ = $ 1 + $ 3 ; } |
L E T T E R { $ $ = r e g s [ $ 1 ] ; }
%%
m a i n ( ) { r e t u r n ( y y p a r s e ( ) ) ; }
y y e r r o r ( C H A R * s ) { f p r i n t f ( s t d e r r, " % s  n " , s ) ; }
y y w r a p ( ) { r e t u r n ( 1 ) ; }
… definitions …
%%
… rules …
%%
… subroutines …
An YACC File Example
%{
#include <stdio.h>
%}
%token NAME NUMBER
%%
statement: NAME '=' expression
| expression { printf("= %dn", $1); }
;
expression: expression '+' NUMBER { $$ = $1 + $3; }
| expression '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
%%
int yyerror(char *s)
{
fprintf(stderr, "%sn", s);
return 0;
}
int main(void)
{
yyparse();
return 0;
}
A YACC PARSER
• A literal consists of a character enclosed in single
quotes ``'''. As in C, the backslash ``'' is an escape
character within literals, and all the C escapes are
recognized. Thus
• 'n' newline
• 'r' return
• ''' single quote ``'''
• '' backslash ``''
• 't' tab
• 'b' backspace
• 'f' form feed
• 'xxx' ``xxx'' in octal
• For a number of technical reasons, the NUL character
('0' or 0) should never be used in grammar rules.
How YACC Works
a.out
File containing desired
grammar in yacc format
yacc program
C source program created by yacc
C compiler
Executable program that will parse
grammar given in gram.y
gram.y
yacc
y.tab.c
cc
or gcc
yacc
How YACC Works
(1) Parser generation time
YACC source (*.y)
y.tab.h
y.tab.c
C compiler/linker
(2) Compile time
y.tab.c a.out
a.out
(3) Run time
Token stream
Abstract
Syntax
Tree
y.output
Creating, Compiling and Running a
Simple Parser
• Yacc environment
– Yacc processes a yacc specification file and produces
a y.tab.c file.
– An integer function yyparse() is produced by Yacc.
• Calls yylex() to get tokens.
• Return non-zero when an error is found.
• Return 0 if the program is accepted.
– Need main() and and yyerror() functions.
Step to create & run/execute the yacc
program
1. Create a file Vi filename.y
2. Type the source code in vi/geditor and save it by
pressing esc shift colon w q then
enter.(esc+shift+:+w+q).
3. Compile yacc file (filename.y) to generate C routine
(y.tab.c and y.tab.h) i.e yacc –d filename.y (token
definition created by the –d )
4. Compile cc y.tab.c -lyto generate output file./a.out
(object file)
5. % cc y.tab.c –o <execfilename>.
Parser-Lexer Communication
• To try out our parser, we need a lexer to feed it
tokens.
• When you use a lex scanner and a yacc parser
together, the parser is the higher level routine.
• It calls the lexer yylex() whenever it needs a token
from the input.
• The lexer then scans through the input recognizing
tokens.
• As soon as it finds a token of interest to the parser, it
returns to the parser, returning the token's code as
the value of yylex().
Continued..
• Not all tokens are of interest to the parser-in
most programming languages the parser
doesn't want to hear about comments and
whitespace.
• Yacc defines the token names in the parser as
C preprocessor names in y.tab.h
Works with Lex
YACC
yyparse()
Input programs
12 + 26
LEX
yylex()
How to work ?
Works with Lex
YACC
yyparse()
Input programs
12 + 26
LEX
yylex()
call yylex()
[0-9]+
next token is NUM
NUM ‘+’ NUM
Communication between LEX and YACC
YACC
yyparse()
Input programs
12 + 26
LEX
yylex()
call yylex()
[0-9]+
next token is NUM
NUM ‘+’ NUM
LEX and YACCtoken
Communication between LEX and
YACC
yacc -d gram.y
Will produce:
y.tab.h
• Use enumeration / define
•Include
•YACC y.tab.h
• LEX include y.tab.h
Communication between LEX and YACC
%{
#include <stdio.h>
#include "y.tab.h"
%}
id [_a-zA-Z][_a-zA-Z0-9]*
%%
int { return INT; }
char { return CHAR; }
float { return FLOAT; }
{id} { return ID;}
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token CHAR, FLOAT, ID, INT
%%
yacc -d xxx.y
Produced
y.tab.h:
# define CHAR 258
# define FLOAT 259
# define ID 260
# define INT 261
parser.y
scanner.l
Yacc Example
• Taken from Lex & Yacc
• Simple calculator
a = 4 + 6
a
a=10
b = 7
c = a + b
c
c = 17
$
Create ,Compiling and Running a
Simple lex & Parser(yacc)
• Lex part:
• % vi chl-n.l
• % lex chl-n.l
Yacc part :
• % vi chl-m.y
• %% yacc -d chl-n.y
Compile both lex & yacc
• % cc -c lex.yy.c y.tab.c
• % cc -0 example-m.n lex.yy.o y.tab.o -ll
Example
% yacc -d ch3-0l.y # makes y.tab.c and "y.tab.h
% lex ch3-01.l # makes lex.yy.c
% cc -o ch3-01 y.tab.c lex.yy.c -ly -ll # cmpile and link
C files
% ch3-01
99+12
= 111
% ch3-01
Example of lex and yacc program
Grammar
expression ::= expression '+' term |
expression '-' term |
term
term ::= term '*' factor |
term '/' factor |
factor
factor ::= '(' expression ')' |
'-' factor |
NUMBER |
NAME
Parser (cont’d)
statement_list: statement 'n'
| statement_list statement 'n'
;
statement: NAME '=' expression { $1->value = $3; }
| expression { printf("= %gn", $1); }
;
expression: expression '+' term { $$ = $1 + $3; }
| expression '-' term { $$ = $1 - $3; }
| term
;
parser.y
Parser (cont’d)
term: term '*' factor { $$ = $1 * $3; }
| term '/' factor { if ($3 == 0.0)
yyerror("divide by zero");
else
$$ = $1 / $3;
}
| factor
;
factor: '(' expression ')' { $$ = $2; }
| '-' factor { $$ = -$2; }
| NUMBER { $$ = $1; }
| NAME { $$ = $1->value; }
;
%%
parser.y
Scanner
%{
#include "y.tab.h"
#include "parser.h"
#include <math.h>
%}
%%
([0-9]+|([0-9]*.[0-9]+)([eE][-+]?[0-9]+)?) {
yylval.dval = atof(yytext);
return NUMBER;
}
[ t] ; /* ignore white space */
scanner.l
Scanner (cont’d)
[A-Za-z][A-Za-z0-9]* { /* return symbol pointer */
yylval.symp = symlook(yytext);
return NAME;
}
"$" { return 0; /* end of input */ }
n|”=“|”+”|”-”|”*”|”/” return yytext[0];
%%
scanner.l
YACC
• Rules may be recursive
• Rules may be ambiguous*
• Rules may be conflicts
• Uses bottom up Shift/Reduce parsing
– Get a token
– Push onto stack
– Can it reduced (How do we know?)
• If yes: Reduce using a rule
• If no: Get another token
• Yacc cannot look ahead more than one token
Phrase -> cart_animal AND CART
| work_animal AND PLOW
…
Define an ambiguity and conflicts.
How it arises .
• Ambiguous means there are multiple possible
parses(o/p) for the same input.
• Conflicts mean that yacc can't properly parse a
grammar, probably because it's ambiguous.
• It arises due to precedence and associativity
operators is not specified.
• Conflicts may arise because of mistakes in input
or logic, or because the grammar rules, while
consistent, require a more complex parser than
Yacc can construct.
Example :"2+3*4",
• (2 + 3 ) * 4 2 + ( 3 * 4 )
what Yacc Cannot Parse ?
If its Ambiguity, Unambiguity and Conflicts
• In some cases the grammar is truly ambiguous, that
is, there are two possible parses(o/p) for a single
input string and yacc cannot handle that.
• In others, the grammar is unambiguous, but the
parsing technique that yacc uses is not powerful
enough to parse the grammar.
• The problem in an unambiguous grammar with
conflicts is that the parser would need to look more
than one token ahead to decide which of two
possible parses to use.
• Yacc takes a default action when there is a conflict.
Example : Input is HORSE AND CART
phrase : cart - animal AND CART
| work - animal AND PLOW
cart : animal + HORSE | GOAT
work : animal +HORSE | OX
• yacc can't handle this because it requires two
symbols of lookahead. It cannot lookahead more
than one token. If we changed the first rule to this:
phrase : cart - animal CART
| work - animal PLOW
Continued..
• The given expression should not be ambiguous and
conflicts.
• Yacc may fail to translate a grammar specification because
the grammar is ambiguous or contains conflicts.
Example for ambiguous A+ B * C two way it can solve:
1. ( A +B ) * C
2. A + (B * C )
Example for conflicts an Xl could either proga or progb:
%%
prog: proga I progb ;
proga : 'XI ‘;
prcgb : 'XI’ ;
Types of conflicts
Shift/Reduce Conflicts/ bottom-up
•A shift/reduce conflict occurs
when there are two possible parses
for an input string, and one of the
parses completes a rule (the
reduce option) and one doesn't
(the shift option).
•Ex:
E : ‘X’ | E + E ;
• The i/p string X + X + X there are
two possible parses,"(X+X)+X “or
"X+(X+X)”
Reduce/Reduce Conflicts
•A reduce/reduce conflict occurs
when the same token could
complete two different rules.
•Example:
E : T | E;
E : id ;
T : id ;
An “id” could either be a E or T.
Disambiguating Rule
• A rule describing which choice to make in a given
situation is called a disambiguating rule.
• Yacc invokes two disambiguating rules by default:
1. In a shift/reduce conflict, the default is to do the
shift.
2. In a reduce/reduce conflict, the default is to reduce
by the earlier grammar rule (in the input sequence).
Arithmetic Expressions
• The arithmetic expressions more general and realistic, extending
the expression rules to handle multiplication and division, unary
negation,and parenthesized expressions:
expression: expression ' + ‘ expression { $$ = $1 + $3; }
| expression ‘ - ' expression (: $$ = $1 - $3; }
| expression ‘* ' expression (: $$ = $1 * $3; }
| expression ‘/ ' expression
{ if ($3 == 0)
yyerror ( "divide by zero " ) ;
else
$$ = $1 / $3;
}
|‘ – ‘ expression { $$ = $2; }
| ' ( ‘ expression ' ) ' { $$ = $2; }
| NUMBER ( $$ = $1; }
;
Precedence, Associativity, and Operator
Declarations
• All yacc grammars have to be unambiguous.
• Unambiguous means that is, there is only one possible way
to parse any legal input using the rules in the grammar.
• Ambiguous grammars cause conflicts, situations where
there are two possible parses and hence two different ways
that yacc can process a token.
• When yacc processes an ambiguous grammar, it uses default
rules to decide which way to parse an ambiguous sequence.
• Often these rules do not produce the desired result, so yacc
includes operator declarations that let you change the way it
handles shiftheduce conflicts that result from ambiguous
grammars.
Precedence and Associativity
• The rules for determining what operands group
with which operators are known as precedence
and associativity.
Ex :
a = b = C + d / e / f
a = (b = (C + ((d / e) / f ) ) ) )
Precedence
• Precedence controls which operators to execute first
in an expression. or
• Precedence assigns each operator a precedence
"level."
• In any expression grammar, operators are grouped
• into levels of precedence from lowest to highest.
• Operators at higher levels bind more tightly,
• e.g., if "*" has higher precedence than "+",
"A+B*C” is treated as “A+(B*C)", while "D*E+F is
"(D*E)+F.
Associativity
• Associativity controls the grouping of operators at
the same precedence level.or
• Associativity controls how the grammar groups
expressions using the same operator or different
operators with the same precedence, whether they
group from the left, from the right, or not at all.
• If "-" were left associative, the expression "A-B-C"
would mean "(A-B)-C”, while if it were right
associative it would mean "A-(B-C)".
How to specify Precedence and Associativity ?
• There are two ways to specify precedence and
associativity in a grammar, implicitly and
explicitly.
• To specify them implicitly, rewrite the
grammar using separate non-terminal symbols
for each precedence level.
Declarations
IMPLICITLY
Ex:
expression: expression '+* mlexp
I expression ' - ' mlexp
I mlexp
;
mlexp: mlexp ‘* ' primary
I rrmlexp ‘/ ' primary
I Primary
;
primary: ' ( ‘ expression ' ) '
I ‘ - ' primary
I NUMBER
;
EXPLICITLY
Precedence / Association
1. 1-2-3 = (1-2)-3? or 1-(2-3)?
Define ‘-’ operator is left-association.
2. 1-2*3 = 1-(2*3)
Define “*” operator is precedent to “-” operator
expr: expr '-' expr
| expr '*' expr
| expr '<' expr
| '(' expr ')'
...
;
(1) 1 – 2 - 3
(2) 1 – 2 * 3
Precedence / Association
%right ‘=‘
%left '<' '>' NE LE GE
%left '+' '-‘
%left '*' '/'
highest precedence
Precedence / Association
expr : expr ‘+’ expr { $$ = $1 + $3; }
| expr ‘-’ expr { $$ = $1 - $3; }
| expr ‘*’ expr { $$ = $1 * $3; }
| expr ‘/’ expr
{
if($3==0)
yyerror(“divide 0”);
else
$$ = $1 / $3;
}
| ‘-’ expr %prec UMINUS {$$ = -$2; }
Shift/Reduce Conflicts
• shift/reduce conflict
– occurs when a grammar is written in such a way
that a decision between shifting and reducing can
not be made.
– ex: IF-ELSE ambiguous.
• To resolve this conflict, yacc will choose to shift.
Shift/Reduce Parsing
• When yacc processes a parser, it creates a set of
states each of which reflects a possible position
in one or more partially parsed rules.
Shift parsing
• As the parser reads tokens, each time it reads a
token that doesn't complete a rule it pushes the
token on an internal stack and switches to a new
state reflecting the token it just read.
• This action is called a shift.
Reduce parsing
• When it has found all the symbols that constitute
the right-hand side of a rule, it pops the right-hand
side symbols off the stack, pushes the left-hand side
symbol onto the stack, and switches to a new state
reflecting the new symbol on the stack. This action
is called a reduction, since it usually reduces the
number of items on the stack.
• (Not always, since it is possible to have rules with
empty right-hand sides.) Whenever yacc reduces a
rule, it executes user code associated with the rule.
Example "fred = 12 + 13"
The parser starts by shifting tokens on to the internal
stack one at a time:
Shift :(PUSH) RULES:
fred
fred =
fred = 12
fred = 12 +
fred = 12 + 13
Now reduce the rule "expression->NUMBER +
NUMBER" so it pops the 12, the plus, and the 13 from
the stack and replaces them with expression
Reduce :(POPS)
fred = expression
Statement
Now it reduces the rule "statement -> NAME =
expression", so it pops fred, =, and expression and
replaces them with statement.
The end of the input and the stack has been reduced
to the start symbol, so the input was valid according
to the grammar.
yacc& lex in Together
• The grammar:
program -> program expr | ε
expr -> expr + expr | expr - expr | id
• Program and expr are nonterminals.
• Id are terminals (tokens returned by lex) .
• expression may be :
– sum of two expressions .
– product of two expressions .
– Or an identifiers
When Not to Use Precedence Rules
• You can use precedence rules to fix any
shift/reduce conflict that occurs in the grammar.
• The use of precedence in only two situations: in
expression grammars, and to resolve the "dangling
else" conflict in grammars for if-then-else language
constructs.
• Otherwise, if you can, you should fix the grammar
to remove the conflict.
How the Parser Works
• Yacc turns the specification file into a C program,
which parses the input according to the
specification given.
• The parser produced by Yacc consists of a finite
state machine with a stack.
• The parser is also capable of reading and
remembering the next input token (called the
lookahead token).
Continued ..
• The current state is always the one on the top
of the stack.
• The states of the finite state machine are given
small integer labels; initially, the machine is in
state 0, the stack contains only state 0, and no
lookahead token has been read.
The machine has only four actions available to it, called
shift, reduce, accept, and error.
• A move of the parser is done as follows:
1. Based on its current state, the parser decides
whether it needs a lookahead token to decide what
action should be done; if it needs one, and does not
have one, it calls yylex to obtain the next token.
2. Using the current state, and the lookahead token if
needed, the parser decides on its next action, and
carries it out. This may result in states being pushed
onto the stack, or popped off the stack, and in the
lookahead token being processed or left alone.
Variables and Typed Tokens
% {
double vbltable [26 ];
%}
%union {
double dval;
int vblno ;
}
%type <dval> expression
Symbol Values and %union
1.Why not have the lexer return the value of the variable as a double, to
make the parser simpler?
The problem is that there are two contexts where a variable name
can occur: as part of an expression, in which case we want the double
value, and to the left of an equal sign, in which case we need to
remember which variable it is so we can update vbltable.
To define the possible symbol types, in the definition section we add a
%union declaration:
%union {
double dval;
int vblno;
}
The y.tab.h generated from this grammar:
#define NAME 257
#define NUMBER 258
#define UMINUS 2 5 9
typedef union {
double dval;
int vblno ;
} YYSTYPE;
extern YYSTYPE yylval;
We have to tell the parser which symbols use which type of value.
%token <vblno> NAME
%token <dval> NUMBER
%type <dval> expression
Yacc Library
• You can include the library by giving the -ly flag at
the end of the cc command line on UNIX systems,
or the equivalent on other systems.
• main()
• yyerror()
• yyparse()
yyerror()
• Whenever a yacc parser detects a syntax error, it
calls yyerror() to report the error to the user,
passing it a single argument, a string describing the
error. (Usually the only error you ever get is "syntax
error.")
• The default version of yyerror in the yacc library
merely prints its argument on the standard output.
Syntax: yyerror()
{ printf(“invalid”);
exit 0;
}
yyparse()
• The entry point to a yacc-generated parser is yyparseo.
• When your program calls yyparse(), the parser attempts
to parse an input stream.
• The parser returns a value of zero if the parse succeeds and
non-zero if not.
• Every time you call yyparse() the parser starts parsing a
new, forgetting whatever state it might have been in the last
time it returned.
Syntax: main()
{
yyparse()
}
Lex file: bas.l
Yacc file: bas.y
Linking lex&yacc
230
Lex v/s Yacc
• Lex
– Lex generates C code for a lexical analyzer, or scanner
– Lex uses patterns that match strings in the input and
converts the strings to tokens
• Yacc
– Yacc generates C code for syntax analyzer, or parser.
– Yacc uses grammar rules that allow it to analyze tokens
from Lex and create a syntax tree.
Lex with Yacc
Lex Yacc
yylex() yyparse()
Lex source
(Lexical Rules)
Yacc source
(Grammar Rules)
Input
Parsed
Input
lex.yy.c y.tab.c
return token
call
RECOMMENDED QUESTIONS:
1. give the specification of yacc program? give an example?
(8)
2. what is grammar? How does yacc parse a tree? (5)
3. how do you compile a yacc file? (5)
4. explain the ambiguity occurring in an grammar with an
example? (6)
5. explain shift/reduce and reduce/reduce parsing ? (8)
6. write a yacc program to test the validity of an arthimetic
expressions? (8)
7. write a yacc program to accept strings of the form anbn ,
n>0? (8)
Module4 lex and yacc.ppt

More Related Content

What's hot

Planning the development process
Planning the development processPlanning the development process
Planning the development processSiva Priya
 
Lecture 02 lexical analysis
Lecture 02 lexical analysisLecture 02 lexical analysis
Lecture 02 lexical analysisIffat Anjum
 
Lex and Yacc ppt
Lex and Yacc pptLex and Yacc ppt
Lex and Yacc pptpssraikar
 
9. Software Implementation
9. Software Implementation9. Software Implementation
9. Software Implementationghayour abbas
 
Peephole optimization techniques
Peephole optimization techniquesPeephole optimization techniques
Peephole optimization techniquesgarishma bhatia
 
Unit 1-problem solving with algorithm
Unit 1-problem solving with algorithmUnit 1-problem solving with algorithm
Unit 1-problem solving with algorithmrajkumar1631010038
 
Language processing activity
Language processing activityLanguage processing activity
Language processing activityDhruv Sabalpara
 
Stacks overview with its applications
Stacks overview with its applicationsStacks overview with its applications
Stacks overview with its applicationsSaqib Saeed
 
Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design MAHASREEM
 
Lexical Analysis
Lexical AnalysisLexical Analysis
Lexical AnalysisNayemid4676
 

What's hot (20)

Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Planning the development process
Planning the development processPlanning the development process
Planning the development process
 
Lex
LexLex
Lex
 
C function presentation
C function presentationC function presentation
C function presentation
 
Predictive parser
Predictive parserPredictive parser
Predictive parser
 
Lecture 02 lexical analysis
Lecture 02 lexical analysisLecture 02 lexical analysis
Lecture 02 lexical analysis
 
Lex and Yacc ppt
Lex and Yacc pptLex and Yacc ppt
Lex and Yacc ppt
 
1.Role lexical Analyzer
1.Role lexical Analyzer1.Role lexical Analyzer
1.Role lexical Analyzer
 
9. Software Implementation
9. Software Implementation9. Software Implementation
9. Software Implementation
 
Peephole optimization techniques
Peephole optimization techniquesPeephole optimization techniques
Peephole optimization techniques
 
Code Generation
Code GenerationCode Generation
Code Generation
 
Parsing LL(1), SLR, LR(1)
Parsing LL(1), SLR, LR(1)Parsing LL(1), SLR, LR(1)
Parsing LL(1), SLR, LR(1)
 
Unit 1-problem solving with algorithm
Unit 1-problem solving with algorithmUnit 1-problem solving with algorithm
Unit 1-problem solving with algorithm
 
Lexical analysis-using-lex
Lexical analysis-using-lexLexical analysis-using-lex
Lexical analysis-using-lex
 
Language processing activity
Language processing activityLanguage processing activity
Language processing activity
 
Lexical analyzer
Lexical analyzerLexical analyzer
Lexical analyzer
 
Parsing
ParsingParsing
Parsing
 
Stacks overview with its applications
Stacks overview with its applicationsStacks overview with its applications
Stacks overview with its applications
 
Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design Syntax Analysis in Compiler Design
Syntax Analysis in Compiler Design
 
Lexical Analysis
Lexical AnalysisLexical Analysis
Lexical Analysis
 

Similar to Module4 lex and yacc.ppt (20)

Lex & yacc
Lex & yaccLex & yacc
Lex & yacc
 
11700220036.pdf
11700220036.pdf11700220036.pdf
11700220036.pdf
 
Compiler design Project
Compiler design ProjectCompiler design Project
Compiler design Project
 
Ch 2.pptx
Ch 2.pptxCh 2.pptx
Ch 2.pptx
 
Language for specifying lexical Analyzer
Language for specifying lexical AnalyzerLanguage for specifying lexical Analyzer
Language for specifying lexical Analyzer
 
module 4.pptx
module 4.pptxmodule 4.pptx
module 4.pptx
 
Lexical Analyzer Implementation
Lexical Analyzer ImplementationLexical Analyzer Implementation
Lexical Analyzer Implementation
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 
Cd ch2 - lexical analysis
Cd   ch2 - lexical analysisCd   ch2 - lexical analysis
Cd ch2 - lexical analysis
 
Compiler design and lexical analyser
Compiler design and lexical analyserCompiler design and lexical analyser
Compiler design and lexical analyser
 
Lex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.pptLex and Yacc Tool M1.ppt
Lex and Yacc Tool M1.ppt
 
Lexical
LexicalLexical
Lexical
 
Lex and Yacc.pdf
Lex and Yacc.pdfLex and Yacc.pdf
Lex and Yacc.pdf
 
LANGUAGE TRANSLATOR
LANGUAGE TRANSLATORLANGUAGE TRANSLATOR
LANGUAGE TRANSLATOR
 
Handout#02
Handout#02Handout#02
Handout#02
 
COMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptxCOMPILER CONSTRUCTION KU 1.pptx
COMPILER CONSTRUCTION KU 1.pptx
 
automata theroy and compiler designc.pptx
automata theroy and compiler designc.pptxautomata theroy and compiler designc.pptx
automata theroy and compiler designc.pptx
 
1._Introduction_.pptx
1._Introduction_.pptx1._Introduction_.pptx
1._Introduction_.pptx
 
Pcd question bank
Pcd question bank Pcd question bank
Pcd question bank
 
CD U1-5.pptx
CD U1-5.pptxCD U1-5.pptx
CD U1-5.pptx
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 

Recently uploaded (20)

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 

Module4 lex and yacc.ppt

  • 1. Introduction to Lex and Yacc Prepared by, Prof. Aruna M.G Computer Science & Engineering Department M.S.E.C Bangalore
  • 2. Lex and Yacc • Two Compiler Writing Tools that are Utilized to easily Specify: – Lexical Tokens and their Order of Processing (Lex) – Context Free Grammar for LALR(1) (Yacc) • Both Lex and Yacc have Long History in Computing – Lex and Yacc – Earliest Days of Unix Minicomputers – Flex and Bison – From GNU – JFlex - Fast Scanner Generator for Java – BYacc/J – Berkeley – CUP, ANTRL, PCYACC, … – PCLEX and PCYACC from Abacus
  • 5. General Compiler Infra-structure Scanner (tokenizer) Parser Semantic Routines Analysis/ Transformations/ optimizations Code Generator Program source (stream of characters) Tokens Syntactic Structure IR: Intermediate Representation (1) Assembly code IR: Intermediate Representation (2) Symbol and Attribute Tables
  • 7. flex - fast lexical analyzer generator • Flex is a tool for generating scanners. • Flex source is a table of regular expressions and corresponding program fragments. • Generates lex.yy.c which defines a routine yylex()
  • 8. Lex  Written by Eric Schmidt and Mike Lesk.  lex is a program (generator) that generates lexical analyzers, (widely used on Unix).  It is mostly used with Yacc parser generator.  It reads the input stream (specifying the lexical analyzer ) and outputs source code implementing the lexical analyzer in the C programming language.  Lex will read patterns (regular expressions); then produces C code for a lexical analyzer that scans for identifiers.
  • 9. What is Lex? • The main job of a lexical analyzer (scanner) is to break up an input stream into more usable elements (tokens) a = b + c * d; ID ASSIGN ID PLUS ID MULT ID SEMI
  • 10. Lex – Lexical Analyzer • A set of descriptions of possible tokens and producing a C routine is called a lexical analyzer or scanner or lexer. • Lexical analyzers tokenize input streams • Tokens are the terminals of a language – English • words, punctuation marks, … – Programming language • Identifiers, operators, keywords, … • Regular expressions define terminals/tokens
  • 11. LEXER • Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. • A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner. • A lexer often exists as a single function which is called by a parser or another function
  • 12. Token • A token is a string of characters, categorized according to the rules as a symbol (e.g. IDENTIFIER, NUMBER, COMMA, etc.).
  • 13. Tokens • Tokens in Lex are declared like variable names in C. Every token has an associated expression. Token Associated expression Meaning number ([0-9])+ 1 or more occurrences of a digit chars [A-Za-z] Any character blank " " A blank space word (chars)+ 1 or more occurrences of chars variable (chars)+(number)*(chars)*( number)*
  • 14. Consider this expression in the C programming language: sum=3+2; Tokenized in the following table: lexeme token type sum Identifier = Assignment operator 3 Number + Addition operator 2 Number ; End of statement
  • 15. Lex – A Lexical Analyzer Generator • A Unix Utility from early 1970s • A Compiler that Takes as Source a Specification for: – Tokens/Patterns of a Language – Generates a “C” Lexical Analyzer Program • Pictorially:Creating a Lexical Analyzer with Lex Lex Compiler C Compiler a.out Lex Source Program: lex.l lex.yy.c lex.yy.c a.out Input stream Sequence of tokens
  • 16. Step for executing lex program • First a specification of a lexical analyzer is prepared by creating a program Lex-l(filename.l) in the Lex language. • Lex-l is run through the Lex compiler to produce a C program Lex.YY.C. • The program Lex.YY.C consists of a tabular representation of a transition diagram constructed from the regular expressions of lex.l, together with a standard routine that uses the table to recognize LEXEMER.
  • 17. Continued.. • The lexical analyses phase reads the characters in the source program and groups them into a stream of tokens in which each token represents a logically cohesive sequence of characters, such as an identifier, a keyword (if, while, etc.) a punctuation character or a multi-character operator like : = . • The character sequence forming a token is called the lexeme for the token. • The actions associated with regular expressions in lex - a are pieces of C code and are carried over directly to lex. YY.C. • Finally, lex .YY.C is run through the C compiler to produce an object program a.out.
  • 18. LEX SPECIFICATION • The set of descriptions you(we) give to lex is called a lex specification.
  • 19. (optional) (required) Lex Source • Lex source is separated into three sections by %% delimiters • The general format of Lex source is • The absolute minimum Lex program is thus {definitions} %% {transition rules} %% {user subroutines} %%
  • 20. Format of the Input File • The flex /lex input file consists of three sections, separated by a line with just %% in it: definitions %% rules %% user code
  • 21. Note • where the definitions and the user subroutines are often omitted. • The second %% is optional, but the first is required to mark the beginning of the rules. • The absolute minimum Lex program is thus %% (no definitions, no rules) which translates into a program which copies the input to the output unchanged.
  • 22. Format of a Lexical Specification – 3 Parts • Declarations: – Literal block contains Defs, Constants, Types, #includes, etc. that can Occur in a C Program. – Regular Definitions (expressions),internal table declaration, start condition and translation. • Translation Rules: – Pairs of (Regular Expression, Action) – Informs Lexical Analyzer of Action when Pattern is Recognized • Auxiliary Procedures: – Designer Defined C Code – Can Replace System Calls Lex.y File Format: DECLARATIONS %% TRANSLATION RULES %% AUXILIARY PROCEDURES
  • 23. Skeleton of a lex specification (.l file) x.l %{ < C global variables, prototypes, comments > %} [DEFINITION SECTION] %% [RULES SECTION] %% < C auxiliary subroutines> lex.yy.c is generated after running > lex x.l LITERAL BLOCK This part will be embedded into lex.yy.c substitutions, internal table, character translation code and start states; will be copied into lex.yy.c define how to scan and what action to take for each token any user code. For example, a main function to call the scanning function yylex().
  • 24. Literal Block • Any initial C program code want to copied into the final program should be written in definition section. • Lex copies the contents between “%{“ and “%}” directly to the generated C file. • In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verbatim to the generated C source file (i.eoutput) near beginning, before the beginning of yylex() (with the %{%}'s removed).
  • 25. Example %{ #include <stdio.h> #include "y.tab.h" int c; extern int yylval; /* * This simple demo: comments in side def section * Example */ %}
  • 26. Definitions Section(substitutions) • Definitions intended for Lex are given before the first %% delimiter. Any line in this section not contained between %{ and %}, and beginning in column 1, is assumed to define Lex substitution strings. • The definitions section contains declarations of simple name definitions to simplify the scanner specification. • Name definitions have the form: name definition or NAME expression • Example: DIGIT [0-9] ID [a-z][a-z0-9]*
  • 27. Continued.. • The format of such lines is name translation and it causes the string given as a translation to be associated with the name. • The name and translation must be separated by at least one blank or tab, and the name must begin with a letter. • The name can contain letters, digits & underscores, & must not start with a digit. • The translation can then be called out by the {name} syntax in a rule.
  • 28. Example Using {D} for the digits and {E} for an exponent field, for example, might abbreviate rules to recognize numbers: D [0-9] E [DEde][-+]?{D}+ %% {D}+ printf("integer"); {D}+"."{D}*({E})? | {D}*"."{D}+({E})? | {D}+{E}
  • 29. Internal Tables (%N Declarations) • Lex use internal tables of a fixed size which may not be big enough for large scanners, they allow the programmer to increase the size of the tables explicitly. • To increase the size of the tables with “%a”, “%e”, “%k”,“%n”, “%o”, and “%p” lines in the definition section. • The old lex accept “%r” to make lex generate a lexer in Ratfor and “%c” for a lexer in C. • Ex: %p 6000 %e 3000 To run lex with –v flag to know current statistics.
  • 30. Character Translations • A lexer uses native character code that the C complier uses. • Ex: The code for the character “A” is the C value “A”. • It is convenient to use some other character code, either because the i/p stream uses different code, EBCDIC or baudot or lex looks for patterns in an i/p stream not consisting of text at all. • Lex character translations allow to define an explicit mapping b/w bytes that are read by input() and characters used in lex patterns.
  • 31. Syntax: %T • Ex: %T 1 aA 2 bB 3 cC %T An i/p byte with value 1 will match anywhere there is an “A” or “a” in a pattern, so on. Note: if translation is used, every literal character used in lex program must appear on RHS of translation line.
  • 32. BEGIN • The BEGIN macro switches among start states. • It invokes, usually in action code for a pattern as: Syntax : BEGIN statename; The scanner starts in state 0(zero), also known as INITIAL. All other states must be named in %s or %x in the definition section. Even BEGIN is a marco, it doesn’t take any arguments itself, and statements need not be enclosed in parentheses.
  • 33. Start States • Start states also called start conditions or start rules in the definition section. • It used to apply a set rules only at certain times, and which makes to limit the scope of certain rules, or to change the way the lexer treats part of the file. Syntax: %s CMNT Or %x CMNT • In rule section, we added start state in angle brackets < > ( ex: <CMNT> ) • The rules that do not have start states can apply in any state. • The standard/default state in which lex starts is state ZERO, also known as INITIAL.
  • 34. Example: %x CMNT /* create new start state in the lexer. This is current start state */ %% “/*” BEGIN CMNT; /* switch to comment mode*/ <CMNT>. | /* these rules are recognized when lexer is in state CMNT */ <CMNT>n ; /* throw away comment state */ <CMNT>”*/” BEGIN INITIAL; /* once it matches the pattern it return back to regular state */ %%
  • 35. Difference B/W Regular And Exclusive Start States • A rule with no start state is not matched when an exclusive state is active. Example: %s NORMAL CMNT /* create new start state in the lexer. This is current start state */ %% %{ BEGIN NORMAL; /* start in Normal state */ %} <NORMAL>“/*” BEGIN CMNT; /* switch to comment mode*/ <CMNT>. | <CMNT>n ; /* throw away comment state */ <CMNT>”*/” BEGIN NORMAL; /* return to regular state */ %%
  • 36. Rules Section • Each rule is made of two parts: pattern and action, separated by whitespace. • The rules section of the lex input contains a series of rules of the form: PATTERN ACTION • Example: {ID} printf( "An identifier: %sn", yytext ); • The yytext and yylength variable.
  • 37. Example [t ]+ /* ignore whitespace */ ; If action is empty, the matched token is discarded. tab space Pattern matches 1 or more copies of subpattern Semicolon, do nothing C statement. its effect is to ignore the input.
  • 38. ACTION • If the action contains a ‘{‘, the action spans till the balancing ‘}‘ is found, as in C. • An action consisting only of a vertical bar ('|') means "same as the action for the next rule.“ • The return statement, as in C. • In case no rule matches: simply copy the input to the standard output (A default rule).
  • 39. Example1 : Single statements in Action Part %% {letter}({letter}|{digit})* printf(“id: %sn”, yytext); n printf(“new linen”); %% • If single statements is present in action part then no need of { } flower brackets/braces. Otherwise if more than one statement or more than one line long in action part then write within the { } flower brackets. Note : lex take everything after pattern as action, while others only read the first statement on the and ignore anything else.
  • 40. Example2 %% .|n ECHO; /* prints the matched pattern on the o/p, copying any punctuation or other character. */ %%
  • 42. Multiple statements in Action Part %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9]* { yylval = atoi(yytext); return(NUMBER); } [^a-z0-9b] { c = yytext[0]; return(c); } %%
  • 43. User Code Section • It can consist of any legal C code. • The lex copies it to the C file after the end of the lex generated code. • The user code section is simply copied to lex.yy.c verbatim. • The presence of this section is optional; if it is missing, the second %% in the input file may be skipped.
  • 44. Example main() { yylex(); /* it produced the c code after it processed the entire i/p.*/ }
  • 45. Comments Satements • Outside “%{“ and “%}”, comments must be indented with whitespace for lex to recognize them correctly. Example: %% [t ]+ /* ignore whitespace */ ; %% Main() { /* user code */ yylex(); /* to run the lexer. To translates the lex specification into a C file */ }
  • 46. Disambiguating rules • Lex has a set of disambiguating rules. The two that make lexer work are: 1. Lex patterns only match a given i/p character or string once. 2. Lex executes the action for the longest possible string match for the current i/p. Ex: island (program for verb/not verb). Ex : well-being as single word (program for no.of words).
  • 47. Precedence Problem • For example: a “<“ can be matched by “<“ and “<=“. • The one matching most text has higher precedence. • If two or more have the same length, the rule listed first in the lex input has higher precedence.
  • 48. Lex program structure … definitions … %% … rules … %% … subroutines … %{ #include <stdio.h> #include "y.tab.h" int c; extern int yylval; %} %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9]* { yylval = atoi(yytext); return(NUMBER); } [^a-z0-9b] { c = yytext[0]; return(c); }
  • 49. Lex Source Program example1 • Lex source is a table of – regular expressions and – corresponding program fragments digit [0-9] letter [a-zA-Z] %% {letter}({letter}|{digit})* printf(“id: %sn”, yytext); n printf(“new linen”); %% main() { yylex(); }
  • 50. A Simple Example2 %{ int num_lines = 0, num_chars = 0; %} %% n ++num_lines; ++num_chars; . ++num_chars; %% main() { yylex(); printf( "# of lines = %d, # of chars = %dn", num_lines, num_chars ); }
  • 51. Lex Source to C Program • The table is translated to a C program (lex.yy.c) which – reads an input stream – partitioning the input into strings which match the given expressions and – copying it to an output stream if necessary
  • 52. Step to create & run/execute the lex program 1. Create a file Vi filename.l 2. Type the source code in vi/geditor and save it by pressing esc shift colon w q then enter. (esc+shift+:+w+q). 3. Compile lex file (filename.l) to generate C routine (lex.yy.c) .i.e lex filename.l 4. Compile cc lex.yy.c to generate output file./a.out (object file) 5. % cc lex.yy.c –o <execfilename> -ll.
  • 53. Ex: cc lex.yy.c –o first –ll(lex library). 1. The lex translates the lex specification into a C source file called lex.yy.c. then this file is compiled and linked with lex library –ll. 2. ./a.out 3. Type the i/p data press enter key then press ctrl+d.
  • 54. Continued.. The execution is as following: 1. If –o <execfilename> is not given run the program by using ./a.out 2. If –o<execfilename> is given run the program by using execfilename. 3. Enter the data after execution. 4. Use ^d to terminate the program and give result.
  • 56. Lex Regular Expressions (Extended Regular Expressions) • A regular expression matches a set of strings. It contains text characters and operator characters. • A regular expression is a pattern description using a “meta” language. • Regular expression – Operators – Character classes – Arbitrary character – Optional expressions – Alternation and grouping – Context sensitivity – Repetitions and definitions
  • 57. Operators “ “ : quotation mark : backward slash(escape character) [ ] : square brackets (character class) ^ : caret (negation) - : minus ? : question mark . : period(dot) * : asterisk(star) + : plus | : vertical bar(pie) ( ) : parentheses $ : dollar / : forward slash { } : flower braces(curly braces) % : percentage < > : angular brackets
  • 58. Metacharacter Matches . Matches any character except newline Used to escape metacharacter * Matches zero or more copies of the preceding expression + Matches one or more copies of the preceding expression ? Matches zero or one copy of the preceding expression ^ Matches beginning of line as first character / complement (negation) $ Matches end of line as last character | Matches either preceding ( ) grouping a series of RE(one or more copies) [] Matches any character {} Indicates how many times the pervious pattern allowed to match. “ ” Interprets everything within it. / Matches preceding RE. only one slash is permitted. - Used to denote range. Example: A-Z implies all characters from A to Z Pattern Matching Primitives
  • 59. Quotation Mark Operator “ “ • The quotation mark operator “…”indicates that whatever is contained between a pair of quotes is to be taken as text characters. Ex: xyz"++“ or "xyz++" • If they are to be used as text characters, an escape should be used Ex: xyz++ = "xyz++" $ = “$” = “” • Every character but blank(b) , tab (t), newline (n), is backspace and the list above is always a text character. • Any blank character not contained within [] must be quoted.
  • 60. Character Classes [] • Classes of characters can be specified using the operator pair []. • It matches any single character, which is in the [ ]. • Every operator meaning is ignored except - and ^. • The - character indicates ranges. • If it is desired to include the character - in a character class, it should be first or last. • If the first character is a circumflex(“ ^”) it changes the meaning to match any character except the ones within the brackets. • C escape sequences starting with “” are recognized.
  • 61. examples [ab] => a or b [a-z] => a or b or c or … or z [-+0-9] => all the digits and the two signs [^a-zA-Z] => any character which is not a letter [a-z0-9<>_] all the lower case letters, the digits, the angle brackets, and underline. joke[rs] Matches either jokes or joker.
  • 62. Arbitrary Character . • To match almost any character, the operator character . is the class of all characters except newline • [40-176] matches all printable characters in the ASCII character set, from octal 40 (blank) to octal 176 (tilde~)
  • 63. Optional & Repeated Expressions • The operator ? indicates an optional element of an expression. Thus ab?c matches either ac or abc. • Repetitions of classes are indicated by the operators * and +. • a? => zero or one instance of a • a* => zero or more instances of a • a+ => one or more instances of a • E.g. ab?c => ac or abc [a-z]+ => all strings of lower case letters [a-zA-Z][a-zA-Z0-9]* => all alphanumeric strings with a leading alphabetic character
  • 64. Examples • an integer: 12345 [1-9][0-9]* • a word: cat [a-zA-Z]+ • a (possibly) signed integer: 12345 or -12345 [“-”+]?[1-9][0-9]* • a floating point number: 1.2345 [0-9]*”.”[0-9]+
  • 65. Examples • a delimiter for an English sentence “.” | “?” | ! or [“.””?”!] • C++ comment: // call foo() here!! “//”.* •white space [ t]+ • English sentence: Look at this! ([ t]+|[a-zA-Z]+)+(“.”|”?”|!)
  • 66. Alternation and Grouping The operator | indicates alternation: (ab|cd) matches either ab or cd. Note that parentheses are used for grouping, although they are not necessary on the outside level; ab|cd would have sufficed. Parentheses can be used for more complex expressions: (ab|cd+)?(ef)* matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd, or abcdef.
  • 67. Context Sensitivity • Lex will recognize a small amount of surrounding context. The two simplest operators for this are ^ and $. • If the first character of an expression is ^, the expression will only be matched at the beginning of a line (after a newline character, or at the beginning of the input stream). This can never conflict with the other meaning of ^, complementation of character classes, since that only applies within the [] operators.
  • 68. Continued.. • If the very last character is $, the expression will only be matched at the end of a line (when immediately followed by newline). • The latter operator is a special case of the / operator character, which indicates trailing context. • The expression ab/cd matches the string ab, but only if followed by cd. • Thus ab$ is the same as ab/n
  • 69. Continued.. Left context is handled in Lex by start conditions. If a rule is only to be executed when the Lex automaton interpreter is in start condition x, the rule should be prefixed by <x> using the angle bracket operator characters. If we considered “being at the beginning of a line'' to be start condition ONE, then the ^ operator would be equivalent to <ONE> Start conditions are explained more fully in earlier.
  • 70. Repetitions and Definitions The operators {} specify either repetitions (if they enclose numbers) or definition expansion (if they enclose a name). For example {digit} looks for a predefined string named digit and inserts it at that point in the expression. The definitions are given in the first part of the Lex input, before the rules. In contrast, a{1,5} looks for 1 to 5 occurrences of a.
  • 71. PLLab, NTHU,Cs2403 Programming Languages 71 Pattern Matching Primitives Meta character Example . a.b, we.78 n [t n] * [t n]*, a*. A.*z + [t n]+, a+, a.+r ? -?[0-9]+ , ab?c ^ [^t n], ^AD , ^(.*)n $ end of line as last character a|b a or b (ab)+ one or more copies of ab (grouping) [ab] a or b a{3} 3 instances of a “a+b” literal “a+b” (C escapes still work) A{1,2}shis+ Matches AAshis, Ashis, AAshi, Ashi. (A[b-e])+ Matches zero or one occurrences of A followed by any character from b to e.
  • 72. Finally, initial % is special, being the separator for Lex source segments • [a-z]+ printf("%s", yytext); will print the string in yytext. The C function printf accepts a format argument and data to be printed; in this case, the format is “print string'' (% indicating data conversion, and %s indicating string type), and the data are the characters in yytext. So this just places the matched string on the output. This action is so common that it may be written as ECHO: • [a-z]+ ECHO;
  • 73. Lex – Pattern Matching Examples
  • 74. Regular Expression (1/3) x match the character 'x' . any character (byte) except newline [xyz] a "character class"; in this case, the pattern matches either an 'x', a 'y', or a 'z‘ [abj-oZ] a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z‘ [^A-Z] a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter. [^A-Zn] any character EXCEPT an uppercase letter or a newline
  • 75. Regular Expression (2/3) r* zero or more r's, where r is any regular expression r+ one or more r's r? zero or one r's (that is, "an optional r") r{2,5} anywhere from two to five r's r{2,} two or more r's r{4} exactly 4 r's {name} the expansion of the "name" definition (see above) "[xyz]"foo“ the literal string: [xyz]"foo X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v‘, then the ANSI-C interpretation of x. Otherwise, a literal 'X' (used to escape operators such as '*')
  • 76. Regular Expression (3/3) 0 a NUL character (ASCII code 0) 123 the character with octal value 123 x2a the character with hexadecimal value 2a (r) match an r; parentheses are used to override precedence (see below) rs the regular expression r followed by the regular expression s; called "concatenation“ r|s either an r or an s ^r an r, but only at the beginning of a line (i.e.,which just starting to scan, or right after a newline has been scanned). r$ an r, but only at the end of a line (i.e., just before a newline). Equivalent to "r/n".
  • 77. Precedence of Operators • Level of precedence – Kleene closure (*), ?, + – concatenation – alternation (|) • All operators are left associative. • Ex: a*b|cd* = ((a*)b)|(c(d*))
  • 78. Lex Predefined Variables yytext • Wherever a scanner matches TOKEN ,the text of the token is stored in the null terminated string yytext. • The contents of yytext are replaced each time a new token is matched. External char yytext[]; //array External char *yytext; //pointer To increases the size of buffer. Format in AT&T and MKS %{ #undef YYLMAX /*remove default defintion*/ #define YYLMAX 500 /* new size */ %}
  • 79. Continued.. • If yytext is an array, any token which is longer than yytext will overflow the end of the array and cause the lexer to fail. • Yytext[] is 200 ,100character in different lex tools. • Flex have default I/O buffer is 16k,which can handled token upto 8k.
  • 80. yyleng The length of the token is stored in it. It is similar to strlen(yytext). • Example : [a-z]+ printf(“%s”, yytext); [a-z]+ ECHO; [a-zA-Z]+ {words++; chars += yyleng;}
  • 81. A Lex action may decide that a rule has not recognized the correct span of characters. Two routines are provided to aid with this situation. First, yymore () can be called to indicate that the next input expression recognized is to be tacked on to the end of this input. Normally, the next input string would overwrite the current entry in yytext. Second, yyless (n) may be called to indicate that not all the characters matched by the currently successful expression are wanted right now. The argument n indicates the number of characters in yytext to be retained. Further characters previously matched are returned to the input.
  • 82. Example "[^"]* { if (yytext[yyleng-1] == '') yymore(); else ... normal user processing } which will, when faced with a string such as "abc" “def" first match the five characters "abc”; then the call to yymore() will cause the next part of the string, "def, to be tacked on the end. Note that the final quote terminating the string should be picked up in the code labeled ``normal processing''.
  • 83. I/O routines Lex also permits access to the I/O routines it uses. They are: 1) input() which returns the next input character; 2) output(c) which writes the character c on the output; and 3) unput(c) pushes the character c back onto the input stream to be read later by input(). By default these routines are provided as macro definitions, but the user can override them and supply private versions. These routines define the relationship between external files and internal characters, and must all be retained or modified consistently.
  • 84. Ambiguous Source Rules Lex can handle ambiguous specifications. When more than one expression can match the current input, Lex chooses as follows: 1) The longest match is preferred. 2) Among rules which matched the same number of characters, the rule given first is preferred. Thus, suppose the rules integer keyword action ...; [a-z]+ identifier action ...; to be given in that order. If the input is integers, it is taken as an identifier, because [az]+ matches 8 characters while integer matches only 7. If the input is integer, both rules match 7 characters, and the keyword rule is selected because it was given first. Anything shorter (e.g. int) will not match the expression integer and so the identifier interpretation is used
  • 85. yywarp() • Yywarp is a built in macro. When a lexer encounters the end of file, it calls the routine yywarp() to find out what to do next. • If yywarp() returns 0, the lexer(scanner) continues scanning process. It indicates that there is more i/p & hence lexer has to continue working. • It first needs to adjust yyin to point to a new file by using fopen(). • If yywarp() returns 1, the lexer(scanner) halts scanning process(i.e scanner return 0 token to report the EOF.
  • 86. The user can create own marco. To do so, put this at the beginning of the rules section. Format: %{ #undef yywarp %} Note : disable the macro by undefine it, before define your macro.
  • 87. yylex() • The scannerr created by lex has the entry point yylex(). • You call yylex() to start or resume scanning. • If a lex action does a return to pass a value to the calling program, the next call to yylex() will continue from the point where it left off. • All code in the rules section is copied into yylex(). • Lines of code immediately after the “%%” line are placed near the beginning of the scanner, before the first executable statement.
  • 88. Example %{ int counter = 0; %} letter [a-zA-Z] %% {letter}+ {printf(“a wordn”); counter++;} %% main() { yylex(); printf(“There are total %d wordsn”, counter); }
  • 89. REJECT • The action REJECT means ``go do the next alternative.'' It causes whatever rule was second choice after the current rule to be executed. The position of the input pointer is adjusted accordingly. • Lex separaters the i/p into non overlapping tokens. • If overlaps occurs with other tokens then also we need all occurrences of a token, a special action REJECT is used to perform it.
  • 90. When to use REJECT? In general, REJECT is useful whenever the purpose of Lex is not to partition the input stream but to detect all examples of some items in the input, and the instances of these items may overlap or include each other.
  • 91. Some Lex rules to do this might be she s++; he h++; n | . ; where the last two rules ignore everything besides he and she. Remember that . does not include newline. Since she includes he, Lex will normally not recognize the instances of he included in she, since once it has passed a she those characters are gone.
  • 92. Examples she {s++; REJECT;} he {h++; REJECT;} n | . ; a[bc]+ { ... ; REJECT;} a[cd]+ { ... ; REJECT;} If the input is ab, only the first rule matches, and on ad only the second matches. The input string accb matches the first rule for four characters and then the second rule for three characters. In contrast, the input accd agrees with the second rule for four characters and then the first rule for three.
  • 93. Example … %% pink {npink++; REJECT;} ink {nink++; REJECT;} pin {npin++; REJECT;} . | n ; %% … i/p data : pink all three pattern will match. Without REJECT statement only pink is match. If REJECT action executes it puts back the text matched by the pattern & finds the next best match for it. Note : where the REJECT is necessary to pick up a letter pair beginning at every character, rather than at every other character.
  • 94. Lex Predefined Variables • yytext -- a string containing the lexeme • yyleng -- the length of the lexeme • yyin -- the input stream pointer – the default input of default main() is stdin • yyout -- the output stream pointer – the default output of default main() is stdout. • cs20: %./a.out < inputfile > outfile • E.g. [a-z]+ printf(“%s”, yytext); [a-z]+ ECHO; [a-zA-Z]+ {words++; chars += yyleng;}
  • 95. Lex Library Routines • yylex() – The default main() contains a call of yylex() • yymore() – return the next token • yyless(n) – retain the first n characters in yytext • yywarp() – is called whenever Lex reaches an end-of-file – The default yywarp() always returns 1
  • 96. Review of Lex Predefined Variables Name Function char *yytext pointer to matched string int yyleng length of matched string FILE *yyin input stream pointer FILE *yyout output stream pointer int yylex(void) call to invoke lexer, returns token char* yymore(void) return the next token int yyless(int n) retain the first n characters in yytext int yywrap(void) wrapup, return 1 if done, 0 if not done ECHO write matched string REJECT go to the next alternative rule INITAL initial start condition BEGIN condition switch start condition
  • 97. Revisiting Internal Variables in Lex • char *yytext; – Pointer to current lexeme terminated by ‘0’ • int yylen; – Number of chacters in yytex but not ‘0’ • yylval: – Global variable through which the token value can be returned to Yacc – Parser (Yacc) can access yylval, yylen, and yytext • How are these used? – Consider Integer Tokens: – yylval = ascii_to_integer (yytext); – Conversion from String to actual Integer Value
  • 98.
  • 99. Symbol Tables • The table of words is a simple symbol table, a common structure in lex and yacc applications. • The use of symbol table to build a table of words as the lexer is running, so we can add new words without modifying and recompiling the lex program. • Ex: A C compiler, for example, stores the variable and structure names, labels, enumeration tags, and all other names used in the program in its symbol table. Each name is stored along with information describing the name. In a C compiler the information is the type of symbol, déclaration scope, variable type, etc.
  • 100. Continued.. • add-word(), which puts a new word into the symbol table, and • lookup-word(), which looks up a word which should already be entered. • a variable state that keeps track of whether we're looking up words, state LOOKUP, or declaring them.
  • 101.
  • 102.
  • 103. RECOMMENDED QUESTIONS: 1. write the specification of lex with an example? (10) 2. what is regular expressions? With examples explain? (8) 3. write a lex program to count the no of words , lines , space, characters? (8) 4. write a lex program to count the no of vowels and consonants? (8) 5. what is lexer- parser communication? Explain? (5) 6. write a program to count no of words by the method of substitution? (7)
  • 104. Yacc - Yet Another Compiler- Compiler
  • 105. The parser is that phase of the compiler which takes a token string as input and with the help of existing grammar, converts it into the corresponding Intermediate Representation. The parser is also known as Syntax Analyzer.
  • 106. Yacc Theory: ◦ Yacc reads the grammar and generate C code for a parser . ◦ Grammars written in Backus Naur Form (BNF) . ◦ BNF grammar used to express context-free languages . ◦ e.g. to parse an expression , do reverse operation( reducing the expression) ◦ This known as bottom-up or shift-reduce parsing . ◦ Using stack for storing (LIFO).
  • 107. What is YACC ? – Tool which will produce a parser for a given grammar. – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar and to produce the source code of the syntactic analyzer of the language produced by this grammar
  • 108. Parsing • The i/p is divided into tokens, a program needs to establish the relationship among tokens. • A C complier needs to finds the expressions, statement, declarations, blocks, and procedures in the program. • This task is known as parsing.
  • 109. Parser • Yacc takes a concise description of a grammar and produces a C routine that can parse that grammar, a parser. • The yacc parser automatically detects whenever a sequences of i/p tokens matches one of the rules in the grammar and also detects a syntax error whenever its i/p tokens doesn’t match any of the rules.
  • 110. Grammar • The list/set of rules that define the relationships that program understands is a grammar. Or It is a series of rules that the parser uses to recognize syntactically valid i/p. For example, one grammar rule might be date : month_name day ',' year • Here, date, month_name, day, and year represent structures of interest in the input process; presumably, month_name, day, and year are defined elsewhere. The comma ``,'' is enclosed in single quotes; this implies that the comma is to appear literally in the input. The colon and semicolon merely serve as punctuation in the rule, and have no significance in controlling the input. Thus, with proper definitions, the input • July 4, 1776 • might be matched by the above rule.
  • 111. Example Ex: CFG expr : expr '+' term | term; term : term '*' factor | factor; factor : '(' expr ')' | ID | NUM; (a + b) *c
  • 112.
  • 113. Example 1. E -> E + E 2. E -> E * E 3. E -> id Three productions have been specified. Terms that appear on the left-hand side (lhs) of a production, such as E (expression) are nonterminals. Terms such as id (identifier) are terminals (tokens returned by lex) and only appear on the right-hand side (rhs) of a production. This grammar specifies that an expression may be the sum of two expressions, the product of two expressions, or an identifier.
  • 114. Example :x + y * z E -> E * E (r2) -> E * z (r3) -> E + E * z (r1) -> E + y * z (r3) -> x + y * z (r3)
  • 115. Symbols • A yacc grammar is constructed from symbols. • Symbols are strings of letter, digits, periods and underscores that do not start with a digit. • The symbol error is reserved for error recovery. • There are two types of symbols: 1. Terminal symbols or tokens 2. Non-terminal symbols or non-terminals
  • 116. Terminal symbols v/s Non-Terminal symbols • Terminal Symbols Symbols that actually appear in the i/p & are returned by the lexer are called terminal symbols or Tokens. •It represented in lower case letter. •It present only on RHS of arrow or colon. •It can not be further derived. •Ex : a , b & c • Non-Terminal Symbols Symbols that appear in the LHS of some rule are called Non-terminal symbols or non-terminals. •It represented in upper case letter. •It present on bothside of arrow or colon LHS &RHS. •It can be further derived. •Ex: E , T & F
  • 117. Continued. • The symbol to the left of the rule is known as the left-hand side rule(LHS). • The symbol to the right of the rule is known as the right-hand side rule(RHS). • Terminal & non-terminal symbols must be different; it is an error to write a rule with a token on the Left side. • Every grammar includes a Start symbol, the one that has to be at the root of the parse tree. • Ex : E is the start symbol.
  • 118. Example : ( a + b ) * c. E E + T | T T T * F | F F ( E ) |a| b | c Start State Vertical bar means two possibilities for the same symbol. Production Rules
  • 119. Recursive Rules • Rules can refer directly or indirectly to itself (themselves); this important ability makes it possible to parse arbitrarily long i/p sequences. • Applying the expression rules repeatedly. • Ex: fred = 14 + 23 – 11 + 7 Rule : E : NUM | E + NUM | E – NUM ; Note : E is called again & again.
  • 120. Left and Right Recursion • A recursive rule can put the recursive reference at the left end or right end of the RHS of the rule. • Ex: Exp: Exp ‘ + ‘ E; /* left recursion */ • Ex: Exp: E ‘ + ‘ Exp ; /* Right recursion */
  • 122.
  • 123. Continued.. • Note: Any recursive rule must have at least one non-recursive(one that does not refer to itself). Otherwise there is no way to terminate the string that it matches, which is an error. • Yacc handles left recursion more efficiently than right recursion. • This is because its internal stack keeps track of all symbols for all partially parsed rules.
  • 124. Yacc • Input to yacc is divided into three sections. • Every specification file consists of three sections: the declarations, (grammar) rules, and programs. The sections are separated by double percent ``%%'' marks. (The percent ``%'' is generally used in Yacc specifications as an escape character.) Format : ... definitions ... %% ... rules ... %% ... subroutines ...
  • 125. YACC File Format %{ C declarations %} yacc declarations %% Grammar rules %% Additional C code – Comments enclosed in /* ... */ may appear in any of the sections.
  • 126. Format of a Yacc Specification – 3 Parts • Definition Section: –Literal block contains Defs, Constants, Types, #includes, etc. that can Occur in a C Program. –Regular Definitions (expressions), declarations, start condition . –They may be %union, %start, %token, %type, %left, %right, and %nonassoc declarations. –All of these are optional, even it may be completely empty. •Translation Rules: –Pairs of (grammar rules, Action). –Informs paser/yacc of Action when grammar is Recognized. •Auxiliary Procedures: –Designer Defined C Code –Can Replace System Calls yacc.y File Format: DECLARATIONS %% TRANSLATION RULES %% AUXILIARY PROCEDURES
  • 127.
  • 128. Definitions Section  The definitions section consists of: ◦ token declarations . ◦ Types of values used on the parser stack, and other odds and ends. ◦ Literal blockC code bracketed by “%{“ and “%}”. ◦ The declaration section may be empty. • There may be %union, %start, %token, %type, %left, %right, and %nonassoc declarations. (See "%union Declaration," "Start Declaration,“ "Tokens," "%type Declarations," and "Precedence and Operator Declarations.")
  • 129. Continued.. • It can also contain comments in the usual C format, surrounded by ''/*n and "*/". • All of these are optional, so in a very simple parser the definition section may be completely empty. • You can use single quoted characters as tokens without declaring them, so we don't need to declare "=", "+", or "-".
  • 130. Literal Block • Any initial C program code want to copied into the final program should be written in definition section. • Yacc copies the contents between “%{“ and “%}” directly to the generated C file. • In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verbatim to the generated C source file (i.eoutput) near beginning, before the beginning of yyparse() (with the %{%}'s removed).
  • 131. Example %{ #include <stdio.h> #include "y.tab.h" int c; extern int yylval; /* * This simple demo: comments in side def section * Example */ %}
  • 132. Tokens Declarations • Tokens may either be symbols defined by %token or individual characters in single quotes. • All symbols used as tokens must be defined explicitly in the definitions section, e.g.: Format : %token NAME1,name2.. • Tokens can also be declared by %left, %right, or %nonassoc declarations, each of which has exactly the same syntax options as has %token
  • 133. Token Numbers • Within the lexer and parser, tokens are identified by small integers. • The token number of a literal token is the numeric value in the local character set, usually ASCII, and is the same as the C value of the quoted character. • %token NAME integervalue
  • 134. Continued.. With example – To specify token AAA BBB • %token AAA BBB • %token UP DOWN LEFT RIGHT – To assign a token number to a token (needed when using lex), a nonnegative integer followed immediately to the first appearance of the token • %token EOFnumber 0 • %token SEMInumber 101 • %token UP 50 DOWN 60 LEFT 17 RIGHT 25 – Non-terminals do not need to be declared unless you want to associated it with a type to store attributes
  • 135. Token Values • Each symbol in a yacc parser can have an associated value. (See "Symbol Values.") • Since tokens can have values, you need to set the values as the lexer returns tokens to the parser. • The token value is always stored in the variable yylval. • Example : [0-9]+ { yylval = atoi (yytext ) ; return NUMBER; }
  • 136. Symbol Values • Every symbol in a yacc parser, both tokens and non- terminals, can have a value associated with it. • If the token were NUMBER, the value might be the particular number, if it were STRING, the value might be a pointer to a copy of the string, and if it were SYMBOL, the value might be a pointer to an entry in the symbol table that describes the symbol.
  • 137. Continued.. • Ex: C type, int or double for the number, char * for the string, and a pointer to a structure for the symbol. • Yacc makes it easy to assign types to symbols so that it automatically uses the correct type for each symbol.
  • 138. Declaring Symbol Types • Internally, yacc declares each value as a C union that includes all of the types. • You list all of the types in a %union declaration, q.v. • Yacc turns this into a typedef for a union type called YYSTYPE. • Then for each symbol whose value is set or used in action code, you have to declare its type.
  • 139. %type Declaration • You declare the types of non-terminals using %type. Each declaration has the form: %type <type> name,name,.. • The type name must have been defined by a %union. • Each name is the name of a non-terminal symbol. • Use %type to declare non-terminals
  • 140. %union Declaration • The %union declaration identifies all of the possible C types that a symbol value can have. The declaration takes this form: %union { . . Field declarations ... } • The field declarations are copied verbatim into a C union declaration of the type YYSTYPE in the output file. Yacc does not check to see if the contents of the %union are valid C. • In the absence of a %union declaration, yacc defines YYSTYPE to be int so all of the symbol values are integers.
  • 141. Start Declaration/ Start Symbol • Normally, the start rule, the one that the parser starts trying to parse, is the one named in the first rule. • If you want to start with some other rule, in the declaration section you can write: Format : %start somename //to start with rule somename EX: • The first non-terminal specified in the grammar specification section. • To overwrite it with %start declaraction. %start non-terminal
  • 142. Example Definitions Section %{ #include <stdio.h> #include <stdlib.h> %} %token ID NUM %start expr It is a terminal expr parse
  • 143. Operator Declarations • Operator declarations appear in the definitions section. • The possible declarations are %left, %right, and %nonassoc. (In very old grammars you may find the obsolete equivalents %<, %>, and %2 or %binary.) • The %left and %right declarations make an operator left or right associative, respectively. • You declare non-associative operators with %nonassoc.
  • 144. Continued.. • Operators are declared in increasing order of precedence. • All operators declared on the same line are at the same precedence level. • Example : %left PLUSnumber, MINUSnumber %left TIMESnumber, DIVIDEnumber
  • 145. Example %union { int iValue; /* integer value */ char sIndex; /* symbol table index */ nodeType *nPtr; /* node pointer */ }; %token <iValue> INTEGER %token <sIndex> VARIABLE %token WHILE IF PRINT %nonassoc IFX %nonassoc ELSE %left GE LE EQ NE '>' '<' %left '+' '-' %left '*' '/' %nonassoc UMINUS %type <nPtr> stmt expr stmt_list
  • 146. YACC Declaration Summary `%start' Specify the grammar's start symbol `%union' Declare the collection of data types that semantic values may have `%token' Declare a terminal symbol (token type name) with no precedence or associativity specified `%type' Declare the type of semantic values for a nonterminal symbol
  • 147. YACC Declaration Summary `%right' Declare a terminal symbol (token type name) that is right-associative `%left' Declare a terminal symbol (token type name) that is left-associative `%nonassoc' Declare a terminal symbol (token type name) that is nonassociative (using it in a way that would be associative is a syntax error, ex: x op. y op. z is syntax error)
  • 148. Yacc RULE SECTION • The rules section contains grammar rules and actions containing C code. • A yacc consists of two task :Grammar and Action. ◦ the rules section consists of:  BNF grammar .  ACTION
  • 149. RULES • A yacc grammar consists of a set of rules. • Each rule starts with a nonterminal symbol and a colon, and is followed by a possibly empty Iist of symbols, literal tokens, and actions. • Rules by convention end with a semicolon, although in most versions of yacc the semicolon is optional. • The rules section is made up of one or more grammar rules.
  • 150. A grammar rule has the form: A : BODY ; A represents a nonterminal name, and BODY represents a sequence of zero or more names and literals. The colon and the semicolon are Yacc punctuation. If a nonterminal symbol matches the empty string. empty : ;
  • 151. Continued.. • If there are several grammar rules with the same left hand side, the vertical bar ``|'' can be used to avoid rewriting the left hand side. In addition, the semicolon at the end of a rule can be dropped before a vertical bar. Thus the grammar rules A : B C D ; A : E F ; A : G ; • can be given to Yacc as A : B C D | E F | G ;
  • 152. Example Rules Section • This section defines grammar • Example expr : expr '+' term | term; term : term '*' factor | factor; factor : '(' expr ')' | ID | NUM;
  • 153. %token NAME NUMBER %% statement: NAME '=' expression | expression ; expression: NUMBER ' + ‘ NUMBER I NUMBER '-' NUMBER ; Note :Unlike lex, yacc pays no attention to line boundaries in the rules section, and you will find that a lot of whitespace makes grammars easier to read. The symbol on the left-hand side of the first rule in the grammar is normally the start symbol. EXAMPLE
  • 154. Actions • An action is C code executed when yacc matches a rule in the grammar. • The action must be a C compound statement, Example: date: month '/' day ' / ' year { printf("date found"); } ; The action can refer to the values associated with the symbols in the rule by using a dollar sign followed by a number, with the first symbol after the colon being number 1.
  • 155. CONTINUED.. EXAMPLE : date: month ‘ / ' day ' / ‘year { print£ ("date %d-%d-%d found", $1, $3, $5) ; } : The name "$$" refers to the value for the symbol to the left of the colon. • For rules with no action, yacc uses a default of: { $$ = $1; }
  • 156. THE RULE'S ACTION • Whenever the parser reduces a rule, it executes user C code associated with the rule, known as the rule's action. • The action appears in braces after the end of the rule, before the semicolon or vertical bar. • The action code can refer to the values of the right- hand side symbols as $1, $2, . . . , and can set • the value of the left-hand side by setting $$. • In our parser, the value of an expression symbol is the value of the expression it represents.
  • 157. Rule Reduction and Action stat: expr {printf("%dn",$1);} | LETTER '=' expr {regs[$1] = $3;} ; expr: expr '+' expr {$$ = $1 + $3;} | LETTER {$$ = regs[$1];} ; Grammar rule Action “or” operator: For multiple RHS
  • 158. Rules Section • Normally written like this • Example: expr : expr '+' term | term ; term : term '*' factor | factor ; factor : '(' expr ')' | ID | NUM ;
  • 159. The Position of Rules expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor : '(' expr ')' { $$ = $2; } | ID | NUM ;
  • 160. The Position of Rules expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor : '(' expr ')' { $$ = $2; } | ID | NUM ; $1
  • 161. The Position of Rules expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor : '(' expr ')' { $$ = $2; } | ID | NUM ; $2
  • 162. The Position of Rules expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor : '(' expr ')' { $$ = $2; } | ID | NUM ; $3 Default: $$ = $1;
  • 163.
  • 164. SUBROUTINES SECTION the subroutines section consists of: ◦ user subroutines .
  • 165. User Code Section • It can consist of any legal C code. • The yacc copies it to the C file after the end of the yacc generated code using a lex generated lexer to compile: main() and yyerror(). • The user code section is simply copied to y.tab.c verbatim. • The presence of this section is optional; if it is missing, the second %% in the input file may be skipped.
  • 167. Example void yyerror(char *s) { fprintf(stderr, "%sn", s); } int main(void) { yyparse(); return 0; }
  • 169. Example yyerror(const char *str) { printf("yyerror: %s at line %dn", str, yyline); } main() { if (!yyparse()) {printf("acceptn");} else printf("rejectn"); }
  • 170. Yacc Program Structure % { # i n c l u d e < s t d i o . h > i n t r e g s [ 2 6 ] ; i n t b a s e ; % } % t o k e n n u m b e r l e t t e r % l e f t ' + ' ' - ‘ % l e f t ' * ' ' / ‘ % % l i s t : | l i s t s t a t ' n ' | l i s t e r r o r ' n ' { y y e r r o k ; } ; s t a t : e x p r { p r i n t f ( " % d n " , $ 1 ) ; } | L E T T E R ' = ' e x p r { r e g s [ $ 1 ] = $ 3 ; } ; e x p r : ' ( ' e x p r ' ) ' { $ $ = $ 2 ; } | e x p r ' + ' e x p r { $ $ = $ 1 + $ 3 ; } | L E T T E R { $ $ = r e g s [ $ 1 ] ; } %% m a i n ( ) { r e t u r n ( y y p a r s e ( ) ) ; } y y e r r o r ( C H A R * s ) { f p r i n t f ( s t d e r r, " % s n " , s ) ; } y y w r a p ( ) { r e t u r n ( 1 ) ; } … definitions … %% … rules … %% … subroutines …
  • 171. An YACC File Example %{ #include <stdio.h> %} %token NAME NUMBER %% statement: NAME '=' expression | expression { printf("= %dn", $1); } ; expression: expression '+' NUMBER { $$ = $1 + $3; } | expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ; %% int yyerror(char *s) { fprintf(stderr, "%sn", s); return 0; } int main(void) { yyparse(); return 0; }
  • 172. A YACC PARSER • A literal consists of a character enclosed in single quotes ``'''. As in C, the backslash ``'' is an escape character within literals, and all the C escapes are recognized. Thus • 'n' newline • 'r' return • ''' single quote ``''' • '' backslash ``'' • 't' tab • 'b' backspace • 'f' form feed • 'xxx' ``xxx'' in octal • For a number of technical reasons, the NUL character ('0' or 0) should never be used in grammar rules.
  • 173. How YACC Works a.out File containing desired grammar in yacc format yacc program C source program created by yacc C compiler Executable program that will parse grammar given in gram.y gram.y yacc y.tab.c cc or gcc
  • 174. yacc How YACC Works (1) Parser generation time YACC source (*.y) y.tab.h y.tab.c C compiler/linker (2) Compile time y.tab.c a.out a.out (3) Run time Token stream Abstract Syntax Tree y.output
  • 175. Creating, Compiling and Running a Simple Parser • Yacc environment – Yacc processes a yacc specification file and produces a y.tab.c file. – An integer function yyparse() is produced by Yacc. • Calls yylex() to get tokens. • Return non-zero when an error is found. • Return 0 if the program is accepted. – Need main() and and yyerror() functions.
  • 176. Step to create & run/execute the yacc program 1. Create a file Vi filename.y 2. Type the source code in vi/geditor and save it by pressing esc shift colon w q then enter.(esc+shift+:+w+q). 3. Compile yacc file (filename.y) to generate C routine (y.tab.c and y.tab.h) i.e yacc –d filename.y (token definition created by the –d ) 4. Compile cc y.tab.c -lyto generate output file./a.out (object file) 5. % cc y.tab.c –o <execfilename>.
  • 177. Parser-Lexer Communication • To try out our parser, we need a lexer to feed it tokens. • When you use a lex scanner and a yacc parser together, the parser is the higher level routine. • It calls the lexer yylex() whenever it needs a token from the input. • The lexer then scans through the input recognizing tokens. • As soon as it finds a token of interest to the parser, it returns to the parser, returning the token's code as the value of yylex().
  • 178. Continued.. • Not all tokens are of interest to the parser-in most programming languages the parser doesn't want to hear about comments and whitespace. • Yacc defines the token names in the parser as C preprocessor names in y.tab.h
  • 179. Works with Lex YACC yyparse() Input programs 12 + 26 LEX yylex() How to work ?
  • 180. Works with Lex YACC yyparse() Input programs 12 + 26 LEX yylex() call yylex() [0-9]+ next token is NUM NUM ‘+’ NUM
  • 181. Communication between LEX and YACC YACC yyparse() Input programs 12 + 26 LEX yylex() call yylex() [0-9]+ next token is NUM NUM ‘+’ NUM LEX and YACCtoken
  • 182. Communication between LEX and YACC yacc -d gram.y Will produce: y.tab.h • Use enumeration / define •Include •YACC y.tab.h • LEX include y.tab.h
  • 183. Communication between LEX and YACC %{ #include <stdio.h> #include "y.tab.h" %} id [_a-zA-Z][_a-zA-Z0-9]* %% int { return INT; } char { return CHAR; } float { return FLOAT; } {id} { return ID;} %{ #include <stdio.h> #include <stdlib.h> %} %token CHAR, FLOAT, ID, INT %% yacc -d xxx.y Produced y.tab.h: # define CHAR 258 # define FLOAT 259 # define ID 260 # define INT 261 parser.y scanner.l
  • 184.
  • 185. Yacc Example • Taken from Lex & Yacc • Simple calculator a = 4 + 6 a a=10 b = 7 c = a + b c c = 17 $
  • 186. Create ,Compiling and Running a Simple lex & Parser(yacc) • Lex part: • % vi chl-n.l • % lex chl-n.l Yacc part : • % vi chl-m.y • %% yacc -d chl-n.y Compile both lex & yacc • % cc -c lex.yy.c y.tab.c • % cc -0 example-m.n lex.yy.o y.tab.o -ll
  • 187. Example % yacc -d ch3-0l.y # makes y.tab.c and "y.tab.h % lex ch3-01.l # makes lex.yy.c % cc -o ch3-01 y.tab.c lex.yy.c -ly -ll # cmpile and link C files % ch3-01 99+12 = 111 % ch3-01
  • 188. Example of lex and yacc program
  • 189. Grammar expression ::= expression '+' term | expression '-' term | term term ::= term '*' factor | term '/' factor | factor factor ::= '(' expression ')' | '-' factor | NUMBER | NAME
  • 190. Parser (cont’d) statement_list: statement 'n' | statement_list statement 'n' ; statement: NAME '=' expression { $1->value = $3; } | expression { printf("= %gn", $1); } ; expression: expression '+' term { $$ = $1 + $3; } | expression '-' term { $$ = $1 - $3; } | term ; parser.y
  • 191. Parser (cont’d) term: term '*' factor { $$ = $1 * $3; } | term '/' factor { if ($3 == 0.0) yyerror("divide by zero"); else $$ = $1 / $3; } | factor ; factor: '(' expression ')' { $$ = $2; } | '-' factor { $$ = -$2; } | NUMBER { $$ = $1; } | NAME { $$ = $1->value; } ; %% parser.y
  • 192. Scanner %{ #include "y.tab.h" #include "parser.h" #include <math.h> %} %% ([0-9]+|([0-9]*.[0-9]+)([eE][-+]?[0-9]+)?) { yylval.dval = atof(yytext); return NUMBER; } [ t] ; /* ignore white space */ scanner.l
  • 193. Scanner (cont’d) [A-Za-z][A-Za-z0-9]* { /* return symbol pointer */ yylval.symp = symlook(yytext); return NAME; } "$" { return 0; /* end of input */ } n|”=“|”+”|”-”|”*”|”/” return yytext[0]; %% scanner.l
  • 194. YACC • Rules may be recursive • Rules may be ambiguous* • Rules may be conflicts • Uses bottom up Shift/Reduce parsing – Get a token – Push onto stack – Can it reduced (How do we know?) • If yes: Reduce using a rule • If no: Get another token • Yacc cannot look ahead more than one token Phrase -> cart_animal AND CART | work_animal AND PLOW …
  • 195. Define an ambiguity and conflicts. How it arises . • Ambiguous means there are multiple possible parses(o/p) for the same input. • Conflicts mean that yacc can't properly parse a grammar, probably because it's ambiguous. • It arises due to precedence and associativity operators is not specified. • Conflicts may arise because of mistakes in input or logic, or because the grammar rules, while consistent, require a more complex parser than Yacc can construct.
  • 196. Example :"2+3*4", • (2 + 3 ) * 4 2 + ( 3 * 4 )
  • 197. what Yacc Cannot Parse ? If its Ambiguity, Unambiguity and Conflicts • In some cases the grammar is truly ambiguous, that is, there are two possible parses(o/p) for a single input string and yacc cannot handle that. • In others, the grammar is unambiguous, but the parsing technique that yacc uses is not powerful enough to parse the grammar. • The problem in an unambiguous grammar with conflicts is that the parser would need to look more than one token ahead to decide which of two possible parses to use. • Yacc takes a default action when there is a conflict.
  • 198. Example : Input is HORSE AND CART phrase : cart - animal AND CART | work - animal AND PLOW cart : animal + HORSE | GOAT work : animal +HORSE | OX • yacc can't handle this because it requires two symbols of lookahead. It cannot lookahead more than one token. If we changed the first rule to this: phrase : cart - animal CART | work - animal PLOW
  • 199. Continued.. • The given expression should not be ambiguous and conflicts. • Yacc may fail to translate a grammar specification because the grammar is ambiguous or contains conflicts. Example for ambiguous A+ B * C two way it can solve: 1. ( A +B ) * C 2. A + (B * C ) Example for conflicts an Xl could either proga or progb: %% prog: proga I progb ; proga : 'XI ‘; prcgb : 'XI’ ;
  • 200. Types of conflicts Shift/Reduce Conflicts/ bottom-up •A shift/reduce conflict occurs when there are two possible parses for an input string, and one of the parses completes a rule (the reduce option) and one doesn't (the shift option). •Ex: E : ‘X’ | E + E ; • The i/p string X + X + X there are two possible parses,"(X+X)+X “or "X+(X+X)” Reduce/Reduce Conflicts •A reduce/reduce conflict occurs when the same token could complete two different rules. •Example: E : T | E; E : id ; T : id ; An “id” could either be a E or T.
  • 201. Disambiguating Rule • A rule describing which choice to make in a given situation is called a disambiguating rule. • Yacc invokes two disambiguating rules by default: 1. In a shift/reduce conflict, the default is to do the shift. 2. In a reduce/reduce conflict, the default is to reduce by the earlier grammar rule (in the input sequence).
  • 202. Arithmetic Expressions • The arithmetic expressions more general and realistic, extending the expression rules to handle multiplication and division, unary negation,and parenthesized expressions: expression: expression ' + ‘ expression { $$ = $1 + $3; } | expression ‘ - ' expression (: $$ = $1 - $3; } | expression ‘* ' expression (: $$ = $1 * $3; } | expression ‘/ ' expression { if ($3 == 0) yyerror ( "divide by zero " ) ; else $$ = $1 / $3; } |‘ – ‘ expression { $$ = $2; } | ' ( ‘ expression ' ) ' { $$ = $2; } | NUMBER ( $$ = $1; } ;
  • 203. Precedence, Associativity, and Operator Declarations • All yacc grammars have to be unambiguous. • Unambiguous means that is, there is only one possible way to parse any legal input using the rules in the grammar. • Ambiguous grammars cause conflicts, situations where there are two possible parses and hence two different ways that yacc can process a token. • When yacc processes an ambiguous grammar, it uses default rules to decide which way to parse an ambiguous sequence. • Often these rules do not produce the desired result, so yacc includes operator declarations that let you change the way it handles shiftheduce conflicts that result from ambiguous grammars.
  • 204. Precedence and Associativity • The rules for determining what operands group with which operators are known as precedence and associativity. Ex : a = b = C + d / e / f a = (b = (C + ((d / e) / f ) ) ) )
  • 205. Precedence • Precedence controls which operators to execute first in an expression. or • Precedence assigns each operator a precedence "level." • In any expression grammar, operators are grouped • into levels of precedence from lowest to highest. • Operators at higher levels bind more tightly, • e.g., if "*" has higher precedence than "+", "A+B*C” is treated as “A+(B*C)", while "D*E+F is "(D*E)+F.
  • 206. Associativity • Associativity controls the grouping of operators at the same precedence level.or • Associativity controls how the grammar groups expressions using the same operator or different operators with the same precedence, whether they group from the left, from the right, or not at all. • If "-" were left associative, the expression "A-B-C" would mean "(A-B)-C”, while if it were right associative it would mean "A-(B-C)".
  • 207. How to specify Precedence and Associativity ? • There are two ways to specify precedence and associativity in a grammar, implicitly and explicitly. • To specify them implicitly, rewrite the grammar using separate non-terminal symbols for each precedence level.
  • 208. Declarations IMPLICITLY Ex: expression: expression '+* mlexp I expression ' - ' mlexp I mlexp ; mlexp: mlexp ‘* ' primary I rrmlexp ‘/ ' primary I Primary ; primary: ' ( ‘ expression ' ) ' I ‘ - ' primary I NUMBER ; EXPLICITLY
  • 209. Precedence / Association 1. 1-2-3 = (1-2)-3? or 1-(2-3)? Define ‘-’ operator is left-association. 2. 1-2*3 = 1-(2*3) Define “*” operator is precedent to “-” operator expr: expr '-' expr | expr '*' expr | expr '<' expr | '(' expr ')' ... ; (1) 1 – 2 - 3 (2) 1 – 2 * 3
  • 210. Precedence / Association %right ‘=‘ %left '<' '>' NE LE GE %left '+' '-‘ %left '*' '/' highest precedence
  • 211. Precedence / Association expr : expr ‘+’ expr { $$ = $1 + $3; } | expr ‘-’ expr { $$ = $1 - $3; } | expr ‘*’ expr { $$ = $1 * $3; } | expr ‘/’ expr { if($3==0) yyerror(“divide 0”); else $$ = $1 / $3; } | ‘-’ expr %prec UMINUS {$$ = -$2; }
  • 212. Shift/Reduce Conflicts • shift/reduce conflict – occurs when a grammar is written in such a way that a decision between shifting and reducing can not be made. – ex: IF-ELSE ambiguous. • To resolve this conflict, yacc will choose to shift.
  • 213. Shift/Reduce Parsing • When yacc processes a parser, it creates a set of states each of which reflects a possible position in one or more partially parsed rules. Shift parsing • As the parser reads tokens, each time it reads a token that doesn't complete a rule it pushes the token on an internal stack and switches to a new state reflecting the token it just read. • This action is called a shift.
  • 214. Reduce parsing • When it has found all the symbols that constitute the right-hand side of a rule, it pops the right-hand side symbols off the stack, pushes the left-hand side symbol onto the stack, and switches to a new state reflecting the new symbol on the stack. This action is called a reduction, since it usually reduces the number of items on the stack. • (Not always, since it is possible to have rules with empty right-hand sides.) Whenever yacc reduces a rule, it executes user code associated with the rule.
  • 215. Example "fred = 12 + 13" The parser starts by shifting tokens on to the internal stack one at a time: Shift :(PUSH) RULES: fred fred = fred = 12 fred = 12 + fred = 12 + 13 Now reduce the rule "expression->NUMBER + NUMBER" so it pops the 12, the plus, and the 13 from the stack and replaces them with expression
  • 216. Reduce :(POPS) fred = expression Statement Now it reduces the rule "statement -> NAME = expression", so it pops fred, =, and expression and replaces them with statement. The end of the input and the stack has been reduced to the start symbol, so the input was valid according to the grammar.
  • 217. yacc& lex in Together • The grammar: program -> program expr | ε expr -> expr + expr | expr - expr | id • Program and expr are nonterminals. • Id are terminals (tokens returned by lex) . • expression may be : – sum of two expressions . – product of two expressions . – Or an identifiers
  • 218. When Not to Use Precedence Rules • You can use precedence rules to fix any shift/reduce conflict that occurs in the grammar. • The use of precedence in only two situations: in expression grammars, and to resolve the "dangling else" conflict in grammars for if-then-else language constructs. • Otherwise, if you can, you should fix the grammar to remove the conflict.
  • 219. How the Parser Works • Yacc turns the specification file into a C program, which parses the input according to the specification given. • The parser produced by Yacc consists of a finite state machine with a stack. • The parser is also capable of reading and remembering the next input token (called the lookahead token).
  • 220. Continued .. • The current state is always the one on the top of the stack. • The states of the finite state machine are given small integer labels; initially, the machine is in state 0, the stack contains only state 0, and no lookahead token has been read.
  • 221. The machine has only four actions available to it, called shift, reduce, accept, and error. • A move of the parser is done as follows: 1. Based on its current state, the parser decides whether it needs a lookahead token to decide what action should be done; if it needs one, and does not have one, it calls yylex to obtain the next token. 2. Using the current state, and the lookahead token if needed, the parser decides on its next action, and carries it out. This may result in states being pushed onto the stack, or popped off the stack, and in the lookahead token being processed or left alone.
  • 222. Variables and Typed Tokens % { double vbltable [26 ]; %} %union { double dval; int vblno ; } %type <dval> expression
  • 223. Symbol Values and %union 1.Why not have the lexer return the value of the variable as a double, to make the parser simpler? The problem is that there are two contexts where a variable name can occur: as part of an expression, in which case we want the double value, and to the left of an equal sign, in which case we need to remember which variable it is so we can update vbltable. To define the possible symbol types, in the definition section we add a %union declaration: %union { double dval; int vblno; }
  • 224. The y.tab.h generated from this grammar: #define NAME 257 #define NUMBER 258 #define UMINUS 2 5 9 typedef union { double dval; int vblno ; } YYSTYPE; extern YYSTYPE yylval; We have to tell the parser which symbols use which type of value. %token <vblno> NAME %token <dval> NUMBER %type <dval> expression
  • 225. Yacc Library • You can include the library by giving the -ly flag at the end of the cc command line on UNIX systems, or the equivalent on other systems. • main() • yyerror() • yyparse()
  • 226. yyerror() • Whenever a yacc parser detects a syntax error, it calls yyerror() to report the error to the user, passing it a single argument, a string describing the error. (Usually the only error you ever get is "syntax error.") • The default version of yyerror in the yacc library merely prints its argument on the standard output. Syntax: yyerror() { printf(“invalid”); exit 0; }
  • 227. yyparse() • The entry point to a yacc-generated parser is yyparseo. • When your program calls yyparse(), the parser attempts to parse an input stream. • The parser returns a value of zero if the parse succeeds and non-zero if not. • Every time you call yyparse() the parser starts parsing a new, forgetting whatever state it might have been in the last time it returned. Syntax: main() { yyparse() }
  • 231. Lex v/s Yacc • Lex – Lex generates C code for a lexical analyzer, or scanner – Lex uses patterns that match strings in the input and converts the strings to tokens • Yacc – Yacc generates C code for syntax analyzer, or parser. – Yacc uses grammar rules that allow it to analyze tokens from Lex and create a syntax tree.
  • 232. Lex with Yacc Lex Yacc yylex() yyparse() Lex source (Lexical Rules) Yacc source (Grammar Rules) Input Parsed Input lex.yy.c y.tab.c return token call
  • 233. RECOMMENDED QUESTIONS: 1. give the specification of yacc program? give an example? (8) 2. what is grammar? How does yacc parse a tree? (5) 3. how do you compile a yacc file? (5) 4. explain the ambiguity occurring in an grammar with an example? (6) 5. explain shift/reduce and reduce/reduce parsing ? (8) 6. write a yacc program to test the validity of an arthimetic expressions? (8) 7. write a yacc program to accept strings of the form anbn , n>0? (8)

Editor's Notes

  1. LR parsers are also known as LR(k) parsers, where L stands for left-to-right scanning of the input stream; R stands for the construction of right-most derivation in reverse, and k denotes the number of lookahead symbols to make decisions. LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second L in LL(k) stands for left-most derivation and k itself represents the number of look aheads. Look Ahead LR Parser (LALR) − LALR Parser is Look Ahead LR Parser.