Module4 lex and yacc.ppt

Introduction to Lex and Yacc
Prepared by,
Prof. Aruna M.G
Computer Science & Engineering Department
M.S.E.C
Bangalore

Lex and Yacc
• Two Compiler Writing Tools that are Utilized to easily
Specify:
– Lexical Tokens and their Order of Processing (Lex)
– Context Free Grammar for LALR(1) (Yacc)
• Both Lex and Yacc have Long History in Computing
– Lex and Yacc – Earliest Days of Unix Minicomputers
– Flex and Bison – From GNU
– JFlex - Fast Scanner Generator for Java
– BYacc/J – Berkeley
– CUP, ANTRL, PCYACC, …
– PCLEX and PCYACC from Abacus

Overview
take a glance at Lex!

General Compiler Infra-structure
Scanner
(tokenizer)
Parser Semantic
Routines
Analysis/
Transformations/
optimizations
Code
Generator
Program source
(stream of
characters)
Tokens
Syntactic
Structure
IR: Intermediate
Representation (1)
Assembly code
IR: Intermediate
Representation (2)
Symbol and
Attribute Tables

flex - fast lexical analyzer generator
• Flex is a tool for generating scanners.
• Flex source is a table of regular expressions
and corresponding program fragments.
• Generates lex.yy.c which defines a routine
yylex()

Lex
 Written by Eric Schmidt and Mike Lesk.
 lex is a program (generator) that generates lexical analyzers, (widely
used on Unix).
 It is mostly used with Yacc parser generator.
 It reads the input stream (specifying the lexical analyzer ) and outputs
source code implementing the lexical analyzer in the C programming
language.
 Lex will read patterns (regular expressions); then produces C code
for a lexical analyzer that scans for identifiers.

What is Lex?
• The main job of a lexical analyzer (scanner) is
to break up an input stream into more usable
elements (tokens)
a = b + c * d;
ID ASSIGN ID PLUS ID MULT ID SEMI

Lex – Lexical Analyzer
• A set of descriptions of possible tokens and
producing a C routine is called a lexical analyzer or
scanner or lexer.
• Lexical analyzers tokenize input streams
• Tokens are the terminals of a language
– English
• words, punctuation marks, …
– Programming language
• Identifiers, operators, keywords, …
• Regular expressions define terminals/tokens

LEXER
• Lexical analysis is the process of converting a
sequence of characters into a sequence of tokens.
• A program or function which performs lexical
analysis is called a lexical analyzer, lexer or scanner.
• A lexer often exists as a single function which is
called by a parser or another function

Token
• A token is a string of characters, categorized
according to the rules as a symbol (e.g.
IDENTIFIER, NUMBER, COMMA, etc.).

Tokens
• Tokens in Lex are declared like variable names
in C. Every token has an associated expression.
Token Associated expression Meaning
number ([0-9])+ 1 or more occurrences of a digit
chars [A-Za-z] Any character
blank " " A blank space
word (chars)+ 1 or more occurrences of chars
variable (chars)+(number)*(chars)*( number)*

Consider this expression in the C programming language:
sum=3+2;
Tokenized in the following table:
lexeme token type
sum Identifier
= Assignment operator
3 Number
+ Addition operator
2 Number
; End of statement

Lex – A Lexical Analyzer Generator
• A Unix Utility from early 1970s
• A Compiler that Takes as Source a Specification for:
– Tokens/Patterns of a Language
– Generates a “C” Lexical Analyzer Program
• Pictorially:Creating a Lexical Analyzer with Lex
Lex
Compiler
C
Compiler
a.out
Lex Source
Program:
lex.l
lex.yy.c
lex.yy.c a.out
Input stream Sequence
of tokens

Step for executing lex program
• First a specification of a lexical analyzer is prepared
by creating a program Lex-l(filename.l) in the Lex
language.
• Lex-l is run through the Lex compiler to produce a C
program Lex.YY.C.
• The program Lex.YY.C consists of a tabular
representation of a transition diagram constructed
from the regular expressions of lex.l, together with
a standard routine that uses the table to recognize
LEXEMER.

Continued..
• The lexical analyses phase reads the characters in the
source program and groups them into a stream of
tokens in which each token represents a logically
cohesive sequence of characters, such as an identifier, a
keyword (if, while, etc.) a punctuation character or a
multi-character operator like : = .
• The character sequence forming a token is called the
lexeme for the token.
• The actions associated with regular expressions in lex
- a are pieces of C code and are carried over directly to
lex. YY.C.
• Finally, lex .YY.C is run through the C compiler to
produce an object program a.out.

LEX SPECIFICATION
• The set of descriptions you(we) give to lex is
called a lex specification.

(optional)
(required)
Lex Source
• Lex source is separated into three sections by %%
delimiters
• The general format of Lex source is
• The absolute minimum Lex program is thus
{definitions}
%%
{transition rules}
%%
{user subroutines}
%%

Format of the Input File
• The flex /lex input file consists of three
sections, separated by a line with just %% in it:
definitions
%%
rules
%%
user code

Note
• where the definitions and the user subroutines are
often omitted.
• The second %% is optional, but the first is required
to mark the beginning of the rules.
• The absolute minimum Lex program is thus
%%
(no definitions, no rules) which translates into a
program which copies the input to the output
unchanged.

Format of a Lexical Specification – 3 Parts
• Declarations:
– Literal block contains Defs, Constants, Types, #includes,
etc. that can Occur in a C Program.
– Regular Definitions (expressions),internal table
declaration, start condition and translation.
• Translation Rules:
– Pairs of (Regular Expression, Action)
– Informs Lexical Analyzer of Action when Pattern is
Recognized
• Auxiliary Procedures:
– Designer Defined C Code
– Can Replace System Calls
Lex.y File Format:
DECLARATIONS
%%
TRANSLATION RULES
%%
AUXILIARY PROCEDURES

Skeleton of a lex specification (.l file)
x.l
%{
< C global variables, prototypes,
comments >
%}
[DEFINITION SECTION]
%%
[RULES SECTION]
%%
< C auxiliary subroutines>
lex.yy.c is generated after
running
> lex x.l
LITERAL BLOCK
This part will be embedded into lex.yy.c
substitutions, internal table, character
translation code and start states; will
be copied into lex.yy.c
define how to scan and what action to
take for each token
any user code. For example, a main
function to call the scanning function
yylex().

Literal Block
• Any initial C program code want to copied
into the final program should be written in
definition section.
• Lex copies the contents between “%{“ and
“%}” directly to the generated C file.
• In the definitions and rules sections, any
indented text or text enclosed in %{ and %} is
copied verbatim to the generated C source file
(i.eoutput) near beginning, before the
beginning of yylex() (with the %{%}'s
removed).

Example
%{
#include <stdio.h>
#include "y.tab.h"
int c;
extern int yylval;
/*
* This simple demo: comments in side def section
* Example
*/
%}

Definitions Section(substitutions)
• Definitions intended for Lex are given before the first
%% delimiter. Any line in this section not contained
between %{ and %}, and beginning in column 1, is
assumed to define Lex substitution strings.
• The definitions section contains declarations of simple
name definitions to simplify the scanner specification.
• Name definitions have the form:
name definition or NAME expression
• Example:
DIGIT [0-9]
ID [a-z][a-z0-9]*

Continued..
• The format of such lines is name translation and
it causes the string given as a translation to be
associated with the name.
• The name and translation must be separated by
at least one blank or tab, and the name must
begin with a letter.
• The name can contain letters, digits &
underscores, & must not start with a digit.
• The translation can then be called out by the
{name} syntax in a rule.

Example
Using {D} for the digits and {E} for an
exponent field, for example, might abbreviate
rules to recognize numbers:
D [0-9]
E [DEde][-+]?{D}+
%%
{D}+ printf("integer");
{D}+"."{D}*({E})? |
{D}*"."{D}+({E})? |
{D}+{E}

Internal Tables (%N Declarations)
• Lex use internal tables of a fixed size which may not be
big enough for large scanners, they allow the
programmer to increase the size of the tables explicitly.
• To increase the size of the tables with “%a”, “%e”,
“%k”,“%n”, “%o”, and “%p” lines in the definition
section.
• The old lex accept “%r” to make lex generate a lexer in
Ratfor and “%c” for a lexer in C.
• Ex: %p 6000
%e 3000
To run lex with –v flag to know current statistics.

Character Translations
• A lexer uses native character code that the C
complier uses.
• Ex: The code for the character “A” is the C value “A”.
• It is convenient to use some other character code,
either because the i/p stream uses different code,
EBCDIC or baudot or lex looks for patterns in an i/p
stream not consisting of text at all.
• Lex character translations allow to define an explicit
mapping b/w bytes that are read by input() and
characters used in lex patterns.

Syntax: %T
• Ex:
%T
1 aA
2 bB
3 cC
%T
An i/p byte with value 1 will match anywhere there is
an “A” or “a” in a pattern, so on.
Note: if translation is used, every literal character
used in lex program must appear on RHS of
translation line.

BEGIN
• The BEGIN macro switches among start states.
• It invokes, usually in action code for a pattern as:
Syntax : BEGIN statename;
The scanner starts in state 0(zero), also known as
INITIAL.
All other states must be named in %s or %x in the
definition section.
Even BEGIN is a marco, it doesn’t take any arguments
itself, and statements need not be enclosed in
parentheses.

Start States
• Start states also called start conditions or start rules
in the definition section.
• It used to apply a set rules only at certain times, and
which makes to limit the scope of certain rules, or
to change the way the lexer treats part of the file.
Syntax: %s CMNT Or %x CMNT
• In rule section, we added start state in angle
brackets < > ( ex: <CMNT> )
• The rules that do not have start states can apply in
any state.
• The standard/default state in which lex starts is
state ZERO, also known as INITIAL.

Example:
%x CMNT /* create new start state in the lexer. This is
current start state */
%%
“/*” BEGIN CMNT; /* switch to comment mode*/
<CMNT>. | /* these rules are recognized when
lexer is in state CMNT */
<CMNT>n ; /* throw away comment state */
<CMNT>”*/” BEGIN INITIAL; /* once it matches the
pattern it return back to regular state */
%%

Difference B/W Regular And Exclusive Start
States
• A rule with no start state is not matched when an exclusive state
is active.
Example:
%s NORMAL CMNT /* create new start state in the lexer. This is
current start state */
%%
%{
BEGIN NORMAL; /* start in Normal state */
%}
<NORMAL>“/*” BEGIN CMNT; /* switch to comment mode*/
<CMNT>. |
<CMNT>n ; /* throw away comment state */
<CMNT>”*/” BEGIN NORMAL; /* return to regular state */
%%

Rules Section
• Each rule is made of two parts: pattern and
action, separated by whitespace.
• The rules section of the lex input contains a
series of rules of the form:
PATTERN ACTION
• Example:
{ID} printf( "An identifier: %sn", yytext );
• The yytext and yylength variable.

Example
[t ]+ /* ignore whitespace */ ;
If action is empty, the matched token is discarded.
tab space Pattern matches 1 or more
copies of subpattern
Semicolon, do nothing C statement.
its effect is to ignore the input.

ACTION
• If the action contains a ‘{‘, the action spans till the
balancing ‘}‘ is found, as in C.
• An action consisting only of a vertical bar ('|') means
"same as the action for the next rule.“
• The return statement, as in C.
• In case no rule matches: simply copy the input to the
standard output (A default rule).

Example1 : Single statements in Action Part
%%
{letter}({letter}|{digit})* printf(“id: %sn”, yytext);
n printf(“new linen”);
%%
• If single statements is present in action part then no need
of { } flower brackets/braces.
Otherwise
if more than one statement or more than one line long in
action part then write within the { } flower brackets.
Note : lex take everything after pattern as action, while others
only read the first statement on the and ignore anything
else.

Example2
%%
.|n ECHO; /* prints the matched pattern on the o/p,
copying any punctuation or other character. */
%%

Example3
%%
colour printf("color");
mechanise printf("mechanize");
petrol printf("gas");
%%

Multiple statements in Action Part
%%
" " ;
[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER);
}
[0-9]* { yylval = atoi(yytext); return(NUMBER); }
[^a-z0-9b] { c = yytext[0]; return(c); }
%%

User Code Section
• It can consist of any legal C code.
• The lex copies it to the C file after the end of
the lex generated code.
• The user code section is simply copied to
lex.yy.c verbatim.
• The presence of this section is optional; if it is
missing, the second %% in the input file may
be skipped.

Example
main()
{
yylex(); /* it produced the c code
after it processed the entire
i/p.*/
}

Comments Satements
• Outside “%{“ and “%}”, comments must be indented
with whitespace for lex to recognize them correctly.
Example:
%%
[t ]+ /* ignore whitespace */ ;
%%
Main()
{
/* user code */
yylex(); /* to run the lexer. To translates the lex specification
into a C file */
}

Disambiguating rules
• Lex has a set of disambiguating rules. The two
that make lexer work are:
1. Lex patterns only match a given i/p character or
string once.
2. Lex executes the action for the longest possible
string match for the current i/p.
Ex: island (program for verb/not verb).
Ex : well-being as single word (program for no.of
words).

Precedence Problem
• For example: a “<“ can be matched by
“<“ and “<=“.
• The one matching most text has higher
precedence.
• If two or more have the same length, the rule
listed first in the lex input has higher
precedence.

Lex program structure
… definitions …
%%
… rules …
%%
… subroutines …
%{
#include <stdio.h>
#include "y.tab.h"
int c;
extern int yylval;
%}
%%
" " ;
[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); }
[0-9]* { yylval = atoi(yytext); return(NUMBER); }
[^a-z0-9b] { c = yytext[0]; return(c); }

Lex Source Program example1
• Lex source is a table of
– regular expressions and
– corresponding program fragments
digit [0-9]
letter [a-zA-Z]
%%
{letter}({letter}|{digit})* printf(“id: %sn”, yytext);
n printf(“new linen”);
%%
main()
{
yylex();
}

A Simple Example2
%{
int num_lines = 0, num_chars = 0;
%}
%%
n ++num_lines; ++num_chars;
. ++num_chars;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars = %dn",
num_lines, num_chars );
}

Lex Source to C Program
• The table is translated to a C program (lex.yy.c)
which
– reads an input stream
– partitioning the input into strings which match the
given expressions and
– copying it to an output stream if necessary

Step to create & run/execute the lex
program
1. Create a file Vi filename.l
2. Type the source code in vi/geditor and save it by
pressing esc shift colon w q then enter.
(esc+shift+:+w+q).
3. Compile lex file (filename.l) to generate C routine
(lex.yy.c) .i.e lex filename.l
4. Compile cc lex.yy.c to generate output file./a.out
(object file)
5. % cc lex.yy.c –o <execfilename> -ll.

Ex: cc lex.yy.c –o first –ll(lex library).
1. The lex translates the lex specification into a C
source file called lex.yy.c. then this file is compiled
and linked with lex library –ll.
2. ./a.out
3. Type the i/p data press enter key then press ctrl+d.

Continued..
The execution is as following:
1. If –o <execfilename> is not given run the
program by using ./a.out
2. If –o<execfilename> is given run the program
by using execfilename.
3. Enter the data after execution.
4. Use ^d to terminate the program and give
result.

Lex Regular Expressions (Extended Regular
Expressions)
• A regular expression matches a set of strings. It contains
text characters and operator characters.
• A regular expression is a pattern description using a “meta”
language.
• Regular expression
– Operators
– Character classes
– Arbitrary character
– Optional expressions
– Alternation and grouping
– Context sensitivity
– Repetitions and definitions

Operators
“ “ : quotation mark
: backward slash(escape character)
[ ] : square brackets (character class)
^ : caret (negation)
- : minus
? : question mark
. : period(dot)
* : asterisk(star)
+ : plus
| : vertical bar(pie)
( ) : parentheses
$ : dollar
/ : forward slash
{ } : flower braces(curly braces)
% : percentage
< > : angular brackets

Metacharacter Matches
. Matches any character except newline
Used to escape metacharacter
* Matches zero or more copies of the preceding expression
+ Matches one or more copies of the preceding expression
? Matches zero or one copy of the preceding expression
^ Matches beginning of line as first character / complement (negation)
$ Matches end of line as last character
| Matches either preceding
( ) grouping a series of RE(one or more copies)
[] Matches any character
{} Indicates how many times the pervious pattern allowed to match.
“ ” Interprets everything within it.
/ Matches preceding RE. only one slash is permitted.
- Used to denote range. Example: A-Z implies all characters from A to Z
Pattern Matching Primitives

Quotation Mark Operator “ “
• The quotation mark operator “…”indicates that
whatever is contained between a pair of quotes is to be
taken as text characters.
Ex: xyz"++“ or "xyz++"
• If they are to be used as text characters, an escape
should be used
Ex: xyz++ = "xyz++"
$ = “$”
= “”
• Every character but blank(b) , tab (t), newline (n),
is backspace and the list above is always a text character.
• Any blank character not contained within [] must be
quoted.

Character Classes []
• Classes of characters can be specified using the operator
pair [].
• It matches any single character, which is in the [ ].
• Every operator meaning is ignored except - and ^.
• The - character indicates ranges.
• If it is desired to include the character - in a character
class, it should be first or last.
• If the first character is a circumflex(“ ^”) it changes the
meaning to match any character except the ones within
the brackets.
• C escape sequences starting with “” are recognized.

examples
[ab] => a or b
[a-z] => a or b or c or … or z
[-+0-9] => all the digits and the two signs
[^a-zA-Z] => any character which is not a
letter
[a-z0-9<>_] all the lower case letters, the digits,
the angle brackets, and underline.
joke[rs] Matches either jokes or joker.

Arbitrary Character .
• To match almost any character, the operator
character . is the class of all characters
except newline
• [40-176] matches all printable
characters in the ASCII character set, from
octal 40 (blank) to octal 176 (tilde~)

Optional & Repeated Expressions
• The operator ? indicates an optional element of an
expression. Thus ab?c matches either ac or abc.
• Repetitions of classes are indicated by the operators * and +.
• a? => zero or one instance of a
• a* => zero or more instances of a
• a+ => one or more instances of a
• E.g.
ab?c => ac or abc
[a-z]+ => all strings of lower case letters
[a-zA-Z][a-zA-Z0-9]* => all alphanumeric strings
with a leading alphabetic character

Examples
• an integer: 12345
[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[“-”+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+

Examples
• a delimiter for an English sentence
“.” | “?” | ! or [“.””?”!]
• C++ comment: // call foo() here!!
“//”.*
•white space
[ t]+
• English sentence: Look at this!
([ t]+|[a-zA-Z]+)+(“.”|”?”|!)

Alternation and Grouping
The operator | indicates alternation:
(ab|cd)
matches either ab or cd. Note that parentheses are used for
grouping, although they are not necessary on the outside
level;
ab|cd
would have sufficed. Parentheses can be used for more
complex expressions:
(ab|cd+)?(ef)*
matches such strings as abefef, efefef, cdef, or cddd; but not
abc, abcd, or abcdef.

Context Sensitivity
• Lex will recognize a small amount of surrounding
context. The two simplest operators for this are ^
and $.
• If the first character of an expression is ^, the
expression will only be matched at the beginning of
a line (after a newline character, or at the beginning
of the input stream). This can never conflict with
the other meaning of ^, complementation of
character classes, since that only applies within the
[] operators.

Continued..
• If the very last character is $, the expression will
only be matched at the end of a line (when
immediately followed by newline).
• The latter operator is a special case of the /
operator character, which indicates trailing context.
• The expression ab/cd matches the string ab, but
only if followed by cd.
• Thus ab$ is the same as ab/n

Continued..
Left context is handled in Lex by start conditions. If a
rule is only to be executed when the Lex
automaton interpreter is in start condition x, the
rule should be prefixed by
<x>
using the angle bracket operator characters. If we
considered “being at the beginning of a line'' to be
start condition ONE, then the ^ operator would be
equivalent to
<ONE>
Start conditions are explained more fully in earlier.

Repetitions and Definitions
The operators {} specify either repetitions (if they enclose
numbers) or definition expansion (if they enclose a
name).
For example
{digit}
looks for a predefined string named digit and inserts it at
that point in the expression. The definitions are given in
the first part of the Lex input, before the rules. In
contrast,
a{1,5}
looks for 1 to 5 occurrences of a.

PLLab, NTHU,Cs2403 Programming Languages 71
Pattern Matching Primitives
Meta
character
Example
. a.b, we.78
n [t n]
* [t n]*, a*. A.*z
+ [t n]+, a+, a.+r
? -?[0-9]+ , ab?c
^ [^t n], ^AD , ^(.*)n
$ end of line as last character
a|b a or b
(ab)+ one or more copies of ab (grouping)
[ab] a or b
a{3} 3 instances of a
“a+b” literal “a+b” (C escapes still work)
A{1,2}shis+ Matches AAshis, Ashis, AAshi, Ashi.
(A[b-e])+ Matches zero or one occurrences of A followed by any character from b to e.

Finally, initial % is special, being the separator
for Lex source segments
• [a-z]+ printf("%s", yytext);
will print the string in yytext. The C function printf
accepts a format argument and data to be printed; in
this case, the format is “print string'' (% indicating
data conversion, and %s indicating string type), and
the data are the characters in yytext. So this just
places the matched string on the output. This action is
so common that it may be written as ECHO:
• [a-z]+ ECHO;

Lex – Pattern Matching Examples

Regular Expression (1/3)
x match the character 'x'
. any character (byte) except newline
[xyz] a "character class"; in this case, the pattern matches either
an 'x', a 'y', or a 'z‘
[abj-oZ] a "character class" with a range in it; matches an 'a', a 'b',
any letter from 'j' through 'o', or a 'Z‘
[^A-Z] a "negated character class", i.e., any character but those in
the class. In this case, any character EXCEPT an uppercase
letter.
[^A-Zn] any character EXCEPT an uppercase letter or a newline

r* zero or more r's, where r is any regular expression
r+ one or more r's
r? zero or one r's (that is, "an optional r")
r{2,5} anywhere from two to five r's
r{2,} two or more r's
r{4} exactly 4 r's
{name} the expansion of the "name" definition (see above)
"[xyz]"foo“ the literal string: [xyz]"foo
X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v‘, then the ANSI-C interpretation of x.
Otherwise, a literal 'X' (used to escape operators such as '*')

0 a NUL character (ASCII code 0)
123 the character with octal value 123
x2a the character with hexadecimal value 2a
(r) match an r; parentheses are used to override precedence (see below)
rs the regular expression r followed by the regular expression s; called
"concatenation“
r|s either an r or an s
^r an r, but only at the beginning of a line (i.e.,which just starting to scan,
or right after a newline has been scanned).
r$ an r, but only at the end of a line (i.e., just before a newline).
Equivalent to "r/n".

Precedence of Operators
• Level of precedence
– Kleene closure (*), ?, +
– concatenation
– alternation (|)
• All operators are left associative.
• Ex: a*b|cd* = ((a*)b)|(c(d*))

Lex Predefined Variables
yytext
• Wherever a scanner matches TOKEN ,the text of the
token is stored in the null terminated string yytext.
• The contents of yytext are replaced each time a new
token is matched.
External char yytext[]; //array
External char *yytext; //pointer
To increases the size of buffer. Format in AT&T and MKS
%{
#undef YYLMAX /*remove default defintion*/
#define YYLMAX 500 /* new size */
%}

Continued..
• If yytext is an array, any token which is longer
than yytext will overflow the end of the array
and cause the lexer to fail.
• Yytext[] is 200 ,100character in different lex
tools.
• Flex have default I/O buffer is 16k,which can
handled token upto 8k.

yyleng
The length of the token is stored in it. It is similar
to strlen(yytext).
• Example :
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}

A Lex action may decide that a rule has not recognized the
correct span of characters. Two routines are provided to aid
with this situation.
First, yymore () can be called to indicate that the next
input expression recognized is to be tacked on to the end
of this input. Normally, the next input string would
overwrite the current entry in yytext.
Second, yyless (n) may be called to indicate that not all the
characters matched by the currently successful expression
are wanted right now. The argument n indicates the
number of characters in yytext to be retained.
Further characters previously matched are
returned to the input.

Example
"[^"]* {
if (yytext[yyleng-1] == '')
yymore();
else
... normal user processing
}
which will, when faced with a string such as "abc" “def"
first match the five characters "abc”; then the call to
yymore() will cause the next part of the string, "def, to be
tacked on the end.
Note that the final quote terminating the string should be
picked up in the code labeled ``normal processing''.

I/O routines
Lex also permits access to the I/O routines it uses.
They are:
1) input() which returns the next input character;
2) output(c) which writes the character c on the output; and
3) unput(c) pushes the character c back onto the input stream to
be read later by input().
By default these routines are provided as macro definitions, but
the user can override them and supply private versions. These
routines define the relationship between external files and
internal characters, and must all be retained or modified
consistently.

Ambiguous Source Rules
Lex can handle ambiguous specifications. When more than one
expression can match the current input, Lex chooses as follows:
1) The longest match is preferred.
2) Among rules which matched the same number of characters,
the rule given first is preferred. Thus, suppose the rules
integer keyword action ...;
[a-z]+ identifier action ...;
to be given in that order. If the input is integers, it is taken as an
identifier, because [az]+ matches 8 characters while integer
matches only 7. If the input is integer, both rules match 7
characters, and the keyword rule is selected because it was given
first. Anything shorter (e.g. int) will not match the expression
integer and so the identifier interpretation is used

yywarp()
• Yywarp is a built in macro. When a lexer encounters
the end of file, it calls the routine yywarp() to find
out what to do next.
• If yywarp() returns 0, the lexer(scanner) continues
scanning process. It indicates that there is more i/p
& hence lexer has to continue working.
• It first needs to adjust yyin to point to a new file by
using fopen().
• If yywarp() returns 1, the lexer(scanner) halts
scanning process(i.e scanner return 0 token to
report the EOF.

The user can create own marco. To do so, put this at the beginning of
the rules section.
Format:
%{
#undef yywarp
%}
Note : disable the macro by undefine it, before
define your macro.

yylex()
• The scannerr created by lex has the entry point
yylex().
• You call yylex() to start or resume scanning.
• If a lex action does a return to pass a value to the
calling program, the next call to yylex() will
continue from the point where it left off.
• All code in the rules section is copied into yylex().
• Lines of code immediately after the “%%” line are
placed near the beginning of the scanner, before
the first executable statement.

Example
%{
int counter = 0;
%}
letter [a-zA-Z]
%%
{letter}+ {printf(“a wordn”); counter++;}
%%
main()
{
yylex();
printf(“There are total %d wordsn”, counter);
}

REJECT
• The action REJECT means ``go do the next alternative.'' It
causes whatever rule was second choice after the current
rule to be executed. The position of the input pointer is
adjusted accordingly.
• Lex separaters the i/p into non overlapping tokens.
• If overlaps occurs with other tokens then also we need all
occurrences of a token, a special action REJECT is used to
perform it.

When to use REJECT?
In general, REJECT is useful whenever the purpose of
Lex is not to partition the input stream but to detect all
examples of some items in the input, and the instances
of these items may overlap or include each other.

Some Lex rules to do this might be
she s++;
he h++;
n |
. ;
where the last two rules ignore everything besides he and
she. Remember that . does not include newline. Since
she includes he, Lex will normally not recognize the
instances of he included in she, since once it has passed
a she those characters are gone.

Examples
she {s++; REJECT;}
he {h++; REJECT;}
n |
. ;
a[bc]+ { ... ; REJECT;}
a[cd]+ { ... ; REJECT;}
If the input is ab, only the first
rule matches, and on ad only the
second matches.
The input string accb matches
the first rule for four characters
and then the second rule for
three characters.
In contrast, the input accd agrees
with the second rule for four
characters and then the first rule
for three.

Example
…
%%
pink {npink++; REJECT;}
ink {nink++; REJECT;}
pin {npin++; REJECT;}
. |
n ;
%%
…
i/p data : pink
all three pattern will match. Without REJECT statement only pink is match.
If REJECT action executes it puts back the text matched by the pattern
& finds the next best match for it.
Note : where the REJECT is necessary to pick up a letter pair beginning at every
character, rather than at every other character.

Lex Predefined Variables
• yytext -- a string containing the lexeme
• yyleng -- the length of the lexeme
• yyin -- the input stream pointer
– the default input of default main() is stdin
• yyout -- the output stream pointer
– the default output of default main() is stdout.
• cs20: %./a.out < inputfile > outfile
• E.g.
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}

Lex Library Routines
• yylex()
– The default main() contains a call of yylex()
• yymore()
– return the next token
• yyless(n)
– retain the first n characters in yytext
• yywarp()
– is called whenever Lex reaches an end-of-file
– The default yywarp() always returns 1

Review of Lex Predefined Variables
Name Function
char *yytext pointer to matched string
int yyleng length of matched string
FILE *yyin input stream pointer
FILE *yyout output stream pointer
int yylex(void) call to invoke lexer, returns token
char* yymore(void) return the next token
int yyless(int n) retain the first n characters in yytext
int yywrap(void) wrapup, return 1 if done, 0 if not done
ECHO write matched string
REJECT go to the next alternative rule
INITAL initial start condition
BEGIN condition switch start condition

Revisiting Internal Variables in Lex
• char *yytext;
– Pointer to current lexeme terminated by ‘0’
• int yylen;
– Number of chacters in yytex but not ‘0’
• yylval:
– Global variable through which the token value can be
returned to Yacc
– Parser (Yacc) can access yylval, yylen, and yytext
• How are these used?
– Consider Integer Tokens:
– yylval = ascii_to_integer (yytext);
– Conversion from String to actual Integer Value

Symbol Tables
• The table of words is a simple symbol table, a
common structure in lex and yacc applications.
• The use of symbol table to build a table of words as
the lexer is running, so we can add new words
without modifying and recompiling the lex program.
• Ex: A C compiler, for example, stores the variable
and structure names, labels, enumeration tags, and
all other names used in the program in its symbol
table. Each name is stored along with information
describing the name. In a C compiler the
information is the type of symbol, déclaration scope,
variable type, etc.

Continued..
• add-word(), which puts a new word into the symbol
table, and
• lookup-word(), which looks up a word which should
already be entered.
• a variable state that keeps track of whether we're
looking up words, state LOOKUP, or declaring them.

RECOMMENDED QUESTIONS:
1. write the specification of lex with an example? (10)
2. what is regular expressions? With examples
explain? (8)
3. write a lex program to count the no of words , lines
, space, characters? (8)
4. write a lex program to count the no of vowels and
consonants? (8)
5. what is lexer- parser communication? Explain? (5)
6. write a program to count no of words by the
method of substitution? (7)

Yacc - Yet Another Compiler-
Compiler

The parser is that phase of the compiler which takes a token
string as input and with the help of existing grammar, converts
it into the corresponding Intermediate Representation. The
parser is also known as Syntax Analyzer.

Yacc
Theory:
◦ Yacc reads the grammar and generate C code for a parser .
◦ Grammars written in Backus Naur Form (BNF) .
◦ BNF grammar used to express context-free languages .
◦ e.g. to parse an expression , do reverse operation( reducing the
expression)
◦ This known as bottom-up or shift-reduce parsing .
◦ Using stack for storing (LIFO).

What is YACC ?
– Tool which will produce a parser for a given
grammar.
– YACC (Yet Another Compiler Compiler) is a program
designed to compile a LALR(1) grammar and to
produce the source code of the syntactic analyzer of the
language produced by this grammar

Parsing
• The i/p is divided into tokens, a program needs
to establish the relationship among tokens.
• A C complier needs to finds the expressions,
statement, declarations, blocks, and procedures
in the program.
• This task is known as parsing.

Parser
• Yacc takes a concise description of a grammar and
produces a C routine that can parse that grammar, a
parser.
• The yacc parser automatically detects whenever a
sequences of i/p tokens matches one of the rules in
the grammar and also detects a syntax error
whenever its i/p tokens doesn’t match any of the
rules.

Grammar
• The list/set of rules that define the relationships that program
understands is a grammar.
Or
It is a series of rules that the parser uses to recognize
syntactically valid i/p.
For example, one grammar rule might be
date : month_name day ',' year
• Here, date, month_name, day, and year represent structures
of interest in the input process; presumably, month_name,
day, and year are defined elsewhere. The comma ``,'' is
enclosed in single quotes; this implies that the comma is to
appear literally in the input. The colon and semicolon merely
serve as punctuation in the rule, and have no significance in
controlling the input. Thus, with proper definitions, the input
• July 4, 1776
• might be matched by the above rule.

Example
Ex: CFG
expr : expr '+' term | term;
term : term '*' factor | factor;
factor : '(' expr ')' | ID | NUM;
(a + b) *c

Example
1. E -> E + E
2. E -> E * E
3. E -> id
Three productions have been specified.
Terms that appear on the left-hand side (lhs) of a
production, such as E (expression) are nonterminals.
Terms such as id (identifier) are terminals (tokens
returned by lex) and only appear on the right-hand side
(rhs) of a production.
This grammar specifies that an expression may be the
sum of two expressions, the product of two expressions,
or an identifier.

Example :x + y * z
E -> E * E (r2)
-> E * z (r3)
-> E + E * z (r1)
-> E + y * z (r3)
-> x + y * z (r3)

Symbols
• A yacc grammar is constructed from symbols.
• Symbols are strings of letter, digits, periods and
underscores that do not start with a digit.
• The symbol error is reserved for error recovery.
• There are two types of symbols:
1. Terminal symbols or tokens
2. Non-terminal symbols or non-terminals

Terminal symbols v/s Non-Terminal
symbols
• Terminal Symbols
Symbols that actually appear
in the i/p & are returned by
the lexer are called terminal
symbols or Tokens.
•It represented in lower case
letter.
•It present only on RHS of
arrow or colon.
•It can not be further derived.
•Ex : a , b & c
• Non-Terminal Symbols
Symbols that appear in the
LHS of some rule are called
Non-terminal symbols or
non-terminals.
•It represented in upper case
letter.
•It present on bothside of
arrow or colon LHS &RHS.
•It can be further derived.
•Ex: E , T & F

Continued.
• The symbol to the left of the rule is known
as the left-hand side rule(LHS).
• The symbol to the right of the rule is known
as the right-hand side rule(RHS).
• Terminal & non-terminal symbols must be
different; it is an error to write a rule with a
token on the Left side.
• Every grammar includes a Start symbol, the one
that has to be at the root of the parse tree.
• Ex : E is the start symbol.

Example : ( a + b ) * c.
E E + T | T
T T * F | F
F ( E ) |a| b | c
Start State Vertical bar means two possibilities for
the same symbol.
Production Rules

Recursive Rules
• Rules can refer directly or indirectly to itself
(themselves); this important ability makes it
possible to parse arbitrarily long i/p sequences.
• Applying the expression rules repeatedly.
• Ex: fred = 14 + 23 – 11 + 7
Rule : E : NUM
| E + NUM
| E – NUM ;
Note : E is called again & again.

Left and Right Recursion
• A recursive rule can put the recursive reference at
the left end or right end of the RHS of the rule.
• Ex: Exp: Exp ‘ + ‘ E; /* left recursion */
• Ex: Exp: E ‘ + ‘ Exp ; /* Right recursion */

Continued..
• Note: Any recursive rule must have at least
one non-recursive(one that does not refer to
itself). Otherwise there is no way to terminate
the string that it matches, which is an error.
• Yacc handles left recursion more efficiently
than right recursion.
• This is because its internal stack keeps track of
all symbols for all partially parsed rules.

Yacc
• Input to yacc is divided into three sections.
• Every specification file consists of three sections: the declarations,
(grammar) rules, and programs. The sections are separated by
double percent ``%%'' marks. (The percent ``%'' is generally used
in Yacc specifications as an escape character.)
Format :
... definitions ...
%%
... rules ...
%%
... subroutines ...

YACC File Format
%{
C declarations
%}
yacc declarations
%%
Grammar rules
%%
Additional C code
– Comments enclosed in /* ... */ may appear in any of the
sections.

Format of a Yacc Specification – 3 Parts
• Definition Section:
–Literal block contains Defs, Constants, Types, #includes, etc. that
can Occur in a C Program.
–Regular Definitions (expressions), declarations, start condition .
–They may be %union, %start, %token, %type, %left, %right, and
%nonassoc declarations.
–All of these are optional, even it may be completely empty.
•Translation Rules:
–Pairs of (grammar rules, Action).
–Informs paser/yacc of Action when grammar
is Recognized.
•Auxiliary Procedures:
–Designer Defined C Code
–Can Replace System Calls
yacc.y File Format:
DECLARATIONS
%%
TRANSLATION RULES
%%
AUXILIARY
PROCEDURES

Definitions Section
 The definitions section consists of:
◦ token declarations .
◦ Types of values used on the parser stack, and other odds
and ends.
◦ Literal blockC code bracketed by “%{“ and “%}”.
◦ The declaration section may be empty.
• There may be %union, %start, %token, %type, %left,
%right, and %nonassoc declarations. (See "%union
Declaration," "Start Declaration,“ "Tokens," "%type
Declarations," and "Precedence and Operator
Declarations.")

Continued..
• It can also contain comments in the usual C
format, surrounded by ''/*n and "*/".
• All of these are optional, so in a very simple
parser the definition section may be
completely empty.
• You can use single quoted characters as tokens
without declaring them, so we don't need to
declare "=", "+", or "-".

Literal Block
• Any initial C program code want to copied
into the final program should be written in
definition section.
• Yacc copies the contents between “%{“ and
“%}” directly to the generated C file.
• In the definitions and rules sections, any
indented text or text enclosed in %{ and %} is
copied verbatim to the generated C source file
(i.eoutput) near beginning, before the
beginning of yyparse() (with the %{%}'s
removed).

Tokens Declarations
• Tokens may either be symbols defined by %token
or individual characters in single quotes.
• All symbols used as tokens must be defined
explicitly in the definitions section, e.g.:
Format : %token NAME1,name2..
• Tokens can also be declared by %left, %right, or
%nonassoc declarations, each of which has
exactly the same syntax options as has %token

Token Numbers
• Within the lexer and parser, tokens are
identified by small integers.
• The token number of a literal token is the
numeric value in the local character set, usually
ASCII, and is the same as the C value of the
quoted character.
• %token NAME integervalue

Continued.. With example
– To specify token AAA BBB
• %token AAA BBB
• %token UP DOWN LEFT RIGHT
– To assign a token number to a token (needed when using
lex), a nonnegative integer followed immediately to the
first appearance of the token
• %token EOFnumber 0
• %token SEMInumber 101
• %token UP 50 DOWN 60 LEFT 17 RIGHT 25
– Non-terminals do not need to be declared unless you
want to associated it with a type to store attributes

Token Values
• Each symbol in a yacc parser can have an associated
value. (See "Symbol Values.")
• Since tokens can have values, you need to set the
values as the lexer returns tokens to the parser.
• The token value is always stored in the variable
yylval.
• Example :
[0-9]+ { yylval = atoi (yytext ) ; return NUMBER; }

Symbol Values
• Every symbol in a yacc parser, both tokens and non-
terminals, can have a value associated with it.
• If the token were NUMBER, the value might be the
particular number, if it were STRING, the value
might be a pointer to a copy of the string, and if it
were SYMBOL, the value might be a pointer to an
entry in the symbol table that describes the symbol.

Continued..
• Ex: C type, int or double for the number,
char * for the string, and a pointer to a
structure for the symbol.
• Yacc makes it easy to assign types to symbols
so that it automatically uses the correct type
for each symbol.

Declaring Symbol Types
• Internally, yacc declares each value as a C union
that includes all of the types.
• You list all of the types in a %union declaration,
q.v.
• Yacc turns this into a typedef for a union type
called YYSTYPE.
• Then for each symbol whose value is set or used
in action code, you have to declare its type.

%type Declaration
• You declare the types of non-terminals using
%type. Each declaration has the form:
%type <type> name,name,..
• The type name must have been defined by a
%union.
• Each name is the name of a non-terminal symbol.
• Use %type to declare non-terminals

%union Declaration
• The %union declaration identifies all of the possible C
types that a symbol value can have. The declaration
takes this form:
%union {
. . Field declarations ...
}
• The field declarations are copied verbatim into a C
union declaration of the type YYSTYPE in the output
file. Yacc does not check to see if the contents of the
%union are valid C.
• In the absence of a %union declaration, yacc defines
YYSTYPE to be int so all of the symbol values are
integers.

Start Declaration/ Start Symbol
• Normally, the start rule, the one that the parser starts
trying to parse, is the one named in the first rule.
• If you want to start with some other rule, in the
declaration section you can write:
Format : %start somename
//to start with rule somename
EX:
• The first non-terminal specified in the grammar
specification section.
• To overwrite it with %start declaraction.
%start non-terminal

Example Definitions Section
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token ID NUM
%start expr
It is a terminal
expr parse

Operator Declarations
• Operator declarations appear in the definitions
section.
• The possible declarations are %left, %right, and
%nonassoc. (In very old grammars you may
find the obsolete equivalents %<, %>, and %2 or
%binary.)
• The %left and %right declarations make an operator
left or right associative, respectively.
• You declare non-associative operators with
%nonassoc.

Continued..
• Operators are declared in increasing order of
precedence.
• All operators declared on the same line are at the
same precedence level.
• Example :
%left PLUSnumber, MINUSnumber
%left TIMESnumber, DIVIDEnumber

Example
%union {
int iValue; /* integer value */
char sIndex; /* symbol table index */
nodeType *nPtr; /* node pointer */
};
%token <iValue> INTEGER
%token <sIndex> VARIABLE
%token WHILE IF PRINT
%nonassoc IFX
%nonassoc ELSE
%left GE LE EQ NE '>' '<'
%left '+' '-'
%left '*' '/'
%nonassoc UMINUS
%type <nPtr> stmt expr stmt_list

YACC Declaration Summary
`%start'
Specify the grammar's start symbol
`%union'
Declare the collection of data types that semantic values may have
`%token'
Declare a terminal symbol (token type name) with no precedence
or associativity specified
`%type'
Declare the type of semantic values for a nonterminal symbol

YACC Declaration Summary
`%right'
Declare a terminal symbol (token type name) that is
right-associative
`%left'
Declare a terminal symbol (token type name) that is left-associative
`%nonassoc'
Declare a terminal symbol (token type name) that is nonassociative
(using it in a way that would be associative is a syntax error,
ex: x op. y op. z is syntax error)

Yacc RULE SECTION
• The rules section contains grammar rules and actions
containing C code.
• A yacc consists of two task :Grammar and Action.
◦ the rules section consists of:
 BNF grammar .
 ACTION

RULES
• A yacc grammar consists of a set of rules.
• Each rule starts with a nonterminal symbol and a
colon, and is followed by a possibly empty Iist of
symbols, literal tokens, and actions.
• Rules by convention end with a semicolon,
although in most versions of yacc the semicolon is
optional.
• The rules section is made up of one or more
grammar rules.

A grammar rule has the form:
A : BODY ;
A represents a nonterminal name, and BODY
represents a sequence of zero or more names and
literals. The colon and the semicolon are Yacc
punctuation.
If a nonterminal symbol matches the empty string.
empty : ;

Continued..
• If there are several grammar rules with the same left
hand side, the vertical bar ``|'' can be used to avoid
rewriting the left hand side. In addition, the semicolon
at the end of a rule can be dropped before a vertical bar.
Thus the grammar rules
A : B C D ;
A : E F ;
A : G ;
• can be given to Yacc as
A : B C D
| E F
| G
;

Example Rules Section
• This section defines grammar
• Example
expr : expr '+' term | term;
term : term '*' factor | factor;
factor : '(' expr ')' | ID | NUM;

%token NAME NUMBER
%%
statement: NAME '=' expression
| expression
;
expression: NUMBER ' + ‘ NUMBER
I NUMBER '-' NUMBER
;
Note :Unlike lex, yacc pays no attention to line boundaries in
the rules section, and you will find that a lot of whitespace
makes grammars easier to read.
The symbol on the left-hand side of the first rule in the
grammar is normally the start symbol.
EXAMPLE

Actions
• An action is C code executed when yacc matches a
rule in the grammar.
• The action must be a C compound statement,
Example:
date: month '/' day ' / ' year { printf("date found"); }
;
The action can refer to the values associated with the
symbols in the rule by using a dollar sign followed by
a number, with the first symbol after the colon being
number 1.

CONTINUED..
EXAMPLE :
date: month ‘ / ' day ' / ‘year
{ print£ ("date %d-%d-%d found", $1, $3, $5) ; }
:
The name "$$" refers to the value for the symbol to
the left of the colon.
• For rules with no action, yacc uses a default of:
{ $$ = $1; }

THE RULE'S ACTION
• Whenever the parser reduces a rule, it executes
user C code associated with the rule, known as the
rule's action.
• The action appears in braces after the end of the
rule, before the semicolon or vertical bar.
• The action code can refer to the values of the right-
hand side symbols as $1, $2, . . . , and can set
• the value of the left-hand side by setting $$.
• In our parser, the value of an expression symbol is
the value of the expression it represents.

Rule Reduction and Action
stat: expr {printf("%dn",$1);}
| LETTER '=' expr {regs[$1] = $3;} ;
expr:
expr '+' expr {$$ = $1 + $3;} |
LETTER {$$ = regs[$1];} ;
Grammar rule Action
“or” operator:
For multiple RHS

Rules Section
• Normally written like this
• Example:
expr : expr '+' term
| term
;
term : term '*' factor
| factor
;
factor : '(' expr ')'
| ID
| NUM
;

The Position of Rules
expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;

expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
$1

expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
$2

expr : expr '+' term { $$ = $1 + $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : '(' expr ')' { $$ = $2; }
| ID
| NUM
;
$3 Default: $$ = $1;

SUBROUTINES SECTION
the subroutines section consists of:
◦ user subroutines .

User Code Section
• It can consist of any legal C code.
• The yacc copies it to the C file after the end
of the yacc generated code using a lex
generated lexer to compile: main() and
yyerror().
• The user code section is simply copied to
y.tab.c verbatim.
• The presence of this section is optional; if it is
missing, the second %% in the input file may
be skipped.

Format
void yyerror()
{
}
int main(void)
{
yyparse();
return 0;
}

Example
void yyerror(char *s)
{
fprintf(stderr, "%sn", s);
}
int main(void)
{
yyparse();
return 0;
}

example
main()
{
return(yyparse());
}
yyerror(CHAR *s)
{
fprintf(stderr, "%sn",s);
}
yywrap()
{
return(1);
}

Example
yyerror(const char *str)
{ printf("yyerror: %s at line %dn", str, yyline);
}
main()
{
if (!yyparse()) {printf("acceptn");}
else
printf("rejectn");
}

Yacc Program Structure
% {
# i n c l u d e < s t d i o . h >
i n t r e g s [ 2 6 ] ;
i n t b a s e ;
% }
% t o k e n n u m b e r l e t t e r
% l e f t ' + ' ' - ‘
% l e f t ' * ' ' / ‘
% %
l i s t : | l i s t s t a t ' n ' | l i s t e r r o r ' n ' { y y e r r o k ; } ;
s t a t : e x p r { p r i n t f ( " % d n " , $ 1 ) ; }
| L E T T E R ' = ' e x p r { r e g s [ $ 1 ] = $ 3 ; } ;
e x p r :
' ( ' e x p r ' ) ' { $ $ = $ 2 ; } |
e x p r ' + ' e x p r { $ $ = $ 1 + $ 3 ; } |
L E T T E R { $ $ = r e g s [ $ 1 ] ; }
%%
m a i n ( ) { r e t u r n ( y y p a r s e ( ) ) ; }
y y e r r o r ( C H A R * s ) { f p r i n t f ( s t d e r r, " % s n " , s ) ; }
y y w r a p ( ) { r e t u r n ( 1 ) ; }
… definitions …
%%
… rules …
%%
… subroutines …

An YACC File Example
%{
#include <stdio.h>
%}
%token NAME NUMBER
%%
statement: NAME '=' expression
| expression { printf("= %dn", $1); }
;
expression: expression '+' NUMBER { $$ = $1 + $3; }
| expression '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
%%
int yyerror(char *s)
{
fprintf(stderr, "%sn", s);
return 0;
}
int main(void)
{
yyparse();
return 0;
}

A YACC PARSER
• A literal consists of a character enclosed in single
quotes ``'''. As in C, the backslash ``'' is an escape
character within literals, and all the C escapes are
recognized. Thus
• 'n' newline
• 'r' return
• ''' single quote ``'''
• '' backslash ``''
• 't' tab
• 'b' backspace
• 'f' form feed
• 'xxx' ``xxx'' in octal
• For a number of technical reasons, the NUL character
('0' or 0) should never be used in grammar rules.

How YACC Works
a.out
File containing desired
grammar in yacc format
yacc program
C source program created by yacc
C compiler
Executable program that will parse
grammar given in gram.y
gram.y
yacc
y.tab.c
cc
or gcc

yacc
How YACC Works
(1) Parser generation time
YACC source (*.y)
y.tab.h
y.tab.c
C compiler/linker
(2) Compile time
y.tab.c a.out
a.out
(3) Run time
Token stream
Abstract
Syntax
Tree
y.output

Creating, Compiling and Running a
Simple Parser
• Yacc environment
– Yacc processes a yacc specification file and produces
a y.tab.c file.
– An integer function yyparse() is produced by Yacc.
• Calls yylex() to get tokens.
• Return non-zero when an error is found.
• Return 0 if the program is accepted.
– Need main() and and yyerror() functions.

Step to create & run/execute the yacc
program
1. Create a file Vi filename.y
2. Type the source code in vi/geditor and save it by
pressing esc shift colon w q then
enter.(esc+shift+:+w+q).
3. Compile yacc file (filename.y) to generate C routine
(y.tab.c and y.tab.h) i.e yacc –d filename.y (token
definition created by the –d )
4. Compile cc y.tab.c -lyto generate output file./a.out
(object file)
5. % cc y.tab.c –o <execfilename>.

Parser-Lexer Communication
• To try out our parser, we need a lexer to feed it
tokens.
• When you use a lex scanner and a yacc parser
together, the parser is the higher level routine.
• It calls the lexer yylex() whenever it needs a token
from the input.
• The lexer then scans through the input recognizing
tokens.
• As soon as it finds a token of interest to the parser, it
returns to the parser, returning the token's code as
the value of yylex().

Continued..
• Not all tokens are of interest to the parser-in
most programming languages the parser
doesn't want to hear about comments and
whitespace.
• Yacc defines the token names in the parser as
C preprocessor names in y.tab.h

Works with Lex
YACC
yyparse()
Input programs
12 + 26
LEX
yylex()
How to work ?

Works with Lex
YACC
yyparse()
Input programs
12 + 26
LEX
yylex()
call yylex()
[0-9]+
next token is NUM
NUM ‘+’ NUM

Communication between LEX and YACC
YACC
yyparse()
Input programs
12 + 26
LEX
yylex()
call yylex()
[0-9]+
next token is NUM
NUM ‘+’ NUM
LEX and YACCtoken

Communication between LEX and
YACC
yacc -d gram.y
Will produce:
y.tab.h
• Use enumeration / define
•Include
•YACC y.tab.h
• LEX include y.tab.h

Communication between LEX and YACC
%{
#include <stdio.h>
#include "y.tab.h"
%}
id [_a-zA-Z][_a-zA-Z0-9]*
%%
int { return INT; }
char { return CHAR; }
float { return FLOAT; }
{id} { return ID;}
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token CHAR, FLOAT, ID, INT
%%
yacc -d xxx.y
Produced
y.tab.h:
# define CHAR 258
# define FLOAT 259
# define ID 260
# define INT 261
parser.y
scanner.l

Yacc Example
• Taken from Lex & Yacc
• Simple calculator
a = 4 + 6
a
a=10
b = 7
c = a + b
c
c = 17
$

Create ,Compiling and Running a
Simple lex & Parser(yacc)
• Lex part:
• % vi chl-n.l
• % lex chl-n.l
Yacc part :
• % vi chl-m.y
• %% yacc -d chl-n.y
Compile both lex & yacc
• % cc -c lex.yy.c y.tab.c
• % cc -0 example-m.n lex.yy.o y.tab.o -ll

Example
% yacc -d ch3-0l.y # makes y.tab.c and "y.tab.h
% lex ch3-01.l # makes lex.yy.c
% cc -o ch3-01 y.tab.c lex.yy.c -ly -ll # cmpile and link
C files
% ch3-01
99+12
= 111
% ch3-01

Example of lex and yacc program

Parser (cont’d)
statement_list: statement 'n'
| statement_list statement 'n'
;
statement: NAME '=' expression { $1->value = $3; }
| expression { printf("= %gn", $1); }
;
expression: expression '+' term { $$ = $1 + $3; }
| expression '-' term { $$ = $1 - $3; }
| term
;
parser.y

Parser (cont’d)
term: term '*' factor { $$ = $1 * $3; }
| term '/' factor { if ($3 == 0.0)
yyerror("divide by zero");
else
$$ = $1 / $3;
}
| factor
;
factor: '(' expression ')' { $$ = $2; }
| '-' factor { $$ = -$2; }
| NUMBER { $$ = $1; }
| NAME { $$ = $1->value; }
;
%%
parser.y

Scanner
%{
#include "y.tab.h"
#include "parser.h"
#include <math.h>
%}
%%
([0-9]+|([0-9]*.[0-9]+)([eE][-+]?[0-9]+)?) {
yylval.dval = atof(yytext);
return NUMBER;
}
[ t] ; /* ignore white space */
scanner.l

Scanner (cont’d)
[A-Za-z][A-Za-z0-9]* { /* return symbol pointer */
yylval.symp = symlook(yytext);
return NAME;
}
"$" { return 0; /* end of input */ }
n|”=“|”+”|”-”|”*”|”/” return yytext[0];
%%
scanner.l

YACC
• Rules may be recursive
• Rules may be ambiguous*
• Rules may be conflicts
• Uses bottom up Shift/Reduce parsing
– Get a token
– Push onto stack
– Can it reduced (How do we know?)
• If yes: Reduce using a rule
• If no: Get another token
• Yacc cannot look ahead more than one token
Phrase -> cart_animal AND CART
| work_animal AND PLOW
…

Define an ambiguity and conflicts.
How it arises .
• Ambiguous means there are multiple possible
parses(o/p) for the same input.
• Conflicts mean that yacc can't properly parse a
grammar, probably because it's ambiguous.
• It arises due to precedence and associativity
operators is not specified.
• Conflicts may arise because of mistakes in input
or logic, or because the grammar rules, while
consistent, require a more complex parser than
Yacc can construct.

Example :"2+3*4",
• (2 + 3 ) * 4 2 + ( 3 * 4 )

what Yacc Cannot Parse ?
If its Ambiguity, Unambiguity and Conflicts
• In some cases the grammar is truly ambiguous, that
is, there are two possible parses(o/p) for a single
input string and yacc cannot handle that.
• In others, the grammar is unambiguous, but the
parsing technique that yacc uses is not powerful
enough to parse the grammar.
• The problem in an unambiguous grammar with
conflicts is that the parser would need to look more
than one token ahead to decide which of two
possible parses to use.
• Yacc takes a default action when there is a conflict.

Example : Input is HORSE AND CART
phrase : cart - animal AND CART
| work - animal AND PLOW
cart : animal + HORSE | GOAT
work : animal +HORSE | OX
• yacc can't handle this because it requires two
symbols of lookahead. It cannot lookahead more
than one token. If we changed the first rule to this:
phrase : cart - animal CART
| work - animal PLOW

Continued..
• The given expression should not be ambiguous and
conflicts.
• Yacc may fail to translate a grammar specification because
the grammar is ambiguous or contains conflicts.
Example for ambiguous A+ B * C two way it can solve:
1. ( A +B ) * C
2. A + (B * C )
Example for conflicts an Xl could either proga or progb:
%%
prog: proga I progb ;
proga : 'XI ‘;
prcgb : 'XI’ ;

Types of conflicts
Shift/Reduce Conflicts/ bottom-up
•A shift/reduce conflict occurs
when there are two possible parses
for an input string, and one of the
parses completes a rule (the
reduce option) and one doesn't
(the shift option).
•Ex:
E : ‘X’ | E + E ;
• The i/p string X + X + X there are
two possible parses,"(X+X)+X “or
"X+(X+X)”
Reduce/Reduce Conflicts
•A reduce/reduce conflict occurs
when the same token could
complete two different rules.
•Example:
E : T | E;
E : id ;
T : id ;
An “id” could either be a E or T.

Disambiguating Rule
• A rule describing which choice to make in a given
situation is called a disambiguating rule.
• Yacc invokes two disambiguating rules by default:
1. In a shift/reduce conflict, the default is to do the
shift.
2. In a reduce/reduce conflict, the default is to reduce
by the earlier grammar rule (in the input sequence).

Arithmetic Expressions
• The arithmetic expressions more general and realistic, extending
the expression rules to handle multiplication and division, unary
negation,and parenthesized expressions:
expression: expression ' + ‘ expression { $$ = $1 + $3; }
| expression ‘ - ' expression (: $$ = $1 - $3; }
| expression ‘* ' expression (: $$ = $1 * $3; }
| expression ‘/ ' expression
{ if ($3 == 0)
yyerror ( "divide by zero " ) ;
else
$$ = $1 / $3;
}
|‘ – ‘ expression { $$ = $2; }
| ' ( ‘ expression ' ) ' { $$ = $2; }
| NUMBER ( $$ = $1; }
;

Precedence, Associativity, and Operator
Declarations
• All yacc grammars have to be unambiguous.
• Unambiguous means that is, there is only one possible way
to parse any legal input using the rules in the grammar.
• Ambiguous grammars cause conflicts, situations where
there are two possible parses and hence two different ways
that yacc can process a token.
• When yacc processes an ambiguous grammar, it uses default
rules to decide which way to parse an ambiguous sequence.
• Often these rules do not produce the desired result, so yacc
includes operator declarations that let you change the way it
handles shiftheduce conflicts that result from ambiguous
grammars.

Precedence and Associativity
• The rules for determining what operands group
with which operators are known as precedence
and associativity.
Ex :
a = b = C + d / e / f
a = (b = (C + ((d / e) / f ) ) ) )

Precedence
• Precedence controls which operators to execute first
in an expression. or
• Precedence assigns each operator a precedence
"level."
• In any expression grammar, operators are grouped
• into levels of precedence from lowest to highest.
• Operators at higher levels bind more tightly,
• e.g., if "*" has higher precedence than "+",
"A+B*C” is treated as “A+(B*C)", while "D*E+F is
"(D*E)+F.

Associativity
• Associativity controls the grouping of operators at
the same precedence level.or
• Associativity controls how the grammar groups
expressions using the same operator or different
operators with the same precedence, whether they
group from the left, from the right, or not at all.
• If "-" were left associative, the expression "A-B-C"
would mean "(A-B)-C”, while if it were right
associative it would mean "A-(B-C)".

How to specify Precedence and Associativity ?
• There are two ways to specify precedence and
associativity in a grammar, implicitly and
explicitly.
• To specify them implicitly, rewrite the
grammar using separate non-terminal symbols
for each precedence level.

Declarations
IMPLICITLY
Ex:
expression: expression '+* mlexp
I expression ' - ' mlexp
I mlexp
;
mlexp: mlexp ‘* ' primary
I rrmlexp ‘/ ' primary
I Primary
;
primary: ' ( ‘ expression ' ) '
I ‘ - ' primary
I NUMBER
;
EXPLICITLY

Precedence / Association
1. 1-2-3 = (1-2)-3? or 1-(2-3)?
Define ‘-’ operator is left-association.
2. 1-2*3 = 1-(2*3)
Define “*” operator is precedent to “-” operator
expr: expr '-' expr
| expr '*' expr
| expr '<' expr
| '(' expr ')'
...
;
(1) 1 – 2 - 3
(2) 1 – 2 * 3

%right ‘=‘
%left '<' '>' NE LE GE
%left '+' '-‘
%left '*' '/'
highest precedence

expr : expr ‘+’ expr { $$ = $1 + $3; }
| expr ‘-’ expr { $$ = $1 - $3; }
| expr ‘*’ expr { $$ = $1 * $3; }
| expr ‘/’ expr
{
if($3==0)
yyerror(“divide 0”);
else
$$ = $1 / $3;
}
| ‘-’ expr %prec UMINUS {$$ = -$2; }

Shift/Reduce Conflicts
• shift/reduce conflict
– occurs when a grammar is written in such a way
that a decision between shifting and reducing can
not be made.
– ex: IF-ELSE ambiguous.
• To resolve this conflict, yacc will choose to shift.

Shift/Reduce Parsing
• When yacc processes a parser, it creates a set of
states each of which reflects a possible position
in one or more partially parsed rules.
Shift parsing
• As the parser reads tokens, each time it reads a
token that doesn't complete a rule it pushes the
token on an internal stack and switches to a new
state reflecting the token it just read.
• This action is called a shift.

Reduce parsing
• When it has found all the symbols that constitute
the right-hand side of a rule, it pops the right-hand
side symbols off the stack, pushes the left-hand side
symbol onto the stack, and switches to a new state
reflecting the new symbol on the stack. This action
is called a reduction, since it usually reduces the
number of items on the stack.
• (Not always, since it is possible to have rules with
empty right-hand sides.) Whenever yacc reduces a
rule, it executes user code associated with the rule.

Example "fred = 12 + 13"
The parser starts by shifting tokens on to the internal
stack one at a time:
Shift :(PUSH) RULES:
fred
fred =
fred = 12
fred = 12 +
fred = 12 + 13
Now reduce the rule "expression->NUMBER +
NUMBER" so it pops the 12, the plus, and the 13 from
the stack and replaces them with expression

Reduce :(POPS)
fred = expression
Statement
Now it reduces the rule "statement -> NAME =
expression", so it pops fred, =, and expression and
replaces them with statement.
The end of the input and the stack has been reduced
to the start symbol, so the input was valid according
to the grammar.

yacc& lex in Together
• The grammar:
program -> program expr | ε
expr -> expr + expr | expr - expr | id
• Program and expr are nonterminals.
• Id are terminals (tokens returned by lex) .
• expression may be :
– sum of two expressions .
– product of two expressions .
– Or an identifiers

When Not to Use Precedence Rules
• You can use precedence rules to fix any
shift/reduce conflict that occurs in the grammar.
• The use of precedence in only two situations: in
expression grammars, and to resolve the "dangling
else" conflict in grammars for if-then-else language
constructs.
• Otherwise, if you can, you should fix the grammar
to remove the conflict.

How the Parser Works
• Yacc turns the specification file into a C program,
which parses the input according to the
specification given.
• The parser produced by Yacc consists of a finite
state machine with a stack.
• The parser is also capable of reading and
remembering the next input token (called the
lookahead token).

Continued ..
• The current state is always the one on the top
of the stack.
• The states of the finite state machine are given
small integer labels; initially, the machine is in
state 0, the stack contains only state 0, and no
lookahead token has been read.

The machine has only four actions available to it, called
shift, reduce, accept, and error.
• A move of the parser is done as follows:
1. Based on its current state, the parser decides
whether it needs a lookahead token to decide what
action should be done; if it needs one, and does not
have one, it calls yylex to obtain the next token.
2. Using the current state, and the lookahead token if
needed, the parser decides on its next action, and
carries it out. This may result in states being pushed
onto the stack, or popped off the stack, and in the
lookahead token being processed or left alone.

Variables and Typed Tokens
% {
double vbltable [26 ];
%}
%union {
double dval;
int vblno ;
}
%type <dval> expression

Symbol Values and %union
1.Why not have the lexer return the value of the variable as a double, to
make the parser simpler?
The problem is that there are two contexts where a variable name
can occur: as part of an expression, in which case we want the double
value, and to the left of an equal sign, in which case we need to
remember which variable it is so we can update vbltable.
To define the possible symbol types, in the definition section we add a
%union declaration:
%union {
double dval;
int vblno;
}

The y.tab.h generated from this grammar:
#define NAME 257
#define NUMBER 258
#define UMINUS 2 5 9
typedef union {
double dval;
int vblno ;
} YYSTYPE;
extern YYSTYPE yylval;
We have to tell the parser which symbols use which type of value.
%token <vblno> NAME
%token <dval> NUMBER
%type <dval> expression

Yacc Library
• You can include the library by giving the -ly flag at
the end of the cc command line on UNIX systems,
or the equivalent on other systems.
• main()
• yyerror()
• yyparse()

yyerror()
• Whenever a yacc parser detects a syntax error, it
calls yyerror() to report the error to the user,
passing it a single argument, a string describing the
error. (Usually the only error you ever get is "syntax
error.")
• The default version of yyerror in the yacc library
merely prints its argument on the standard output.
Syntax: yyerror()
{ printf(“invalid”);
exit 0;
}

yyparse()
• The entry point to a yacc-generated parser is yyparseo.
• When your program calls yyparse(), the parser attempts
to parse an input stream.
• The parser returns a value of zero if the parse succeeds and
non-zero if not.
• Every time you call yyparse() the parser starts parsing a
new, forgetting whatever state it might have been in the last
time it returned.
Syntax: main()
{
yyparse()
}

Lex v/s Yacc
• Lex
– Lex generates C code for a lexical analyzer, or scanner
– Lex uses patterns that match strings in the input and
converts the strings to tokens
• Yacc
– Yacc generates C code for syntax analyzer, or parser.
– Yacc uses grammar rules that allow it to analyze tokens
from Lex and create a syntax tree.

Lex with Yacc
Lex Yacc
yylex() yyparse()
Lex source
(Lexical Rules)
Yacc source
(Grammar Rules)
Input
Parsed
Input
lex.yy.c y.tab.c
return token
call

RECOMMENDED QUESTIONS:
1. give the specification of yacc program? give an example?
(8)
2. what is grammar? How does yacc parse a tree? (5)
3. how do you compile a yacc file? (5)
4. explain the ambiguity occurring in an grammar with an
example? (6)
5. explain shift/reduce and reduce/reduce parsing ? (8)
6. write a yacc program to test the validity of an arthimetic
expressions? (8)
7. write a yacc program to accept strings of the form anbn ,
n>0? (8)

Module4 lex and yacc.ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Module4 lex and yacc.ppt

Similar to Module4 lex and yacc.ppt (20)

Recently uploaded

Recently uploaded (20)

Module4 lex and yacc.ppt

Editor's Notes