SlideShare a Scribd company logo
1 of 87
Download to read offline
Writing Parsers and Compilers
          with PLY
             David Beazley
        http://www.dabeaz.com

          February 23, 2007
Overview

• Crash course on compilers
• An introduction to PLY
• Notable PLY features (why use it?)
• Experience writing a compiler in Python
Background
• Programs that process other programs
• Compilers
• Interpreters
• Wrapper generators
• Domain-specific languages
• Code-checkers
Example
• Parse and generate assembly code
  /* Compute GCD of two integers */
  fun gcd(x:int, y:int)
      g: int;
      begin
          g := y;
          while x > 0 do
              begin
                  g := x;
                  x := y - (y/x)*x;
                  y := g
              end;
          return g
      end
Compilers 101
• Compilers have multiple phases
• First phase usually concerns "parsing"
• Read program and create abstract representation
/* Compute GCD of two integers */
fun gcd(x:int, y:int)
    g: int;
    begin
        g := y;


                                    parser
        while x > 0 do
            begin
                g := x;
                x := y - (y/x)*x;
                y := g
            end;
        return g
    end
Compilers 101
• Code generation phase
• Process the abstract representation
• Produce some kind of output
                                        LOAD R1, A
                                        LOAD R2, B
                    codegen             ADD R1,R2,R1
                                        STORE C, R1
                                        ...
Commentary

• There are many advanced details
• Most people care about code generation
• Yet, parsing is often the most annoying problem
• A major focus of tool building
Parsing in a Nutshell
• Lexing : Input is split into tokens
                 b = 40 + 20*(2+3)/37.5


     NAME = NUM + NUM * ( NUM + NUM ) / FLOAT

• Parsing : Applying language grammar rules
            =
     NAME             +
                NUM                                 /
                                *                       FLOAT
                          NUM             +

                                    NUM       NUM
Lex & Yacc
• Programming tools for writing parsers
• Lex - Lexical analysis (tokenizing)
• Yacc - Yet Another Compiler Compiler (parsing)
• History:
    - Yacc : ~1973. Stephen Johnson (AT&T)
    - Lex : ~1974. Eric Schmidt and Mike Lesk (AT&T)

• Variations of both tools are widely known
• Covered in compilers classes and textbooks
Lex/Yacc Big Picture
lexer.l
    token
specification
Lex/Yacc Big Picture
lexer.l
    token
 /* lexer.l */                       grammar
specification
 %{                               specification
 #include “header.h”
 int lineno = 1;
 %}
 %%
 [ t]* ;      /* Ignore whitespace */
 n                      { lineno++; }
 [0-9]+                  { yylval.val = atoi(yytext);
                           return NUMBER; }
 [a-zA-Z_][a-zA-Z0-9_]* { yylval.name = strdup(yytext);
                           return ID; }
 +                      { return PLUS; }
 -                       { return MINUS; }
 *                      { return TIMES; }
 /                      { return DIVIDE; }
 =                       { return EQUALS; }
 %%
Lex/Yacc Big Picture
lexer.l
    token
specification

          lex

lexer.c
Lex/Yacc Big Picture
lexer.l         parser.y
    token          grammar
specification   specification

          lex

lexer.c
Lex/Yacc Big Picture
 lexer.l                         parser.y
     token
     /* parser.y */                  grammar
 specification
     %{                           specification
      #include “header.h”
      %}
           lex
      %union {
          char *name;
lexer.c.c int   val;
      }
      %token PLUS MINUS TIMES DIVIDE EQUALS
      %token<name> ID;
      %token<val> NUMBER;
      %%
      start : ID EQUALS expr;
      expr : expr PLUS term
            | expr MINUS term
            | term
            ;
      ...
Lex/Yacc Big Picture
lexer.l         parser.y
    token          grammar
specification   specification

          lex              yacc

lexer.c         parser.c
Lex/Yacc Big Picture
lexer.l                      parser.y
     token                      grammar
 specification               specification

          lex                           yacc

lexer.c                      parser.c


typecheck.c      codegen.c      otherstuff.c
Lex/Yacc Big Picture
lexer.l                       parser.y
     token                       grammar
 specification                specification

          lex                            yacc

lexer.c                       parser.c


typecheck.c       codegen.c      otherstuff.c




                 mycompiler
What is PLY?

• PLY = Python Lex-Yacc
• A Python version of the lex/yacc toolset
• Same functionality as lex/yacc
• But a different interface
• Influences : Unix yacc, SPARK (John Aycock)
Some History
• Late 90's : "Why isn't SWIG written in Python?"
• 2001 : Taught a compilers course. Students
  write a compiler in Python as an experiment.
• 2001 : PLY-1.0 developed and released
• 2001-2005: Occasional maintenance
• 2006 : Major update to PLY-2.x.
PLY Package
• PLY consists of two Python modules
   ply.lex
   ply.yacc


• You simply import the modules to use them
• However, PLY is not a code generator
ply.lex

• A module for writing lexers
• Tokens specified using regular expressions
• Provides functions for reading input text
• An annotated example follows...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’                      tokens list specifies
t_TIMES = r’*’                 all of the possible tokens
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()          # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’                 Each token has a matching
t_MINUS = r’-’
t_TIMES = r’*’
                                  declaration of the form
t_DIVIDE = r’/’                         t_TOKNAME
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’        These names must match
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’                         Tokens are defined  by
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’
                                         regular expressions
def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’           For simple tokens,
t_TIMES = r’*’
t_DIVIDE = r’/’
                          strings are used.
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
                       Functions are used when
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’ code
                          special action
                             must execute
def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):       docstring holds
    r’d+’           regular expression
    t.value = int(t.value)
    return t

lex.lex()          # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ Specifies ignored
                              ]
t_ignore = ‘ t’           characters between
t_PLUS   = r’+’
                        tokens (usually whitespace)
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t
                   Builds the lexer
lex.lex()        by creating a master
                  # Build the lexer
                  regular expression
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’         Introspection used
t_TIMES = r’*’       to examine contents
t_DIVIDE = r’/’         of calling module.
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’         Introspection used
t_TIMES = r’*’       to examine contents
t_DIVIDE = r’/’         of calling module.
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’

def t_NUMBER(t):
    r’d+’                 __dict__ = {
    t.value = int(t.value)    'tokens' :   [ 'NAME' ...],
    return t                  't_ignore'   : ' t',
                              't_PLUS' :   '+',
lex.lex()         # Build the ...
                               lexer
                              't_NUMBER'   : <function ...
                           }
ply.lex use
• Two functions: input() and token()
   ...
   lex.lex()         # Build the lexer
   ...
   lex.input("x = 3 * 4 + 5 * 6")
   while True:
        tok = lex.token()
        if not tok: break

        # Use token
        ...
ply.lex use
• Two functions: input() and token()
   ...
   lex.lex()         # Build the lexer
   ...
                                         input() feeds a string
   lex.input("x = 3 * 4 + 5 * 6")
   while True:                              into the lexer
        tok = lex.token()
        if not tok: break

        # Use token
        ...
ply.lex use
• Two functions: input() and token()
   ...
   lex.lex()         # Build the lexer
   ...
   lex.input("x = 3 * 4 + 5 * 6")
   while True:
        tok = lex.token()            token() returns the
        if not tok: break            next token or None
        # Use token
        ...
ply.lex use
• Two functions: input() and token()
    ...
    lex.lex()         # Build the lexer
    ...
    lex.input("x = 3 * 4 + 5 * 6")
    while True:
         tok = lex.token()
         if not tok: break

tok.type # Use   token
         ...
tok.value
tok.line
tok.lexpos
ply.lex use
• Two functions: input() and token()
    ...
    lex.lex()         # Build the lexer
    ...
    lex.input("x = 3 * 4 + 5 * 6")
    while True:
         tok = lex.token()
         if not tok: break

tok.type # Use   token
         ...
tok.value
tok.line
tok.lexpos               t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’
ply.lex use
• Two functions: input() and token()
    ...
    lex.lex()         # Build the lexer
    ...
    lex.input("x = 3 * 4 + 5 * 6")
    while True:
         tok = lex.token()
         if not tok: break

tok.type # Use   token
         ...
tok.value            matching text
tok.line
tok.lexpos               t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’
ply.lex use
• Two functions: input() and token()
    ...
    lex.lex()         # Build the lexer
    ...
    lex.input("x = 3 * 4 + 5 * 6")
    while True:
         tok = lex.token()
         if not tok: break

tok.type # Use   token
         ...
tok.value
tok.line
tok.lexpos               Position in input text
ply.lex Commentary


• Normally you don't use the tokenizer directly
• Instead, it's used by the parser module
ply.yacc preliminaries
• ply.yacc is a module for creating a parser
• Assumes you have defined a BNF grammar
    assign : NAME EQUALS expr
    expr   : expr PLUS term
           | expr MINUS term
           | term
    term   : term TIMES factor
           | term DIVIDE factor
           | factor
    factor : NUMBER
ply.yacc example
import ply.yacc as yacc
import mylexer               # Import lexer information
tokens = mylexer.tokens      # Need token list

def p_assign(p):
    '''assign : NAME EQUALS expr'''

def p_expr(p):
    '''expr : expr PLUS term
            | expr MINUS term
            | term'''
def p_term(p):
    '''term : term TIMES factor
            | term DIVIDE factor
            | factor'''
def p_factor(p):
    '''factor : NUMBER'''

yacc.yacc()               # Build the parser
ply.yacc example
import ply.yacc as yacc
import mylexer                    token information
                            # Import lexer information
tokens = mylexer.tokens     # Need token list lexer
                                 imported from
def p_assign(p):
    '''assign : NAME EQUALS expr'''

def p_expr(p):
    '''expr : expr PLUS term
            | expr MINUS term
            | term'''
def p_term(p):
    '''term : term TIMES factor
            | term DIVIDE factor
            | factor'''
def p_factor(p):
    '''factor : NUMBER'''

yacc.yacc()               # Build the parser
ply.yacc example
import ply.yacc as yacc
import mylexer               # Import lexer information
tokens = mylexer.tokens      # Need token list

def p_assign(p):
    '''assign : NAME EQUALS expr'''

def p_expr(p):                          grammar rules encoded
    '''expr : expr PLUS term            as functions with names
            | expr MINUS term                 p_rulename
            | term'''
def p_term(p):
    '''term : term TIMES factor
            | term DIVIDE factor          Note: Name doesn't
            | factor'''                   matter as long as it
def p_factor(p):
                                            starts with p_
    '''factor : NUMBER'''

yacc.yacc()               # Build the parser
ply.yacc example
import ply.yacc as yacc
import mylexer               # Import lexer information
tokens = mylexer.tokens      # Need token list

def p_assign(p):
    '''assign : NAME EQUALS expr'''

def p_expr(p):
    '''expr : expr PLUS term
            | expr MINUS term
            | term'''                          docstrings contain
def p_term(p):                                  grammar rules
    '''term : term TIMES factor                   from BNF
            | term DIVIDE factor
            | factor'''
def p_factor(p):
    '''factor : NUMBER'''

yacc.yacc()               # Build the parser
ply.yacc example
import ply.yacc as yacc
import mylexer            # Import lexer information
tokens = mylexer.tokens   # Need token list

def p_assign(p):
    '''assign : NAME EQUALS expr'''

def p_expr(p):
    '''expr : expr PLUS term
            | expr MINUS term
            | term'''
def p_term(p):
    '''term : term TIMES factor
            | term DIVIDE factor
            | factor'''
def p_factor(p):
    '''factor : NUMBER'''
                       Builds the parser
yacc.yacc()           # Build the parser
                      using introspection
ply.yacc parsing
• yacc.parse() function
    yacc.yacc()      # Build the parser
    ...
    data = "x = 3*4+5*6"
    yacc.parse(data)   # Parse some text



• This feeds data into lexer
• Parses the text and invokes grammar rules
A peek inside

• PLY uses LR-parsing. LALR(1)
• AKA: Shift-reduce parsing
• Widely used parsing technique
• Table driven
General Idea
• Input tokens are shifted onto a parsing stack
     Stack                 Input
                           X = 3 *   4   +   5
     NAME                    = 3 *   4   +   5
     NAME =                    3 *   4   +   5
     NAME = NUM                  *   4   +   5



• This continues until a complete grammar rule
  appears on the top of the stack
General Idea
• If rules are found, a "reduction" occurs
    Stack                       Input
                                X = 3 *   4   +   5
    NAME                          = 3 *   4   +   5
    NAME =                          3 *   4   +   5
    NAME = NUM                        *   4   +   5
                 reduce   factor : NUM

    NAME = factor


• RHS of grammar rule replaced with LHS
Rule Functions

• During reduction, rule functions are invoked
    def p_factor(p):
        ‘factor : NUMBER’


• Parameter p contains grammar symbol values
    def p_factor(p):
        ‘factor : NUMBER’

          p[0]     p[1]
Using an LR Parser

• Rule functions generally process values on
  right hand side of grammar rule
• Result is then stored in left hand side
• Results propagate up through the grammar
• Bottom-up parsing
Example: Calculator
def p_assign(p):
    ‘’’assign : NAME EQUALS expr’’’
    vars[p[1]] = p[3]

def p_expr_plus(p):
    ‘’’expr : expr PLUS term’’’
    p[0] = p[1] + p[3]

def p_term_mul(p):
    ‘’’term : term TIMES factor’’’
    p[0] = p[1] * p[3]

def p_term_factor(p):
    '''term : factor'''
    p[0] = p[1]

def p_factor(p):
    ‘’’factor : NUMBER’’’
    p[0] = p[1]
Example: Parse Tree
def p_assign(p):
    ‘’’assign : NAME EQUALS expr’’’
    p[0] = (‘ASSIGN’,p[1],p[3])

def p_expr_plus(p):
    ‘’’expr : expr PLUS term’’’
    p[0] = (‘+’,p[1],p[3])

def p_term_mul(p):
    ‘’’term : term TIMES factor’’’
    p[0] = (‘*’,p[1],p[3])

def p_term_factor(p):
    '''term : factor'''
    p[0] = p[1]

def p_factor(p):
    ‘’’factor : NUMBER’’’
    p[0] = (‘NUM’,p[1])
Example: Parse Tree
>>> t = yacc.parse("x = 3*4 + 5*6")
>>> t
('ASSIGN','x',('+',
                  ('*',('NUM',3),('NUM',4)),
                  ('*',('NUM',5),('NUM',6))
               )
)
>>>

                         ASSIGN


                  'x'              '+'


                         '*'                 '*'

                     3         4         5         6
Why use PLY?

• There are many Python parsing tools
• Some use more powerful parsing algorithms
• Isn't parsing a "solved" problem anyways?
PLY is Informative
• Compiler writing is hard
• Tools should not make it even harder
• PLY provides extensive diagnostics
• Major emphasis on error reporting
• Provides the same information as yacc
PLY Diagnostics
• PLY produces the same diagnostics as yacc
• Yacc
    % yacc grammar.y
    4 shift/reduce conflicts
    2 reduce/reduce conflicts

• PLY
    % python mycompiler.py
    yacc: Generating LALR parsing table...
    4 shift/reduce conflicts
    2 reduce/reduce conflicts

• PLY also produces the same debugging output
Debugging Output
Grammar                                                             state 10
Rule       1      statement -> NAME = expression                          (1)    statement -> NAME = expression .
Rule       2      statement -> expression                                 (3)    expression -> expression . + expression
Rule       3      expression -> expression + expression                   (4)    expression -> expression . - expression
Rule       4      expression -> expression - expression                   (5)    expression -> expression . * expression
Rule       5      expression -> expression * expression                   (6)    expression -> expression . / expression
Rule       6      expression -> expression / expression
Rule       7      expression -> NUMBER                                    $end                 reduce using     rule 1 (statement -> NAME = expression .)
                                                                          +                    shift and go     to state 7
Terminals, with rules where they appear                                   -                    shift and go     to state 6
                                                                          *                    shift and go     to state 8
*                           :   5                                         /                    shift and go     to state 9
+                           :   3
-                           :   4
/                           :   6
=                           :   1                                   state 11
NAME                        :   1
NUMBER                      :   7                                         (4)    expression   ->   expression   -   expression .
error                       :                                             (3)    expression   ->   expression   .   + expression
                                                                          (4)    expression   ->   expression   .   - expression
Nonterminals, with rules where they appear                                (5)    expression   ->   expression   .   * expression
                                                                          (6)    expression   ->   expression   .   / expression
expression                  : 1 2 3 3 4 4 5 5 6 6
statement                   : 0                                       !   shift/reduce    conflict for + resolved as shift.
                                                                      !   shift/reduce    conflict for - resolved as shift.
Parsing method: LALR                                                  !   shift/reduce    conflict for * resolved as shift.
                                                                      !   shift/reduce    conflict for / resolved as shift.
state 0                                                                   $end               reduce using rule 4 (expression -> expression - expression .)
                                                                          +                  shift and go to state 7
       (0)     S' -> . statement                                          -                  shift and go to state 6
       (1)     statement -> . NAME = expression                           *                  shift and go to state 8
       (2)     statement -> . expression                                  /                  shift and go to state 9
       (3)     expression -> . expression + expression
       (4)     expression -> . expression - expression                !   +                    [   reduce   using   rule   4   (expression   ->   expression   -   expression   .)   ]
       (5)     expression -> . expression * expression                !   -                    [   reduce   using   rule   4   (expression   ->   expression   -   expression   .)   ]
       (6)     expression -> . expression / expression                !   *                    [   reduce   using   rule   4   (expression   ->   expression   -   expression   .)   ]
       (7)     expression -> . NUMBER                                 !   /                    [   reduce   using   rule   4   (expression   ->   expression   -   expression   .)   ]
       NAME                shift and go to state 1
       NUMBER              shift and go to state 2


       expression                         shift and go to state 4
       statement                          shift and go to state 3

state 1

       (1) statement -> NAME . = expression

       =                   shift and go to state 5
Debugging Output
... Grammar                                                                          state 10
state 1 11 statement
   Rule                      -> NAME = expression                                          (1)   statement -> NAME = expression .
    Rule       2   statement -> expression                                                 (3)   expression -> expression . + expression
    Rule       3   expression -> expression + expression                                   (4)   expression -> expression . - expression
      (4) expression -> expression
    Rule
    Rule
               4
               5
                   expression -> expression - expression
                   expression -> expression * expression
                                                                  -    expression .        (5)   expression -> expression . * expression
                                                                                           (6)   expression -> expression . / expression
      (3) expression -> expression
    Rule
    Rule
               6
               7
                   expression -> expression / expression
                   expression -> NUMBER
                                                                  .    + expression
                                                                                    $end                     reduce using     rule 1 (statement -> NAME = expression .)
      (4) expression -> expression
    Terminals, with rules where they appear
                                                                  .    - expression +                        shift and go     to state 7
                                                                                    -                        shift and go     to state 6
    *
      (5) expression -> expression
                         : 5
                                                                  .    * expression *                        shift and go     to state 8
                                                                                    /                        shift and go     to state 9
    + (6) expression -> expression
                         : 3                                      .    / expression
    -                    : 4
    /                        :   6
    =                        :   1                                                   state 11
   ! shift/reduce conflict for + resolved as shift.
    NAME
    NUMBER
                             :
                             :
                                 1
                                 7                                 (4) expression -> expression - expression .
   ! shift/reduce conflict for - resolved as shift.
    error                    :                                     (3) expression -> expression . + expression
                                                                   (4) expression -> expression . - expression
   ! shift/reduce conflict for * resolved as shift.
    Nonterminals, with rules where they appear                     (5) expression -> expression . * expression
                                                                   (6) expression -> expression . / expression
   !statement
      shift/reduce
    expression           : 1 2 3 conflict for / resolved as shift.
                         : 0
                                 3 4 4 5 5 6 6
                                                                 ! shift/reduce conflict for + resolved as shift.
      $end
    Parsing method: LALR
                                       reduce using rule 4 (expression ->conflict for - resolved as shift.
                                                                 ! shift/reduce   expression - expression .)
                                                                 ! shift/reduce conflict for * resolved as shift.
      +
    state 0
                                       shift and go to state 7   ! shift/reduce conflict for / resolved as shift.
                                                                   $end            reduce using rule 4 (expression -> expression - expression                                             .)
      -(0) S' -> . statement           shift and go to state 6     +               shift and go to state 7
                                                                   -               shift and go to state 6
      *(2) statement -> . expression shift and go to state 8
        (1) statement -> . NAME = expression                       *               shift and go to state 8
                                                                   /               shift and go to state 9
      /(4) expression -> . expression - expression
        (3) expression -> . expression shift and go to state 9
                                       + expression
                                                                                       !   +                 [   reduce   using   rule   4   (expression   ->   expression   -   expression    .)   ]
           (5) expression -> . expression * expression                                 !   -                 [   reduce   using   rule   4   (expression   ->   expression   -   expression    .)   ]
           (6) expression -> . expression / expression                                 !   *                 [   reduce   using   rule   4   (expression   ->   expression   -   expression    .)   ]
  ! +      (7) expression -> . NUMBER       [ reduce using             rule   4   (expression
                                                                                       !   /              -> [    expression
                                                                                                                 reduce   using   rule   4      -   expression
                                                                                                                                             (expression   ->   expression   -    .)    ]
                                                                                                                                                                                 expression    .)   ]

  ! -NAME
       NUMBER
                            shift and go
                            shift and go
                                            [ reduce using
                                           to state 1
                                           to state 2
                                                                       rule   4   (expression             ->      expression                    -   expression                    .)    ]
  ! *                                       [ reduce using             rule   4   (expression             ->      expression                    -   expression                    .)    ]
  ! /expression
       statement
                                            [ reduce state 4
                                             shift and go to using
                                             shift and go to state 3
                                                                       rule   4   (expression             ->      expression                    -   expression                    .)    ]
...state 1
           (1) statement -> NAME . = expression

           =                shift and go to state 5
PLY Validation
• PLY validates all token/grammar specs
• Duplicate rules
• Malformed regexs and grammars
• Missing rules and tokens
• Unused tokens and rules
• Improper function declarations
• Infinite recursion
Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’
t_MINUS = r'-'      example.py:12: Rule t_MINUS redefined.
t_POWER = r'^'                    Previously defined on line 6

def t_NUMBER():
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’
t_MINUS = r'-'
t_POWER = r'^'     lex: Rule 't_POWER' defined for an
                    unspecified token POWER
def t_NUMBER():
    r’d+’
    t.value = int(t.value)
    return t

lex.lex()        # Build the lexer
Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ignore = ‘ t’
t_PLUS   = r’+’
t_MINUS = r’-’
t_TIMES = r’*’
t_DIVIDE = r’/’
t_EQUALS = r’=’
t_NAME   = r’[a-zA-Z_][a-zA-Z0-9_]*’
t_MINUS = r'-'
t_POWER = r'^'

def t_NUMBER():     example.py:15: Rule 't_NUMBER' requires
    r’d+’          an argument.
    t.value = int(t.value)
    return t

lex.lex()         # Build the lexer
Commentary

• PLY was developed for classroom use
• Major emphasis on identifying and reporting
  potential problems
• Report errors rather that fail with exception
PLY is Yacc
• PLY supports all of the major features of
  Unix lex/yacc
• Syntax error handling and synchronization
• Precedence specifiers
• Character literals
• Start conditions
• Inherited attributes
Precedence Specifiers
• Yacc
   %left PLUS MINUS
   %left TIMES DIVIDE
   %nonassoc UMINUS
   ...
   expr : MINUS expr %prec UMINUS {
       $$ = -$1;
   }

• PLY
   precedence = (
      ('left','PLUS','MINUS'),
      ('left','TIMES','DIVIDE'),
      ('nonassoc','UMINUS'),
   )
   def p_expr_uminus(p):
      'expr : MINUS expr %prec UMINUS'
      p[0] = -p[1]
Character Literals
• Yacc
  expr :   expr   '+'   expr    {    $$   =   $1   +   $3;   }
       |   expr   '-'   expr    {    $$   =   $1   -   $3;   }
       |   expr   '*'   expr    {    $$   =   $1   *   $3;   }
       |   expr   '/'   expr    {    $$   =   $1   /   $3;   }
       ;


• PLY
  def p_expr(p):
     '''expr : expr       '+'       expr
             | expr       '-'       expr
             | expr       '*'       expr
             | expr       '/'       expr'''
     ...
Error Productions
• Yacc
  funcall_err : ID LPAREN error RPAREN {
           printf("Syntax error in argumentsn");
       }
       ;


• PLY
  def p_funcall_err(p):
     '''ID LPAREN error RPAREN'''
     print "Syntax error in argumentsn"
Commentary

• Books and documentation on yacc/bison
  used to guide the development of PLY
• Tried to copy all of the major features
• Usage as similar to lex/yacc as reasonable
PLY is Simple
• Two pure-Python modules. That's it.
• Not part of a "parser framework"
• Use doesn't involve exotic design patterns
• Doesn't rely upon C extension modules
• Doesn't rely on third party tools
PLY is Fast
• For a parser written entirely in Python
• Underlying parser is table driven
• Parsing tables are saved and only regenerated if
  the grammar changes
• Considerable work went into optimization
  from the start (developed on 200Mhz PC)
PLY Performance
• Example: Generating the LALR tables
     •   Input: SWIG C++ grammar

     •   459 grammar rules, 892 parser states

     •   3.6 seconds (PLY-2.3, 2.66Ghz Intel Xeon)

     •   0.026 seconds (bison/ANSI C)

• Fast enough not to be annoying
• Tables only generated once and reused
PLY Performance
• Parse file with 1000 random expressions
  (805KB) and build an abstract syntax tree
      •   PLY-2.3   : 2.95 sec,     10.2 MB   (Python)

      •   YAPPS2    : 6.57 sec,     32.5 MB   (Python)

      •   PyParsing : 13.11 sec,    15.6 MB   (Python)

      •   ANTLR      : 53.16 sec,   94 MB     (Python)

      •   SPARK      : 235.88 sec, 347 MB     (Python)


• System: MacPro 2.66Ghz Xeon, Python-2.5
PLY Performance
• Parse file with 1000 random expressions
  (805KB) and build an abstract syntax tree
     •   PLY-2.3   : 2.95 sec,   10.2 MB   (Python)

     •   DParser   : 0.71 sec,   72 MB     (Python/C)

     •   BisonGen : 0.25 sec,    13 MB     (Python/C)

     •   Bison     : 0.063 sec, 7.9 MB     (C)

• 12x slower than BisonGen (mostly C)
• 47x slower than pure C
• System: MacPro 2.66Ghz Xeon, Python-2.5
Perf. Breakdown
• Parse file with 1000 random expressions
  (805KB) and build an abstract syntax tree
     •   Total time : 2.95 sec

     •   Startup    : 0.02 sec

     •   Lexing     : 1.20 sec

     •   Parsing    : 1.12 sec

     •   AST        : 0.61 sec


• System: MacPro 2.66Ghz Xeon, Python-2.5
Advanced PLY

• PLY has many advanced features
• Lexers/parsers can be defined as classes
• Support for multiple lexers and parsers
• Support for optimized mode (python -O)
Class Example
import ply.yacc as yacc

class MyParser:
    def p_assign(self,p):
        ‘’’assign : NAME EQUALS expr’’’
    def p_expr(self,p):
        ‘’’expr : expr PLUS term
                | expr MINUS term
                | term’’’
    def p_term(self,p):
        ‘’’term : term TIMES factor
                | term DIVIDE factor
                | factor’’’
    def p_factor(self,p):
        ‘’’factor : NUMBER’’’
    def build(self):
        self.parser = yacc.yacc(object=self)
Experience with PLY

• In 2001, I taught a compilers course
• Students wrote a full compiler
• Lexing, parsing, type checking, code generation
• Procedures, nested scopes, and type inference
• Produced working SPARC assembly code
Classroom Results

• You can write a real compiler in Python
• Students were successful with projects
• However, many projects were quite "hacky"
• Still unsure about dynamic nature of Python
• May be too easy to create a "bad" compiler
General PLY Experience

• May be very useful for prototyping
• PLY's strength is in its diagnostics
• Significantly faster than most Python parsers
• Not sure I'd rewrite gcc in Python just yet
• I'm still thinking about SWIG.
Limitations
• LALR(1) parsing
• Not easy to work with very complex grammars
  (e.g., C++ parsing)
• Retains all of yacc's black magic
• Not as powerful as more general parsing
  algorithms (ANTLR, SPARK, etc.)
• Tradeoff : Speed vs. Generality
PLY Usage

• Current version : Ply-2.3
• >100 downloads/week
• People are obviously using it
• Largest project I know of : Ada parser
• Many other small projects
Future Directions
• PLY was written for Python-2.0
• Not yet updated to use modern Python
  features such as iterators and generators
• May update, but not at the expense of
  performance
• Working on some add-ons to ease transition
  between yacc <---> PLY.
Acknowledgements
• Many people have contributed to PLY
    Thad Austin          Adam Kerrison
    Shannon Behrens      Daniel Larraz
    Michael Brown        David McNab
    Russ Cox             Patrick Mezard
    Johan Dahl           Pearu Peterson
    Andrew Dalke         François Pinard
    Michael Dyck         Eric Raymond
    Joshua Gerth         Adam Ring
    Elias Ioup           Rich Salz
    Oldrich Jedlicka     Markus Schoepflin
    Sverre Jørgensen     Christoper Stawarz
    Lee June             Miki Tebeka
    Andreas Jung         Andrew Waters
    Cem Karan

• Apologies to anyone I forgot
Resources

• PLY homepage
     http://www.dabeaz.com/ply


• Mailing list/group
http://groups.google.com/group/ply-hack

More Related Content

What's hot

React + Redux Introduction
React + Redux IntroductionReact + Redux Introduction
React + Redux IntroductionNikolaus Graf
 
[131]해커의 관점에서 바라보기
[131]해커의 관점에서 바라보기[131]해커의 관점에서 바라보기
[131]해커의 관점에서 바라보기NAVER D2
 
JSF2.2で簡単webアプリケーション開発
JSF2.2で簡単webアプリケーション開発JSF2.2で簡単webアプリケーション開発
JSF2.2で簡単webアプリケーション開発Masuji Katoda
 
Reactive Extensionsで非同期処理を簡単に
Reactive Extensionsで非同期処理を簡単にReactive Extensionsで非同期処理を簡単に
Reactive Extensionsで非同期処理を簡単にYoshifumi Kawai
 
Python 테스트 시작하기
Python 테스트 시작하기Python 테스트 시작하기
Python 테스트 시작하기Hosung Lee
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features SummaryChris Ohk
 
Making The Move To Java 17 (JConf 2022)
Making The Move To Java 17 (JConf 2022)Making The Move To Java 17 (JConf 2022)
Making The Move To Java 17 (JConf 2022)Alex Motley
 
組み込みでこそC++を使う10の理由
組み込みでこそC++を使う10の理由組み込みでこそC++を使う10の理由
組み込みでこそC++を使う10の理由kikairoya
 
Types of operators in C
Types of operators in CTypes of operators in C
Types of operators in CPrabhu Govind
 
C# 8.0 null許容参照型
C# 8.0 null許容参照型C# 8.0 null許容参照型
C# 8.0 null許容参照型信之 岩永
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.jsVikash Singh
 
Node.js Event Loop & EventEmitter
Node.js Event Loop & EventEmitterNode.js Event Loop & EventEmitter
Node.js Event Loop & EventEmitterSimen Li
 
Php internal architecture
Php internal architecturePhp internal architecture
Php internal architectureElizabeth Smith
 
Access modifiers in Python
Access modifiers in PythonAccess modifiers in Python
Access modifiers in PythonSantosh Verma
 

What's hot (20)

React + Redux Introduction
React + Redux IntroductionReact + Redux Introduction
React + Redux Introduction
 
Core java Essentials
Core java EssentialsCore java Essentials
Core java Essentials
 
[131]해커의 관점에서 바라보기
[131]해커의 관점에서 바라보기[131]해커의 관점에서 바라보기
[131]해커의 관점에서 바라보기
 
An Introduction to Redux
An Introduction to ReduxAn Introduction to Redux
An Introduction to Redux
 
JSF2.2で簡単webアプリケーション開発
JSF2.2で簡単webアプリケーション開発JSF2.2で簡単webアプリケーション開発
JSF2.2で簡単webアプリケーション開発
 
Reactive Extensionsで非同期処理を簡単に
Reactive Extensionsで非同期処理を簡単にReactive Extensionsで非同期処理を簡単に
Reactive Extensionsで非同期処理を簡単に
 
Python 테스트 시작하기
Python 테스트 시작하기Python 테스트 시작하기
Python 테스트 시작하기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
Making The Move To Java 17 (JConf 2022)
Making The Move To Java 17 (JConf 2022)Making The Move To Java 17 (JConf 2022)
Making The Move To Java 17 (JConf 2022)
 
組み込みでこそC++を使う10の理由
組み込みでこそC++を使う10の理由組み込みでこそC++を使う10の理由
組み込みでこそC++を使う10の理由
 
Types of operators in C
Types of operators in CTypes of operators in C
Types of operators in C
 
Core java
Core javaCore java
Core java
 
JS Event Loop
JS Event LoopJS Event Loop
JS Event Loop
 
C# 8.0 null許容参照型
C# 8.0 null許容参照型C# 8.0 null許容参照型
C# 8.0 null許容参照型
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Node.js Event Loop & EventEmitter
Node.js Event Loop & EventEmitterNode.js Event Loop & EventEmitter
Node.js Event Loop & EventEmitter
 
Spring Boot
Spring BootSpring Boot
Spring Boot
 
Net framework
Net frameworkNet framework
Net framework
 
Php internal architecture
Php internal architecturePhp internal architecture
Php internal architecture
 
Access modifiers in Python
Access modifiers in PythonAccess modifiers in Python
Access modifiers in Python
 

Viewers also liked

Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0David Beazley (Dabeaz LLC)
 
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...David Beazley (Dabeaz LLC)
 
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...David Beazley (Dabeaz LLC)
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsDavid Beazley (Dabeaz LLC)
 
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++David Beazley (Dabeaz LLC)
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockDavid Beazley (Dabeaz LLC)
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyDavid Beazley (Dabeaz LLC)
 
Ctypes в игровых приложениях на python
Ctypes в игровых приложениях на pythonCtypes в игровых приложениях на python
Ctypes в игровых приложениях на pythonPyNSK
 
llvm-py: Writing Compilers In Python
llvm-py: Writing Compilers In Pythonllvm-py: Writing Compilers In Python
llvm-py: Writing Compilers In Pythonmdevan
 

Viewers also liked (20)

Generators: The Final Frontier
Generators: The Final FrontierGenerators: The Final Frontier
Generators: The Final Frontier
 
Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0Generator Tricks for Systems Programmers, v2.0
Generator Tricks for Systems Programmers, v2.0
 
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
 
Perl-C/C++ Integration with Swig
Perl-C/C++ Integration with SwigPerl-C/C++ Integration with Swig
Perl-C/C++ Integration with Swig
 
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...
 
Python in Action (Part 1)
Python in Action (Part 1)Python in Action (Part 1)
Python in Action (Part 1)
 
Generator Tricks for Systems Programmers
Generator Tricks for Systems ProgrammersGenerator Tricks for Systems Programmers
Generator Tricks for Systems Programmers
 
Mastering Python 3 I/O
Mastering Python 3 I/OMastering Python 3 I/O
Mastering Python 3 I/O
 
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python ExceptionsWAD : A Module for Converting Fatal Extension Errors into Python Exceptions
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
 
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
SWIG : An Easy to Use Tool for Integrating Scripting Languages with C and C++
 
Python Generator Hacking
Python Generator HackingPython Generator Hacking
Python Generator Hacking
 
Interfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIGInterfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIG
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter Lock
 
Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)Mastering Python 3 I/O (Version 2)
Mastering Python 3 I/O (Version 2)
 
Understanding the Python GIL
Understanding the Python GILUnderstanding the Python GIL
Understanding the Python GIL
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and Concurrency
 
An Introduction to Python Concurrency
An Introduction to Python ConcurrencyAn Introduction to Python Concurrency
An Introduction to Python Concurrency
 
Ctypes в игровых приложениях на python
Ctypes в игровых приложениях на pythonCtypes в игровых приложениях на python
Ctypes в игровых приложениях на python
 
llvm-py: Writing Compilers In Python
llvm-py: Writing Compilers In Pythonllvm-py: Writing Compilers In Python
llvm-py: Writing Compilers In Python
 

Similar to Parsing and Compilers with PLY

Lex (lexical analyzer)
Lex (lexical analyzer)Lex (lexical analyzer)
Lex (lexical analyzer)Sami Said
 
Lex tool manual
Lex tool manualLex tool manual
Lex tool manualSami Said
 
Introduction to pygments
Introduction to pygmentsIntroduction to pygments
Introduction to pygmentsroskakori
 
Writing a compiler in go
Writing a compiler in goWriting a compiler in go
Writing a compiler in goYusuke Kita
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxSreeLaya9
 
Introduction to Elixir
Introduction to ElixirIntroduction to Elixir
Introduction to Elixirbrien_wankel
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1Shashwat Shriparv
 
Inside PHP [OSCON 2012]
Inside PHP [OSCON 2012]Inside PHP [OSCON 2012]
Inside PHP [OSCON 2012]Tom Lee
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonTristan Penman
 
Yacc topic beyond syllabus
Yacc   topic beyond syllabusYacc   topic beyond syllabus
Yacc topic beyond syllabusJK Knowledge
 
Write Your Own JVM Compiler
Write Your Own JVM CompilerWrite Your Own JVM Compiler
Write Your Own JVM CompilerErin Dees
 
Elixir and Dialyzer, Types and Typespecs, using and understanding them
Elixir and Dialyzer, Types and Typespecs, using and understanding themElixir and Dialyzer, Types and Typespecs, using and understanding them
Elixir and Dialyzer, Types and Typespecs, using and understanding themDan Janowski
 
6 compiler lab - Flex
6 compiler lab - Flex6 compiler lab - Flex
6 compiler lab - FlexMashaelQ
 
DSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDebasish Ghosh
 

Similar to Parsing and Compilers with PLY (20)

Lexyacc
LexyaccLexyacc
Lexyacc
 
Compiler Design Tutorial
Compiler Design Tutorial Compiler Design Tutorial
Compiler Design Tutorial
 
Lex (lexical analyzer)
Lex (lexical analyzer)Lex (lexical analyzer)
Lex (lexical analyzer)
 
Lex tool manual
Lex tool manualLex tool manual
Lex tool manual
 
Introduction to pygments
Introduction to pygmentsIntroduction to pygments
Introduction to pygments
 
Writing a compiler in go
Writing a compiler in goWriting a compiler in go
Writing a compiler in go
 
Unit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptxUnit I - 1R introduction to R program.pptx
Unit I - 1R introduction to R program.pptx
 
Introduction to Elixir
Introduction to ElixirIntroduction to Elixir
Introduction to Elixir
 
C language
C languageC language
C language
 
Yacc lex
Yacc lexYacc lex
Yacc lex
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1
 
Inside PHP [OSCON 2012]
Inside PHP [OSCON 2012]Inside PHP [OSCON 2012]
Inside PHP [OSCON 2012]
 
Generating parsers using Ragel and Lemon
Generating parsers using Ragel and LemonGenerating parsers using Ragel and Lemon
Generating parsers using Ragel and Lemon
 
Ch04
Ch04Ch04
Ch04
 
Yacc topic beyond syllabus
Yacc   topic beyond syllabusYacc   topic beyond syllabus
Yacc topic beyond syllabus
 
Write Your Own JVM Compiler
Write Your Own JVM CompilerWrite Your Own JVM Compiler
Write Your Own JVM Compiler
 
Elixir and Dialyzer, Types and Typespecs, using and understanding them
Elixir and Dialyzer, Types and Typespecs, using and understanding themElixir and Dialyzer, Types and Typespecs, using and understanding them
Elixir and Dialyzer, Types and Typespecs, using and understanding them
 
LEX & YACC TOOL
LEX & YACC TOOLLEX & YACC TOOL
LEX & YACC TOOL
 
6 compiler lab - Flex
6 compiler lab - Flex6 compiler lab - Flex
6 compiler lab - Flex
 
DSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic modelDSL - expressive syntax on top of a clean semantic model
DSL - expressive syntax on top of a clean semantic model
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

Parsing and Compilers with PLY

  • 1. Writing Parsers and Compilers with PLY David Beazley http://www.dabeaz.com February 23, 2007
  • 2. Overview • Crash course on compilers • An introduction to PLY • Notable PLY features (why use it?) • Experience writing a compiler in Python
  • 3. Background • Programs that process other programs • Compilers • Interpreters • Wrapper generators • Domain-specific languages • Code-checkers
  • 4. Example • Parse and generate assembly code /* Compute GCD of two integers */ fun gcd(x:int, y:int) g: int; begin g := y; while x > 0 do begin g := x; x := y - (y/x)*x; y := g end; return g end
  • 5. Compilers 101 • Compilers have multiple phases • First phase usually concerns "parsing" • Read program and create abstract representation /* Compute GCD of two integers */ fun gcd(x:int, y:int) g: int; begin g := y; parser while x > 0 do begin g := x; x := y - (y/x)*x; y := g end; return g end
  • 6. Compilers 101 • Code generation phase • Process the abstract representation • Produce some kind of output LOAD R1, A LOAD R2, B codegen ADD R1,R2,R1 STORE C, R1 ...
  • 7. Commentary • There are many advanced details • Most people care about code generation • Yet, parsing is often the most annoying problem • A major focus of tool building
  • 8. Parsing in a Nutshell • Lexing : Input is split into tokens b = 40 + 20*(2+3)/37.5 NAME = NUM + NUM * ( NUM + NUM ) / FLOAT • Parsing : Applying language grammar rules = NAME + NUM / * FLOAT NUM + NUM NUM
  • 9. Lex & Yacc • Programming tools for writing parsers • Lex - Lexical analysis (tokenizing) • Yacc - Yet Another Compiler Compiler (parsing) • History: - Yacc : ~1973. Stephen Johnson (AT&T) - Lex : ~1974. Eric Schmidt and Mike Lesk (AT&T) • Variations of both tools are widely known • Covered in compilers classes and textbooks
  • 10. Lex/Yacc Big Picture lexer.l token specification
  • 11. Lex/Yacc Big Picture lexer.l token /* lexer.l */ grammar specification %{ specification #include “header.h” int lineno = 1; %} %% [ t]* ; /* Ignore whitespace */ n { lineno++; } [0-9]+ { yylval.val = atoi(yytext); return NUMBER; } [a-zA-Z_][a-zA-Z0-9_]* { yylval.name = strdup(yytext); return ID; } + { return PLUS; } - { return MINUS; } * { return TIMES; } / { return DIVIDE; } = { return EQUALS; } %%
  • 12. Lex/Yacc Big Picture lexer.l token specification lex lexer.c
  • 13. Lex/Yacc Big Picture lexer.l parser.y token grammar specification specification lex lexer.c
  • 14. Lex/Yacc Big Picture lexer.l parser.y token /* parser.y */ grammar specification %{ specification #include “header.h” %} lex %union { char *name; lexer.c.c int val; } %token PLUS MINUS TIMES DIVIDE EQUALS %token<name> ID; %token<val> NUMBER; %% start : ID EQUALS expr; expr : expr PLUS term | expr MINUS term | term ; ...
  • 15. Lex/Yacc Big Picture lexer.l parser.y token grammar specification specification lex yacc lexer.c parser.c
  • 16. Lex/Yacc Big Picture lexer.l parser.y token grammar specification specification lex yacc lexer.c parser.c typecheck.c codegen.c otherstuff.c
  • 17. Lex/Yacc Big Picture lexer.l parser.y token grammar specification specification lex yacc lexer.c parser.c typecheck.c codegen.c otherstuff.c mycompiler
  • 18. What is PLY? • PLY = Python Lex-Yacc • A Python version of the lex/yacc toolset • Same functionality as lex/yacc • But a different interface • Influences : Unix yacc, SPARK (John Aycock)
  • 19. Some History • Late 90's : "Why isn't SWIG written in Python?" • 2001 : Taught a compilers course. Students write a compiler in Python as an experiment. • 2001 : PLY-1.0 developed and released • 2001-2005: Occasional maintenance • 2006 : Major update to PLY-2.x.
  • 20. PLY Package • PLY consists of two Python modules ply.lex ply.yacc • You simply import the modules to use them • However, PLY is not a code generator
  • 21. ply.lex • A module for writing lexers • Tokens specified using regular expressions • Provides functions for reading input text • An annotated example follows...
  • 22. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 23. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ tokens list specifies t_TIMES = r’*’ all of the possible tokens t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 24. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ Each token has a matching t_MINUS = r’-’ t_TIMES = r’*’ declaration of the form t_DIVIDE = r’/’ t_TOKNAME t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 25. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ These names must match t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 26. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ Tokens are defined by t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ regular expressions def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 27. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ For simple tokens, t_TIMES = r’*’ t_DIVIDE = r’/’ strings are used. t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 28. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ Functions are used when t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ code special action must execute def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 29. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): docstring holds r’d+’ regular expression t.value = int(t.value) return t lex.lex() # Build the lexer
  • 30. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ Specifies ignored ] t_ignore = ‘ t’ characters between t_PLUS = r’+’ tokens (usually whitespace) t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 31. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t Builds the lexer lex.lex() by creating a master # Build the lexer regular expression
  • 32. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ Introspection used t_TIMES = r’*’ to examine contents t_DIVIDE = r’/’ of calling module. t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 33. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ Introspection used t_TIMES = r’*’ to examine contents t_DIVIDE = r’/’ of calling module. t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ __dict__ = { t.value = int(t.value) 'tokens' : [ 'NAME' ...], return t 't_ignore' : ' t', 't_PLUS' : '+', lex.lex() # Build the ... lexer 't_NUMBER' : <function ... }
  • 34. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break # Use token ...
  • 35. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... input() feeds a string lex.input("x = 3 * 4 + 5 * 6") while True: into the lexer tok = lex.token() if not tok: break # Use token ...
  • 36. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() token() returns the if not tok: break next token or None # Use token ...
  • 37. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break tok.type # Use token ... tok.value tok.line tok.lexpos
  • 38. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break tok.type # Use token ... tok.value tok.line tok.lexpos t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
  • 39. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break tok.type # Use token ... tok.value matching text tok.line tok.lexpos t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
  • 40. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break tok.type # Use token ... tok.value tok.line tok.lexpos Position in input text
  • 41. ply.lex Commentary • Normally you don't use the tokenizer directly • Instead, it's used by the parser module
  • 42. ply.yacc preliminaries • ply.yacc is a module for creating a parser • Assumes you have defined a BNF grammar assign : NAME EQUALS expr expr : expr PLUS term | expr MINUS term | term term : term TIMES factor | term DIVIDE factor | factor factor : NUMBER
  • 43. ply.yacc example import ply.yacc as yacc import mylexer # Import lexer information tokens = mylexer.tokens # Need token list def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): '''expr : expr PLUS term | expr MINUS term | term''' def p_term(p): '''term : term TIMES factor | term DIVIDE factor | factor''' def p_factor(p): '''factor : NUMBER''' yacc.yacc() # Build the parser
  • 44. ply.yacc example import ply.yacc as yacc import mylexer token information # Import lexer information tokens = mylexer.tokens # Need token list lexer imported from def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): '''expr : expr PLUS term | expr MINUS term | term''' def p_term(p): '''term : term TIMES factor | term DIVIDE factor | factor''' def p_factor(p): '''factor : NUMBER''' yacc.yacc() # Build the parser
  • 45. ply.yacc example import ply.yacc as yacc import mylexer # Import lexer information tokens = mylexer.tokens # Need token list def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): grammar rules encoded '''expr : expr PLUS term as functions with names | expr MINUS term p_rulename | term''' def p_term(p): '''term : term TIMES factor | term DIVIDE factor Note: Name doesn't | factor''' matter as long as it def p_factor(p): starts with p_ '''factor : NUMBER''' yacc.yacc() # Build the parser
  • 46. ply.yacc example import ply.yacc as yacc import mylexer # Import lexer information tokens = mylexer.tokens # Need token list def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): '''expr : expr PLUS term | expr MINUS term | term''' docstrings contain def p_term(p): grammar rules '''term : term TIMES factor from BNF | term DIVIDE factor | factor''' def p_factor(p): '''factor : NUMBER''' yacc.yacc() # Build the parser
  • 47. ply.yacc example import ply.yacc as yacc import mylexer # Import lexer information tokens = mylexer.tokens # Need token list def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): '''expr : expr PLUS term | expr MINUS term | term''' def p_term(p): '''term : term TIMES factor | term DIVIDE factor | factor''' def p_factor(p): '''factor : NUMBER''' Builds the parser yacc.yacc() # Build the parser using introspection
  • 48. ply.yacc parsing • yacc.parse() function yacc.yacc() # Build the parser ... data = "x = 3*4+5*6" yacc.parse(data) # Parse some text • This feeds data into lexer • Parses the text and invokes grammar rules
  • 49. A peek inside • PLY uses LR-parsing. LALR(1) • AKA: Shift-reduce parsing • Widely used parsing technique • Table driven
  • 50. General Idea • Input tokens are shifted onto a parsing stack Stack Input X = 3 * 4 + 5 NAME = 3 * 4 + 5 NAME = 3 * 4 + 5 NAME = NUM * 4 + 5 • This continues until a complete grammar rule appears on the top of the stack
  • 51. General Idea • If rules are found, a "reduction" occurs Stack Input X = 3 * 4 + 5 NAME = 3 * 4 + 5 NAME = 3 * 4 + 5 NAME = NUM * 4 + 5 reduce factor : NUM NAME = factor • RHS of grammar rule replaced with LHS
  • 52. Rule Functions • During reduction, rule functions are invoked def p_factor(p): ‘factor : NUMBER’ • Parameter p contains grammar symbol values def p_factor(p): ‘factor : NUMBER’ p[0] p[1]
  • 53. Using an LR Parser • Rule functions generally process values on right hand side of grammar rule • Result is then stored in left hand side • Results propagate up through the grammar • Bottom-up parsing
  • 54. Example: Calculator def p_assign(p): ‘’’assign : NAME EQUALS expr’’’ vars[p[1]] = p[3] def p_expr_plus(p): ‘’’expr : expr PLUS term’’’ p[0] = p[1] + p[3] def p_term_mul(p): ‘’’term : term TIMES factor’’’ p[0] = p[1] * p[3] def p_term_factor(p): '''term : factor''' p[0] = p[1] def p_factor(p): ‘’’factor : NUMBER’’’ p[0] = p[1]
  • 55. Example: Parse Tree def p_assign(p): ‘’’assign : NAME EQUALS expr’’’ p[0] = (‘ASSIGN’,p[1],p[3]) def p_expr_plus(p): ‘’’expr : expr PLUS term’’’ p[0] = (‘+’,p[1],p[3]) def p_term_mul(p): ‘’’term : term TIMES factor’’’ p[0] = (‘*’,p[1],p[3]) def p_term_factor(p): '''term : factor''' p[0] = p[1] def p_factor(p): ‘’’factor : NUMBER’’’ p[0] = (‘NUM’,p[1])
  • 56. Example: Parse Tree >>> t = yacc.parse("x = 3*4 + 5*6") >>> t ('ASSIGN','x',('+', ('*',('NUM',3),('NUM',4)), ('*',('NUM',5),('NUM',6)) ) ) >>> ASSIGN 'x' '+' '*' '*' 3 4 5 6
  • 57. Why use PLY? • There are many Python parsing tools • Some use more powerful parsing algorithms • Isn't parsing a "solved" problem anyways?
  • 58. PLY is Informative • Compiler writing is hard • Tools should not make it even harder • PLY provides extensive diagnostics • Major emphasis on error reporting • Provides the same information as yacc
  • 59. PLY Diagnostics • PLY produces the same diagnostics as yacc • Yacc % yacc grammar.y 4 shift/reduce conflicts 2 reduce/reduce conflicts • PLY % python mycompiler.py yacc: Generating LALR parsing table... 4 shift/reduce conflicts 2 reduce/reduce conflicts • PLY also produces the same debugging output
  • 60. Debugging Output Grammar state 10 Rule 1 statement -> NAME = expression (1) statement -> NAME = expression . Rule 2 statement -> expression (3) expression -> expression . + expression Rule 3 expression -> expression + expression (4) expression -> expression . - expression Rule 4 expression -> expression - expression (5) expression -> expression . * expression Rule 5 expression -> expression * expression (6) expression -> expression . / expression Rule 6 expression -> expression / expression Rule 7 expression -> NUMBER $end reduce using rule 1 (statement -> NAME = expression .) + shift and go to state 7 Terminals, with rules where they appear - shift and go to state 6 * shift and go to state 8 * : 5 / shift and go to state 9 + : 3 - : 4 / : 6 = : 1 state 11 NAME : 1 NUMBER : 7 (4) expression -> expression - expression . error : (3) expression -> expression . + expression (4) expression -> expression . - expression Nonterminals, with rules where they appear (5) expression -> expression . * expression (6) expression -> expression . / expression expression : 1 2 3 3 4 4 5 5 6 6 statement : 0 ! shift/reduce conflict for + resolved as shift. ! shift/reduce conflict for - resolved as shift. Parsing method: LALR ! shift/reduce conflict for * resolved as shift. ! shift/reduce conflict for / resolved as shift. state 0 $end reduce using rule 4 (expression -> expression - expression .) + shift and go to state 7 (0) S' -> . statement - shift and go to state 6 (1) statement -> . NAME = expression * shift and go to state 8 (2) statement -> . expression / shift and go to state 9 (3) expression -> . expression + expression (4) expression -> . expression - expression ! + [ reduce using rule 4 (expression -> expression - expression .) ] (5) expression -> . expression * expression ! - [ reduce using rule 4 (expression -> expression - expression .) ] (6) expression -> . expression / expression ! * [ reduce using rule 4 (expression -> expression - expression .) ] (7) expression -> . NUMBER ! / [ reduce using rule 4 (expression -> expression - expression .) ] NAME shift and go to state 1 NUMBER shift and go to state 2 expression shift and go to state 4 statement shift and go to state 3 state 1 (1) statement -> NAME . = expression = shift and go to state 5
  • 61. Debugging Output ... Grammar state 10 state 1 11 statement Rule -> NAME = expression (1) statement -> NAME = expression . Rule 2 statement -> expression (3) expression -> expression . + expression Rule 3 expression -> expression + expression (4) expression -> expression . - expression (4) expression -> expression Rule Rule 4 5 expression -> expression - expression expression -> expression * expression - expression . (5) expression -> expression . * expression (6) expression -> expression . / expression (3) expression -> expression Rule Rule 6 7 expression -> expression / expression expression -> NUMBER . + expression $end reduce using rule 1 (statement -> NAME = expression .) (4) expression -> expression Terminals, with rules where they appear . - expression + shift and go to state 7 - shift and go to state 6 * (5) expression -> expression : 5 . * expression * shift and go to state 8 / shift and go to state 9 + (6) expression -> expression : 3 . / expression - : 4 / : 6 = : 1 state 11 ! shift/reduce conflict for + resolved as shift. NAME NUMBER : : 1 7 (4) expression -> expression - expression . ! shift/reduce conflict for - resolved as shift. error : (3) expression -> expression . + expression (4) expression -> expression . - expression ! shift/reduce conflict for * resolved as shift. Nonterminals, with rules where they appear (5) expression -> expression . * expression (6) expression -> expression . / expression !statement shift/reduce expression : 1 2 3 conflict for / resolved as shift. : 0 3 4 4 5 5 6 6 ! shift/reduce conflict for + resolved as shift. $end Parsing method: LALR reduce using rule 4 (expression ->conflict for - resolved as shift. ! shift/reduce expression - expression .) ! shift/reduce conflict for * resolved as shift. + state 0 shift and go to state 7 ! shift/reduce conflict for / resolved as shift. $end reduce using rule 4 (expression -> expression - expression .) -(0) S' -> . statement shift and go to state 6 + shift and go to state 7 - shift and go to state 6 *(2) statement -> . expression shift and go to state 8 (1) statement -> . NAME = expression * shift and go to state 8 / shift and go to state 9 /(4) expression -> . expression - expression (3) expression -> . expression shift and go to state 9 + expression ! + [ reduce using rule 4 (expression -> expression - expression .) ] (5) expression -> . expression * expression ! - [ reduce using rule 4 (expression -> expression - expression .) ] (6) expression -> . expression / expression ! * [ reduce using rule 4 (expression -> expression - expression .) ] ! + (7) expression -> . NUMBER [ reduce using rule 4 (expression ! / -> [ expression reduce using rule 4 - expression (expression -> expression - .) ] expression .) ] ! -NAME NUMBER shift and go shift and go [ reduce using to state 1 to state 2 rule 4 (expression -> expression - expression .) ] ! * [ reduce using rule 4 (expression -> expression - expression .) ] ! /expression statement [ reduce state 4 shift and go to using shift and go to state 3 rule 4 (expression -> expression - expression .) ] ...state 1 (1) statement -> NAME . = expression = shift and go to state 5
  • 62. PLY Validation • PLY validates all token/grammar specs • Duplicate rules • Malformed regexs and grammars • Missing rules and tokens • Unused tokens and rules • Improper function declarations • Infinite recursion
  • 63. Error Example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ t_MINUS = r'-' example.py:12: Rule t_MINUS redefined. t_POWER = r'^' Previously defined on line 6 def t_NUMBER(): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 64. Error Example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ t_MINUS = r'-' t_POWER = r'^' lex: Rule 't_POWER' defined for an unspecified token POWER def t_NUMBER(): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  • 65. Error Example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ t_MINUS = r'-' t_POWER = r'^' def t_NUMBER(): example.py:15: Rule 't_NUMBER' requires r’d+’ an argument. t.value = int(t.value) return t lex.lex() # Build the lexer
  • 66. Commentary • PLY was developed for classroom use • Major emphasis on identifying and reporting potential problems • Report errors rather that fail with exception
  • 67. PLY is Yacc • PLY supports all of the major features of Unix lex/yacc • Syntax error handling and synchronization • Precedence specifiers • Character literals • Start conditions • Inherited attributes
  • 68. Precedence Specifiers • Yacc %left PLUS MINUS %left TIMES DIVIDE %nonassoc UMINUS ... expr : MINUS expr %prec UMINUS { $$ = -$1; } • PLY precedence = ( ('left','PLUS','MINUS'), ('left','TIMES','DIVIDE'), ('nonassoc','UMINUS'), ) def p_expr_uminus(p): 'expr : MINUS expr %prec UMINUS' p[0] = -p[1]
  • 69. Character Literals • Yacc expr : expr '+' expr { $$ = $1 + $3; } | expr '-' expr { $$ = $1 - $3; } | expr '*' expr { $$ = $1 * $3; } | expr '/' expr { $$ = $1 / $3; } ; • PLY def p_expr(p): '''expr : expr '+' expr | expr '-' expr | expr '*' expr | expr '/' expr''' ...
  • 70. Error Productions • Yacc funcall_err : ID LPAREN error RPAREN { printf("Syntax error in argumentsn"); } ; • PLY def p_funcall_err(p): '''ID LPAREN error RPAREN''' print "Syntax error in argumentsn"
  • 71. Commentary • Books and documentation on yacc/bison used to guide the development of PLY • Tried to copy all of the major features • Usage as similar to lex/yacc as reasonable
  • 72. PLY is Simple • Two pure-Python modules. That's it. • Not part of a "parser framework" • Use doesn't involve exotic design patterns • Doesn't rely upon C extension modules • Doesn't rely on third party tools
  • 73. PLY is Fast • For a parser written entirely in Python • Underlying parser is table driven • Parsing tables are saved and only regenerated if the grammar changes • Considerable work went into optimization from the start (developed on 200Mhz PC)
  • 74. PLY Performance • Example: Generating the LALR tables • Input: SWIG C++ grammar • 459 grammar rules, 892 parser states • 3.6 seconds (PLY-2.3, 2.66Ghz Intel Xeon) • 0.026 seconds (bison/ANSI C) • Fast enough not to be annoying • Tables only generated once and reused
  • 75. PLY Performance • Parse file with 1000 random expressions (805KB) and build an abstract syntax tree • PLY-2.3 : 2.95 sec, 10.2 MB (Python) • YAPPS2 : 6.57 sec, 32.5 MB (Python) • PyParsing : 13.11 sec, 15.6 MB (Python) • ANTLR : 53.16 sec, 94 MB (Python) • SPARK : 235.88 sec, 347 MB (Python) • System: MacPro 2.66Ghz Xeon, Python-2.5
  • 76. PLY Performance • Parse file with 1000 random expressions (805KB) and build an abstract syntax tree • PLY-2.3 : 2.95 sec, 10.2 MB (Python) • DParser : 0.71 sec, 72 MB (Python/C) • BisonGen : 0.25 sec, 13 MB (Python/C) • Bison : 0.063 sec, 7.9 MB (C) • 12x slower than BisonGen (mostly C) • 47x slower than pure C • System: MacPro 2.66Ghz Xeon, Python-2.5
  • 77. Perf. Breakdown • Parse file with 1000 random expressions (805KB) and build an abstract syntax tree • Total time : 2.95 sec • Startup : 0.02 sec • Lexing : 1.20 sec • Parsing : 1.12 sec • AST : 0.61 sec • System: MacPro 2.66Ghz Xeon, Python-2.5
  • 78. Advanced PLY • PLY has many advanced features • Lexers/parsers can be defined as classes • Support for multiple lexers and parsers • Support for optimized mode (python -O)
  • 79. Class Example import ply.yacc as yacc class MyParser: def p_assign(self,p): ‘’’assign : NAME EQUALS expr’’’ def p_expr(self,p): ‘’’expr : expr PLUS term | expr MINUS term | term’’’ def p_term(self,p): ‘’’term : term TIMES factor | term DIVIDE factor | factor’’’ def p_factor(self,p): ‘’’factor : NUMBER’’’ def build(self): self.parser = yacc.yacc(object=self)
  • 80. Experience with PLY • In 2001, I taught a compilers course • Students wrote a full compiler • Lexing, parsing, type checking, code generation • Procedures, nested scopes, and type inference • Produced working SPARC assembly code
  • 81. Classroom Results • You can write a real compiler in Python • Students were successful with projects • However, many projects were quite "hacky" • Still unsure about dynamic nature of Python • May be too easy to create a "bad" compiler
  • 82. General PLY Experience • May be very useful for prototyping • PLY's strength is in its diagnostics • Significantly faster than most Python parsers • Not sure I'd rewrite gcc in Python just yet • I'm still thinking about SWIG.
  • 83. Limitations • LALR(1) parsing • Not easy to work with very complex grammars (e.g., C++ parsing) • Retains all of yacc's black magic • Not as powerful as more general parsing algorithms (ANTLR, SPARK, etc.) • Tradeoff : Speed vs. Generality
  • 84. PLY Usage • Current version : Ply-2.3 • >100 downloads/week • People are obviously using it • Largest project I know of : Ada parser • Many other small projects
  • 85. Future Directions • PLY was written for Python-2.0 • Not yet updated to use modern Python features such as iterators and generators • May update, but not at the expense of performance • Working on some add-ons to ease transition between yacc <---> PLY.
  • 86. Acknowledgements • Many people have contributed to PLY Thad Austin Adam Kerrison Shannon Behrens Daniel Larraz Michael Brown David McNab Russ Cox Patrick Mezard Johan Dahl Pearu Peterson Andrew Dalke François Pinard Michael Dyck Eric Raymond Joshua Gerth Adam Ring Elias Ioup Rich Salz Oldrich Jedlicka Markus Schoepflin Sverre Jørgensen Christoper Stawarz Lee June Miki Tebeka Andreas Jung Andrew Waters Cem Karan • Apologies to anyone I forgot
  • 87. Resources • PLY homepage http://www.dabeaz.com/ply • Mailing list/group http://groups.google.com/group/ply-hack