Writing Parsers and Compilers
          with PLY
             David Beazley
        http://www.dabeaz.com

          Febru...
Overview

• Crash course on compilers
• An introduction to PLY
• Notable PLY features (why use it?)
• Experience writing a...
Background
• Programs that process other programs
• Compilers
• Interpreters
• Wrapper generators
• Domain-specific languag...
Example
• Parse and generate assembly code
  /* Compute GCD of two integers */
  fun gcd(x:int, y:int)
      g: int;
     ...
Compilers 101
• Compilers have multiple phases
• First phase usually concerns "parsing"
• Read program and create abstract...
Compilers 101
• Code generation phase
• Process the abstract representation
• Produce some kind of output
                ...
Commentary

• There are many advanced details
• Most people care about code generation
• Yet, parsing is often the most an...
Parsing in a Nutshell
• Lexing : Input is split into tokens
                 b = 40 + 20*(2+3)/37.5


     NAME = NUM + NU...
Lex & Yacc
• Programming tools for writing parsers
• Lex - Lexical analysis (tokenizing)
• Yacc - Yet Another Compiler Com...
Lex/Yacc Big Picture
lexer.l
    token
specification
Lex/Yacc Big Picture
lexer.l
    token
 /* lexer.l */                       grammar
specification
 %{                     ...
Lex/Yacc Big Picture
lexer.l
    token
specification

          lex

lexer.c
Lex/Yacc Big Picture
lexer.l         parser.y
    token          grammar
specification   specification

          lex

lex...
Lex/Yacc Big Picture
 lexer.l                         parser.y
     token
     /* parser.y */                  grammar
 sp...
Lex/Yacc Big Picture
lexer.l         parser.y
    token          grammar
specification   specification

          lex     ...
Lex/Yacc Big Picture
lexer.l                      parser.y
     token                      grammar
 specification         ...
Lex/Yacc Big Picture
lexer.l                       parser.y
     token                       grammar
 specification       ...
What is PLY?

• PLY = Python Lex-Yacc
• A Python version of the lex/yacc toolset
• Same functionality as lex/yacc
• But a ...
Some History
• Late 90's : "Why isn't SWIG written in Python?"
• 2001 : Taught a compilers course. Students
  write a comp...
PLY Package
• PLY consists of two Python modules
   ply.lex
   ply.yacc


• You simply import the modules to use them
• Ho...
ply.lex

• A module for writing lexers
• Tokens specified using regular expressions
• Provides functions for reading input ...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ Spec...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_...
ply.lex use
• Two functions: input() and token()
   ...
   lex.lex()         # Build the lexer
   ...
   lex.input("x = 3 ...
ply.lex use
• Two functions: input() and token()
   ...
   lex.lex()         # Build the lexer
   ...
                    ...
ply.lex use
• Two functions: input() and token()
   ...
   lex.lex()         # Build the lexer
   ...
   lex.input("x = 3 ...
ply.lex use
• Two functions: input() and token()
    ...
    lex.lex()         # Build the lexer
    ...
    lex.input("x ...
ply.lex use
• Two functions: input() and token()
    ...
    lex.lex()         # Build the lexer
    ...
    lex.input("x ...
ply.lex use
• Two functions: input() and token()
    ...
    lex.lex()         # Build the lexer
    ...
    lex.input("x ...
ply.lex use
• Two functions: input() and token()
    ...
    lex.lex()         # Build the lexer
    ...
    lex.input("x ...
ply.lex Commentary


• Normally you don't use the tokenizer directly
• Instead, it's used by the parser module
ply.yacc preliminaries
• ply.yacc is a module for creating a parser
• Assumes you have defined a BNF grammar
    assign : N...
ply.yacc example
import ply.yacc as yacc
import mylexer               # Import lexer information
tokens = mylexer.tokens  ...
ply.yacc example
import ply.yacc as yacc
import mylexer                    token information
                            #...
ply.yacc example
import ply.yacc as yacc
import mylexer               # Import lexer information
tokens = mylexer.tokens  ...
ply.yacc example
import ply.yacc as yacc
import mylexer               # Import lexer information
tokens = mylexer.tokens  ...
ply.yacc example
import ply.yacc as yacc
import mylexer            # Import lexer information
tokens = mylexer.tokens   # ...
ply.yacc parsing
• yacc.parse() function
    yacc.yacc()      # Build the parser
    ...
    data = "x = 3*4+5*6"
    yacc...
A peek inside

• PLY uses LR-parsing. LALR(1)
• AKA: Shift-reduce parsing
• Widely used parsing technique
• Table driven
General Idea
• Input tokens are shifted onto a parsing stack
     Stack                 Input
                           X...
General Idea
• If rules are found, a "reduction" occurs
    Stack                       Input
                            ...
Rule Functions

• During reduction, rule functions are invoked
    def p_factor(p):
        ‘factor : NUMBER’


• Paramete...
Using an LR Parser

• Rule functions generally process values on
  right hand side of grammar rule
• Result is then stored...
Example: Calculator
def p_assign(p):
    ‘’’assign : NAME EQUALS expr’’’
    vars[p[1]] = p[3]

def p_expr_plus(p):
    ‘’...
Example: Parse Tree
def p_assign(p):
    ‘’’assign : NAME EQUALS expr’’’
    p[0] = (‘ASSIGN’,p[1],p[3])

def p_expr_plus(...
Example: Parse Tree
>>> t = yacc.parse("x = 3*4 + 5*6")
>>> t
('ASSIGN','x',('+',
                  ('*',('NUM',3),('NUM',...
Why use PLY?

• There are many Python parsing tools
• Some use more powerful parsing algorithms
• Isn't parsing a "solved"...
PLY is Informative
• Compiler writing is hard
• Tools should not make it even harder
• PLY provides extensive diagnostics
...
PLY Diagnostics
• PLY produces the same diagnostics as yacc
• Yacc
    % yacc grammar.y
    4 shift/reduce conflicts
    2...
Debugging Output
Grammar                                                             state 10
Rule       1      statement ...
Debugging Output
... Grammar                                                                          state 10
state 1 11 ...
PLY Validation
• PLY validates all token/grammar specs
• Duplicate rules
• Malformed regexs and grammars
• Missing rules a...
Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ig...
Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ig...
Error Example
import ply.lex as lex
tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’,
           ’DIVIDE’, EQUALS’ ]
t_ig...
Commentary

• PLY was developed for classroom use
• Major emphasis on identifying and reporting
  potential problems
• Rep...
PLY is Yacc
• PLY supports all of the major features of
  Unix lex/yacc
• Syntax error handling and synchronization
• Prec...
Precedence Specifiers
• Yacc
   %left PLUS MINUS
   %left TIMES DIVIDE
   %nonassoc UMINUS
   ...
   expr : MINUS expr %pre...
Character Literals
• Yacc
  expr :   expr   '+'   expr    {    $$   =   $1   +   $3;   }
       |   expr   '-'   expr    {...
Error Productions
• Yacc
  funcall_err : ID LPAREN error RPAREN {
           printf("Syntax error in argumentsn");
       ...
Commentary

• Books and documentation on yacc/bison
  used to guide the development of PLY
• Tried to copy all of the majo...
PLY is Simple
• Two pure-Python modules. That's it.
• Not part of a "parser framework"
• Use doesn't involve exotic design...
PLY is Fast
• For a parser written entirely in Python
• Underlying parser is table driven
• Parsing tables are saved and o...
PLY Performance
• Example: Generating the LALR tables
     •   Input: SWIG C++ grammar

     •   459 grammar rules, 892 pa...
PLY Performance
• Parse file with 1000 random expressions
  (805KB) and build an abstract syntax tree
      •   PLY-2.3   :...
PLY Performance
• Parse file with 1000 random expressions
  (805KB) and build an abstract syntax tree
     •   PLY-2.3   : ...
Perf. Breakdown
• Parse file with 1000 random expressions
  (805KB) and build an abstract syntax tree
     •   Total time :...
Advanced PLY

• PLY has many advanced features
• Lexers/parsers can be defined as classes
• Support for multiple lexers and...
Class Example
import ply.yacc as yacc

class MyParser:
    def p_assign(self,p):
        ‘’’assign : NAME EQUALS expr’’’
 ...
Experience with PLY

• In 2001, I taught a compilers course
• Students wrote a full compiler
• Lexing, parsing, type check...
Classroom Results

• You can write a real compiler in Python
• Students were successful with projects
• However, many proj...
General PLY Experience

• May be very useful for prototyping
• PLY's strength is in its diagnostics
• Significantly faster ...
Limitations
• LALR(1) parsing
• Not easy to work with very complex grammars
  (e.g., C++ parsing)
• Retains all of yacc's ...
PLY Usage

• Current version : Ply-2.3
• >100 downloads/week
• People are obviously using it
• Largest project I know of :...
Future Directions
• PLY was written for Python-2.0
• Not yet updated to use modern Python
  features such as iterators and...
Acknowledgements
• Many people have contributed to PLY
    Thad Austin          Adam Kerrison
    Shannon Behrens      Dan...
Resources

• PLY homepage
     http://www.dabeaz.com/ply


• Mailing list/group
http://groups.google.com/group/ply-hack
Upcoming SlideShare
Loading in...5
×

Writing Parsers and Compilers with PLY

9,876

Published on

Presentation from PyCon'2007, February 23, 2007. Dallas, Texas.

Published in: Technology
2 Comments
15 Likes
Statistics
Notes
No Downloads
Views
Total Views
9,876
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
301
Comments
2
Likes
15
Embeds 0
No embeds

No notes for slide

Writing Parsers and Compilers with PLY

  1. 1. Writing Parsers and Compilers with PLY David Beazley http://www.dabeaz.com February 23, 2007
  2. 2. Overview • Crash course on compilers • An introduction to PLY • Notable PLY features (why use it?) • Experience writing a compiler in Python
  3. 3. Background • Programs that process other programs • Compilers • Interpreters • Wrapper generators • Domain-specific languages • Code-checkers
  4. 4. Example • Parse and generate assembly code /* Compute GCD of two integers */ fun gcd(x:int, y:int) g: int; begin g := y; while x > 0 do begin g := x; x := y - (y/x)*x; y := g end; return g end
  5. 5. Compilers 101 • Compilers have multiple phases • First phase usually concerns "parsing" • Read program and create abstract representation /* Compute GCD of two integers */ fun gcd(x:int, y:int) g: int; begin g := y; parser while x > 0 do begin g := x; x := y - (y/x)*x; y := g end; return g end
  6. 6. Compilers 101 • Code generation phase • Process the abstract representation • Produce some kind of output LOAD R1, A LOAD R2, B codegen ADD R1,R2,R1 STORE C, R1 ...
  7. 7. Commentary • There are many advanced details • Most people care about code generation • Yet, parsing is often the most annoying problem • A major focus of tool building
  8. 8. Parsing in a Nutshell • Lexing : Input is split into tokens b = 40 + 20*(2+3)/37.5 NAME = NUM + NUM * ( NUM + NUM ) / FLOAT • Parsing : Applying language grammar rules = NAME + NUM / * FLOAT NUM + NUM NUM
  9. 9. Lex & Yacc • Programming tools for writing parsers • Lex - Lexical analysis (tokenizing) • Yacc - Yet Another Compiler Compiler (parsing) • History: - Yacc : ~1973. Stephen Johnson (AT&T) - Lex : ~1974. Eric Schmidt and Mike Lesk (AT&T) • Variations of both tools are widely known • Covered in compilers classes and textbooks
  10. 10. Lex/Yacc Big Picture lexer.l token specification
  11. 11. Lex/Yacc Big Picture lexer.l token /* lexer.l */ grammar specification %{ specification #include “header.h” int lineno = 1; %} %% [ t]* ; /* Ignore whitespace */ n { lineno++; } [0-9]+ { yylval.val = atoi(yytext); return NUMBER; } [a-zA-Z_][a-zA-Z0-9_]* { yylval.name = strdup(yytext); return ID; } + { return PLUS; } - { return MINUS; } * { return TIMES; } / { return DIVIDE; } = { return EQUALS; } %%
  12. 12. Lex/Yacc Big Picture lexer.l token specification lex lexer.c
  13. 13. Lex/Yacc Big Picture lexer.l parser.y token grammar specification specification lex lexer.c
  14. 14. Lex/Yacc Big Picture lexer.l parser.y token /* parser.y */ grammar specification %{ specification #include “header.h” %} lex %union { char *name; lexer.c.c int val; } %token PLUS MINUS TIMES DIVIDE EQUALS %token<name> ID; %token<val> NUMBER; %% start : ID EQUALS expr; expr : expr PLUS term | expr MINUS term | term ; ...
  15. 15. Lex/Yacc Big Picture lexer.l parser.y token grammar specification specification lex yacc lexer.c parser.c
  16. 16. Lex/Yacc Big Picture lexer.l parser.y token grammar specification specification lex yacc lexer.c parser.c typecheck.c codegen.c otherstuff.c
  17. 17. Lex/Yacc Big Picture lexer.l parser.y token grammar specification specification lex yacc lexer.c parser.c typecheck.c codegen.c otherstuff.c mycompiler
  18. 18. What is PLY? • PLY = Python Lex-Yacc • A Python version of the lex/yacc toolset • Same functionality as lex/yacc • But a different interface • Influences : Unix yacc, SPARK (John Aycock)
  19. 19. Some History • Late 90's : "Why isn't SWIG written in Python?" • 2001 : Taught a compilers course. Students write a compiler in Python as an experiment. • 2001 : PLY-1.0 developed and released • 2001-2005: Occasional maintenance • 2006 : Major update to PLY-2.x.
  20. 20. PLY Package • PLY consists of two Python modules ply.lex ply.yacc • You simply import the modules to use them • However, PLY is not a code generator
  21. 21. ply.lex • A module for writing lexers • Tokens specified using regular expressions • Provides functions for reading input text • An annotated example follows...
  22. 22. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  23. 23. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ tokens list specifies t_TIMES = r’*’ all of the possible tokens t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  24. 24. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ Each token has a matching t_MINUS = r’-’ t_TIMES = r’*’ declaration of the form t_DIVIDE = r’/’ t_TOKNAME t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  25. 25. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ These names must match t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  26. 26. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ Tokens are defined by t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ regular expressions def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  27. 27. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ For simple tokens, t_TIMES = r’*’ t_DIVIDE = r’/’ strings are used. t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  28. 28. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ Functions are used when t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ code special action must execute def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  29. 29. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): docstring holds r’d+’ regular expression t.value = int(t.value) return t lex.lex() # Build the lexer
  30. 30. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ Specifies ignored ] t_ignore = ‘ t’ characters between t_PLUS = r’+’ tokens (usually whitespace) t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  31. 31. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t Builds the lexer lex.lex() by creating a master # Build the lexer regular expression
  32. 32. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ Introspection used t_TIMES = r’*’ to examine contents t_DIVIDE = r’/’ of calling module. t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  33. 33. ply.lex example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ Introspection used t_TIMES = r’*’ to examine contents t_DIVIDE = r’/’ of calling module. t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ def t_NUMBER(t): r’d+’ __dict__ = { t.value = int(t.value) 'tokens' : [ 'NAME' ...], return t 't_ignore' : ' t', 't_PLUS' : '+', lex.lex() # Build the ... lexer 't_NUMBER' : <function ... }
  34. 34. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break # Use token ...
  35. 35. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... input() feeds a string lex.input("x = 3 * 4 + 5 * 6") while True: into the lexer tok = lex.token() if not tok: break # Use token ...
  36. 36. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() token() returns the if not tok: break next token or None # Use token ...
  37. 37. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break tok.type # Use token ... tok.value tok.line tok.lexpos
  38. 38. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break tok.type # Use token ... tok.value tok.line tok.lexpos t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
  39. 39. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break tok.type # Use token ... tok.value matching text tok.line tok.lexpos t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’
  40. 40. ply.lex use • Two functions: input() and token() ... lex.lex() # Build the lexer ... lex.input("x = 3 * 4 + 5 * 6") while True: tok = lex.token() if not tok: break tok.type # Use token ... tok.value tok.line tok.lexpos Position in input text
  41. 41. ply.lex Commentary • Normally you don't use the tokenizer directly • Instead, it's used by the parser module
  42. 42. ply.yacc preliminaries • ply.yacc is a module for creating a parser • Assumes you have defined a BNF grammar assign : NAME EQUALS expr expr : expr PLUS term | expr MINUS term | term term : term TIMES factor | term DIVIDE factor | factor factor : NUMBER
  43. 43. ply.yacc example import ply.yacc as yacc import mylexer # Import lexer information tokens = mylexer.tokens # Need token list def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): '''expr : expr PLUS term | expr MINUS term | term''' def p_term(p): '''term : term TIMES factor | term DIVIDE factor | factor''' def p_factor(p): '''factor : NUMBER''' yacc.yacc() # Build the parser
  44. 44. ply.yacc example import ply.yacc as yacc import mylexer token information # Import lexer information tokens = mylexer.tokens # Need token list lexer imported from def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): '''expr : expr PLUS term | expr MINUS term | term''' def p_term(p): '''term : term TIMES factor | term DIVIDE factor | factor''' def p_factor(p): '''factor : NUMBER''' yacc.yacc() # Build the parser
  45. 45. ply.yacc example import ply.yacc as yacc import mylexer # Import lexer information tokens = mylexer.tokens # Need token list def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): grammar rules encoded '''expr : expr PLUS term as functions with names | expr MINUS term p_rulename | term''' def p_term(p): '''term : term TIMES factor | term DIVIDE factor Note: Name doesn't | factor''' matter as long as it def p_factor(p): starts with p_ '''factor : NUMBER''' yacc.yacc() # Build the parser
  46. 46. ply.yacc example import ply.yacc as yacc import mylexer # Import lexer information tokens = mylexer.tokens # Need token list def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): '''expr : expr PLUS term | expr MINUS term | term''' docstrings contain def p_term(p): grammar rules '''term : term TIMES factor from BNF | term DIVIDE factor | factor''' def p_factor(p): '''factor : NUMBER''' yacc.yacc() # Build the parser
  47. 47. ply.yacc example import ply.yacc as yacc import mylexer # Import lexer information tokens = mylexer.tokens # Need token list def p_assign(p): '''assign : NAME EQUALS expr''' def p_expr(p): '''expr : expr PLUS term | expr MINUS term | term''' def p_term(p): '''term : term TIMES factor | term DIVIDE factor | factor''' def p_factor(p): '''factor : NUMBER''' Builds the parser yacc.yacc() # Build the parser using introspection
  48. 48. ply.yacc parsing • yacc.parse() function yacc.yacc() # Build the parser ... data = "x = 3*4+5*6" yacc.parse(data) # Parse some text • This feeds data into lexer • Parses the text and invokes grammar rules
  49. 49. A peek inside • PLY uses LR-parsing. LALR(1) • AKA: Shift-reduce parsing • Widely used parsing technique • Table driven
  50. 50. General Idea • Input tokens are shifted onto a parsing stack Stack Input X = 3 * 4 + 5 NAME = 3 * 4 + 5 NAME = 3 * 4 + 5 NAME = NUM * 4 + 5 • This continues until a complete grammar rule appears on the top of the stack
  51. 51. General Idea • If rules are found, a "reduction" occurs Stack Input X = 3 * 4 + 5 NAME = 3 * 4 + 5 NAME = 3 * 4 + 5 NAME = NUM * 4 + 5 reduce factor : NUM NAME = factor • RHS of grammar rule replaced with LHS
  52. 52. Rule Functions • During reduction, rule functions are invoked def p_factor(p): ‘factor : NUMBER’ • Parameter p contains grammar symbol values def p_factor(p): ‘factor : NUMBER’ p[0] p[1]
  53. 53. Using an LR Parser • Rule functions generally process values on right hand side of grammar rule • Result is then stored in left hand side • Results propagate up through the grammar • Bottom-up parsing
  54. 54. Example: Calculator def p_assign(p): ‘’’assign : NAME EQUALS expr’’’ vars[p[1]] = p[3] def p_expr_plus(p): ‘’’expr : expr PLUS term’’’ p[0] = p[1] + p[3] def p_term_mul(p): ‘’’term : term TIMES factor’’’ p[0] = p[1] * p[3] def p_term_factor(p): '''term : factor''' p[0] = p[1] def p_factor(p): ‘’’factor : NUMBER’’’ p[0] = p[1]
  55. 55. Example: Parse Tree def p_assign(p): ‘’’assign : NAME EQUALS expr’’’ p[0] = (‘ASSIGN’,p[1],p[3]) def p_expr_plus(p): ‘’’expr : expr PLUS term’’’ p[0] = (‘+’,p[1],p[3]) def p_term_mul(p): ‘’’term : term TIMES factor’’’ p[0] = (‘*’,p[1],p[3]) def p_term_factor(p): '''term : factor''' p[0] = p[1] def p_factor(p): ‘’’factor : NUMBER’’’ p[0] = (‘NUM’,p[1])
  56. 56. Example: Parse Tree >>> t = yacc.parse("x = 3*4 + 5*6") >>> t ('ASSIGN','x',('+', ('*',('NUM',3),('NUM',4)), ('*',('NUM',5),('NUM',6)) ) ) >>> ASSIGN 'x' '+' '*' '*' 3 4 5 6
  57. 57. Why use PLY? • There are many Python parsing tools • Some use more powerful parsing algorithms • Isn't parsing a "solved" problem anyways?
  58. 58. PLY is Informative • Compiler writing is hard • Tools should not make it even harder • PLY provides extensive diagnostics • Major emphasis on error reporting • Provides the same information as yacc
  59. 59. PLY Diagnostics • PLY produces the same diagnostics as yacc • Yacc % yacc grammar.y 4 shift/reduce conflicts 2 reduce/reduce conflicts • PLY % python mycompiler.py yacc: Generating LALR parsing table... 4 shift/reduce conflicts 2 reduce/reduce conflicts • PLY also produces the same debugging output
  60. 60. Debugging Output Grammar state 10 Rule 1 statement -> NAME = expression (1) statement -> NAME = expression . Rule 2 statement -> expression (3) expression -> expression . + expression Rule 3 expression -> expression + expression (4) expression -> expression . - expression Rule 4 expression -> expression - expression (5) expression -> expression . * expression Rule 5 expression -> expression * expression (6) expression -> expression . / expression Rule 6 expression -> expression / expression Rule 7 expression -> NUMBER $end reduce using rule 1 (statement -> NAME = expression .) + shift and go to state 7 Terminals, with rules where they appear - shift and go to state 6 * shift and go to state 8 * : 5 / shift and go to state 9 + : 3 - : 4 / : 6 = : 1 state 11 NAME : 1 NUMBER : 7 (4) expression -> expression - expression . error : (3) expression -> expression . + expression (4) expression -> expression . - expression Nonterminals, with rules where they appear (5) expression -> expression . * expression (6) expression -> expression . / expression expression : 1 2 3 3 4 4 5 5 6 6 statement : 0 ! shift/reduce conflict for + resolved as shift. ! shift/reduce conflict for - resolved as shift. Parsing method: LALR ! shift/reduce conflict for * resolved as shift. ! shift/reduce conflict for / resolved as shift. state 0 $end reduce using rule 4 (expression -> expression - expression .) + shift and go to state 7 (0) S' -> . statement - shift and go to state 6 (1) statement -> . NAME = expression * shift and go to state 8 (2) statement -> . expression / shift and go to state 9 (3) expression -> . expression + expression (4) expression -> . expression - expression ! + [ reduce using rule 4 (expression -> expression - expression .) ] (5) expression -> . expression * expression ! - [ reduce using rule 4 (expression -> expression - expression .) ] (6) expression -> . expression / expression ! * [ reduce using rule 4 (expression -> expression - expression .) ] (7) expression -> . NUMBER ! / [ reduce using rule 4 (expression -> expression - expression .) ] NAME shift and go to state 1 NUMBER shift and go to state 2 expression shift and go to state 4 statement shift and go to state 3 state 1 (1) statement -> NAME . = expression = shift and go to state 5
  61. 61. Debugging Output ... Grammar state 10 state 1 11 statement Rule -> NAME = expression (1) statement -> NAME = expression . Rule 2 statement -> expression (3) expression -> expression . + expression Rule 3 expression -> expression + expression (4) expression -> expression . - expression (4) expression -> expression Rule Rule 4 5 expression -> expression - expression expression -> expression * expression - expression . (5) expression -> expression . * expression (6) expression -> expression . / expression (3) expression -> expression Rule Rule 6 7 expression -> expression / expression expression -> NUMBER . + expression $end reduce using rule 1 (statement -> NAME = expression .) (4) expression -> expression Terminals, with rules where they appear . - expression + shift and go to state 7 - shift and go to state 6 * (5) expression -> expression : 5 . * expression * shift and go to state 8 / shift and go to state 9 + (6) expression -> expression : 3 . / expression - : 4 / : 6 = : 1 state 11 ! shift/reduce conflict for + resolved as shift. NAME NUMBER : : 1 7 (4) expression -> expression - expression . ! shift/reduce conflict for - resolved as shift. error : (3) expression -> expression . + expression (4) expression -> expression . - expression ! shift/reduce conflict for * resolved as shift. Nonterminals, with rules where they appear (5) expression -> expression . * expression (6) expression -> expression . / expression !statement shift/reduce expression : 1 2 3 conflict for / resolved as shift. : 0 3 4 4 5 5 6 6 ! shift/reduce conflict for + resolved as shift. $end Parsing method: LALR reduce using rule 4 (expression ->conflict for - resolved as shift. ! shift/reduce expression - expression .) ! shift/reduce conflict for * resolved as shift. + state 0 shift and go to state 7 ! shift/reduce conflict for / resolved as shift. $end reduce using rule 4 (expression -> expression - expression .) -(0) S' -> . statement shift and go to state 6 + shift and go to state 7 - shift and go to state 6 *(2) statement -> . expression shift and go to state 8 (1) statement -> . NAME = expression * shift and go to state 8 / shift and go to state 9 /(4) expression -> . expression - expression (3) expression -> . expression shift and go to state 9 + expression ! + [ reduce using rule 4 (expression -> expression - expression .) ] (5) expression -> . expression * expression ! - [ reduce using rule 4 (expression -> expression - expression .) ] (6) expression -> . expression / expression ! * [ reduce using rule 4 (expression -> expression - expression .) ] ! + (7) expression -> . NUMBER [ reduce using rule 4 (expression ! / -> [ expression reduce using rule 4 - expression (expression -> expression - .) ] expression .) ] ! -NAME NUMBER shift and go shift and go [ reduce using to state 1 to state 2 rule 4 (expression -> expression - expression .) ] ! * [ reduce using rule 4 (expression -> expression - expression .) ] ! /expression statement [ reduce state 4 shift and go to using shift and go to state 3 rule 4 (expression -> expression - expression .) ] ...state 1 (1) statement -> NAME . = expression = shift and go to state 5
  62. 62. PLY Validation • PLY validates all token/grammar specs • Duplicate rules • Malformed regexs and grammars • Missing rules and tokens • Unused tokens and rules • Improper function declarations • Infinite recursion
  63. 63. Error Example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ t_MINUS = r'-' example.py:12: Rule t_MINUS redefined. t_POWER = r'^' Previously defined on line 6 def t_NUMBER(): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  64. 64. Error Example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ t_MINUS = r'-' t_POWER = r'^' lex: Rule 't_POWER' defined for an unspecified token POWER def t_NUMBER(): r’d+’ t.value = int(t.value) return t lex.lex() # Build the lexer
  65. 65. Error Example import ply.lex as lex tokens = [ ‘NAME’,’NUMBER’,’PLUS’,’MINUS’,’TIMES’, ’DIVIDE’, EQUALS’ ] t_ignore = ‘ t’ t_PLUS = r’+’ t_MINUS = r’-’ t_TIMES = r’*’ t_DIVIDE = r’/’ t_EQUALS = r’=’ t_NAME = r’[a-zA-Z_][a-zA-Z0-9_]*’ t_MINUS = r'-' t_POWER = r'^' def t_NUMBER(): example.py:15: Rule 't_NUMBER' requires r’d+’ an argument. t.value = int(t.value) return t lex.lex() # Build the lexer
  66. 66. Commentary • PLY was developed for classroom use • Major emphasis on identifying and reporting potential problems • Report errors rather that fail with exception
  67. 67. PLY is Yacc • PLY supports all of the major features of Unix lex/yacc • Syntax error handling and synchronization • Precedence specifiers • Character literals • Start conditions • Inherited attributes
  68. 68. Precedence Specifiers • Yacc %left PLUS MINUS %left TIMES DIVIDE %nonassoc UMINUS ... expr : MINUS expr %prec UMINUS { $$ = -$1; } • PLY precedence = ( ('left','PLUS','MINUS'), ('left','TIMES','DIVIDE'), ('nonassoc','UMINUS'), ) def p_expr_uminus(p): 'expr : MINUS expr %prec UMINUS' p[0] = -p[1]
  69. 69. Character Literals • Yacc expr : expr '+' expr { $$ = $1 + $3; } | expr '-' expr { $$ = $1 - $3; } | expr '*' expr { $$ = $1 * $3; } | expr '/' expr { $$ = $1 / $3; } ; • PLY def p_expr(p): '''expr : expr '+' expr | expr '-' expr | expr '*' expr | expr '/' expr''' ...
  70. 70. Error Productions • Yacc funcall_err : ID LPAREN error RPAREN { printf("Syntax error in argumentsn"); } ; • PLY def p_funcall_err(p): '''ID LPAREN error RPAREN''' print "Syntax error in argumentsn"
  71. 71. Commentary • Books and documentation on yacc/bison used to guide the development of PLY • Tried to copy all of the major features • Usage as similar to lex/yacc as reasonable
  72. 72. PLY is Simple • Two pure-Python modules. That's it. • Not part of a "parser framework" • Use doesn't involve exotic design patterns • Doesn't rely upon C extension modules • Doesn't rely on third party tools
  73. 73. PLY is Fast • For a parser written entirely in Python • Underlying parser is table driven • Parsing tables are saved and only regenerated if the grammar changes • Considerable work went into optimization from the start (developed on 200Mhz PC)
  74. 74. PLY Performance • Example: Generating the LALR tables • Input: SWIG C++ grammar • 459 grammar rules, 892 parser states • 3.6 seconds (PLY-2.3, 2.66Ghz Intel Xeon) • 0.026 seconds (bison/ANSI C) • Fast enough not to be annoying • Tables only generated once and reused
  75. 75. PLY Performance • Parse file with 1000 random expressions (805KB) and build an abstract syntax tree • PLY-2.3 : 2.95 sec, 10.2 MB (Python) • YAPPS2 : 6.57 sec, 32.5 MB (Python) • PyParsing : 13.11 sec, 15.6 MB (Python) • ANTLR : 53.16 sec, 94 MB (Python) • SPARK : 235.88 sec, 347 MB (Python) • System: MacPro 2.66Ghz Xeon, Python-2.5
  76. 76. PLY Performance • Parse file with 1000 random expressions (805KB) and build an abstract syntax tree • PLY-2.3 : 2.95 sec, 10.2 MB (Python) • DParser : 0.71 sec, 72 MB (Python/C) • BisonGen : 0.25 sec, 13 MB (Python/C) • Bison : 0.063 sec, 7.9 MB (C) • 12x slower than BisonGen (mostly C) • 47x slower than pure C • System: MacPro 2.66Ghz Xeon, Python-2.5
  77. 77. Perf. Breakdown • Parse file with 1000 random expressions (805KB) and build an abstract syntax tree • Total time : 2.95 sec • Startup : 0.02 sec • Lexing : 1.20 sec • Parsing : 1.12 sec • AST : 0.61 sec • System: MacPro 2.66Ghz Xeon, Python-2.5
  78. 78. Advanced PLY • PLY has many advanced features • Lexers/parsers can be defined as classes • Support for multiple lexers and parsers • Support for optimized mode (python -O)
  79. 79. Class Example import ply.yacc as yacc class MyParser: def p_assign(self,p): ‘’’assign : NAME EQUALS expr’’’ def p_expr(self,p): ‘’’expr : expr PLUS term | expr MINUS term | term’’’ def p_term(self,p): ‘’’term : term TIMES factor | term DIVIDE factor | factor’’’ def p_factor(self,p): ‘’’factor : NUMBER’’’ def build(self): self.parser = yacc.yacc(object=self)
  80. 80. Experience with PLY • In 2001, I taught a compilers course • Students wrote a full compiler • Lexing, parsing, type checking, code generation • Procedures, nested scopes, and type inference • Produced working SPARC assembly code
  81. 81. Classroom Results • You can write a real compiler in Python • Students were successful with projects • However, many projects were quite "hacky" • Still unsure about dynamic nature of Python • May be too easy to create a "bad" compiler
  82. 82. General PLY Experience • May be very useful for prototyping • PLY's strength is in its diagnostics • Significantly faster than most Python parsers • Not sure I'd rewrite gcc in Python just yet • I'm still thinking about SWIG.
  83. 83. Limitations • LALR(1) parsing • Not easy to work with very complex grammars (e.g., C++ parsing) • Retains all of yacc's black magic • Not as powerful as more general parsing algorithms (ANTLR, SPARK, etc.) • Tradeoff : Speed vs. Generality
  84. 84. PLY Usage • Current version : Ply-2.3 • >100 downloads/week • People are obviously using it • Largest project I know of : Ada parser • Many other small projects
  85. 85. Future Directions • PLY was written for Python-2.0 • Not yet updated to use modern Python features such as iterators and generators • May update, but not at the expense of performance • Working on some add-ons to ease transition between yacc <---> PLY.
  86. 86. Acknowledgements • Many people have contributed to PLY Thad Austin Adam Kerrison Shannon Behrens Daniel Larraz Michael Brown David McNab Russ Cox Patrick Mezard Johan Dahl Pearu Peterson Andrew Dalke François Pinard Michael Dyck Eric Raymond Joshua Gerth Adam Ring Elias Ioup Rich Salz Oldrich Jedlicka Markus Schoepflin Sverre Jørgensen Christoper Stawarz Lee June Miki Tebeka Andreas Jung Andrew Waters Cem Karan • Apologies to anyone I forgot
  87. 87. Resources • PLY homepage http://www.dabeaz.com/ply • Mailing list/group http://groups.google.com/group/ply-hack
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×