1. Text parsing with Python and PLY
Daniil Baturin <daniil@baturin.org>
LVEE 2014
2. Formal language
Set of strings composed of symbols of some alphabet.
A language can be defined by a formal grammar.
Two or more grammars may produce the same language.
Forward problem: given a grammar, generate a string that
belongs a language (trivial).
Parsing is the inverse problem: given a string, decide if it
belongs a language defined by a grammar.
3. Formal grammar
Formal grammar is a set of production rules.
A production rule has a one or more symbols at its left-hand and right-hand sides.
You need to specify the start symbol.
Example:
<sentence> ::= <verb phrase> <noun phrase>
<noun phrase> ::= <article> <adjective> <noun>
<verb phrase> ::= <verb> <article> <noun>
<article> ::= <”a” | “the” | empty>
...
“sentence” is the start symbol.
4. Lexers and tokens
Lexer—breaks the text into tokens, string
literals with metadata.
Tokens are identified by regular expressions.
Parsers usually operate on tokens rather than
string literals. E.g. all numbers ([0-9]+) become
TOKEN(NUMBER, value).
Lexer can be autogenerated or hand-coded.
5. Symbols
Terminal symbols are symbols that can't be
further reduced (rewritten).
Non-terminal symbols are composed from
references to terminals and other non-terminals.
<article> ::= <”a” | “the” | empty>
“Article” is a non-terminal; “a”, “the”, empty are
terminals.
7. Regular grammar
All production rules are of the following form:
<some nonterminal> ::= <some terminal>
<some nonterminal> ::= <some terminal> <another nonterminal>
<some nonterminal> ::= <empty>
Equivalent to regular expressions.
“(a|b)*” regex:
<start> ::= “a” <start>
| “b” <start>
| <empty>
This is good for a hand-coded parser (or a regex library).
8. Context-free grammar
All productions are of the following form:
<some nonterminal> ::= <some terminals or nonterminals>
Good for autogenerated parsers. Ad-hoc
parsers tend to be messy.
9. More complex grammars
Context-sensitive: rules may contain terminals
and nonterminals in both sides.
Can be conquered with special tools or lexer
hacks.
Not even context-sensitive: you are in trouble. ;)
10. Parser generators
Usually use LALR(1) method.
Read a token and push it on stack.
If some rule is matched, empty the stack and
proceed (reduce).
If not, push the token on stack and proceed
(shift).
11. PLY Lex—the lexer generator
Token recognizers are functions.
Token regexes are in docstrings.
You should export a tuple of all tokens for use in
a parser.
def t_NUMBER(t):
r'[09]+'
t.value = int(t.value)
return t
12. PLY YACC—the parser generator
Production rule recognizers are functions.
Rules are in docstrings.
Rules can refer to other rules and can be recursive.
You can refer to lexer tokens in rules.
The argument is the token list.
def p_expr(p):
''' expr : NUMBER OPSIGN NUMBER '''
p[0] = (p[2], p[1], p[3])
13. Grammar “patterns”
The rule form used by YACC doesn't have a
notation for repetition etc.
Empty:
something :
This or that:
something : foo | bar
One or more of something:
list_of_something : list_of_something something
| something
14. Common problems
Shift-reduce conflict: two or more rules have more than
one common token at the beginning.
foo : baz quux xyzzy
bar : baz quux fgsfds
Usually can be solved by left factorization:
foobar_start : baz quux
foo : foobar_start xyzzy
bar : foobar_start fgsfds
Default resolution is shift.
Reduce-reduce conflict: same string matches more
than one rule. Often indicates a grammar design error.
15. What I didn't tell
Exclusive lexer states.
Precedence rules.
Wrapping lexers and parsers into classes.
Error recovery.