Successfully reported this slideshow.

LVEE 2014: Text parsing with Python and PLY



Loading in …3
1 of 17
1 of 17

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

LVEE 2014: Text parsing with Python and PLY

  1. 1. Text parsing with Python and PLY Daniil Baturin <> LVEE 2014
  2. 2. Formal language Set of strings composed of symbols of some alphabet. A language can be defined by a formal grammar. Two or more grammars may produce the same language. Forward problem: given a grammar, generate a string that belongs a language (trivial). Parsing is the inverse problem: given a string, decide if it belongs a language defined by a grammar.
  3. 3. Formal grammar Formal grammar is a set of production rules. A production rule has a one or more symbols at its left-hand and right-hand sides. You need to specify the start symbol. Example: <sentence> ::= <verb phrase> <noun phrase> <noun phrase> ::= <article> <adjective> <noun> <verb phrase> ::= <verb> <article> <noun> <article> ::= <”a” | “the” | empty> ... “sentence” is the start symbol.
  4. 4. Lexers and tokens Lexer—breaks the text into tokens, string literals with metadata. Tokens are identified by regular expressions. Parsers usually operate on tokens rather than string literals. E.g. all numbers ([0-9]+) become TOKEN(NUMBER, value). Lexer can be autogenerated or hand-coded.
  5. 5. Symbols Terminal symbols are symbols that can't be further reduced (rewritten). Non-terminal symbols are composed from references to terminals and other non-terminals. <article> ::= <”a” | “the” | empty> “Article” is a non-terminal; “a”, “the”, empty are terminals.
  6. 6. Approaches to parsing ● Hand-coded ad-hoc parser ● Autogenerated parser ● Hand-coded advanced algorithm
  7. 7. Regular grammar All production rules are of the following form: <some nonterminal> ::= <some terminal> <some nonterminal> ::= <some terminal> <another nonterminal> <some nonterminal> ::= <empty> Equivalent to regular expressions. “(a|b)*” regex: <start> ::= “a” <start> | “b” <start> | <empty> This is good for a hand-coded parser (or a regex library).
  8. 8. Context-free grammar All productions are of the following form: <some nonterminal> ::= <some terminals or nonterminals> Good for autogenerated parsers. Ad-hoc parsers tend to be messy.
  9. 9. More complex grammars Context-sensitive: rules may contain terminals and nonterminals in both sides. Can be conquered with special tools or lexer hacks. Not even context-sensitive: you are in trouble. ;)
  10. 10. Parser generators Usually use LALR(1) method. Read a token and push it on stack. If some rule is matched, empty the stack and proceed (reduce). If not, push the token on stack and proceed (shift).
  11. 11. PLY Lex—the lexer generator Token recognizers are functions. Token regexes are in docstrings. You should export a tuple of all tokens for use in a parser. def t_NUMBER(t): r'[0­9]+' t.value = int(t.value) return t
  12. 12. PLY YACC—the parser generator Production rule recognizers are functions. Rules are in docstrings. Rules can refer to other rules and can be recursive. You can refer to lexer tokens in rules. The argument is the token list. def p_expr(p): ''' expr : NUMBER OPSIGN NUMBER ''' p[0] = (p[2], p[1], p[3])
  13. 13. Grammar “patterns” The rule form used by YACC doesn't have a notation for repetition etc. Empty: something : This or that: something : foo | bar One or more of something: list_of_something : list_of_something something | something
  14. 14. Common problems Shift-reduce conflict: two or more rules have more than one common token at the beginning. foo : baz quux xyzzy bar : baz quux fgsfds Usually can be solved by left factorization: foobar_start : baz quux foo : foobar_start xyzzy bar : foobar_start fgsfds Default resolution is shift. Reduce-reduce conflict: same string matches more than one rule. Often indicates a grammar design error.
  15. 15. What I didn't tell Exclusive lexer states. Precedence rules. Wrapping lexers and parsers into classes. Error recovery.
  16. 16. The sample code
  17. 17. Questions?