Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Loading in …3
×
1 of 45

Let's build a parser!

12

Share

https://joind.in/talk/view/6257

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Let's build a parser!

  1. 1. Let’s build a Parser! A short introduction to parsing with PHP Boy Baukema June 9th 2012, Amsterdam
  2. 2. 2 Source: http://www.sxc.hu/photo/1384894
  3. 3. Boy Baukema Software Engineer @ Ibuildings 3
  4. 4. Reasons for common fear of writing parsers: 1. Never took compiler class, think it is scary. 2. Did take compiler - Martin Fowler 4
  5. 5. Language cacaphony 5 Source: http://www.wordle.net/show/wrdl/5292561/ Languages_used_in_PHP_Web_Development
  6. 6. Lookahead (?= Languages Parsing QueryLang Parsing PHP code Resources 6
  7. 7. RegExes And now you have two problems... 7
  8. 8. Mail::RFC822::Address (?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t] )+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?: rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:( ?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-0 31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)* ](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+ (?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?: (?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z |(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn) ?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?: rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn) ?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t] )*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])* )(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t] )+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*) *:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+ |Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:r n)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?: rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t ]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031 ]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*]( ?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(? :(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(? :rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:".[] 000-031]+(?:(? :(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)? [ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]| .|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<> @,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|" (?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t] )*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;: ".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(? :[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[ ]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000- 031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|( ?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,; :".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([ 8 Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
  9. 9. Choamsky hierarchy 9 Source: http://en.wikipedia.org/wiki/File:Chomsky-hierarchy.svg
  10. 10. HTTP 1.1 Accept Header BNF Accept = "Accept" ":" #( media-range [ accept-params ] ) media-range = ( "*/*" | ( type "/" "*" ) | ( type "/" subtype ) ) *( ";" parameter ) accept-params = ";" "q" "=" qvalue *( accept-extension ) accept-extension = ";" token [ "=" ( token | quoted-string ) ] 10 Source: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
  11. 11. Arithmetic expression BNF <expression> ::= <term> | <expression> "+" <term> <term> ::= <factor> | <term> "*" <factor> <factor> ::= <constant> | <variable> | "(" <expression> ")" <variable> ::= "x" | "y" | "z" <constant> ::= <digit> | <digit> <constant> <digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 11 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  12. 12. Recursion in BNF Production <constant> ::= <digit> | <digit> <constant> <digit> ::= "0" | "1" | "2" | "3" Terminal | "4" | "5" | "6" | "7" | "8" | "9" 12
  13. 13. Matching 123 <constant> 1 <digit> <constant> 2 <digit> <constant> 3 <digit> <constant> ::= <digit> 13 Source: https://secure.flickr.com/photos/threedots/110586879/
  14. 14. Arithmetic expression EBNF expression = term , {"+" , term}; term = factor , {"*" , factor}; factor = constant | variable | "(" , expression , ")"; variable = "x" | "y" | "z"; constant = digit , {digit}; digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; 14 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  15. 15. Parsing Expression Grammar expression = term ("+" term)* product = factor ("*" factor)* factor = constant / variable / "(" expression ")" variable = "x" / "y" / "z" constant = [0-9]+ 15 Source: https://secure.flickr.com/photos/sasastro/5590210866/
  16. 16. So how will this help me parse a language? 16
  17. 17. Parser Generators for PHP Lime-php LALR(1) , 2008, abandoned PHP_ParserGenerator LALR(1), 2010, abandoned Loco combinatory parsing, 2011, alpha php-peg PEG, 2012, active?, alpha 17
  18. 18. QueryLang https://github.com/relaxnow/QueryLang 18
  19. 19. QueryLang: Example query parsers OR 123 AND (dpc OR phpbnl) Query (OR) |-- Term - "parsers" |-- Query (AND) |-- Term - "123" |-- Query (OR) |-- Term - "dpc" |-- Term - "phpbnl" 19
  20. 20. v1/Peg/grammar.peg.inc /*!* QueryLangV1 Term: /[wd]+/ */ public function parse() { $match = $this->match_Term(); if (!$match) { return ''; } return $match['text']; } 20
  21. 21. v1/Peg/Parser.php - generated match_Term /* Term: /[wd]+/ */ protected $match_Term_typestack = array('Term'); function match_Term ($stack = array()) { $matchrule = "Term"; $result = $this- >construct($matchrule, $matchrule, null); if (( $subres = $this->rx( '/[wd]+/' ) ) !== FALSE) { $result["text"] .= $subres; return $this->finalise($result); } else { return FALSE; } } 21
  22. 22. v1/Peg/grammar.peg.inc test $parser = new Parser('test'); print_r($parser->parse()); // test $parser = new Parser('test 123'); print_r($parser->parse()); // test 22
  23. 23. v2/Peg/grammar.peg.inc /*!* QueryLangV2 Query: Term (> Term)* Term: /[wd]+/ */ public function parse() { $result = $this->match_Query(); return $result['query']; } 23
  24. 24. v2/Peg/grammar.peg.inc (cont.) public function Query__construct(&$result) { $result['query'] = new NodeQuery(); } public function Query_Term(&$result, $sub) { $term = new NodeTerm($sub['text']); $result['query']->addTerm($term); } 24
  25. 25. v2/Peg/grammar.peg.inc test $parser = new Parser('test 123'); print_r($parser->parse()); Query |-- Term - "test" |-- Term - "123" 25
  26. 26. v3/Peg/grammar.peg.inc /*!* QueryLangV3 Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)* Term: "(" Query ")" | Value:/[wd]+/ */ public function parse() { $node = $this->match_Query(); return $node['query']; } 26
  27. 27. v3/Peg/grammar.peg.inc (cont.) Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)* public function Query__construct(&$r) { $r['query'] = new NodeQuery('OR'); } public function Query_AndQuery(&$r, $s) { $r['query']->add($s['query']); } public function AndQuery__construct(&$r) { $r['query'] = new NodeQuery('AND'); } public function AndQuery_Term(&$r, $s) { $r['query']->add($s['query']); 27 }
  28. 28. v3/Peg/grammar.peg.inc (cont.) /*!* QueryLangV3 Term: "(" Query ")" | Value:/[wd]+/ */ public function Term_Query(&$r, $s){ $r['query'] = $s['query']; } public function Term_Value(&$r, $s){ $r['query']= new NodeTerm($s['text']); } 28
  29. 29. v3/Peg/grammar.peg.inc test $parser = new Parser('a AND b OR c'); Query (OR) |-- Query (AND) | |-- Term - "a" | |-- Term - "b" |-- Query (AND) |-- Term - "c" 29
  30. 30. Optional: Optimizer / Semantic checking 30
  31. 31. Optimized query $parser = new Parser('a AND b OR c'); $query = $parser->parse(); $queryOptimizer = new Optimizer($query); $query = $queryOptimizer->optimize(); Query (OR) |-- Term - "c" |-- Query (AND) | |-- Term - "a" | |-- Term - "b" 31
  32. 32. Manual Parser Building: Predictive parsing 32 Source: http://en.wikipedia.org/wiki/File:PsychicBoston.jpg
  33. 33. Manual Parser Building: Lexing Characters get turned into tokens by a lexical analyzer. Also called lexer, scanner or tokenizer. "a OR (b)" 'term' => "a" 'OR' 'LeftParen' 'term' => "b" 'RightParen' 33
  34. 34. Manual Parser Building: Lexing if ($this->_match('LeftParen', '/^(()/')) {continue;} if ($this->_match('RightParen', '/^())/')) {continue;} if ($this->_match('OR', '/^(OR)/i')) {continue;} if ($this->_match('AND', '/^(AND)/i')) {continue;} if ($this->_match('TermValue', '/^([wd]+)/i')) {continue;} if ($this->_match('WS', '/^s+/', true)) {continue;} 34
  35. 35. Manual Parser Building: Lexing - UML 35 Source: http://commons.wikimedia.org/wiki/File:Willem- Alexander,_Prince_of_Orange.jpg
  36. 36. Manual Parser Building: Parsing Non-terminals become methods protected function _query(); protected function _andQuery(); protected function _term(); Parse to a tree structure. 36
  37. 37. Manual Parser Building: Parsing - UML 37
  38. 38. Manual Parser Building: example non-terminal protected function _query() { $query = new NodeQuery('OR'); $leftTerm = $this->_andQuery(); $query->add($leftTerm); while($this->_tokenStream->look()->getType() === 'OR') { $this->_tokenStream->expect('OR'); $rightTerm = $this->_andQuery(); $query->add($rightTerm); } return $query; } 38
  39. 39. Predictive Parsing: Warning! Tokens must be decidable with a fixed lookahead <term> ::= <TermValue> "-" <TermValue> | <TermValue> | "(" <Query> ")" No left recursion <orQuery> ::= <orQuery> ("OR" <orQuery>)? | <term> 39
  40. 40. But I wanna parse 40
  41. 41. PHP Parsers PHP_Depend 1.0.0 PHP 5.4 PHP-Parser alpha PHP 5.4 phc 0.3.0.1 (unmaintained) PHP 5.2 (?) 41
  42. 42. PHPDepend Abstract Syntax Tree example $string = "Manuel $Pichler <{$email}>"; PHP_Depend_Code_ASTString |-- ASTLiteral - "Manuel " |-- ASTVariable - $Pichler |-- ASTLiteral - " <" |-- ASTCompoundExpression - {...} | |-- ASTVariable - $email |-- ASTLiteral - ">" 42
  43. 43. Resources 43
  44. 44. More resources Examples of modern parsers in PHP: Twig (Predictive Parser) Behats Gherkin (Predictive Parser) Smarty 3 (LALR parser) More information: Rich Programmer Food by Steve Yegge Let’s Build a Compiler, by Jack Crenshaw nathansuniversity.com Coursera: Compilers by Stanford University SE-Radio: Episode 182: DSLs 44
  45. 45. QUESTIONS? Joind.in: https://joind.in/6257 Twitter: @relaxnow E-mail: boy@ibuildings.nl Slideshare: http://slidesha.re/INY43R 45 GitHub: https://github.com/relaxnow/QueryLang

Editor's Notes

  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • ×