Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Let's build a parser!

11,029 views

Published on

https://joind.in/talk/view/6257

Published in: Technology
  • Be the first to comment

Let's build a parser!

  1. 1. Let’s build a Parser!A short introduction to parsing with PHP Boy Baukema June 9th 2012, Amsterdam
  2. 2. 2Source: http://www.sxc.hu/photo/1384894
  3. 3. Boy BaukemaSoftware Engineer @ Ibuildings 3
  4. 4. Reasons for commonfear of writing parsers:1. Never tookcompiler class, think itis scary.2. Did take compiler- Martin Fowler 4
  5. 5. Language cacaphony 5Source: http://www.wordle.net/show/wrdl/5292561/ Languages_used_in_PHP_Web_Development
  6. 6. Lookahead (?= Languages Parsing QueryLang Parsing PHP code Resources 6
  7. 7. RegExesAnd now you have two problems... 7
  8. 8. Mail::RFC822::Address(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([ 8Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
  9. 9. Choamsky hierarchy 9Source: http://en.wikipedia.org/wiki/File:Chomsky-hierarchy.svg
  10. 10. HTTP 1.1 Accept Header BNFAccept = "Accept" ":" #( media-range [ accept-params ] )media-range = ( "*/*" | ( type "/" "*" ) | ( type "/" subtype ) ) *( ";" parameter )accept-params = ";" "q" "=" qvalue *( accept-extension )accept-extension = ";" token [ "=" ( token | quoted-string ) ] 10Source: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
  11. 11. Arithmetic expression BNF<expression> ::= <term> | <expression> "+" <term><term> ::= <factor> | <term> "*" <factor><factor> ::= <constant> | <variable> | "(" <expression> ")"<variable> ::= "x" | "y" | "z"<constant> ::= <digit> | <digit> <constant><digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 11 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  12. 12. Recursion in BNFProduction<constant> ::= <digit> | <digit> <constant><digit> ::= "0" | "1" | "2" | "3"Terminal | "4" | "5" | "6" | "7" | "8" | "9" 12
  13. 13. Matching 123 <constant>1 <digit> <constant> 2 <digit> <constant> 3 <digit><constant> ::= <digit> 13Source: https://secure.flickr.com/photos/threedots/110586879/
  14. 14. Arithmetic expression EBNF expression = term , {"+" , term}; term = factor , {"*" , factor}; factor = constant | variable | "(" , expression , ")"; variable = "x" | "y" | "z"; constant = digit , {digit}; digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; 14 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  15. 15. Parsing Expression Grammar expression = term ("+" term)* product = factor ("*" factor)* factor = constant / variable / "(" expression ")" variable = "x" / "y" / "z" constant = [0-9]+ 15Source: https://secure.flickr.com/photos/sasastro/5590210866/
  16. 16. So how will this helpme parse a language? 16
  17. 17. Parser Generators for PHP Lime-php LALR(1) , 2008, abandoned PHP_ParserGenerator LALR(1), 2010, abandoned Loco combinatory parsing, 2011, alpha php-peg PEG, 2012, active?, alpha 17
  18. 18. QueryLanghttps://github.com/relaxnow/QueryLang 18
  19. 19. QueryLang: Example query parsers OR 123 AND (dpc OR phpbnl)Query (OR)|-- Term - "parsers"|-- Query (AND) |-- Term - "123" |-- Query (OR) |-- Term - "dpc" |-- Term - "phpbnl" 19
  20. 20. v1/Peg/grammar.peg.inc /*!* QueryLangV1 Term: /[wd]+/ */ public function parse() { $match = $this->match_Term(); if (!$match) { return ; } return $match[text]; } 20
  21. 21. v1/Peg/Parser.php - generated match_Term /* Term: /[wd]+/ */ protected $match_Term_typestack = array(Term); function match_Term ($stack = array()) { $matchrule = "Term"; $result = $this->construct($matchrule, $matchrule, null); if (( $subres = $this->rx( /[wd]+/ ) ) !==FALSE) { $result["text"] .= $subres; return $this->finalise($result); } else { return FALSE; } } 21
  22. 22. v1/Peg/grammar.peg.inc test $parser = new Parser(test); print_r($parser->parse()); // test $parser = new Parser(test 123); print_r($parser->parse()); // test 22
  23. 23. v2/Peg/grammar.peg.inc /*!* QueryLangV2 Query: Term (> Term)* Term: /[wd]+/ */ public function parse() { $result = $this->match_Query(); return $result[query]; } 23
  24. 24. v2/Peg/grammar.peg.inc (cont.)public function Query__construct(&$result){ $result[query] = new NodeQuery();}public function Query_Term(&$result, $sub){ $term = new NodeTerm($sub[text]); $result[query]->addTerm($term);} 24
  25. 25. v2/Peg/grammar.peg.inc test $parser = new Parser(test 123); print_r($parser->parse()); Query |-- Term - "test" |-- Term - "123" 25
  26. 26. v3/Peg/grammar.peg.inc /*!* QueryLangV3 Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)* Term: "(" Query ")" | Value:/[wd]+/ */ public function parse() { $node = $this->match_Query(); return $node[query]; } 26
  27. 27. v3/Peg/grammar.peg.inc (cont.) Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)*public function Query__construct(&$r) { $r[query] = new NodeQuery(OR);}public function Query_AndQuery(&$r, $s) { $r[query]->add($s[query]);}public function AndQuery__construct(&$r) { $r[query] = new NodeQuery(AND);}public function AndQuery_Term(&$r, $s) { $r[query]->add($s[query]); 27}
  28. 28. v3/Peg/grammar.peg.inc (cont.) /*!* QueryLangV3 Term: "(" Query ")" | Value:/[wd]+/ */public function Term_Query(&$r, $s){ $r[query] = $s[query];}public function Term_Value(&$r, $s){ $r[query]= new NodeTerm($s[text]);} 28
  29. 29. v3/Peg/grammar.peg.inc test $parser = new Parser(a AND b OR c); Query (OR) |-- Query (AND) | |-- Term - "a" | |-- Term - "b" |-- Query (AND) |-- Term - "c" 29
  30. 30. Optional: Optimizer / Semantic checking 30
  31. 31. Optimized query$parser = new Parser(a AND b OR c);$query = $parser->parse();$queryOptimizer = new Optimizer($query);$query = $queryOptimizer->optimize();Query (OR)|-- Term - "c"|-- Query (AND)| |-- Term - "a"| |-- Term - "b" 31
  32. 32. Manual Parser Building: Predictive parsing 32 Source: http://en.wikipedia.org/wiki/File:PsychicBoston.jpg
  33. 33. Manual Parser Building: LexingCharacters get turned into tokens by a lexicalanalyzer. Also called lexer, scanner ortokenizer."a OR (b)"term => "a"ORLeftParenterm => "b"RightParen 33
  34. 34. Manual Parser Building: Lexingif ($this->_match(LeftParen, /^(()/)) {continue;}if ($this->_match(RightParen, /^())/)) {continue;}if ($this->_match(OR, /^(OR)/i)){continue;}if ($this->_match(AND, /^(AND)/i)) {continue;}if ($this->_match(TermValue, /^([wd]+)/i)){continue;}if ($this->_match(WS, /^s+/, true)) {continue;} 34
  35. 35. Manual Parser Building: Lexing - UML 35 Source: http://commons.wikimedia.org/wiki/File:Willem- Alexander,_Prince_of_Orange.jpg
  36. 36. Manual Parser Building: ParsingNon-terminals become methodsprotected function _query();protected function _andQuery();protected function _term();Parse to a tree structure. 36
  37. 37. Manual Parser Building: Parsing - UML 37
  38. 38. Manual Parser Building: example non-terminalprotected function _query() { $query = new NodeQuery(OR); $leftTerm = $this->_andQuery(); $query->add($leftTerm); while($this->_tokenStream->look()->getType() === OR) { $this->_tokenStream->expect(OR); $rightTerm = $this->_andQuery(); $query->add($rightTerm); } return $query;} 38
  39. 39. Predictive Parsing: Warning!Tokens must be decidable with a fixed lookahead<term> ::= <TermValue> "-" <TermValue> | <TermValue> | "(" <Query> ")"No left recursion<orQuery> ::= <orQuery> ("OR" <orQuery>)? | <term> 39
  40. 40. But I wanna parse 40
  41. 41. PHP Parsers PHP_Depend 1.0.0 PHP 5.4 PHP-Parser alpha PHP 5.4 phc 0.3.0.1 (unmaintained) PHP 5.2 (?) 41
  42. 42. PHPDepend Abstract Syntax Tree example$string = "Manuel $Pichler <{$email}>";PHP_Depend_Code_ASTString|-- ASTLiteral - "Manuel "|-- ASTVariable - $Pichler|-- ASTLiteral - " <"|-- ASTCompoundExpression - {...}| |-- ASTVariable - $email|-- ASTLiteral - ">" 42
  43. 43. Resources 43
  44. 44. More resourcesExamples of modern parsers in PHP: Twig (Predictive Parser) Behats Gherkin (Predictive Parser) Smarty 3 (LALR parser)More information:Rich Programmer Food by Steve YeggeLet’s Build a Compiler, by Jack Crenshawnathansuniversity.comCoursera: Compilers by Stanford UniversitySE-Radio: Episode 182: DSLs 44
  45. 45. QUESTIONS?Joind.in: https://joind.in/6257Twitter: @relaxnowE-mail: boy@ibuildings.nlSlideshare: http://slidesha.re/INY43R 45GitHub: https://github.com/relaxnow/QueryLang

×