• Like
  • Save
Let's build a parser!
Upcoming SlideShare
Loading in...5
×
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,591
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Let’s build a Parser!A short introduction to parsing with PHP Boy Baukema June 9th 2012, Amsterdam
  • 2. 2Source: http://www.sxc.hu/photo/1384894
  • 3. Boy BaukemaSoftware Engineer @ Ibuildings 3
  • 4. Reasons for commonfear of writing parsers:1. Never tookcompiler class, think itis scary.2. Did take compiler- Martin Fowler 4
  • 5. Language cacaphony 5Source: http://www.wordle.net/show/wrdl/5292561/ Languages_used_in_PHP_Web_Development
  • 6. Lookahead (?= Languages Parsing QueryLang Parsing PHP code Resources 6
  • 7. RegExesAnd now you have two problems... 7
  • 8. Mail::RFC822::Address(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([ 8Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
  • 9. Choamsky hierarchy 9Source: http://en.wikipedia.org/wiki/File:Chomsky-hierarchy.svg
  • 10. HTTP 1.1 Accept Header BNFAccept = "Accept" ":" #( media-range [ accept-params ] )media-range = ( "*/*" | ( type "/" "*" ) | ( type "/" subtype ) ) *( ";" parameter )accept-params = ";" "q" "=" qvalue *( accept-extension )accept-extension = ";" token [ "=" ( token | quoted-string ) ] 10Source: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
  • 11. Arithmetic expression BNF<expression> ::= <term> | <expression> "+" <term><term> ::= <factor> | <term> "*" <factor><factor> ::= <constant> | <variable> | "(" <expression> ")"<variable> ::= "x" | "y" | "z"<constant> ::= <digit> | <digit> <constant><digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 11 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  • 12. Recursion in BNFProduction<constant> ::= <digit> | <digit> <constant><digit> ::= "0" | "1" | "2" | "3"Terminal | "4" | "5" | "6" | "7" | "8" | "9" 12
  • 13. Matching 123 <constant>1 <digit> <constant> 2 <digit> <constant> 3 <digit><constant> ::= <digit> 13Source: https://secure.flickr.com/photos/threedots/110586879/
  • 14. Arithmetic expression EBNF expression = term , {"+" , term}; term = factor , {"*" , factor}; factor = constant | variable | "(" , expression , ")"; variable = "x" | "y" | "z"; constant = digit , {digit}; digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; 14 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  • 15. Parsing Expression Grammar expression = term ("+" term)* product = factor ("*" factor)* factor = constant / variable / "(" expression ")" variable = "x" / "y" / "z" constant = [0-9]+ 15Source: https://secure.flickr.com/photos/sasastro/5590210866/
  • 16. So how will this helpme parse a language? 16
  • 17. Parser Generators for PHP Lime-php LALR(1) , 2008, abandoned PHP_ParserGenerator LALR(1), 2010, abandoned Loco combinatory parsing, 2011, alpha php-peg PEG, 2012, active?, alpha 17
  • 18. QueryLanghttps://github.com/relaxnow/QueryLang 18
  • 19. QueryLang: Example query parsers OR 123 AND (dpc OR phpbnl)Query (OR)|-- Term - "parsers"|-- Query (AND) |-- Term - "123" |-- Query (OR) |-- Term - "dpc" |-- Term - "phpbnl" 19
  • 20. v1/Peg/grammar.peg.inc /*!* QueryLangV1 Term: /[wd]+/ */ public function parse() { $match = $this->match_Term(); if (!$match) { return ; } return $match[text]; } 20
  • 21. v1/Peg/Parser.php - generated match_Term /* Term: /[wd]+/ */ protected $match_Term_typestack = array(Term); function match_Term ($stack = array()) { $matchrule = "Term"; $result = $this->construct($matchrule, $matchrule, null); if (( $subres = $this->rx( /[wd]+/ ) ) !==FALSE) { $result["text"] .= $subres; return $this->finalise($result); } else { return FALSE; } } 21
  • 22. v1/Peg/grammar.peg.inc test $parser = new Parser(test); print_r($parser->parse()); // test $parser = new Parser(test 123); print_r($parser->parse()); // test 22
  • 23. v2/Peg/grammar.peg.inc /*!* QueryLangV2 Query: Term (> Term)* Term: /[wd]+/ */ public function parse() { $result = $this->match_Query(); return $result[query]; } 23
  • 24. v2/Peg/grammar.peg.inc (cont.)public function Query__construct(&$result){ $result[query] = new NodeQuery();}public function Query_Term(&$result, $sub){ $term = new NodeTerm($sub[text]); $result[query]->addTerm($term);} 24
  • 25. v2/Peg/grammar.peg.inc test $parser = new Parser(test 123); print_r($parser->parse()); Query |-- Term - "test" |-- Term - "123" 25
  • 26. v3/Peg/grammar.peg.inc /*!* QueryLangV3 Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)* Term: "(" Query ")" | Value:/[wd]+/ */ public function parse() { $node = $this->match_Query(); return $node[query]; } 26
  • 27. v3/Peg/grammar.peg.inc (cont.) Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)*public function Query__construct(&$r) { $r[query] = new NodeQuery(OR);}public function Query_AndQuery(&$r, $s) { $r[query]->add($s[query]);}public function AndQuery__construct(&$r) { $r[query] = new NodeQuery(AND);}public function AndQuery_Term(&$r, $s) { $r[query]->add($s[query]); 27}
  • 28. v3/Peg/grammar.peg.inc (cont.) /*!* QueryLangV3 Term: "(" Query ")" | Value:/[wd]+/ */public function Term_Query(&$r, $s){ $r[query] = $s[query];}public function Term_Value(&$r, $s){ $r[query]= new NodeTerm($s[text]);} 28
  • 29. v3/Peg/grammar.peg.inc test $parser = new Parser(a AND b OR c); Query (OR) |-- Query (AND) | |-- Term - "a" | |-- Term - "b" |-- Query (AND) |-- Term - "c" 29
  • 30. Optional: Optimizer / Semantic checking 30
  • 31. Optimized query$parser = new Parser(a AND b OR c);$query = $parser->parse();$queryOptimizer = new Optimizer($query);$query = $queryOptimizer->optimize();Query (OR)|-- Term - "c"|-- Query (AND)| |-- Term - "a"| |-- Term - "b" 31
  • 32. Manual Parser Building: Predictive parsing 32 Source: http://en.wikipedia.org/wiki/File:PsychicBoston.jpg
  • 33. Manual Parser Building: LexingCharacters get turned into tokens by a lexicalanalyzer. Also called lexer, scanner ortokenizer."a OR (b)"term => "a"ORLeftParenterm => "b"RightParen 33
  • 34. Manual Parser Building: Lexingif ($this->_match(LeftParen, /^(()/)) {continue;}if ($this->_match(RightParen, /^())/)) {continue;}if ($this->_match(OR, /^(OR)/i)){continue;}if ($this->_match(AND, /^(AND)/i)) {continue;}if ($this->_match(TermValue, /^([wd]+)/i)){continue;}if ($this->_match(WS, /^s+/, true)) {continue;} 34
  • 35. Manual Parser Building: Lexing - UML 35 Source: http://commons.wikimedia.org/wiki/File:Willem- Alexander,_Prince_of_Orange.jpg
  • 36. Manual Parser Building: ParsingNon-terminals become methodsprotected function _query();protected function _andQuery();protected function _term();Parse to a tree structure. 36
  • 37. Manual Parser Building: Parsing - UML 37
  • 38. Manual Parser Building: example non-terminalprotected function _query() { $query = new NodeQuery(OR); $leftTerm = $this->_andQuery(); $query->add($leftTerm); while($this->_tokenStream->look()->getType() === OR) { $this->_tokenStream->expect(OR); $rightTerm = $this->_andQuery(); $query->add($rightTerm); } return $query;} 38
  • 39. Predictive Parsing: Warning!Tokens must be decidable with a fixed lookahead<term> ::= <TermValue> "-" <TermValue> | <TermValue> | "(" <Query> ")"No left recursion<orQuery> ::= <orQuery> ("OR" <orQuery>)? | <term> 39
  • 40. But I wanna parse 40
  • 41. PHP Parsers PHP_Depend 1.0.0 PHP 5.4 PHP-Parser alpha PHP 5.4 phc 0.3.0.1 (unmaintained) PHP 5.2 (?) 41
  • 42. PHPDepend Abstract Syntax Tree example$string = "Manuel $Pichler <{$email}>";PHP_Depend_Code_ASTString|-- ASTLiteral - "Manuel "|-- ASTVariable - $Pichler|-- ASTLiteral - " <"|-- ASTCompoundExpression - {...}| |-- ASTVariable - $email|-- ASTLiteral - ">" 42
  • 43. Resources 43
  • 44. More resourcesExamples of modern parsers in PHP: Twig (Predictive Parser) Behats Gherkin (Predictive Parser) Smarty 3 (LALR parser)More information:Rich Programmer Food by Steve YeggeLet’s Build a Compiler, by Jack Crenshawnathansuniversity.comCoursera: Compilers by Stanford UniversitySE-Radio: Episode 182: DSLs 44
  • 45. QUESTIONS?Joind.in: https://joind.in/6257Twitter: @relaxnowE-mail: boy@ibuildings.nlSlideshare: http://slidesha.re/INY43R 45GitHub: https://github.com/relaxnow/QueryLang