• Save
Let's build a parser!
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Let's build a parser!

  • 7,051 views
Uploaded on

https://joind.in/talk/view/6257

https://joind.in/talk/view/6257

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,051
On Slideshare
6,985
From Embeds
66
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
7

Embeds 66

https://twitter.com 48
http://protalk.me 15
http://www.linkedin.com 3

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Let’s build a Parser!A short introduction to parsing with PHP Boy Baukema June 9th 2012, Amsterdam
  • 2. 2Source: http://www.sxc.hu/photo/1384894
  • 3. Boy BaukemaSoftware Engineer @ Ibuildings 3
  • 4. Reasons for commonfear of writing parsers:1. Never tookcompiler class, think itis scary.2. Did take compiler- Martin Fowler 4
  • 5. Language cacaphony 5Source: http://www.wordle.net/show/wrdl/5292561/ Languages_used_in_PHP_Web_Development
  • 6. Lookahead (?= Languages Parsing QueryLang Parsing PHP code Resources 6
  • 7. RegExesAnd now you have two problems... 7
  • 8. Mail::RFC822::Address(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([ 8Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
  • 9. Choamsky hierarchy 9Source: http://en.wikipedia.org/wiki/File:Chomsky-hierarchy.svg
  • 10. HTTP 1.1 Accept Header BNFAccept = "Accept" ":" #( media-range [ accept-params ] )media-range = ( "*/*" | ( type "/" "*" ) | ( type "/" subtype ) ) *( ";" parameter )accept-params = ";" "q" "=" qvalue *( accept-extension )accept-extension = ";" token [ "=" ( token | quoted-string ) ] 10Source: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
  • 11. Arithmetic expression BNF<expression> ::= <term> | <expression> "+" <term><term> ::= <factor> | <term> "*" <factor><factor> ::= <constant> | <variable> | "(" <expression> ")"<variable> ::= "x" | "y" | "z"<constant> ::= <digit> | <digit> <constant><digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 11 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  • 12. Recursion in BNFProduction<constant> ::= <digit> | <digit> <constant><digit> ::= "0" | "1" | "2" | "3"Terminal | "4" | "5" | "6" | "7" | "8" | "9" 12
  • 13. Matching 123 <constant>1 <digit> <constant> 2 <digit> <constant> 3 <digit><constant> ::= <digit> 13Source: https://secure.flickr.com/photos/threedots/110586879/
  • 14. Arithmetic expression EBNF expression = term , {"+" , term}; term = factor , {"*" , factor}; factor = constant | variable | "(" , expression , ")"; variable = "x" | "y" | "z"; constant = digit , {digit}; digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; 14 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  • 15. Parsing Expression Grammar expression = term ("+" term)* product = factor ("*" factor)* factor = constant / variable / "(" expression ")" variable = "x" / "y" / "z" constant = [0-9]+ 15Source: https://secure.flickr.com/photos/sasastro/5590210866/
  • 16. So how will this helpme parse a language? 16
  • 17. Parser Generators for PHP Lime-php LALR(1) , 2008, abandoned PHP_ParserGenerator LALR(1), 2010, abandoned Loco combinatory parsing, 2011, alpha php-peg PEG, 2012, active?, alpha 17
  • 18. QueryLanghttps://github.com/relaxnow/QueryLang 18
  • 19. QueryLang: Example query parsers OR 123 AND (dpc OR phpbnl)Query (OR)|-- Term - "parsers"|-- Query (AND) |-- Term - "123" |-- Query (OR) |-- Term - "dpc" |-- Term - "phpbnl" 19
  • 20. v1/Peg/grammar.peg.inc /*!* QueryLangV1 Term: /[wd]+/ */ public function parse() { $match = $this->match_Term(); if (!$match) { return ; } return $match[text]; } 20
  • 21. v1/Peg/Parser.php - generated match_Term /* Term: /[wd]+/ */ protected $match_Term_typestack = array(Term); function match_Term ($stack = array()) { $matchrule = "Term"; $result = $this->construct($matchrule, $matchrule, null); if (( $subres = $this->rx( /[wd]+/ ) ) !==FALSE) { $result["text"] .= $subres; return $this->finalise($result); } else { return FALSE; } } 21
  • 22. v1/Peg/grammar.peg.inc test $parser = new Parser(test); print_r($parser->parse()); // test $parser = new Parser(test 123); print_r($parser->parse()); // test 22
  • 23. v2/Peg/grammar.peg.inc /*!* QueryLangV2 Query: Term (> Term)* Term: /[wd]+/ */ public function parse() { $result = $this->match_Query(); return $result[query]; } 23
  • 24. v2/Peg/grammar.peg.inc (cont.)public function Query__construct(&$result){ $result[query] = new NodeQuery();}public function Query_Term(&$result, $sub){ $term = new NodeTerm($sub[text]); $result[query]->addTerm($term);} 24
  • 25. v2/Peg/grammar.peg.inc test $parser = new Parser(test 123); print_r($parser->parse()); Query |-- Term - "test" |-- Term - "123" 25
  • 26. v3/Peg/grammar.peg.inc /*!* QueryLangV3 Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)* Term: "(" Query ")" | Value:/[wd]+/ */ public function parse() { $node = $this->match_Query(); return $node[query]; } 26
  • 27. v3/Peg/grammar.peg.inc (cont.) Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)*public function Query__construct(&$r) { $r[query] = new NodeQuery(OR);}public function Query_AndQuery(&$r, $s) { $r[query]->add($s[query]);}public function AndQuery__construct(&$r) { $r[query] = new NodeQuery(AND);}public function AndQuery_Term(&$r, $s) { $r[query]->add($s[query]); 27}
  • 28. v3/Peg/grammar.peg.inc (cont.) /*!* QueryLangV3 Term: "(" Query ")" | Value:/[wd]+/ */public function Term_Query(&$r, $s){ $r[query] = $s[query];}public function Term_Value(&$r, $s){ $r[query]= new NodeTerm($s[text]);} 28
  • 29. v3/Peg/grammar.peg.inc test $parser = new Parser(a AND b OR c); Query (OR) |-- Query (AND) | |-- Term - "a" | |-- Term - "b" |-- Query (AND) |-- Term - "c" 29
  • 30. Optional: Optimizer / Semantic checking 30
  • 31. Optimized query$parser = new Parser(a AND b OR c);$query = $parser->parse();$queryOptimizer = new Optimizer($query);$query = $queryOptimizer->optimize();Query (OR)|-- Term - "c"|-- Query (AND)| |-- Term - "a"| |-- Term - "b" 31
  • 32. Manual Parser Building: Predictive parsing 32 Source: http://en.wikipedia.org/wiki/File:PsychicBoston.jpg
  • 33. Manual Parser Building: LexingCharacters get turned into tokens by a lexicalanalyzer. Also called lexer, scanner ortokenizer."a OR (b)"term => "a"ORLeftParenterm => "b"RightParen 33
  • 34. Manual Parser Building: Lexingif ($this->_match(LeftParen, /^(()/)) {continue;}if ($this->_match(RightParen, /^())/)) {continue;}if ($this->_match(OR, /^(OR)/i)){continue;}if ($this->_match(AND, /^(AND)/i)) {continue;}if ($this->_match(TermValue, /^([wd]+)/i)){continue;}if ($this->_match(WS, /^s+/, true)) {continue;} 34
  • 35. Manual Parser Building: Lexing - UML 35 Source: http://commons.wikimedia.org/wiki/File:Willem- Alexander,_Prince_of_Orange.jpg
  • 36. Manual Parser Building: ParsingNon-terminals become methodsprotected function _query();protected function _andQuery();protected function _term();Parse to a tree structure. 36
  • 37. Manual Parser Building: Parsing - UML 37
  • 38. Manual Parser Building: example non-terminalprotected function _query() { $query = new NodeQuery(OR); $leftTerm = $this->_andQuery(); $query->add($leftTerm); while($this->_tokenStream->look()->getType() === OR) { $this->_tokenStream->expect(OR); $rightTerm = $this->_andQuery(); $query->add($rightTerm); } return $query;} 38
  • 39. Predictive Parsing: Warning!Tokens must be decidable with a fixed lookahead<term> ::= <TermValue> "-" <TermValue> | <TermValue> | "(" <Query> ")"No left recursion<orQuery> ::= <orQuery> ("OR" <orQuery>)? | <term> 39
  • 40. But I wanna parse 40
  • 41. PHP Parsers PHP_Depend 1.0.0 PHP 5.4 PHP-Parser alpha PHP 5.4 phc 0.3.0.1 (unmaintained) PHP 5.2 (?) 41
  • 42. PHPDepend Abstract Syntax Tree example$string = "Manuel $Pichler <{$email}>";PHP_Depend_Code_ASTString|-- ASTLiteral - "Manuel "|-- ASTVariable - $Pichler|-- ASTLiteral - " <"|-- ASTCompoundExpression - {...}| |-- ASTVariable - $email|-- ASTLiteral - ">" 42
  • 43. Resources 43
  • 44. More resourcesExamples of modern parsers in PHP: Twig (Predictive Parser) Behats Gherkin (Predictive Parser) Smarty 3 (LALR parser)More information:Rich Programmer Food by Steve YeggeLet’s Build a Compiler, by Jack Crenshawnathansuniversity.comCoursera: Compilers by Stanford UniversitySE-Radio: Episode 182: DSLs 44
  • 45. QUESTIONS?Joind.in: https://joind.in/6257Twitter: @relaxnowE-mail: boy@ibuildings.nlSlideshare: http://slidesha.re/INY43R 45GitHub: https://github.com/relaxnow/QueryLang