Let’s build a Parser!A short introduction to parsing with PHP                                   Boy Baukema               ...
2Source: http://www.sxc.hu/photo/1384894
Boy BaukemaSoftware Engineer @ Ibuildings                                 3
Reasons for commonfear of writing parsers:1. Never tookcompiler class, think itis scary.2. Did take compiler- Martin Fowle...
Language cacaphony                                                   5Source: http://www.wordle.net/show/wrdl/5292561/    ...
Lookahead (?=   Languages   Parsing   QueryLang   Parsing PHP code   Resources                      6
RegExesAnd now you have two problems...                                   7
Mail::RFC822::Address(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r...
Choamsky hierarchy                                                                  9Source: http://en.wikipedia.org/wiki/...
HTTP 1.1 Accept Header BNFAccept        = "Accept" ":"         #( media-range [ accept-params ] )media-range = ( "*/*"    ...
Arithmetic expression BNF<expression> ::= <term>         | <expression> "+" <term><term>        ::= <factor>         | <te...
Recursion in BNFProduction<constant> ::= <digit>         | <digit> <constant><digit>       ::= "0" | "1" | "2" | "3"Termin...
Matching 123    <constant>1    <digit>     <constant> 2    <digit>      <constant>  3     <digit><constant> ::= <digit>   ...
Arithmetic expression EBNF   expression = term , {"+" , term};   term      = factor , {"*" , factor};   factor    = consta...
Parsing Expression Grammar   expression = term ("+" term)*   product = factor ("*" factor)*   factor   = constant         ...
So how will this helpme parse a language?                        16
Parser Generators for PHP   Lime-php   LALR(1) , 2008, abandoned   PHP_ParserGenerator   LALR(1), 2010, abandoned   Loco  ...
QueryLanghttps://github.com/relaxnow/QueryLang                                        18
QueryLang: Example query parsers OR 123 AND (dpc OR phpbnl)Query (OR)|-- Term        -   "parsers"|-- Query (AND)   |-- Te...
v1/Peg/grammar.peg.inc  /*!* QueryLangV1  Term: /[wd]+/  */  public function parse()  {    $match = $this->match_Term();  ...
v1/Peg/Parser.php - generated match_Term     /* Term: /[wd]+/ */  protected $match_Term_typestack =             array(Term...
v1/Peg/grammar.peg.inc test    $parser = new Parser(test);    print_r($parser->parse());    // test    $parser = new Parse...
v2/Peg/grammar.peg.inc  /*!* QueryLangV2  Query: Term (> Term)*  Term: /[wd]+/  */  public function parse()  {    $result ...
v2/Peg/grammar.peg.inc (cont.)public function Query__construct(&$result){  $result[query] = new NodeQuery();}public functi...
v2/Peg/grammar.peg.inc test    $parser = new Parser(test 123);    print_r($parser->parse());    Query    |-- Term        -...
v3/Peg/grammar.peg.inc  /*!* QueryLangV3  Query: AndQuery ([ "OR" ] AndQuery)*  AndQuery: Term ([ "AND" ] Term)*  Term: "(...
v3/Peg/grammar.peg.inc (cont.)  Query: AndQuery ([ "OR" ] AndQuery)*  AndQuery: Term ([ "AND" ] Term)*public function Quer...
v3/Peg/grammar.peg.inc (cont.)  /*!* QueryLangV3  Term: "(" Query ")" | Value:/[wd]+/  */public function Term_Query(&$r, $...
v3/Peg/grammar.peg.inc test   $parser = new Parser(a AND b OR c);   Query (OR)   |-- Query (AND)   | |-- Term      - "a"  ...
Optional: Optimizer / Semantic checking                                          30
Optimized query$parser = new Parser(a AND b OR c);$query = $parser->parse();$queryOptimizer = new Optimizer($query);$query...
Manual Parser Building: Predictive parsing                                                                   32     Source...
Manual Parser Building: LexingCharacters get turned into tokens by a lexicalanalyzer. Also called lexer, scanner ortokeniz...
Manual Parser Building: Lexingif ($this->_match(LeftParen, /^(()/)) {continue;}if ($this->_match(RightParen, /^())/)) {con...
Manual Parser Building: Lexing - UML                                                              35     Source: http://co...
Manual Parser Building: ParsingNon-terminals become methodsprotected function _query();protected function _andQuery();prot...
Manual Parser Building: Parsing - UML                                        37
Manual Parser Building: example non-terminalprotected function _query() { $query = new NodeQuery(OR);    $leftTerm = $this...
Predictive Parsing: Warning!Tokens must be decidable with a fixed lookahead<term> ::= <TermValue> "-" <TermValue>     | <Te...
But I wanna parse                    40
PHP Parsers   PHP_Depend   1.0.0   PHP 5.4   PHP-Parser   alpha   PHP 5.4   phc   0.3.0.1 (unmaintained)   PHP 5.2 (?)    ...
PHPDepend Abstract Syntax Tree example$string = "Manuel $Pichler <{$email}>";PHP_Depend_Code_ASTString|-- ASTLiteral    - ...
Resources            43
More resourcesExamples of modern parsers in PHP:   Twig (Predictive Parser)   Behats Gherkin (Predictive Parser)   Smarty ...
QUESTIONS?Joind.in: https://joind.in/6257Twitter: @relaxnowE-mail: boy@ibuildings.nlSlideshare: http://slidesha.re/INY43R ...
Upcoming SlideShare
Loading in...5
×

Let's build a parser!

7,910

Published on

https://joind.in/talk/view/6257

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,910
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Let's build a parser!

    1. 1. Let’s build a Parser!A short introduction to parsing with PHP Boy Baukema June 9th 2012, Amsterdam
    2. 2. 2Source: http://www.sxc.hu/photo/1384894
    3. 3. Boy BaukemaSoftware Engineer @ Ibuildings 3
    4. 4. Reasons for commonfear of writing parsers:1. Never tookcompiler class, think itis scary.2. Did take compiler- Martin Fowler 4
    5. 5. Language cacaphony 5Source: http://www.wordle.net/show/wrdl/5292561/ Languages_used_in_PHP_Web_Development
    6. 6. Lookahead (?= Languages Parsing QueryLang Parsing PHP code Resources 6
    7. 7. RegExesAnd now you have two problems... 7
    8. 8. Mail::RFC822::Address(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*)*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[]000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([ 8Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
    9. 9. Choamsky hierarchy 9Source: http://en.wikipedia.org/wiki/File:Chomsky-hierarchy.svg
    10. 10. HTTP 1.1 Accept Header BNFAccept = "Accept" ":" #( media-range [ accept-params ] )media-range = ( "*/*" | ( type "/" "*" ) | ( type "/" subtype ) ) *( ";" parameter )accept-params = ";" "q" "=" qvalue *( accept-extension )accept-extension = ";" token [ "=" ( token | quoted-string ) ] 10Source: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
    11. 11. Arithmetic expression BNF<expression> ::= <term> | <expression> "+" <term><term> ::= <factor> | <term> "*" <factor><factor> ::= <constant> | <variable> | "(" <expression> ")"<variable> ::= "x" | "y" | "z"<constant> ::= <digit> | <digit> <constant><digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 11 Source: http://en.wikipedia.org/wiki/Syntax_diagram
    12. 12. Recursion in BNFProduction<constant> ::= <digit> | <digit> <constant><digit> ::= "0" | "1" | "2" | "3"Terminal | "4" | "5" | "6" | "7" | "8" | "9" 12
    13. 13. Matching 123 <constant>1 <digit> <constant> 2 <digit> <constant> 3 <digit><constant> ::= <digit> 13Source: https://secure.flickr.com/photos/threedots/110586879/
    14. 14. Arithmetic expression EBNF expression = term , {"+" , term}; term = factor , {"*" , factor}; factor = constant | variable | "(" , expression , ")"; variable = "x" | "y" | "z"; constant = digit , {digit}; digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; 14 Source: http://en.wikipedia.org/wiki/Syntax_diagram
    15. 15. Parsing Expression Grammar expression = term ("+" term)* product = factor ("*" factor)* factor = constant / variable / "(" expression ")" variable = "x" / "y" / "z" constant = [0-9]+ 15Source: https://secure.flickr.com/photos/sasastro/5590210866/
    16. 16. So how will this helpme parse a language? 16
    17. 17. Parser Generators for PHP Lime-php LALR(1) , 2008, abandoned PHP_ParserGenerator LALR(1), 2010, abandoned Loco combinatory parsing, 2011, alpha php-peg PEG, 2012, active?, alpha 17
    18. 18. QueryLanghttps://github.com/relaxnow/QueryLang 18
    19. 19. QueryLang: Example query parsers OR 123 AND (dpc OR phpbnl)Query (OR)|-- Term - "parsers"|-- Query (AND) |-- Term - "123" |-- Query (OR) |-- Term - "dpc" |-- Term - "phpbnl" 19
    20. 20. v1/Peg/grammar.peg.inc /*!* QueryLangV1 Term: /[wd]+/ */ public function parse() { $match = $this->match_Term(); if (!$match) { return ; } return $match[text]; } 20
    21. 21. v1/Peg/Parser.php - generated match_Term /* Term: /[wd]+/ */ protected $match_Term_typestack = array(Term); function match_Term ($stack = array()) { $matchrule = "Term"; $result = $this->construct($matchrule, $matchrule, null); if (( $subres = $this->rx( /[wd]+/ ) ) !==FALSE) { $result["text"] .= $subres; return $this->finalise($result); } else { return FALSE; } } 21
    22. 22. v1/Peg/grammar.peg.inc test $parser = new Parser(test); print_r($parser->parse()); // test $parser = new Parser(test 123); print_r($parser->parse()); // test 22
    23. 23. v2/Peg/grammar.peg.inc /*!* QueryLangV2 Query: Term (> Term)* Term: /[wd]+/ */ public function parse() { $result = $this->match_Query(); return $result[query]; } 23
    24. 24. v2/Peg/grammar.peg.inc (cont.)public function Query__construct(&$result){ $result[query] = new NodeQuery();}public function Query_Term(&$result, $sub){ $term = new NodeTerm($sub[text]); $result[query]->addTerm($term);} 24
    25. 25. v2/Peg/grammar.peg.inc test $parser = new Parser(test 123); print_r($parser->parse()); Query |-- Term - "test" |-- Term - "123" 25
    26. 26. v3/Peg/grammar.peg.inc /*!* QueryLangV3 Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)* Term: "(" Query ")" | Value:/[wd]+/ */ public function parse() { $node = $this->match_Query(); return $node[query]; } 26
    27. 27. v3/Peg/grammar.peg.inc (cont.) Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)*public function Query__construct(&$r) { $r[query] = new NodeQuery(OR);}public function Query_AndQuery(&$r, $s) { $r[query]->add($s[query]);}public function AndQuery__construct(&$r) { $r[query] = new NodeQuery(AND);}public function AndQuery_Term(&$r, $s) { $r[query]->add($s[query]); 27}
    28. 28. v3/Peg/grammar.peg.inc (cont.) /*!* QueryLangV3 Term: "(" Query ")" | Value:/[wd]+/ */public function Term_Query(&$r, $s){ $r[query] = $s[query];}public function Term_Value(&$r, $s){ $r[query]= new NodeTerm($s[text]);} 28
    29. 29. v3/Peg/grammar.peg.inc test $parser = new Parser(a AND b OR c); Query (OR) |-- Query (AND) | |-- Term - "a" | |-- Term - "b" |-- Query (AND) |-- Term - "c" 29
    30. 30. Optional: Optimizer / Semantic checking 30
    31. 31. Optimized query$parser = new Parser(a AND b OR c);$query = $parser->parse();$queryOptimizer = new Optimizer($query);$query = $queryOptimizer->optimize();Query (OR)|-- Term - "c"|-- Query (AND)| |-- Term - "a"| |-- Term - "b" 31
    32. 32. Manual Parser Building: Predictive parsing 32 Source: http://en.wikipedia.org/wiki/File:PsychicBoston.jpg
    33. 33. Manual Parser Building: LexingCharacters get turned into tokens by a lexicalanalyzer. Also called lexer, scanner ortokenizer."a OR (b)"term => "a"ORLeftParenterm => "b"RightParen 33
    34. 34. Manual Parser Building: Lexingif ($this->_match(LeftParen, /^(()/)) {continue;}if ($this->_match(RightParen, /^())/)) {continue;}if ($this->_match(OR, /^(OR)/i)){continue;}if ($this->_match(AND, /^(AND)/i)) {continue;}if ($this->_match(TermValue, /^([wd]+)/i)){continue;}if ($this->_match(WS, /^s+/, true)) {continue;} 34
    35. 35. Manual Parser Building: Lexing - UML 35 Source: http://commons.wikimedia.org/wiki/File:Willem- Alexander,_Prince_of_Orange.jpg
    36. 36. Manual Parser Building: ParsingNon-terminals become methodsprotected function _query();protected function _andQuery();protected function _term();Parse to a tree structure. 36
    37. 37. Manual Parser Building: Parsing - UML 37
    38. 38. Manual Parser Building: example non-terminalprotected function _query() { $query = new NodeQuery(OR); $leftTerm = $this->_andQuery(); $query->add($leftTerm); while($this->_tokenStream->look()->getType() === OR) { $this->_tokenStream->expect(OR); $rightTerm = $this->_andQuery(); $query->add($rightTerm); } return $query;} 38
    39. 39. Predictive Parsing: Warning!Tokens must be decidable with a fixed lookahead<term> ::= <TermValue> "-" <TermValue> | <TermValue> | "(" <Query> ")"No left recursion<orQuery> ::= <orQuery> ("OR" <orQuery>)? | <term> 39
    40. 40. But I wanna parse 40
    41. 41. PHP Parsers PHP_Depend 1.0.0 PHP 5.4 PHP-Parser alpha PHP 5.4 phc 0.3.0.1 (unmaintained) PHP 5.2 (?) 41
    42. 42. PHPDepend Abstract Syntax Tree example$string = "Manuel $Pichler <{$email}>";PHP_Depend_Code_ASTString|-- ASTLiteral - "Manuel "|-- ASTVariable - $Pichler|-- ASTLiteral - " <"|-- ASTCompoundExpression - {...}| |-- ASTVariable - $email|-- ASTLiteral - ">" 42
    43. 43. Resources 43
    44. 44. More resourcesExamples of modern parsers in PHP: Twig (Predictive Parser) Behats Gherkin (Predictive Parser) Smarty 3 (LALR parser)More information:Rich Programmer Food by Steve YeggeLet’s Build a Compiler, by Jack Crenshawnathansuniversity.comCoursera: Compilers by Stanford UniversitySE-Radio: Episode 182: DSLs 44
    45. 45. QUESTIONS?Joind.in: https://joind.in/6257Twitter: @relaxnowE-mail: boy@ibuildings.nlSlideshare: http://slidesha.re/INY43R 45GitHub: https://github.com/relaxnow/QueryLang

    ×