Let’s build a Parser!
A short introduction to parsing with PHP

                                   Boy Baukema
                      June 9th 2012, Amsterdam
2
Source: http://www.sxc.hu/photo/1384894
Boy Baukema
Software Engineer @ Ibuildings




                                 3
Reasons for common
fear of writing parsers:
1. Never took
compiler class, think it
is scary.
2. Did take compiler
- Martin Fowler

                           4
Language cacaphony




                                                   5
Source: http://www.wordle.net/show/wrdl/5292561/
     Languages_used_in_PHP_Web_Development
Lookahead (?=

   Languages

   Parsing

   QueryLang

   Parsing PHP code

   Resources



                      6
RegExes
And now you have two problems...




                                   7
Mail::RFC822::Address
(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t]
)+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:
rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(
?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[
t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-0
31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*
](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+
(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:
(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z
|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)
?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:
rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[
 t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)
?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t]
)*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[
 t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*
)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t]
)+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*)
*:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+
|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:r
n)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:
rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t
]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031
]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](
?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?
:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?
:rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:".[] 000-031]+(?:(?
:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?
[ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[]
000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|
.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>
@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"
(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t]
)*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:
".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?
:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[
]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-
031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(
?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;
:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([
                                                                                 8
Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
Choamsky hierarchy




                                                                  9
Source: http://en.wikipedia.org/wiki/File:Chomsky-hierarchy.svg
HTTP 1.1 Accept Header BNF
Accept        = "Accept" ":"
         #( media-range [ accept-params ] )

media-range = ( "*/*"
         | ( type "/" "*" )
         | ( type "/" subtype )
         ) *( ";" parameter )

accept-params = ";" "q" "=" qvalue
       *( accept-extension )

accept-extension = ";" token
       [ "=" ( token | quoted-string ) ]
                                                                 10
Source: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Arithmetic expression BNF

<expression> ::= <term>
         | <expression> "+" <term>
<term>        ::= <factor>
         | <term> "*" <factor>
<factor>     ::= <constant>
         | <variable>
         | "(" <expression> ")"
<variable> ::= "x" | "y" | "z"
<constant> ::= <digit>
         | <digit> <constant>
<digit>     ::= "0" | "1" | "2" | "3"
         | "4" | "5" | "6" | "7"
         | "8" | "9"
                                                          11
    Source: http://en.wikipedia.org/wiki/Syntax_diagram
Recursion in BNF

Production
<constant> ::= <digit>
         | <digit> <constant>



<digit>       ::= "0" | "1" | "2" | "3"
Terminal         | "4" | "5" | "6" | "7"
           | "8" | "9"



                                           12
Matching 123

    <constant>
1    <digit>
     <constant>
 2    <digit>
      <constant>
  3     <digit>




<constant> ::= <digit>

                                                               13
Source: https://secure.flickr.com/photos/threedots/110586879/
Arithmetic expression EBNF

   expression = term , {"+" , term};
   term      = factor , {"*" , factor};
   factor    = constant
          | variable
          | "(" , expression , ")";
   variable = "x"
          | "y"
          | "z";
   constant = digit , {digit};
   digit    = "0" | "1" | "2" | "3"
          | "4" | "5" | "6" | "7"
          | "8" | "9";
                                                          14
    Source: http://en.wikipedia.org/wiki/Syntax_diagram
Parsing Expression Grammar

   expression = term ("+" term)*
   product = factor ("*" factor)*
   factor   = constant
          / variable
          / "(" expression ")"
   variable = "x"
          / "y"
          / "z"
   constant = [0-9]+


                                                               15
Source: https://secure.flickr.com/photos/sasastro/5590210866/
So how will this help
me parse a language?



                        16
Parser Generators for PHP
   Lime-php
   LALR(1) , 2008, abandoned

   PHP_ParserGenerator
   LALR(1), 2010, abandoned

   Loco
   combinatory parsing, 2011,
   alpha

   php-peg
   PEG, 2012, active?, alpha
                                17
QueryLang
https://github.com/relaxnow/QueryLang




                                        18
QueryLang: Example query

 parsers OR 123 AND (dpc OR phpbnl)

Query (OR)
|-- Term        -   "parsers"
|-- Query (AND)
   |-- Term     -   "123"
   |-- Query (OR)
      |-- Term -    "dpc"
      |-- Term -    "phpbnl"




                                      19
v1/Peg/grammar.peg.inc

  /*!* QueryLangV1
  Term: /[wd]+/
  */

  public function parse()
  {
    $match = $this->match_Term();
    if (!$match) {
        return '';
    }
    return $match['text'];
  }
                                    20
v1/Peg/Parser.php - generated match_Term

     /* Term: /[wd]+/ */
  protected $match_Term_typestack =
             array('Term');
  function match_Term ($stack = array()) {
  
 $matchrule = "Term"; $result = $this-
>construct($matchrule, $matchrule, null);
  
 if (( $subres = $this->rx( '/[wd]+/' ) ) !==
FALSE) {
  
 
 $result["text"] .= $subres;
  
 
 return $this->finalise($result);
  
 }
  
 else { return FALSE; }
  }
                                                     21
v1/Peg/grammar.peg.inc test



    $parser = new Parser('test');
    print_r($parser->parse());
    // test

    $parser = new Parser('test 123');
    print_r($parser->parse());
    // test




                                        22
v2/Peg/grammar.peg.inc

  /*!* QueryLangV2
  Query: Term (> Term)*
  Term: /[wd]+/
  */

  public function parse()
  {
    $result = $this->match_Query();
    return $result['query'];
  }



                                      23
v2/Peg/grammar.peg.inc (cont.)



public function Query__construct(&$result)
{
  $result['query'] = new NodeQuery();
}

public function Query_Term(&$result, $sub)
{
  $term = new NodeTerm($sub['text']);
  $result['query']->addTerm($term);
}

                                             24
v2/Peg/grammar.peg.inc test



    $parser = new Parser('test 123');
    print_r($parser->parse());

    Query
    |-- Term        - "test"
    |-- Term        - "123"




                                        25
v3/Peg/grammar.peg.inc

  /*!* QueryLangV3
  Query: AndQuery ([ "OR" ] AndQuery)*
  AndQuery: Term ([ "AND" ] Term)*
  Term: "(" Query ")" | Value:/[wd]+/
  */

  public function parse()
  {
    $node = $this->match_Query();
    return $node['query'];
  }


                                          26
v3/Peg/grammar.peg.inc (cont.)

  Query: AndQuery ([ "OR" ] AndQuery)*
  AndQuery: Term ([ "AND" ] Term)*
public function Query__construct(&$r) {
  $r['query'] = new NodeQuery('OR');
}
public function Query_AndQuery(&$r, $s) {
  $r['query']->add($s['query']);
}
public function AndQuery__construct(&$r) {
  $r['query'] = new NodeQuery('AND');
}
public function AndQuery_Term(&$r, $s) {
  $r['query']->add($s['query']);
                                             27
}
v3/Peg/grammar.peg.inc (cont.)

  /*!* QueryLangV3
  Term: "(" Query ")" | Value:/[wd]+/
  */

public function Term_Query(&$r, $s){
  $r['query'] = $s['query'];
}

public function Term_Value(&$r, $s){
  $r['query']= new NodeTerm($s['text']);
}


                                            28
v3/Peg/grammar.peg.inc test



   $parser = new Parser('a AND b OR c');

   Query (OR)
   |-- Query (AND)
   | |-- Term      - "a"
   | |-- Term      - "b"
   |-- Query (AND)
      |-- Term     - "c"



                                           29
Optional: Optimizer / Semantic checking




                                          30
Optimized query

$parser = new Parser('a AND b OR c');
$query = $parser->parse();

$queryOptimizer = new Optimizer($query);
$query = $queryOptimizer->optimize();

Query (OR)
|-- Term        - "c"
|-- Query (AND)
| |-- Term      - "a"
| |-- Term      - "b"


                                           31
Manual Parser Building: Predictive parsing




                                                                   32
     Source: http://en.wikipedia.org/wiki/File:PsychicBoston.jpg
Manual Parser Building: Lexing

Characters get turned into tokens by a lexical
analyzer. Also called lexer, scanner or
tokenizer.

"a OR (b)"

'term'      => "a"
'OR'
'LeftParen'
'term'      => "b"
'RightParen'                                     33
Manual Parser Building: Lexing

if ($this->_match('LeftParen', '/^(()/')) {continue;}
if ($this->_match('RightParen', '/^())/')) {continue;}
if ($this->_match('OR', '/^(OR)/i'))
{continue;}
if ($this->_match('AND', '/^(AND)/i')) {continue;}
if ($this->_match('TermValue', '/^([wd]+)/i'))
{continue;}
if ($this->_match('WS', '/^s+/', true)) {continue;}




                                                          34
Manual Parser Building: Lexing - UML




                                                              35
     Source: http://commons.wikimedia.org/wiki/File:Willem-
                 Alexander,_Prince_of_Orange.jpg
Manual Parser Building: Parsing

Non-terminals become methods

protected function _query();
protected function _andQuery();
protected function _term();
Parse to a tree structure.




                                  36
Manual Parser Building: Parsing - UML




                                        37
Manual Parser Building: example non-terminal
protected function _query() {
 $query = new NodeQuery('OR');

    $leftTerm = $this->_andQuery();
    $query->add($leftTerm);

    while($this->_tokenStream->look()->getType()
         === 'OR') {
      $this->_tokenStream->expect('OR');
      $rightTerm = $this->_andQuery();
      $query->add($rightTerm);
    }
    return $query;
}                                                  38
Predictive Parsing: Warning!

Tokens must be decidable with a fixed lookahead

<term> ::= <TermValue> "-" <TermValue>
     | <TermValue>
     | "(" <Query> ")"

No left recursion

<orQuery> ::= <orQuery> ("OR" <orQuery>)?
      | <term>


                                                 39
But I wanna parse




                    40
PHP Parsers

   PHP_Depend
   1.0.0
   PHP 5.4

   PHP-Parser
   alpha
   PHP 5.4

   phc
   0.3.0.1 (unmaintained)
   PHP 5.2 (?)

                            41
PHPDepend Abstract Syntax Tree example

$string = "Manuel $Pichler <{$email}>";

PHP_Depend_Code_ASTString
|-- ASTLiteral    - "Manuel "
|-- ASTVariable    - $Pichler
|-- ASTLiteral    - " <"
|-- ASTCompoundExpression - {...}
| |-- ASTVariable  - $email
|-- ASTLiteral    - ">"




                                          42
Resources




            43
More resources

Examples of modern parsers in PHP:
   Twig (Predictive Parser)
   Behats Gherkin (Predictive Parser)
   Smarty 3 (LALR parser)

More information:
Rich Programmer Food by Steve Yegge
Let’s Build a Compiler, by Jack Crenshaw
nathansuniversity.com
Coursera: Compilers by Stanford University
SE-Radio: Episode 182: DSLs
                                             44
QUESTIONS?

Joind.in: https://joind.in/6257
Twitter: @relaxnow
E-mail: boy@ibuildings.nl
Slideshare: http://slidesha.re/INY43R           45
GitHub: https://github.com/relaxnow/QueryLang

Let's build a parser!

  • 1.
    Let’s build aParser! A short introduction to parsing with PHP Boy Baukema June 9th 2012, Amsterdam
  • 2.
  • 3.
  • 4.
    Reasons for common fearof writing parsers: 1. Never took compiler class, think it is scary. 2. Did take compiler - Martin Fowler 4
  • 5.
    Language cacaphony 5 Source: http://www.wordle.net/show/wrdl/5292561/ Languages_used_in_PHP_Web_Development
  • 6.
    Lookahead (?= Languages Parsing QueryLang Parsing PHP code Resources 6
  • 7.
    RegExes And now youhave two problems... 7
  • 8.
    Mail::RFC822::Address (?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[t] )+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?: rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:( ?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-0 31]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)* ](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+ (?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?: (?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z |(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn) ?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,;:".[] 000-031]+(?:(?:(?: rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn) ?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t] )*))*(?:,@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])* )(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t] )+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*) *:(?:(?:rn)?[ t])*)?(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+ |Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:r n)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?: rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)?[ t ]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031 ]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*]( ?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<>@,;:".[] 000-031]+(? :(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([^[]r]|.)*](?:(? :rn)?[ t])*))*>(?:(?:rn)?[ t])*)|(?:[^()<>@,;:".[] 000-031]+(?:(? :(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|(?:(?:rn)? [ t]))*"(?:(?:rn)?[ t])*)*:(?:(?:rn)?[ t])*(?:(?:(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]| .|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(?:[^()<> @,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|" (?:[^"r]|.|(?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*))*@(?:(?:rn)?[ t] )*(?:[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;: ".[]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*)(?:.(?:(?:rn)?[ t])*(? :[^()<>@,;:".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[ ]]))|[([^[]r]|.)*](?:(?:rn)?[ t])*))*|(?:[^()<>@,;:".[] 000- 031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|"(?:[^"r]|.|( ?:(?:rn)?[ t]))*"(?:(?:rn)?[ t])*)*<(?:(?:rn)?[ t])*(?:@(?:[^()<>@,; :".[] 000-031]+(?:(?:(?:rn)?[ t])+|Z|(?=[["()<>@,;:".[]]))|[([ 8 Source: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html
  • 9.
    Choamsky hierarchy 9 Source: http://en.wikipedia.org/wiki/File:Chomsky-hierarchy.svg
  • 10.
    HTTP 1.1 AcceptHeader BNF Accept = "Accept" ":" #( media-range [ accept-params ] ) media-range = ( "*/*" | ( type "/" "*" ) | ( type "/" subtype ) ) *( ";" parameter ) accept-params = ";" "q" "=" qvalue *( accept-extension ) accept-extension = ";" token [ "=" ( token | quoted-string ) ] 10 Source: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
  • 11.
    Arithmetic expression BNF <expression>::= <term> | <expression> "+" <term> <term> ::= <factor> | <term> "*" <factor> <factor> ::= <constant> | <variable> | "(" <expression> ")" <variable> ::= "x" | "y" | "z" <constant> ::= <digit> | <digit> <constant> <digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" 11 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  • 12.
    Recursion in BNF Production <constant>::= <digit> | <digit> <constant> <digit> ::= "0" | "1" | "2" | "3" Terminal | "4" | "5" | "6" | "7" | "8" | "9" 12
  • 13.
    Matching 123 <constant> 1 <digit> <constant> 2 <digit> <constant> 3 <digit> <constant> ::= <digit> 13 Source: https://secure.flickr.com/photos/threedots/110586879/
  • 14.
    Arithmetic expression EBNF expression = term , {"+" , term}; term = factor , {"*" , factor}; factor = constant | variable | "(" , expression , ")"; variable = "x" | "y" | "z"; constant = digit , {digit}; digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"; 14 Source: http://en.wikipedia.org/wiki/Syntax_diagram
  • 15.
    Parsing Expression Grammar expression = term ("+" term)* product = factor ("*" factor)* factor = constant / variable / "(" expression ")" variable = "x" / "y" / "z" constant = [0-9]+ 15 Source: https://secure.flickr.com/photos/sasastro/5590210866/
  • 16.
    So how willthis help me parse a language? 16
  • 17.
    Parser Generators forPHP Lime-php LALR(1) , 2008, abandoned PHP_ParserGenerator LALR(1), 2010, abandoned Loco combinatory parsing, 2011, alpha php-peg PEG, 2012, active?, alpha 17
  • 18.
  • 19.
    QueryLang: Example query parsers OR 123 AND (dpc OR phpbnl) Query (OR) |-- Term - "parsers" |-- Query (AND) |-- Term - "123" |-- Query (OR) |-- Term - "dpc" |-- Term - "phpbnl" 19
  • 20.
    v1/Peg/grammar.peg.inc /*!*QueryLangV1 Term: /[wd]+/ */ public function parse() { $match = $this->match_Term(); if (!$match) { return ''; } return $match['text']; } 20
  • 21.
    v1/Peg/Parser.php - generatedmatch_Term /* Term: /[wd]+/ */ protected $match_Term_typestack = array('Term'); function match_Term ($stack = array()) { $matchrule = "Term"; $result = $this- >construct($matchrule, $matchrule, null); if (( $subres = $this->rx( '/[wd]+/' ) ) !== FALSE) { $result["text"] .= $subres; return $this->finalise($result); } else { return FALSE; } } 21
  • 22.
    v1/Peg/grammar.peg.inc test $parser = new Parser('test'); print_r($parser->parse()); // test $parser = new Parser('test 123'); print_r($parser->parse()); // test 22
  • 23.
    v2/Peg/grammar.peg.inc /*!*QueryLangV2 Query: Term (> Term)* Term: /[wd]+/ */ public function parse() { $result = $this->match_Query(); return $result['query']; } 23
  • 24.
    v2/Peg/grammar.peg.inc (cont.) public functionQuery__construct(&$result) { $result['query'] = new NodeQuery(); } public function Query_Term(&$result, $sub) { $term = new NodeTerm($sub['text']); $result['query']->addTerm($term); } 24
  • 25.
    v2/Peg/grammar.peg.inc test $parser = new Parser('test 123'); print_r($parser->parse()); Query |-- Term - "test" |-- Term - "123" 25
  • 26.
    v3/Peg/grammar.peg.inc /*!*QueryLangV3 Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)* Term: "(" Query ")" | Value:/[wd]+/ */ public function parse() { $node = $this->match_Query(); return $node['query']; } 26
  • 27.
    v3/Peg/grammar.peg.inc (cont.) Query: AndQuery ([ "OR" ] AndQuery)* AndQuery: Term ([ "AND" ] Term)* public function Query__construct(&$r) { $r['query'] = new NodeQuery('OR'); } public function Query_AndQuery(&$r, $s) { $r['query']->add($s['query']); } public function AndQuery__construct(&$r) { $r['query'] = new NodeQuery('AND'); } public function AndQuery_Term(&$r, $s) { $r['query']->add($s['query']); 27 }
  • 28.
    v3/Peg/grammar.peg.inc (cont.) /*!* QueryLangV3 Term: "(" Query ")" | Value:/[wd]+/ */ public function Term_Query(&$r, $s){ $r['query'] = $s['query']; } public function Term_Value(&$r, $s){ $r['query']= new NodeTerm($s['text']); } 28
  • 29.
    v3/Peg/grammar.peg.inc test $parser = new Parser('a AND b OR c'); Query (OR) |-- Query (AND) | |-- Term - "a" | |-- Term - "b" |-- Query (AND) |-- Term - "c" 29
  • 30.
    Optional: Optimizer /Semantic checking 30
  • 31.
    Optimized query $parser =new Parser('a AND b OR c'); $query = $parser->parse(); $queryOptimizer = new Optimizer($query); $query = $queryOptimizer->optimize(); Query (OR) |-- Term - "c" |-- Query (AND) | |-- Term - "a" | |-- Term - "b" 31
  • 32.
    Manual Parser Building:Predictive parsing 32 Source: http://en.wikipedia.org/wiki/File:PsychicBoston.jpg
  • 33.
    Manual Parser Building:Lexing Characters get turned into tokens by a lexical analyzer. Also called lexer, scanner or tokenizer. "a OR (b)" 'term' => "a" 'OR' 'LeftParen' 'term' => "b" 'RightParen' 33
  • 34.
    Manual Parser Building:Lexing if ($this->_match('LeftParen', '/^(()/')) {continue;} if ($this->_match('RightParen', '/^())/')) {continue;} if ($this->_match('OR', '/^(OR)/i')) {continue;} if ($this->_match('AND', '/^(AND)/i')) {continue;} if ($this->_match('TermValue', '/^([wd]+)/i')) {continue;} if ($this->_match('WS', '/^s+/', true)) {continue;} 34
  • 35.
    Manual Parser Building:Lexing - UML 35 Source: http://commons.wikimedia.org/wiki/File:Willem- Alexander,_Prince_of_Orange.jpg
  • 36.
    Manual Parser Building:Parsing Non-terminals become methods protected function _query(); protected function _andQuery(); protected function _term(); Parse to a tree structure. 36
  • 37.
    Manual Parser Building:Parsing - UML 37
  • 38.
    Manual Parser Building:example non-terminal protected function _query() { $query = new NodeQuery('OR'); $leftTerm = $this->_andQuery(); $query->add($leftTerm); while($this->_tokenStream->look()->getType() === 'OR') { $this->_tokenStream->expect('OR'); $rightTerm = $this->_andQuery(); $query->add($rightTerm); } return $query; } 38
  • 39.
    Predictive Parsing: Warning! Tokensmust be decidable with a fixed lookahead <term> ::= <TermValue> "-" <TermValue> | <TermValue> | "(" <Query> ")" No left recursion <orQuery> ::= <orQuery> ("OR" <orQuery>)? | <term> 39
  • 40.
    But I wannaparse 40
  • 41.
    PHP Parsers PHP_Depend 1.0.0 PHP 5.4 PHP-Parser alpha PHP 5.4 phc 0.3.0.1 (unmaintained) PHP 5.2 (?) 41
  • 42.
    PHPDepend Abstract SyntaxTree example $string = "Manuel $Pichler <{$email}>"; PHP_Depend_Code_ASTString |-- ASTLiteral - "Manuel " |-- ASTVariable - $Pichler |-- ASTLiteral - " <" |-- ASTCompoundExpression - {...} | |-- ASTVariable - $email |-- ASTLiteral - ">" 42
  • 43.
  • 44.
    More resources Examples ofmodern parsers in PHP: Twig (Predictive Parser) Behats Gherkin (Predictive Parser) Smarty 3 (LALR parser) More information: Rich Programmer Food by Steve Yegge Let’s Build a Compiler, by Jack Crenshaw nathansuniversity.com Coursera: Compilers by Stanford University SE-Radio: Episode 182: DSLs 44
  • 45.
    QUESTIONS? Joind.in: https://joind.in/6257 Twitter: @relaxnow E-mail:boy@ibuildings.nl Slideshare: http://slidesha.re/INY43R 45 GitHub: https://github.com/relaxnow/QueryLang