SlideShare a Scribd company logo
Out With Regex,
 In With Tokens
     Sean Coates
     php|tek 2009
Who is this Sean guy?

• Web Architect at OmniTI (http://omniti.com/)
• Former Editor-in-Chief of php|architect and former
  organizer of php|tek
• PHP Community, Habari, Phergie
• Other conferences (PHP Quebec earlier this year)
• the Twitter: @coates
• Beer Lover (and brewer)
• (I speak too quickly)
“A token is a
categorized block of
text. It can look like
anything; it just needs
to be a useful part of
the structured text.”
                 -Wikipedia
$a = 5 + 7 ;
$a = 5 + 7 ;

    (10 tokens)
$a = 5 + 7 ;



   Whitespace
$a = 5 + 7 ;



       Whitespace
  Variable
$a = 5 + 7 ;


        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                 Number

        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign      Number
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign      Number
       Whitespace
                    Terminator
  Variable
Grammar Matters

$a = 5 + 7; // $b
Grammar Matters

$a = 5 + 7; // $b

            Not a Variable
 Variable
Grammar Matters

$a = 5 + 7; // $b

 Variable
            Comment
PHP Example
    <?php

    $a

    =

    5

    +

    7
    ;

    // $b
PHP Example
T_OPEN_TAG     <?php

T_VARIABLE     $a
T_WHITESPACE
               =
T_WHITESPACE
T_LNUMBER      5
T_WHITESPACE
               +
T_WHITESPACE
T_LNUMBER      7
               ;
T_WHITESPACE
T_COMMENT      // $b
“Lexing”
• a Lexer converts a sequence of characters
  into tokens
• “Lexical Analysis”
• Lex, Flex, re2c (lexer generators)
Static vs. Dynamic
          Analysis
• Dynamic: actual execution, practical
  implementations such as pen. testing.
• Static: analysis of code, tokens, opcodes,
  etc. to determine if a particular action will
  take place


• (not the only use for Tokens, though)
Out with Regex
• Find all variables
• Regex:
  /($[a-z_][a-z0-9_]*)/i
Out with Regex
• Find all variables
• Regex:
  /($[a-z_][a-z0-9_]*)/i
• context matters:
  $str = '$a = 5 + 7; // $b';
Regex Fail
<?php
$str = '$a = 5 + 7; // $b';
preg_match_all(
     '/($[a-z_][a-z0-9_]*)/i', $str, $m
);
var_dump($m[0]);
Regex Fail
array(2) {
    [0]=> string(2) quot;$aquot;
    [1]=> string(2) quot;$bquot;
}
Out with Regex
• Find all variables      RONG!
• Regex:
  /($[a-z_][a-z0-9_]*)/i
• context matters:
  $str = '$a = 5 + 7; // $b';
Remember?

$a = 5 + 7; // $b

 Variable
            Not a Variable!
Token Approach
<?php
// look ma, no regex!
$str = '<?php $a = 5 + 7; // $b';
foreach (token_get_all($str) as $t) {
    if (is_array($t) && $t[0] == T_VARIABLE) {
        echo $t[1] . quot;nquot;;
    }
}
// outputs: $a
PHP Example (again)
T_OPEN_TAG     <?php

T_VARIABLE     $a
T_WHITESPACE
               =
T_WHITESPACE
T_LNUMBER      5
T_WHITESPACE
               +
T_WHITESPACE
T_LNUMBER      7
               ;
T_WHITESPACE
T_COMMENT      // $b
Regex can be complicated
            (email validation from MRE)
 [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xff
n015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()]
* (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )*
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff]
| ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-
xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: 
( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-
xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^
x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t

                               Domain
Localpart Separator
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t
• Still hard, but validation is restricted to
  different types of data
• BTW, don’t bother (-:
Regex can be complicated
            (email validation from MRE)
 [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xff
n015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()]
* (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )*

                                       strpos($email, ‘@’) !== false
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff]
| ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-
xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: 
( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-
xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^
x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
Dirty Little Secret
• Most tokenizers (lexers) use regular
  expressions to separate tokens
• re2c
• Multiple ways to represent separators,
  whitespace, etc.. simplified with regex
Practical Uses
• Compile source code
• Simple, contextual replacement (e.g. BBCode)
• Friendly line breaks
• “Curly” quotes, special punctuation
• Input validation/stripping
• Refactoring
PHP’s Tokenizer
• Similar in other languages
• Available (and useful!) in userspace
• Built in to PHP (always available)
PHP Execution
• Lex
• Parse
• Compile
• Execute
• Cleanup
PHP Execution
• Lex
• Parse     Tokeny Goodness
• Compile
• Execute
• Cleanup
Tokenizer in Userspace
• token_get_all()
• token_name()
Tokenizer in Userspace
• token_get_all() returns an array of scalars
  and arrays
• A bit hard to work with
• Needs opening tag (<?php or <? depending
  on config)
Tokenizer in Userspace
          (Example)
  print_r(token_get_all('<?php $a = 5 + 7; // $b'));
Array                 [2] => Array      [5] => Array      [8] => Array          [11] => Array
(                        (                 (                 (                     (
  [0] => Array             [0] => 370        [0] => 305        [0] => 370            [0] => 370
     (                     [1] =>            [1] => 5          [1] =>                [1] =>
       [0] => 367          [2] => 1          [2] => 1          [2] => 1              [2] => 1
       [1] => <?php      )                 )                 )                     )
       [2] => 1
     )                [3] => =          [6] => Array      [9] => Array          [12] => Array
                      [4] => Array         (                 (                     (
  [1] => Array           (                   [0] => 370        [0] => 305            [0] => 365
     (                     [0] => 370        [1] =>            [1] => 7              [1] => // $b
       [0] => 309          [1] =>            [2] => 1          [2] => 1              [2] => 1
       [1] => $a           [2] => 1        )                 )                     )
       [2] => 1          )
     )                                  [7] => +          [10] => ;         )
Tokenizer in Userspace
       (Example)
[0] => Array
   (
     [0] => 367     Token Number
     [1] => <?php   Token Text
     [2] => 1       Line Number
   )

[1] => Array
    (
      [0] => 309    token_name(309)
      [1] => $a      == ‘T_VARIABLE’
      [2] => 1
    )
(...)
[3] => =            Scalar (not array)
Practical Example:
<pre>
      Simple Highlighter
<?php
$c = array(
    T_VARIABLE => 'red',
    T_LNUMBER => 'blue',
);
foreach (token_get_all(fread(STDIN, 9999999)) as $t) {
    if (!is_array($t)) {
        echo htmlentities($t);
    } elseif (!isset($c[$t[0]])) {
        echo htmlentities($t[1]);
        continue;
    } else {
        echo '<span style=quot;color: ' . $c[$t[0]] . 'quot;>'
        . htmlentities($t[1]) . '</span>';
    }
}
?>
</pre>
Highlighter Output
<?php
$a = 5 + 7; // $b
<pre>
&lt;?php
<span style=quot;color: redquot;>$a</span> =
<span style=quot;color: bluequot;>5</span> +
<span style=quot;color: bluequot;>7</span>; // $b
</pre>
Entities
•   Hi... I'm Sean
Entities
•   Hi... I'm Sean

•   Hi&#8230; I&#8217;m Sean

•   Hi… I’m Sean
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code <code>$foo = 'bar';</code>

•   Here’s some code <code>$foo = 'bar';</code>
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code <code>$foo = 'bar';</code>

•   Here’s some code <code>$foo = 'bar';</code>
Tokalizer
• PHP token analysis wrapper
• Object-oriented
• Normalized
• Includes a partial parser (in PHP, so it’s
  slow). Doesn’t work with new 5.3
  constructs... yet.
• http://github.com/scoates/tokalizer
Context-aware tools
• phpgrep
regular grep:
     file.php:123: matched line
php grep:
     file.php:123(foo::bar()): matched line
Context-aware tools
• diff-php
regular diff:
@@ -68,6 +68,7 @@
php diff:
@@ -68,6 +68,7 @@ GeshiHighlighterFormatPlugin::do_highlight()
Token dumps
• text token dump
• definition dump (*cough* currently broken)
• html dump
Habari’s HTML
          Tokenizer
• Filter user input (can strip tags intelligently)
• Allow plugins to inject/replace whole
  blocks of HTML without (developer-facing)
  regex
• Facilitate autop, introspection
HTMLPurifier
• Intelligently filters/escapes potentially
  dangerous data
• Token-based approach
• Really difficult
• Code is slow and memory-intensive, but it’s
  extremely complicated
Questions? Contact...

• http://seancoates.com/
• sean@php.net
• http://omniti.com/is/sean-coates
• IRC: scoates (Freenode and EFNet)
• @coates on Twitter (if it happens to be up)

More Related Content

What's hot

Perl Sucks - and what to do about it
Perl Sucks - and what to do about itPerl Sucks - and what to do about it
Perl Sucks - and what to do about it
2shortplanks
 
Why Go Scales
Why Go ScalesWhy Go Scales
Why Go Scales
Eyal Post
 
What's new in PHP 8.0?
What's new in PHP 8.0?What's new in PHP 8.0?
What's new in PHP 8.0?
Nikita Popov
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
Workhorse Computing
 
Ae internals
Ae internalsAe internals
Ae internals
mnikolenko
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
lichtkind
 
From typing the test to testing the type
From typing the test to testing the typeFrom typing the test to testing the type
From typing the test to testing the type
Wim Godden
 
R workshop i r basic (4th time)
R workshop i r basic (4th time)R workshop i r basic (4th time)
R workshop i r basic (4th time)
Vivian S. Zhang
 
Perl 6 by example
Perl 6 by examplePerl 6 by example
Perl 6 by example
Andrew Shitov
 
Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)
Patricia Aas
 
Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3guesta3202
 
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...adrianoalmeida7
 
Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)
osfameron
 
Ramda lets write declarative js
Ramda   lets write declarative jsRamda   lets write declarative js
Ramda lets write declarative js
Pivorak MeetUp
 
Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
Alex Soto
 
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Kevlin Henney
 
Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)
James Titcumb
 
C++ Programming - 11th Study
C++ Programming - 11th StudyC++ Programming - 11th Study
C++ Programming - 11th Study
Chris Ohk
 
Continuous Delivery As Code
Continuous Delivery As CodeContinuous Delivery As Code
Continuous Delivery As Code
Alex Soto
 
javascript function & closure
javascript function & closurejavascript function & closure
javascript function & closure
Hika Maeng
 

What's hot (20)

Perl Sucks - and what to do about it
Perl Sucks - and what to do about itPerl Sucks - and what to do about it
Perl Sucks - and what to do about it
 
Why Go Scales
Why Go ScalesWhy Go Scales
Why Go Scales
 
What's new in PHP 8.0?
What's new in PHP 8.0?What's new in PHP 8.0?
What's new in PHP 8.0?
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
 
Ae internals
Ae internalsAe internals
Ae internals
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
 
From typing the test to testing the type
From typing the test to testing the typeFrom typing the test to testing the type
From typing the test to testing the type
 
R workshop i r basic (4th time)
R workshop i r basic (4th time)R workshop i r basic (4th time)
R workshop i r basic (4th time)
 
Perl 6 by example
Perl 6 by examplePerl 6 by example
Perl 6 by example
 
Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)
 
Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3
 
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
 
Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)
 
Ramda lets write declarative js
Ramda   lets write declarative jsRamda   lets write declarative js
Ramda lets write declarative js
 
Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
 
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
 
Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)
 
C++ Programming - 11th Study
C++ Programming - 11th StudyC++ Programming - 11th Study
C++ Programming - 11th Study
 
Continuous Delivery As Code
Continuous Delivery As CodeContinuous Delivery As Code
Continuous Delivery As Code
 
javascript function & closure
javascript function & closurejavascript function & closure
javascript function & closure
 

Viewers also liked

GCMartinez signed cover letter 2016
GCMartinez   signed cover letter 2016GCMartinez   signed cover letter 2016
GCMartinez signed cover letter 2016
Graciela Martinez
 
Educación para la Sostenibilidad
Educación para la SostenibilidadEducación para la Sostenibilidad
Educación para la Sostenibilidad
Roraima Carolina Cuare Arquiades
 
Smart moves slideshare
Smart moves   slideshareSmart moves   slideshare
Smart moves slideshare
SmartMoves_UKK
 
Discovering Yoga EN
Discovering Yoga ENDiscovering Yoga EN
Discovering Yoga EN
Descubriendo el Yoga
 
Higado v biliares pancreas
Higado v biliares pancreasHigado v biliares pancreas
Higado v biliares pancreas
Paul Martinez
 
Diapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen camposDiapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen campos
CarmenCampos16174021
 
Fear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case studyFear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case study
Sean Porter
 
Trabajo final canelo
Trabajo final caneloTrabajo final canelo
Trabajo final canelo
ana karen cota villegas
 
Trabajofinalrobertoterminado
TrabajofinalrobertoterminadoTrabajofinalrobertoterminado
Trabajofinalrobertoterminado
ana karen cota villegas
 
Proyecto de Vida
Proyecto de VidaProyecto de Vida
Proyecto de Vida
MDaniela0304
 
Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС Center of Energysaving Technologies ECO
 
Funciones Mentales y Emoción
Funciones Mentales y EmociónFunciones Mentales y Emoción
Funciones Mentales y Emoción
Universidad Bicentenaria De Aragua
 
Infografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestreInfografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestre
UBA
 
Presentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambientalPresentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambiental
made0312
 
Guia para la evaluación del enfoque ambiental
Guia para la evaluación del enfoque ambientalGuia para la evaluación del enfoque ambiental
Guia para la evaluación del enfoque ambiental
SEGUNDO JUAN PORTAL PIZARRO
 
Mapa conceptual ecologia
Mapa conceptual ecologiaMapa conceptual ecologia
Mapa conceptual ecologia
Yuanjian Zheng
 
Tempos e modos do verbo na fundep
Tempos e modos do verbo na fundepTempos e modos do verbo na fundep
Tempos e modos do verbo na fundep
ma.no.el.ne.ves
 
Cisto ovariano funcional
Cisto ovariano funcionalCisto ovariano funcional
Cisto ovariano funcional
Marcelo Madureira Montroni
 
Alteraciones del sist i 2016
Alteraciones del sist i 2016Alteraciones del sist i 2016
Alteraciones del sist i 2016
Ivan A Berne S
 
Musculos de-miembro-inferior
Musculos de-miembro-inferiorMusculos de-miembro-inferior
Musculos de-miembro-inferior
Ivan A Berne S
 

Viewers also liked (20)

GCMartinez signed cover letter 2016
GCMartinez   signed cover letter 2016GCMartinez   signed cover letter 2016
GCMartinez signed cover letter 2016
 
Educación para la Sostenibilidad
Educación para la SostenibilidadEducación para la Sostenibilidad
Educación para la Sostenibilidad
 
Smart moves slideshare
Smart moves   slideshareSmart moves   slideshare
Smart moves slideshare
 
Discovering Yoga EN
Discovering Yoga ENDiscovering Yoga EN
Discovering Yoga EN
 
Higado v biliares pancreas
Higado v biliares pancreasHigado v biliares pancreas
Higado v biliares pancreas
 
Diapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen camposDiapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen campos
 
Fear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case studyFear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case study
 
Trabajo final canelo
Trabajo final caneloTrabajo final canelo
Trabajo final canelo
 
Trabajofinalrobertoterminado
TrabajofinalrobertoterminadoTrabajofinalrobertoterminado
Trabajofinalrobertoterminado
 
Proyecto de Vida
Proyecto de VidaProyecto de Vida
Proyecto de Vida
 
Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС
 
Funciones Mentales y Emoción
Funciones Mentales y EmociónFunciones Mentales y Emoción
Funciones Mentales y Emoción
 
Infografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestreInfografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestre
 
Presentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambientalPresentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambiental
 
Guia para la evaluación del enfoque ambiental
Guia para la evaluación del enfoque ambientalGuia para la evaluación del enfoque ambiental
Guia para la evaluación del enfoque ambiental
 
Mapa conceptual ecologia
Mapa conceptual ecologiaMapa conceptual ecologia
Mapa conceptual ecologia
 
Tempos e modos do verbo na fundep
Tempos e modos do verbo na fundepTempos e modos do verbo na fundep
Tempos e modos do verbo na fundep
 
Cisto ovariano funcional
Cisto ovariano funcionalCisto ovariano funcional
Cisto ovariano funcional
 
Alteraciones del sist i 2016
Alteraciones del sist i 2016Alteraciones del sist i 2016
Alteraciones del sist i 2016
 
Musculos de-miembro-inferior
Musculos de-miembro-inferiorMusculos de-miembro-inferior
Musculos de-miembro-inferior
 

Similar to Out with Regex, In with Tokens

My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
frankieroberto
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
Codemotion
 
Unsung Heroes of PHP
Unsung Heroes of PHPUnsung Heroes of PHP
Unsung Heroes of PHP
jsmith92
 
Impacta - Show Day de Rails
Impacta - Show Day de RailsImpacta - Show Day de Rails
Impacta - Show Day de Rails
Fabio Akita
 
LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6
umapst
 
recycle
recyclerecycle
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of ARJSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
Yusuke Kawasaki
 
Get Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP StreamsGet Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP StreamsDavey Shafik
 
Erlang with Regexp Perl And Port
Erlang with Regexp Perl And PortErlang with Regexp Perl And Port
Erlang with Regexp Perl And Port
Keiichi Daiba
 
[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port
Keiichi Daiba
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With Php
Jeremy Coates
 
Ruby 程式語言簡介
Ruby 程式語言簡介Ruby 程式語言簡介
Ruby 程式語言簡介Wen-Tien Chang
 
R57php 1231677414471772-2
R57php 1231677414471772-2R57php 1231677414471772-2
R57php 1231677414471772-2
ady36
 
Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....
Raffi Krikorian
 
Cより速いRubyプログラム
Cより速いRubyプログラムCより速いRubyプログラム
Cより速いRubyプログラム
kwatch
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
Sopan Shewale
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
Aslak Hellesøy
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
Aslak Hellesøy
 

Similar to Out with Regex, In with Tokens (20)

My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
 
Unsung Heroes of PHP
Unsung Heroes of PHPUnsung Heroes of PHP
Unsung Heroes of PHP
 
Impacta - Show Day de Rails
Impacta - Show Day de RailsImpacta - Show Day de Rails
Impacta - Show Day de Rails
 
Rack Middleware
Rack MiddlewareRack Middleware
Rack Middleware
 
LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6
 
recycle
recyclerecycle
recycle
 
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of ARJSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
 
Get Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP StreamsGet Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP Streams
 
Erlang with Regexp Perl And Port
Erlang with Regexp Perl And PortErlang with Regexp Perl And Port
Erlang with Regexp Perl And Port
 
[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With Php
 
Ruby 程式語言簡介
Ruby 程式語言簡介Ruby 程式語言簡介
Ruby 程式語言簡介
 
Php 2
Php 2Php 2
Php 2
 
R57php 1231677414471772-2
R57php 1231677414471772-2R57php 1231677414471772-2
R57php 1231677414471772-2
 
Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....
 
Cより速いRubyプログラム
Cより速いRubyプログラムCより速いRubyプログラム
Cより速いRubyプログラム
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 

Recently uploaded

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Out with Regex, In with Tokens

  • 1. Out With Regex, In With Tokens Sean Coates php|tek 2009
  • 2. Who is this Sean guy? • Web Architect at OmniTI (http://omniti.com/) • Former Editor-in-Chief of php|architect and former organizer of php|tek • PHP Community, Habari, Phergie • Other conferences (PHP Quebec earlier this year) • the Twitter: @coates • Beer Lover (and brewer) • (I speak too quickly)
  • 3. “A token is a categorized block of text. It can look like anything; it just needs to be a useful part of the structured text.” -Wikipedia
  • 4. $a = 5 + 7 ;
  • 5. $a = 5 + 7 ; (10 tokens)
  • 6. $a = 5 + 7 ; Whitespace
  • 7. $a = 5 + 7 ; Whitespace Variable
  • 8. $a = 5 + 7 ; Assign Whitespace Variable
  • 9. $a = 5 + 7 ; Number Assign Whitespace Variable
  • 10. $a = 5 + 7 ; Add Number Assign Whitespace Variable
  • 11. $a = 5 + 7 ; Add Number Assign Number Whitespace Variable
  • 12. $a = 5 + 7 ; Add Number Assign Number Whitespace Terminator Variable
  • 13. Grammar Matters $a = 5 + 7; // $b
  • 14. Grammar Matters $a = 5 + 7; // $b Not a Variable Variable
  • 15. Grammar Matters $a = 5 + 7; // $b Variable Comment
  • 16. PHP Example <?php $a = 5 + 7 ; // $b
  • 17. PHP Example T_OPEN_TAG <?php T_VARIABLE $a T_WHITESPACE = T_WHITESPACE T_LNUMBER 5 T_WHITESPACE + T_WHITESPACE T_LNUMBER 7 ; T_WHITESPACE T_COMMENT // $b
  • 18. “Lexing” • a Lexer converts a sequence of characters into tokens • “Lexical Analysis” • Lex, Flex, re2c (lexer generators)
  • 19. Static vs. Dynamic Analysis • Dynamic: actual execution, practical implementations such as pen. testing. • Static: analysis of code, tokens, opcodes, etc. to determine if a particular action will take place • (not the only use for Tokens, though)
  • 20. Out with Regex • Find all variables • Regex: /($[a-z_][a-z0-9_]*)/i
  • 21. Out with Regex • Find all variables • Regex: /($[a-z_][a-z0-9_]*)/i • context matters: $str = '$a = 5 + 7; // $b';
  • 22. Regex Fail <?php $str = '$a = 5 + 7; // $b'; preg_match_all( '/($[a-z_][a-z0-9_]*)/i', $str, $m ); var_dump($m[0]);
  • 23. Regex Fail array(2) { [0]=> string(2) quot;$aquot; [1]=> string(2) quot;$bquot; }
  • 24. Out with Regex • Find all variables RONG! • Regex: /($[a-z_][a-z0-9_]*)/i • context matters: $str = '$a = 5 + 7; // $b';
  • 25. Remember? $a = 5 + 7; // $b Variable Not a Variable!
  • 26. Token Approach <?php // look ma, no regex! $str = '<?php $a = 5 + 7; // $b'; foreach (token_get_all($str) as $t) { if (is_array($t) && $t[0] == T_VARIABLE) { echo $t[1] . quot;nquot;; } } // outputs: $a
  • 27. PHP Example (again) T_OPEN_TAG <?php T_VARIABLE $a T_WHITESPACE = T_WHITESPACE T_LNUMBER 5 T_WHITESPACE + T_WHITESPACE T_LNUMBER 7 ; T_WHITESPACE T_COMMENT // $b
  • 28. Regex can be complicated (email validation from MRE) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?! [^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^ x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn 015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80- xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80- xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80- xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80- xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80- xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80- xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^ x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000- 037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80- xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^ x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000- 037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80- xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^ x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
  • 29. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t
  • 30. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t Domain Localpart Separator
  • 31. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t • Still hard, but validation is restricted to different types of data • BTW, don’t bother (-:
  • 32. Regex can be complicated (email validation from MRE) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?! [^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^ x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn 015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80- xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80- xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80- xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* strpos($email, ‘@’) !== false ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80- xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80- xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80- xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^ x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000- 037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80- xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^ x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000- 037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80- xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^ x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
  • 33. Dirty Little Secret • Most tokenizers (lexers) use regular expressions to separate tokens • re2c • Multiple ways to represent separators, whitespace, etc.. simplified with regex
  • 34. Practical Uses • Compile source code • Simple, contextual replacement (e.g. BBCode) • Friendly line breaks • “Curly” quotes, special punctuation • Input validation/stripping • Refactoring
  • 35. PHP’s Tokenizer • Similar in other languages • Available (and useful!) in userspace • Built in to PHP (always available)
  • 36. PHP Execution • Lex • Parse • Compile • Execute • Cleanup
  • 37. PHP Execution • Lex • Parse Tokeny Goodness • Compile • Execute • Cleanup
  • 38. Tokenizer in Userspace • token_get_all() • token_name()
  • 39. Tokenizer in Userspace • token_get_all() returns an array of scalars and arrays • A bit hard to work with • Needs opening tag (<?php or <? depending on config)
  • 40. Tokenizer in Userspace (Example) print_r(token_get_all('<?php $a = 5 + 7; // $b')); Array [2] => Array [5] => Array [8] => Array [11] => Array ( ( ( ( ( [0] => Array [0] => 370 [0] => 305 [0] => 370 [0] => 370 ( [1] => [1] => 5 [1] => [1] => [0] => 367 [2] => 1 [2] => 1 [2] => 1 [2] => 1 [1] => <?php ) ) ) ) [2] => 1 ) [3] => = [6] => Array [9] => Array [12] => Array [4] => Array ( ( ( [1] => Array ( [0] => 370 [0] => 305 [0] => 365 ( [0] => 370 [1] => [1] => 7 [1] => // $b [0] => 309 [1] => [2] => 1 [2] => 1 [2] => 1 [1] => $a [2] => 1 ) ) ) [2] => 1 ) ) [7] => + [10] => ; )
  • 41. Tokenizer in Userspace (Example) [0] => Array ( [0] => 367 Token Number [1] => <?php Token Text [2] => 1 Line Number ) [1] => Array ( [0] => 309 token_name(309) [1] => $a == ‘T_VARIABLE’ [2] => 1 ) (...) [3] => = Scalar (not array)
  • 42. Practical Example: <pre> Simple Highlighter <?php $c = array( T_VARIABLE => 'red', T_LNUMBER => 'blue', ); foreach (token_get_all(fread(STDIN, 9999999)) as $t) { if (!is_array($t)) { echo htmlentities($t); } elseif (!isset($c[$t[0]])) { echo htmlentities($t[1]); continue; } else { echo '<span style=quot;color: ' . $c[$t[0]] . 'quot;>' . htmlentities($t[1]) . '</span>'; } } ?> </pre>
  • 43. Highlighter Output <?php $a = 5 + 7; // $b <pre> &lt;?php <span style=quot;color: redquot;>$a</span> = <span style=quot;color: bluequot;>5</span> + <span style=quot;color: bluequot;>7</span>; // $b </pre>
  • 44. Entities • Hi... I'm Sean
  • 45. Entities • Hi... I'm Sean • Hi&#8230; I&#8217;m Sean • Hi… I’m Sean
  • 46. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code
  • 47. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code <code>$foo = 'bar';</code> • Here’s some code <code>$foo = 'bar';</code>
  • 48. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code <code>$foo = 'bar';</code> • Here’s some code <code>$foo = 'bar';</code>
  • 49. Tokalizer • PHP token analysis wrapper • Object-oriented • Normalized • Includes a partial parser (in PHP, so it’s slow). Doesn’t work with new 5.3 constructs... yet. • http://github.com/scoates/tokalizer
  • 50. Context-aware tools • phpgrep regular grep: file.php:123: matched line php grep: file.php:123(foo::bar()): matched line
  • 51. Context-aware tools • diff-php regular diff: @@ -68,6 +68,7 @@ php diff: @@ -68,6 +68,7 @@ GeshiHighlighterFormatPlugin::do_highlight()
  • 52. Token dumps • text token dump • definition dump (*cough* currently broken) • html dump
  • 53. Habari’s HTML Tokenizer • Filter user input (can strip tags intelligently) • Allow plugins to inject/replace whole blocks of HTML without (developer-facing) regex • Facilitate autop, introspection
  • 54. HTMLPurifier • Intelligently filters/escapes potentially dangerous data • Token-based approach • Really difficult • Code is slow and memory-intensive, but it’s extremely complicated
  • 55.
  • 56. Questions? Contact... • http://seancoates.com/ • sean@php.net • http://omniti.com/is/sean-coates • IRC: scoates (Freenode and EFNet) • @coates on Twitter (if it happens to be up)