SlideShare a Scribd company logo
1 of 56
Out With Regex,
 In With Tokens
     Sean Coates
     php|tek 2009
Who is this Sean guy?

• Web Architect at OmniTI (http://omniti.com/)
• Former Editor-in-Chief of php|architect and former
  organizer of php|tek
• PHP Community, Habari, Phergie
• Other conferences (PHP Quebec earlier this year)
• the Twitter: @coates
• Beer Lover (and brewer)
• (I speak too quickly)
“A token is a
categorized block of
text. It can look like
anything; it just needs
to be a useful part of
the structured text.”
                 -Wikipedia
$a = 5 + 7 ;
$a = 5 + 7 ;

    (10 tokens)
$a = 5 + 7 ;



   Whitespace
$a = 5 + 7 ;



       Whitespace
  Variable
$a = 5 + 7 ;


        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                 Number

        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign      Number
       Whitespace
  Variable
$a = 5 + 7 ;

                      Add
                 Number

        Assign      Number
       Whitespace
                    Terminator
  Variable
Grammar Matters

$a = 5 + 7; // $b
Grammar Matters

$a = 5 + 7; // $b

            Not a Variable
 Variable
Grammar Matters

$a = 5 + 7; // $b

 Variable
            Comment
PHP Example
    <?php

    $a

    =

    5

    +

    7
    ;

    // $b
PHP Example
T_OPEN_TAG     <?php

T_VARIABLE     $a
T_WHITESPACE
               =
T_WHITESPACE
T_LNUMBER      5
T_WHITESPACE
               +
T_WHITESPACE
T_LNUMBER      7
               ;
T_WHITESPACE
T_COMMENT      // $b
“Lexing”
• a Lexer converts a sequence of characters
  into tokens
• “Lexical Analysis”
• Lex, Flex, re2c (lexer generators)
Static vs. Dynamic
          Analysis
• Dynamic: actual execution, practical
  implementations such as pen. testing.
• Static: analysis of code, tokens, opcodes,
  etc. to determine if a particular action will
  take place


• (not the only use for Tokens, though)
Out with Regex
• Find all variables
• Regex:
  /($[a-z_][a-z0-9_]*)/i
Out with Regex
• Find all variables
• Regex:
  /($[a-z_][a-z0-9_]*)/i
• context matters:
  $str = '$a = 5 + 7; // $b';
Regex Fail
<?php
$str = '$a = 5 + 7; // $b';
preg_match_all(
     '/($[a-z_][a-z0-9_]*)/i', $str, $m
);
var_dump($m[0]);
Regex Fail
array(2) {
    [0]=> string(2) quot;$aquot;
    [1]=> string(2) quot;$bquot;
}
Out with Regex
• Find all variables      RONG!
• Regex:
  /($[a-z_][a-z0-9_]*)/i
• context matters:
  $str = '$a = 5 + 7; // $b';
Remember?

$a = 5 + 7; // $b

 Variable
            Not a Variable!
Token Approach
<?php
// look ma, no regex!
$str = '<?php $a = 5 + 7; // $b';
foreach (token_get_all($str) as $t) {
    if (is_array($t) && $t[0] == T_VARIABLE) {
        echo $t[1] . quot;nquot;;
    }
}
// outputs: $a
PHP Example (again)
T_OPEN_TAG     <?php

T_VARIABLE     $a
T_WHITESPACE
               =
T_WHITESPACE
T_LNUMBER      5
T_WHITESPACE
               +
T_WHITESPACE
T_LNUMBER      7
               ;
T_WHITESPACE
T_COMMENT      // $b
Regex can be complicated
            (email validation from MRE)
 [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xff
n015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()]
* (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )*
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff]
| ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-
xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: 
( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-
xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^
x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t

                               Domain
Localpart Separator
Difficult validation
       made simpler
• Email validation is haaaard!
• Validate logical units separately:
 s e a n @ p h p. n e t
• Still hard, but validation is restricted to
  different types of data
• BTW, don’t bother (-:
Regex can be complicated
            (email validation from MRE)
 [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xff
n015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()]
* (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )*

                                       strpos($email, ‘@’) !== false
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff]
| ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-
xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: 
( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-
xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^
x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?:  [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] |  [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?:  [^
x80-xff] | ( [^x80-xffn015()] * (?:  [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
Dirty Little Secret
• Most tokenizers (lexers) use regular
  expressions to separate tokens
• re2c
• Multiple ways to represent separators,
  whitespace, etc.. simplified with regex
Practical Uses
• Compile source code
• Simple, contextual replacement (e.g. BBCode)
• Friendly line breaks
• “Curly” quotes, special punctuation
• Input validation/stripping
• Refactoring
PHP’s Tokenizer
• Similar in other languages
• Available (and useful!) in userspace
• Built in to PHP (always available)
PHP Execution
• Lex
• Parse
• Compile
• Execute
• Cleanup
PHP Execution
• Lex
• Parse     Tokeny Goodness
• Compile
• Execute
• Cleanup
Tokenizer in Userspace
• token_get_all()
• token_name()
Tokenizer in Userspace
• token_get_all() returns an array of scalars
  and arrays
• A bit hard to work with
• Needs opening tag (<?php or <? depending
  on config)
Tokenizer in Userspace
          (Example)
  print_r(token_get_all('<?php $a = 5 + 7; // $b'));
Array                 [2] => Array      [5] => Array      [8] => Array          [11] => Array
(                        (                 (                 (                     (
  [0] => Array             [0] => 370        [0] => 305        [0] => 370            [0] => 370
     (                     [1] =>            [1] => 5          [1] =>                [1] =>
       [0] => 367          [2] => 1          [2] => 1          [2] => 1              [2] => 1
       [1] => <?php      )                 )                 )                     )
       [2] => 1
     )                [3] => =          [6] => Array      [9] => Array          [12] => Array
                      [4] => Array         (                 (                     (
  [1] => Array           (                   [0] => 370        [0] => 305            [0] => 365
     (                     [0] => 370        [1] =>            [1] => 7              [1] => // $b
       [0] => 309          [1] =>            [2] => 1          [2] => 1              [2] => 1
       [1] => $a           [2] => 1        )                 )                     )
       [2] => 1          )
     )                                  [7] => +          [10] => ;         )
Tokenizer in Userspace
       (Example)
[0] => Array
   (
     [0] => 367     Token Number
     [1] => <?php   Token Text
     [2] => 1       Line Number
   )

[1] => Array
    (
      [0] => 309    token_name(309)
      [1] => $a      == ‘T_VARIABLE’
      [2] => 1
    )
(...)
[3] => =            Scalar (not array)
Practical Example:
<pre>
      Simple Highlighter
<?php
$c = array(
    T_VARIABLE => 'red',
    T_LNUMBER => 'blue',
);
foreach (token_get_all(fread(STDIN, 9999999)) as $t) {
    if (!is_array($t)) {
        echo htmlentities($t);
    } elseif (!isset($c[$t[0]])) {
        echo htmlentities($t[1]);
        continue;
    } else {
        echo '<span style=quot;color: ' . $c[$t[0]] . 'quot;>'
        . htmlentities($t[1]) . '</span>';
    }
}
?>
</pre>
Highlighter Output
<?php
$a = 5 + 7; // $b
<pre>
&lt;?php
<span style=quot;color: redquot;>$a</span> =
<span style=quot;color: bluequot;>5</span> +
<span style=quot;color: bluequot;>7</span>; // $b
</pre>
Entities
•   Hi... I'm Sean
Entities
•   Hi... I'm Sean

•   Hi&#8230; I&#8217;m Sean

•   Hi… I’m Sean
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code <code>$foo = 'bar';</code>

•   Here’s some code <code>$foo = 'bar';</code>
Entities
•   Here's some code <code>$foo = 'bar';</code>

•   Here&#8217;some code <code>$foo = 'bar';</code>

•   Here’s some code <code>$foo = 'bar';</code>
Tokalizer
• PHP token analysis wrapper
• Object-oriented
• Normalized
• Includes a partial parser (in PHP, so it’s
  slow). Doesn’t work with new 5.3
  constructs... yet.
• http://github.com/scoates/tokalizer
Context-aware tools
• phpgrep
regular grep:
     file.php:123: matched line
php grep:
     file.php:123(foo::bar()): matched line
Context-aware tools
• diff-php
regular diff:
@@ -68,6 +68,7 @@
php diff:
@@ -68,6 +68,7 @@ GeshiHighlighterFormatPlugin::do_highlight()
Token dumps
• text token dump
• definition dump (*cough* currently broken)
• html dump
Habari’s HTML
          Tokenizer
• Filter user input (can strip tags intelligently)
• Allow plugins to inject/replace whole
  blocks of HTML without (developer-facing)
  regex
• Facilitate autop, introspection
HTMLPurifier
• Intelligently filters/escapes potentially
  dangerous data
• Token-based approach
• Really difficult
• Code is slow and memory-intensive, but it’s
  extremely complicated
Questions? Contact...

• http://seancoates.com/
• sean@php.net
• http://omniti.com/is/sean-coates
• IRC: scoates (Freenode and EFNet)
• @coates on Twitter (if it happens to be up)

More Related Content

What's hot

Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3
guesta3202
 
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
adrianoalmeida7
 

What's hot (20)

Perl Sucks - and what to do about it
Perl Sucks - and what to do about itPerl Sucks - and what to do about it
Perl Sucks - and what to do about it
 
Why Go Scales
Why Go ScalesWhy Go Scales
Why Go Scales
 
What's new in PHP 8.0?
What's new in PHP 8.0?What's new in PHP 8.0?
What's new in PHP 8.0?
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
 
Ae internals
Ae internalsAe internals
Ae internals
 
Perl 6 in Context
Perl 6 in ContextPerl 6 in Context
Perl 6 in Context
 
From typing the test to testing the type
From typing the test to testing the typeFrom typing the test to testing the type
From typing the test to testing the type
 
R workshop i r basic (4th time)
R workshop i r basic (4th time)R workshop i r basic (4th time)
R workshop i r basic (4th time)
 
Perl 6 by example
Perl 6 by examplePerl 6 by example
Perl 6 by example
 
Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)Secure Programming Practices in C++ (NDC Security 2018)
Secure Programming Practices in C++ (NDC Security 2018)
 
Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3Erlang Introduction Bcberlin3
Erlang Introduction Bcberlin3
 
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
Cypher inside out: Como a linguagem de pesquisas em grafo do Neo4j foi constr...
 
Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)Functional Pearls 4 (YAPC::EU::2009 remix)
Functional Pearls 4 (YAPC::EU::2009 remix)
 
Ramda lets write declarative js
Ramda   lets write declarative jsRamda   lets write declarative js
Ramda lets write declarative js
 
Jenkins 20
Jenkins 20Jenkins 20
Jenkins 20
 
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
Clean Coders Hate What Happens To Your Code When You Use These Enterprise Pro...
 
Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)Diving into HHVM Extensions (php[tek] 2016)
Diving into HHVM Extensions (php[tek] 2016)
 
C++ Programming - 11th Study
C++ Programming - 11th StudyC++ Programming - 11th Study
C++ Programming - 11th Study
 
Continuous Delivery As Code
Continuous Delivery As CodeContinuous Delivery As Code
Continuous Delivery As Code
 
javascript function & closure
javascript function & closurejavascript function & closure
javascript function & closure
 

Viewers also liked

Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС
Center of Energysaving Technologies ECO
 

Viewers also liked (20)

GCMartinez signed cover letter 2016
GCMartinez   signed cover letter 2016GCMartinez   signed cover letter 2016
GCMartinez signed cover letter 2016
 
Educación para la Sostenibilidad
Educación para la SostenibilidadEducación para la Sostenibilidad
Educación para la Sostenibilidad
 
Smart moves slideshare
Smart moves   slideshareSmart moves   slideshare
Smart moves slideshare
 
Discovering Yoga EN
Discovering Yoga ENDiscovering Yoga EN
Discovering Yoga EN
 
Higado v biliares pancreas
Higado v biliares pancreasHigado v biliares pancreas
Higado v biliares pancreas
 
Diapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen camposDiapositivas ciencia y tecnologia carmen campos
Diapositivas ciencia y tecnologia carmen campos
 
Fear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case studyFear The Walking Churn: A retention case study
Fear The Walking Churn: A retention case study
 
Trabajo final canelo
Trabajo final caneloTrabajo final canelo
Trabajo final canelo
 
Trabajofinalrobertoterminado
TrabajofinalrobertoterminadoTrabajofinalrobertoterminado
Trabajofinalrobertoterminado
 
Proyecto de Vida
Proyecto de VidaProyecto de Vida
Proyecto de Vida
 
Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС Адаптация строительных норм к требованиям ЕС
Адаптация строительных норм к требованиям ЕС
 
Funciones Mentales y Emoción
Funciones Mentales y EmociónFunciones Mentales y Emoción
Funciones Mentales y Emoción
 
Infografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestreInfografia ciencia y tecnologia. 3er trimestre
Infografia ciencia y tecnologia. 3er trimestre
 
Presentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambientalPresentacion recursos naturales y contaminacion ambiental
Presentacion recursos naturales y contaminacion ambiental
 
Guia para la evaluación del enfoque ambiental
Guia para la evaluación del enfoque ambientalGuia para la evaluación del enfoque ambiental
Guia para la evaluación del enfoque ambiental
 
Mapa conceptual ecologia
Mapa conceptual ecologiaMapa conceptual ecologia
Mapa conceptual ecologia
 
Tempos e modos do verbo na fundep
Tempos e modos do verbo na fundepTempos e modos do verbo na fundep
Tempos e modos do verbo na fundep
 
Cisto ovariano funcional
Cisto ovariano funcionalCisto ovariano funcional
Cisto ovariano funcional
 
Alteraciones del sist i 2016
Alteraciones del sist i 2016Alteraciones del sist i 2016
Alteraciones del sist i 2016
 
Musculos de-miembro-inferior
Musculos de-miembro-inferiorMusculos de-miembro-inferior
Musculos de-miembro-inferior
 

Similar to Out with Regex, In with Tokens

Get Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP StreamsGet Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP Streams
Davey Shafik
 
Ruby 程式語言簡介
Ruby 程式語言簡介Ruby 程式語言簡介
Ruby 程式語言簡介
Wen-Tien Chang
 
R57php 1231677414471772-2
R57php 1231677414471772-2R57php 1231677414471772-2
R57php 1231677414471772-2
ady36
 

Similar to Out with Regex, In with Tokens (20)

My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
 
And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...And now you have two problems. Ruby regular expressions for fun and profit by...
And now you have two problems. Ruby regular expressions for fun and profit by...
 
Unsung Heroes of PHP
Unsung Heroes of PHPUnsung Heroes of PHP
Unsung Heroes of PHP
 
Impacta - Show Day de Rails
Impacta - Show Day de RailsImpacta - Show Day de Rails
Impacta - Show Day de Rails
 
Rack Middleware
Rack MiddlewareRack Middleware
Rack Middleware
 
LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6LAMP_TRAINING_SESSION_6
LAMP_TRAINING_SESSION_6
 
recycle
recyclerecycle
recycle
 
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of ARJSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
JSARToolKit / LiveChromaKey / LivePointers - Next gen of AR
 
Get Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP StreamsGet Soaked - An In Depth Look At PHP Streams
Get Soaked - An In Depth Look At PHP Streams
 
[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port
 
Erlang with Regexp Perl And Port
Erlang with Regexp Perl And PortErlang with Regexp Perl And Port
Erlang with Regexp Perl And Port
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With Php
 
Ruby 程式語言簡介
Ruby 程式語言簡介Ruby 程式語言簡介
Ruby 程式語言簡介
 
Php 2
Php 2Php 2
Php 2
 
R57php 1231677414471772-2
R57php 1231677414471772-2R57php 1231677414471772-2
R57php 1231677414471772-2
 
Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....Scala + WattzOn, sitting in a tree....
Scala + WattzOn, sitting in a tree....
 
Cより速いRubyプログラム
Cより速いRubyプログラムCより速いRubyプログラム
Cより速いRubyプログラム
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 
Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009Ruby presentasjon på NTNU 22 april 2009
Ruby presentasjon på NTNU 22 april 2009
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Out with Regex, In with Tokens

  • 1. Out With Regex, In With Tokens Sean Coates php|tek 2009
  • 2. Who is this Sean guy? • Web Architect at OmniTI (http://omniti.com/) • Former Editor-in-Chief of php|architect and former organizer of php|tek • PHP Community, Habari, Phergie • Other conferences (PHP Quebec earlier this year) • the Twitter: @coates • Beer Lover (and brewer) • (I speak too quickly)
  • 3. “A token is a categorized block of text. It can look like anything; it just needs to be a useful part of the structured text.” -Wikipedia
  • 4. $a = 5 + 7 ;
  • 5. $a = 5 + 7 ; (10 tokens)
  • 6. $a = 5 + 7 ; Whitespace
  • 7. $a = 5 + 7 ; Whitespace Variable
  • 8. $a = 5 + 7 ; Assign Whitespace Variable
  • 9. $a = 5 + 7 ; Number Assign Whitespace Variable
  • 10. $a = 5 + 7 ; Add Number Assign Whitespace Variable
  • 11. $a = 5 + 7 ; Add Number Assign Number Whitespace Variable
  • 12. $a = 5 + 7 ; Add Number Assign Number Whitespace Terminator Variable
  • 13. Grammar Matters $a = 5 + 7; // $b
  • 14. Grammar Matters $a = 5 + 7; // $b Not a Variable Variable
  • 15. Grammar Matters $a = 5 + 7; // $b Variable Comment
  • 16. PHP Example <?php $a = 5 + 7 ; // $b
  • 17. PHP Example T_OPEN_TAG <?php T_VARIABLE $a T_WHITESPACE = T_WHITESPACE T_LNUMBER 5 T_WHITESPACE + T_WHITESPACE T_LNUMBER 7 ; T_WHITESPACE T_COMMENT // $b
  • 18. “Lexing” • a Lexer converts a sequence of characters into tokens • “Lexical Analysis” • Lex, Flex, re2c (lexer generators)
  • 19. Static vs. Dynamic Analysis • Dynamic: actual execution, practical implementations such as pen. testing. • Static: analysis of code, tokens, opcodes, etc. to determine if a particular action will take place • (not the only use for Tokens, though)
  • 20. Out with Regex • Find all variables • Regex: /($[a-z_][a-z0-9_]*)/i
  • 21. Out with Regex • Find all variables • Regex: /($[a-z_][a-z0-9_]*)/i • context matters: $str = '$a = 5 + 7; // $b';
  • 22. Regex Fail <?php $str = '$a = 5 + 7; // $b'; preg_match_all( '/($[a-z_][a-z0-9_]*)/i', $str, $m ); var_dump($m[0]);
  • 23. Regex Fail array(2) { [0]=> string(2) quot;$aquot; [1]=> string(2) quot;$bquot; }
  • 24. Out with Regex • Find all variables RONG! • Regex: /($[a-z_][a-z0-9_]*)/i • context matters: $str = '$a = 5 + 7; // $b';
  • 25. Remember? $a = 5 + 7; // $b Variable Not a Variable!
  • 26. Token Approach <?php // look ma, no regex! $str = '<?php $a = 5 + 7; // $b'; foreach (token_get_all($str) as $t) { if (is_array($t) && $t[0] == T_VARIABLE) { echo $t[1] . quot;nquot;; } } // outputs: $a
  • 27. PHP Example (again) T_OPEN_TAG <?php T_VARIABLE $a T_WHITESPACE = T_WHITESPACE T_LNUMBER 5 T_WHITESPACE + T_WHITESPACE T_LNUMBER 7 ; T_WHITESPACE T_COMMENT // $b
  • 28. Regex can be complicated (email validation from MRE) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?! [^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^ x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn 015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80- xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80- xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80- xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80- xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80- xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80- xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^ x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000- 037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80- xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^ x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000- 037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80- xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^ x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
  • 29. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t
  • 30. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t Domain Localpart Separator
  • 31. Difficult validation made simpler • Email validation is haaaard! • Validate logical units separately: s e a n @ p h p. n e t • Still hard, but validation is restricted to different types of data • BTW, don’t bother (-:
  • 32. Regex can be complicated (email validation from MRE) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?! [^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^ x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn 015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn 015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80- xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80- xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80- xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* strpos($email, ‘@’) !== false ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80- xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80- xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80- xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^ x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000- 037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80- xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^ x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000- 037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80- xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^ x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^ x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
  • 33. Dirty Little Secret • Most tokenizers (lexers) use regular expressions to separate tokens • re2c • Multiple ways to represent separators, whitespace, etc.. simplified with regex
  • 34. Practical Uses • Compile source code • Simple, contextual replacement (e.g. BBCode) • Friendly line breaks • “Curly” quotes, special punctuation • Input validation/stripping • Refactoring
  • 35. PHP’s Tokenizer • Similar in other languages • Available (and useful!) in userspace • Built in to PHP (always available)
  • 36. PHP Execution • Lex • Parse • Compile • Execute • Cleanup
  • 37. PHP Execution • Lex • Parse Tokeny Goodness • Compile • Execute • Cleanup
  • 38. Tokenizer in Userspace • token_get_all() • token_name()
  • 39. Tokenizer in Userspace • token_get_all() returns an array of scalars and arrays • A bit hard to work with • Needs opening tag (<?php or <? depending on config)
  • 40. Tokenizer in Userspace (Example) print_r(token_get_all('<?php $a = 5 + 7; // $b')); Array [2] => Array [5] => Array [8] => Array [11] => Array ( ( ( ( ( [0] => Array [0] => 370 [0] => 305 [0] => 370 [0] => 370 ( [1] => [1] => 5 [1] => [1] => [0] => 367 [2] => 1 [2] => 1 [2] => 1 [2] => 1 [1] => <?php ) ) ) ) [2] => 1 ) [3] => = [6] => Array [9] => Array [12] => Array [4] => Array ( ( ( [1] => Array ( [0] => 370 [0] => 305 [0] => 365 ( [0] => 370 [1] => [1] => 7 [1] => // $b [0] => 309 [1] => [2] => 1 [2] => 1 [2] => 1 [1] => $a [2] => 1 ) ) ) [2] => 1 ) ) [7] => + [10] => ; )
  • 41. Tokenizer in Userspace (Example) [0] => Array ( [0] => 367 Token Number [1] => <?php Token Text [2] => 1 Line Number ) [1] => Array ( [0] => 309 token_name(309) [1] => $a == ‘T_VARIABLE’ [2] => 1 ) (...) [3] => = Scalar (not array)
  • 42. Practical Example: <pre> Simple Highlighter <?php $c = array( T_VARIABLE => 'red', T_LNUMBER => 'blue', ); foreach (token_get_all(fread(STDIN, 9999999)) as $t) { if (!is_array($t)) { echo htmlentities($t); } elseif (!isset($c[$t[0]])) { echo htmlentities($t[1]); continue; } else { echo '<span style=quot;color: ' . $c[$t[0]] . 'quot;>' . htmlentities($t[1]) . '</span>'; } } ?> </pre>
  • 43. Highlighter Output <?php $a = 5 + 7; // $b <pre> &lt;?php <span style=quot;color: redquot;>$a</span> = <span style=quot;color: bluequot;>5</span> + <span style=quot;color: bluequot;>7</span>; // $b </pre>
  • 44. Entities • Hi... I'm Sean
  • 45. Entities • Hi... I'm Sean • Hi&#8230; I&#8217;m Sean • Hi… I’m Sean
  • 46. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code
  • 47. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code <code>$foo = 'bar';</code> • Here’s some code <code>$foo = 'bar';</code>
  • 48. Entities • Here's some code <code>$foo = 'bar';</code> • Here&#8217;some code <code>$foo = 'bar';</code> • Here’s some code <code>$foo = 'bar';</code>
  • 49. Tokalizer • PHP token analysis wrapper • Object-oriented • Normalized • Includes a partial parser (in PHP, so it’s slow). Doesn’t work with new 5.3 constructs... yet. • http://github.com/scoates/tokalizer
  • 50. Context-aware tools • phpgrep regular grep: file.php:123: matched line php grep: file.php:123(foo::bar()): matched line
  • 51. Context-aware tools • diff-php regular diff: @@ -68,6 +68,7 @@ php diff: @@ -68,6 +68,7 @@ GeshiHighlighterFormatPlugin::do_highlight()
  • 52. Token dumps • text token dump • definition dump (*cough* currently broken) • html dump
  • 53. Habari’s HTML Tokenizer • Filter user input (can strip tags intelligently) • Allow plugins to inject/replace whole blocks of HTML without (developer-facing) regex • Facilitate autop, introspection
  • 54. HTMLPurifier • Intelligently filters/escapes potentially dangerous data • Token-based approach • Really difficult • Code is slow and memory-intensive, but it’s extremely complicated
  • 55.
  • 56. Questions? Contact... • http://seancoates.com/ • sean@php.net • http://omniti.com/is/sean-coates • IRC: scoates (Freenode and EFNet) • @coates on Twitter (if it happens to be up)