More Related Content Similar to Out with Regex, In with Tokens (20) Out with Regex, In with Tokens2. Who is this Sean guy?
• Web Architect at OmniTI (http://omniti.com/)
• Former Editor-in-Chief of php|architect and former
organizer of php|tek
• PHP Community, Habari, Phergie
• Other conferences (PHP Quebec earlier this year)
• the Twitter: @coates
• Beer Lover (and brewer)
• (I speak too quickly)
3. “A token is a
categorized block of
text. It can look like
anything; it just needs
to be a useful part of
the structured text.”
-Wikipedia
7. $a = 5 + 7 ;
Whitespace
Variable
8. $a = 5 + 7 ;
Assign
Whitespace
Variable
9. $a = 5 + 7 ;
Number
Assign
Whitespace
Variable
10. $a = 5 + 7 ;
Add
Number
Assign
Whitespace
Variable
11. $a = 5 + 7 ;
Add
Number
Assign Number
Whitespace
Variable
12. $a = 5 + 7 ;
Add
Number
Assign Number
Whitespace
Terminator
Variable
17. PHP Example
T_OPEN_TAG <?php
T_VARIABLE $a
T_WHITESPACE
=
T_WHITESPACE
T_LNUMBER 5
T_WHITESPACE
+
T_WHITESPACE
T_LNUMBER 7
;
T_WHITESPACE
T_COMMENT // $b
18. “Lexing”
• a Lexer converts a sequence of characters
into tokens
• “Lexical Analysis”
• Lex, Flex, re2c (lexer generators)
19. Static vs. Dynamic
Analysis
• Dynamic: actual execution, practical
implementations such as pen. testing.
• Static: analysis of code, tokens, opcodes,
etc. to determine if a particular action will
take place
• (not the only use for Tokens, though)
21. Out with Regex
• Find all variables
• Regex:
/($[a-z_][a-z0-9_]*)/i
• context matters:
$str = '$a = 5 + 7; // $b';
22. Regex Fail
<?php
$str = '$a = 5 + 7; // $b';
preg_match_all(
'/($[a-z_][a-z0-9_]*)/i', $str, $m
);
var_dump($m[0]);
24. Out with Regex
• Find all variables RONG!
• Regex:
/($[a-z_][a-z0-9_]*)/i
• context matters:
$str = '$a = 5 + 7; // $b';
26. Token Approach
<?php
// look ma, no regex!
$str = '<?php $a = 5 + 7; // $b';
foreach (token_get_all($str) as $t) {
if (is_array($t) && $t[0] == T_VARIABLE) {
echo $t[1] . quot;nquot;;
}
}
// outputs: $a
27. PHP Example (again)
T_OPEN_TAG <?php
T_VARIABLE $a
T_WHITESPACE
=
T_WHITESPACE
T_LNUMBER 5
T_WHITESPACE
+
T_WHITESPACE
T_LNUMBER 7
;
T_WHITESPACE
T_COMMENT // $b
28. Regex can be complicated
(email validation from MRE)
[040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff
n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()]
* (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )*
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^
x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff]
| ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-
xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?:
( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-
xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^
x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^
x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
29. Difficult validation
made simpler
• Email validation is haaaard!
• Validate logical units separately:
s e a n @ p h p. n e t
30. Difficult validation
made simpler
• Email validation is haaaard!
• Validate logical units separately:
s e a n @ p h p. n e t
Domain
Localpart Separator
31. Difficult validation
made simpler
• Email validation is haaaard!
• Validate logical units separately:
s e a n @ p h p. n e t
• Still hard, but validation is restricted to
different types of data
• BTW, don’t bother (-:
32. Regex can be complicated
(email validation from MRE)
[040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] *
(?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?!
[^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xff
n015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()]
* (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^
x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn
015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )*
) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn
015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* | (?: [^(040)<>@,;:quot;.[]000-037x80-
xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * (?: (?: ( [^x80-
xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-
xffn015quot;] * )* quot; ) [^()<>@,;:quot;.[]x80-xff000-010012-037] * )* < [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xff
n015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()]
* )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )*
strpos($email, ‘@’) !== false
] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: .
[040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^
x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* (?: , [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff]
| ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-
xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-
xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) )
[^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-
xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?:
( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* )* : [040t]* (?: ( [^
x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )? (?: [^(040)<>@,;:quot;.[]000-
037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-
xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^
x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-
037x80-xff]) | quot; [^x80-xffn015quot;] * (?: [^x80-xff] [^x80-xffn015quot;] * )* quot; ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff]
[^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* @ [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-
xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?: [^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^
x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )*
(?: . [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* (?:
[^(040)<>@,;:quot;.[]000-037x80-xff]+ (?![^(040)<>@,;:quot;.[]000-037x80-xff]) | [ (?: [^x80-xffn015[]] | [^x80-xff] )* ] ) [040t]* (?: ( [^x80-xffn015()] * (?: (?: [^
x80-xff] | ( [^x80-xffn015()] * (?: [^x80-xff] [^x80-xffn015()] * )* ) ) [^x80-xffn015()] * )* ) [040t]* )* )* > )
33. Dirty Little Secret
• Most tokenizers (lexers) use regular
expressions to separate tokens
• re2c
• Multiple ways to represent separators,
whitespace, etc.. simplified with regex
34. Practical Uses
• Compile source code
• Simple, contextual replacement (e.g. BBCode)
• Friendly line breaks
• “Curly” quotes, special punctuation
• Input validation/stripping
• Refactoring
39. Tokenizer in Userspace
• token_get_all() returns an array of scalars
and arrays
• A bit hard to work with
• Needs opening tag (<?php or <? depending
on config)
40. Tokenizer in Userspace
(Example)
print_r(token_get_all('<?php $a = 5 + 7; // $b'));
Array [2] => Array [5] => Array [8] => Array [11] => Array
( ( ( ( (
[0] => Array [0] => 370 [0] => 305 [0] => 370 [0] => 370
( [1] => [1] => 5 [1] => [1] =>
[0] => 367 [2] => 1 [2] => 1 [2] => 1 [2] => 1
[1] => <?php ) ) ) )
[2] => 1
) [3] => = [6] => Array [9] => Array [12] => Array
[4] => Array ( ( (
[1] => Array ( [0] => 370 [0] => 305 [0] => 365
( [0] => 370 [1] => [1] => 7 [1] => // $b
[0] => 309 [1] => [2] => 1 [2] => 1 [2] => 1
[1] => $a [2] => 1 ) ) )
[2] => 1 )
) [7] => + [10] => ; )
41. Tokenizer in Userspace
(Example)
[0] => Array
(
[0] => 367 Token Number
[1] => <?php Token Text
[2] => 1 Line Number
)
[1] => Array
(
[0] => 309 token_name(309)
[1] => $a == ‘T_VARIABLE’
[2] => 1
)
(...)
[3] => = Scalar (not array)
42. Practical Example:
<pre>
Simple Highlighter
<?php
$c = array(
T_VARIABLE => 'red',
T_LNUMBER => 'blue',
);
foreach (token_get_all(fread(STDIN, 9999999)) as $t) {
if (!is_array($t)) {
echo htmlentities($t);
} elseif (!isset($c[$t[0]])) {
echo htmlentities($t[1]);
continue;
} else {
echo '<span style=quot;color: ' . $c[$t[0]] . 'quot;>'
. htmlentities($t[1]) . '</span>';
}
}
?>
</pre>
43. Highlighter Output
<?php
$a = 5 + 7; // $b
<pre>
<?php
<span style=quot;color: redquot;>$a</span> =
<span style=quot;color: bluequot;>5</span> +
<span style=quot;color: bluequot;>7</span>; // $b
</pre>
45. Entities
• Hi... I'm Sean
• Hi… I’m Sean
• Hi… I’m Sean
46. Entities
• Here's some code <code>$foo = 'bar';</code>
• Here’some code
47. Entities
• Here's some code <code>$foo = 'bar';</code>
• Here’some code <code>$foo = 'bar';</code>
• Here’s some code <code>$foo = 'bar';</code>
48. Entities
• Here's some code <code>$foo = 'bar';</code>
• Here’some code <code>$foo = 'bar';</code>
• Here’s some code <code>$foo = 'bar';</code>
49. Tokalizer
• PHP token analysis wrapper
• Object-oriented
• Normalized
• Includes a partial parser (in PHP, so it’s
slow). Doesn’t work with new 5.3
constructs... yet.
• http://github.com/scoates/tokalizer
53. Habari’s HTML
Tokenizer
• Filter user input (can strip tags intelligently)
• Allow plugins to inject/replace whole
blocks of HTML without (developer-facing)
regex
• Facilitate autop, introspection