From my November 3, 2011 talk at MNPHP. Regular expressions are a powerful tool available in nearly every programming language or platform, including PHP. I go over the history of POSIX vs. PCRE, examples in PHP, and optimizations on how to write faster expressions.
2. Regular Expressions
Regular expressions provide a concise, flexible means for
matching strings of text, such as words or patterns of
characters.
POSIX PCRE
Portable Operating System Interface Perl Compatible Regular Expressions
• Traditional Unix regular • Perl 5 Extended Features
expression syntax • Native C Extension
• Generally Faster
• PHP’s ereg_ functions • Optimization Qualifiers
• Basic and extended versions Used by:
• Programming languages
• Apache and other servers
3. Why Use Them?
• Input Validation
• Input Filtering
• Search and Replace
• Parsing and Data Extraction
• Dynamic Recursion
• Automation
4. In PHP, POSIX = Deprecated
ereg_* functions are now deprecated in newer versions of
PHP.
Switching to preg_* is generally pain free. Pain points:
• Different matching criteria (greed)
• preg_* requires delimiters
• Different characters require escape sequences
• preg favors option modifiers over functions
5. Anatomy of a PHP Regular Expression
/foo/i
• Delimiters
• Pattern to match
• Options/modifiers
preg_replace(
„/(href|src)=„([^‟])*‟/i‟,
„1=“2”‟,
$str
);
6. PHP Regular Expressions
• Must use a delimiter: ! @ # /
• Use PHP’s single quotes (no escaping ’s)
preg_match Match against a pattern and
extract text
preg_replace Like str_replace with a pattern
(and sub-patterns)
preg_match_all Like preg_match, but an array
and count for every match
preg_split Like explode() but with a
pattern
preg_quote Escapes text for use in a regular
expression
7. Modifiers and Options
i PCRE_CASELESS – Ignores case
m PCRE_MULTILINE – Ignores new-lines
s PCRE_DOTALL – New lines count with dots
(.)
U Don’t be greedy
8. Performance Killers
Slow-downs in performance generally come from:
• Alternation, the pipe/OR operator (|)
Use [abcd] when possible over (a|b|c|d)
• Multi-line (PCRE_DOTALL or /s)
• Recursion: (d+)d*
Use lengths when possible
It’s not that slow!
9. Sub-Patterns
Sub-Patterns allow you to extract relevant text from searches:
• For preg_replace, use either 1 or $1 in your replacement string
• Sub-patterns are left-most indexed by first left parenthesis “(“
16. Non-Greedy with Modifier
The /U modifier returns the SMALLEST match.
100,000 runs took 0.2638 seconds
(a little better, and it’s right)
17. Restrictive Wild-Carding
No greedy flag needed, faster without broad wild-cards.
100,000 runs took 0.2271 seconds
(fastest yet, no options needed)
18. grep
Use grep –E or egrep for extended regular expressions (+, ?, |)
and advanced functionality.
-A n Print the next n lines after each match.
-B n Print the previous n lines before each match.
-i Ignore case
-m n Stop after n matches
-r Recursively search the file system
-n Show line numbers
-v Only show lines that don’t match
19. sed
Use –r (-E on OS X / FreeBSD) for extended regular expressions.
20. The End
Web: http://andrewkandels.com
Mail: mailto:akandels@gmail.com
Twitter: @andrewkandels