The document discusses regular expressions (regex) in PHP. It begins with a brief introduction to regex, then provides examples of common PHP functions for using regex like preg_match(), preg_replace(), and preg_quote(). The document also shares a very long regex pattern that is intended to match all valid email addresses. It notes that accurately matching email addresses with a regex is challenging.
Introduction to regex in PHP by David Stockton, focusing on its applications and functionalities.
Definition of regular expressions as text patterns, their power, and humor regarding their misuse.
Overview of preg_* functions for regex in PHP, emphasizing the use of preg_match, preg_replace, and others.
Instructions for writing regex patterns to match email addresses; discusses limitations and complexity.The construction of a comprehensive regex for validating email addresses, suggesting simpler alternatives.
Fundamentals of matching specific characters and words using regex patterns.
Explanation of regex delimiters and character classes for matching specific characters or ranges.
Patterns for repeating characters, special symbols for matching digits, whitespace, and formatting examples.
Explanation of anchors and word boundaries for defining positions within strings.
Using alternation for matching strings and the difference between greedy and lazy matching.
Utilization of capturing groups and backreferences in regex for reusing matched content.
Modifiers for regex behavior and the use of named capture groups for easier data extraction.
Explanation of lookaheads and lookbehinds in regex, focusing on their zero-width nature.
Guidelines on when to use regex, recommended alternatives, and resources for further learning.
Regular Expressions inPHP/(?:dave@davidstockton\.com)/Front Range PHP User GroupDavid Stockton
2.
What is aregular expression?A pattern used to describe a part of some text“Regular” has some implications to how it can be built, but that’s not really part of this presentationExtremely powerful and useful(And often abused)
3.
Regex JokeA programmersays, “I have a problem that I can solve with regular expressions.”Now, he has two problems…
4.
How to useregex in PHP The preg_* functionsPerl compatible regular expressions.Probably the most common regex syntaxThe ereg_* functionsPOSIX style regular expressionsI am not covering these functions.Don’t use the ereg ones. They are deprecated in PHP 5.3.
5.
How can weuse regex in PHP?preg_match( ) – Searches a subject for a matchpreg_match_all( ) – Searches a subject for all matchespreg_replace( ) – Searches a subject for a pattern and replaces it with something elsepreg_split( ) – Split a string into an array based on a regex delimiterpreg_filter( ) – Identical to preg_replace except it returns only the matchespreg_replace_callback( ) – Like preg_replace, but replacement is defined in a callbackpreg_grep( ) – Returns an array of array elements that match a pattern
6.
How can weuse regex in PHP?preg_quote( ) – Quotes regular expression characterspreg_last_error( ) – Returns the error code of the last PCRE (Perl Compatible Regular Expression) function execution
7.
How can weuse regex in PHP?Those are the function calls, and we’ll play with the later.First, we need to learn how to create regex patterns since we need those for any function call.
8.
Starting Pattern/[A-Z0-9\._+=]+@[A-Z0-9\.-]\.[A-Z]{2,4}/iThis matchesa series of letters, numbers, plus, dash, dots, underscores and equals, followed by an “AT” (@) sign, followed by a series of letters, numbers, dots and dashes, followed by a dot, followed by 2 to 4 letters.In other words… It matches an email address… Or rather some email addresses.
9.
Matching Email AddressesWhatabout james@smithsonian.museum?What about freddie@wherecanI.travel?Both of those are valid email addresses, but they fail because our patter only allows 2-4 character TLD parts for the email address.How can we match all valid email addresses and only valid email addresses?
So… Howdo we write this?Don’t. Other much more simple patterns have been written and will match 99.9% of valid email addresses.Use something like Zend_Validate_EmailAddress
13.
So now thereal learnin’…Letters and numbers match… letters and numbers/a/ - Matches a string that contains an “a”/7/ - Matches a string that contains a 7.
14.
More learnin’Match aword/regex/ - Matches a string with the word “regex” in itYou can use a pipe character to give a choice/pizza|steak|cheeseburger/ - Matches a string with any of these foods
15.
DelimitersThe examples sofar have started with / and ended with /.These are delimiters and let the regex engine know where the pattern starts and ends.You can choose another delimiter if you’d like or if it’s more convenientMatch namespace:#/My/PHP/Namespace#If I used “/” in that example, I’d need to escape each of the forward slashes to differentiate them from the delimiter
16.
Character Matching ContinuedYoucan match a selection of characters/[Pp][Hh][Pp]/ - Matches PHP in any mixture of upper and lowercaseRanges can be defined/[abcdefghijklmnopqrstuvwxyz]/ - Matches any lowercase alpha character/[a-z]/ - Matches any lowercase alpha character
17.
Character Selection RangesRangescan be combined/[A-Za-z0-9]/ - Matches an alphanumeric character/[A-Fa-f0-9]/ - Matches any hex characterCharacter Selection can be inversed/[^0-9]/ - Matches any non-digit character/[^ ]/ - Matches any non space character/[.!@#$%^*]/ - Matches some punctuation
18.
Special CharactersDot (.)matches any character/.//../ - Matches any two charactersTo match an actual dot character, you must escape/\./ - Matches a single dot characterUnless it’s a character selection/[.]/ - Matches a single dot character
19.
Character classes\d means[0-9]\D means non-digits - [^0-9]\w means word characters - [A-Za-z0-9_]\W means non word characters – [^A-Za-z0-9_]\s means a whitespace character [ \t\n\r]\S means non white space characters
20.
Repeating Character ClassesMatchtwo digits in a row/\d\d//[0-9][0-9]//\d{2}//[0-9]{2}/Match at least one digit (but as many as it can)/\d+/Match 0 to infinite digits/\d*/
21.
Repeating Character Classescont.* means match 0 or more+ means match 1 or more{x} where x is a number means match exactly x of the preceding selection{x,} means match at least x{x,y} means match between x and y{,y} means match up to y
22.
More special characters?Means the preceding selection is optionalPutting it togetherTelephone Number/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/Matches 720-675-7471 or (720)675-7471 or (720) 675-7471 or 7206757471 or 720 675 7471Find a misspelled word (and get great deals on EBay)/la[bp]topcomputer[s]?/
23.
Regex AnchorsAnchors allowyou to specify a position, like before, after or in between characters/^ab/ matches abcdefg but not cabNotice that it’s the caret character… It means start of the string in this context, but means the opposite of a character class inside the square brackets/ab$/ matches cab but not abcdefg/^[a-z]+$/ will match a string that consists only of lowercase characters
24.
Word Boundaries\b meansword boundariesBefore first character if first character is a word characterAfter last character if last character is a word characterBetween two characters if one is a word character and the other is not/\bfish\b/ matches fish, but not fisherman or catfish./fish\b/ matches fish and catfish
25.
Alternation/cow|boy/ - Matchescow, or boy or cowboy or coward, etc/\b(cow|boy)\b/ - Matches cow or boy but not cowboy or cowardThe above example also captures the matching word due to the parens. More on this later.
26.
Greedy vs LazyBydefault, regular expressions are greedy… That is, they will match as much as they canGrab a starting html tag:/<.+>/ Matches in bold: <h1>Welcome to FRPUG</h1>Not what we wantMake it lazy: /<.+?>/Now it matches <h1>Welcome to FRPUG</h1>
27.
Another tag matchingsolution/<[^>]+>/Literally match a less than character followed by one or more non-greater than characters followed by a greater than characterThis way eliminates the need for the engine to backtrack (almost certainly faster than the last example).
28.
Capturing part ofregex (backreference)/__(construct|destruct)/Backreference will contain either construct or destruct so you can use it later/([a-z]+)\1/Matches groups of repeated characters that repeat an even number of times.Matches aa but not a. Matches aaaaa/([a-z]{3})\1/Matches words like booboo or bambam
29.
Backreference Continued…Very usefulwhen performing regex search and replacepreg_replace('/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/', '(\1) \2-\3', $phone)The above example will take any phone number from the previous example and return it formatted in (xxx) xxx-xxxx format
Non-capturing groupsMatch anIPv4 address/((?:\d{1,3}\.){3}\d{1,3})/We’re matching 1 to 3 digits followed by a dot 3 times. We don’t care (right now) about the octets, we just want to repeat the match, so ?: says to not capture the group.
32.
Pattern Modifiers Modifiers goafter the last delimiter (now you know why there are delimiters) and affect how the regex engine worksi – case insensitive matching (matches are case-sensitive by default)m – multiline matchings - dot matches all characters, including \nx – ignore all whitespace characters except if escaped or in a character class
33.
Pattern Modifiers Continued…D– Anchor for the end of the string only, otherwise $ matches \n charactersAllow username to be alphabetic only/^[A-Za-z]$/ - This will match dave\nextra stuffHowever, /^[A-Za-z]$/D will not matchU – Invert the meaning of the greediness. With this on by default matches are lazy and ? makes it greedy.There are lots of other modifiers and you can see them at http://us2.php.net/manual/en/reference.pcre.pattern.modifiers.php
34.
Named Capture GroupsRatherthan get back a numbered array of matches, get back an associative array.If you add a new capture group, you don’t have to renumber where you use the capture group
Named Capture Groupscont…Combined numbered and associative arrayCapture group 0 is the wholepattern that is matched.If our string to match against was abcde720-675 7471foobar, $matches[0] will contain720-675 7471
37.
Positive Look AheadMatchesLook for a pattern follow by another pattern/p(?=h)/ - Match a “p” followed by an “h” but don’t include the “h”
38.
Negative Look AheadLookfor a pattern which is not followed by some other pattern/p(?=!h)/ - pnot followed by h.
39.
Look AheadsPositive andnegative look aheads do not capture anything. They just determine if the pattern match is possibleThey are zero-width/p[^h]/ is not the same as /p(?!h)//ph/ is not the same as /p(?=h)/
40.
Look behindsPositive lookbehind/(?<=oo)d/ - d which is preceded by ooMatches “food”, “mood”, match only contains the “d”Negative look behind/(?<!oo)d/ - d which is not preceded by ooMatches “dude”, “crude”, and “d”
41.
With great power…Testyour regular expressions before they go to productionIt’s much easier to get them wrong than to get themright if you don’t test
42.
When to notuse regexWhenever they aren’t needed.If you can use strstr or strpos or str_replace to do the job, do that. They are much faster, much simpler and easier to do correctly.However, if you cannot use those functions, regex may be your best bet.Don’t use regex when you really need a parser