Grokking regex

1,141 views

Published on

Understanding regular expressions gives developers another extremely useful and powerful tool they can use to perform some operations that would otherwise be very tedious or difficult. This presentation goes over how to build and test regular expressions so developers can start using them within their own code.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,141
On SlideShare
0
From Embeds
0
Number of Embeds
286
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Grokking regex

  1. 1. php[tek] 2014 David Stockton May 21, 2014 Grokking Regex
  2. 2. What are regular expressions?
  3. 3. Patterns to describe text
  4. 4. Regular
  5. 5. Extremely Powerful
  6. 6. Often Abused.
  7. 7. Regular Expression Joke
  8. 8. How to use regex in PHP ● The preg_* functions ○ Use Perl compatible regular expressions ○ Probably the most common regex syntax ● Don't use ereg_* functions
  9. 9. PHP Functions preg_match - Search a subject for a match preg_match_all - Searches a subject for all matches preg_replace - Replace a pattern with something else preg_split - Split a string based on regex delimiter
  10. 10. PHP Functions preg_replace_callback - Replacement defined in a callback preg_grep - Return array of elements that match a pattern preg_quote - Quote regular expression characters preg_last_error - Error code of last regex function
  11. 11. Starting Pattern ● Matches letters, numbers, plus, dash, dots, underscore, plus, equals (1 or more) ● Followed by @ ● Followed by letters, numbers, dots and dashes ● Followed by a dot ● Followed by 2 to 4 letters /[A-Z0-9._+=]+@[A-Z0-9.-].[A-Z]{2,4}/i
  12. 12. What does it mean?
  13. 13. Email Addresses
  14. 14. Some Email Addresses
  15. 15. The "real" email address regex (?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ] )+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?: (?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:( ?: )?[ ])+||(?=[["()<>@,;: quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00- 31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)* ](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+ (?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?: (?: )?[ ])*))*|(?:[^()<>@,;:quot;.[] 00-31]+(?:(?: (?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: ) ?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^() <>@,;:quot;.[] 00-31]+(?:(?:(?: r )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: ) ?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ] )*))*(?:, @(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])* )(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ] )+||(?=[["()<>@,;:quot;.[]]))|[([^[] |) *](?:(?: )?[ ])*))*) *:(?:(?: )?[ ])*)?(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+ ||(?=[["()<>@,;:quot;. []]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+|| (?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00- 31 ]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*]( ?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;: quot;.[] 00-31]+(? :(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(? : )?[ ])*))*>(?:(?: )?[ ])*)|(?: [^()<>@,;:quot;.[] 00-31]+(?:(? :(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )? [ ]))*"(?:(?: )?[ ])*)*:(?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" | |(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<> @,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["() <>@,;:quot;.[]]))|" (?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ] )*(?:[^()<>@,;:quot;.[] 00-31]+ (?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(? :[^()<>@,;:quot;.[] 00-
  16. 16. More "real" regex 31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[ ]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?:[^()<>@,;:quot;.[] 00- 31]+(?:(?:(?: )?[ ]) +||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||( ?:(?: )?[ ]))*"(?:(?: )?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,; :quot;.[] 00-31]+(?: (?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([ ^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot; .[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[ ] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;. [] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] r|)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?= [["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)?(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["() <>@,;:quot;.[]]))|"(?:[^" |.|(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@, ;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+|| (?=[["()<>@,;:quot;.[]]))|"(? :[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])* (?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;. []]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[ ^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+|| (?=[["()<>@,;:quot;.[] ]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:(?: )?[ ])*)(?:,s*( ?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ]) +||(?=[["()<>@,;:".[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:( ?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[ ["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?: (?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(? :.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ]) +| |(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?: [^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;: quot;.[ ]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)*<(?:(?: ) ?[ ])*(?:@(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?= [[" ()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: ) ?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<> @,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@, ;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ] )*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))| [([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)? (?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;. []]))|"(?: [^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[ "()<>@,;:quot;. []]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ]) *))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ]) +||(?=[["() <>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?: .(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;: quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:( ?: )?[ ])*))*)?;s*)
  17. 17. How do we implement this regex?
  18. 18. Time for real learning
  19. 19. Letters and Numbers Letters and numbers match... letters and numbers /a/ - Matches a string that contains "a" /7/ - Matches a string that contains a 7
  20. 20. Match a word /regex/ - Matches a string with the word "regex" in it
  21. 21. Match a choice of words Use pipe when you want a choice /pizza|steak|cheeseburger/
  22. 22. Delimiters So far, delimiters have been / Needs to tell regex where to start and end Can use other delimiters #MyPHPNamespace#
  23. 23. Character Matching /[Pp][Hh][Pp]/ - Matches PHP in an case Define ranges /[abcdefghijklmnopqrstuvwxyz]/ - Any lower case alpha /[a-z]/ - Any lower case alpha
  24. 24. Character Ranges Combine Ranges: /[A-Za-z0-9]/ - Matches any alphanumeric /[A-Fa-f0-9]/ - Matches hex character Invert Character selection /[^0-9]/ - Non digit characters /[^ ]/ - Non space characters /[.!@#$%^&*]/ - Some punctuation
  25. 25. Special Characters Dot (.) matches any character /./ /../ - Matches any two characters To match an actual dot character, escape it /./ Not needed in character selection /[.]/
  26. 26. Character Classes d means [0-9] (Digit, but also all unicode digits) D means [^0-9] w means word characters - [A-Za-z0-9_] W means non word - [^A-Za-z0-9_] s means whitespace character [ tnr] S means non-whitespace characters
  27. 27. Repetition Match two digits in a row ● /dd/ ● /[0-9][0-9]/ ● /d{2}/ ● /[0-9]{2}/ Match at least one, as many as possible /d+/ Zero or more: /d*/
  28. 28. Repetition Repeated ● * match 0 or more ● + match 1 or more ● {x} match exactly x ● {x,} match x or more ● {,y} match up to y ● {x,y} match between x and y
  29. 29. More special characters ? - Preceding selection is optional
  30. 30. Step by Step /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/
  31. 31. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Opening delimiter
  32. 32. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Optional open paren
  33. 33. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Capture group - Parens capture pattern inside
  34. 34. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Three digits (captured)
  35. 35. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Optional closing paren
  36. 36. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Space or dash character
  37. 37. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Optional space or dash character
  38. 38. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Another three digit capture group
  39. 39. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Optional space or dash character
  40. 40. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Capture group for four digits
  41. 41. Break it down /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Closing delimiter
  42. 42. More special characters Put it together: /(?(d{3}))?[s-]?(d{3})[s-]?(d{4})/ Matches 720-675-7471 or (720)675-7471 or (720) 675-7471 or 7206757471 or 720 675 7471
  43. 43. Phone number matching Does not match 720.675.7471 or a number of other formats. Other ways? Replace all non-digits, check for length of 10
  44. 44. PHP Codes $number = preg_replace( '/[^0-9]/', '', $potentialNumber ); $valid = strlen($number) == 10;
  45. 45. Regex Anchors
  46. 46. Specify Position With Anchors /^ab/ - Matches abcdefg but not cab /ab$/ - Matches cab but not abcdefg /^[a-z]+$/ - Matches a string of only lowercase characters
  47. 47. Word Boundaries b means word boundaries ● Before first character if first character is word character ● After last character if word character ● Between two characters if one is a word character and the other isn't /bfishb/ matches fish but not fisherman or catfish /fishb/ matches fish and catfish
  48. 48. Alternation /cow|boy/ Matches cow or boy or cowboy or coward, etc /b(cow|boy)b/ - Matches cow or boy but not cowboy or coward Parens capture the matching word - more on that later
  49. 49. Greedy vs Lazy Default is greedy - match as much as possible Grab starting HTML tag: /<.+>/ Matches in bold: <h1>Welcome to Tek</h1> Not what we want.
  50. 50. Make it lazy.
  51. 51. Lazy Matching /<.+?>/ Now matches: <h1>Welcome to FRPUG</h1>
  52. 52. Another way to match tags /<[^>]+>/ Literally match: “Less than” followed by one or more non-“less than” characters followed by a “less than” character. Faster than the last example. No backtracking.
  53. 53. Capture Part of Regex
  54. 54. Capturing Regex - Backreference /__(construct|destruct)/ Backreference will contain construct or destruct so you can use it later /([a-z]+)1/ Matches repeated sequence of characters
  55. 55. Backreference /([a-z]{3})1/ Matches words like booboo or bambam
  56. 56. Practical Backreference Uses Search and replace preg_replace('/(?(d{3}))?[s-]? (d{3})[s-]?(d{4})/', '(1) 2-3', $phone); Format phone numbers from a variety of input styles (xxx) xxx-xxxx
  57. 57. More Practical Backreferences preg_replace( '/b(w+)s+1b/', '1', $string ); Replace duplicated words that that have been inadvertently been left in. Replace duplicated words that have been inadvertently been left in.
  58. 58. Non-capturing groups Match an IPv4 address /((?:d{1,3}.){3}d{1,3})/ Matching 1-3 digits followed by a dot 3 times. Repeat that match 3 times
  59. 59. Non-capturing groups Match an IPv4 address /((?:d{1,3}.){3}d{1,3})/ Matching 1-3 digits followed by a dot 3 times. Repeat that match 3 times
  60. 60. Pattern Modifiers Modifiers after the last delimiter: i - case insensitive matching m - multiline matching s - dot matches all characters, including n x - ignore whitespace characters if not escaped or in a character class
  61. 61. More Pattern Modifiers D - Anchor matches end of string only U - Invert the meaning of greediness Other modifiers can be seen here: http://php.net/manual/en/reference.pcre.pattern.modifiers.php
  62. 62. Named Capture Groups Instead of numbers, get back names No need to renumber in code later if you add another capture group
  63. 63. Named Capture Group - Phone preg_match('/ (? # opt. open paren (?P<area_code>d{3}) # area code )? # opt. closed paren [ -]? # opt. space/dash (?P<exchange>d{3}) # exchange [ -]? # opt. space/dash (?P<number>d{4}) # last 4 digits /x', // ignore spaces and comment stuff $number, $matches);
  64. 64. Named Capture Group Result array(7) { [0] => string(10) "7206757471" ['area_code'] => string(3) "720" [1] => string(3) "720" ['exchange'] => string(3) "675" [2] => string(3) "675" ['number'] => string(4) "7471" [3] => string(4) "7471" }
  65. 65. Positive Look Ahead Matches Find a pattern followed by another pattern /p(?=h)/ - Match a p followed by an "h" but don't include the "h" Matches "phone", "phish", "telegraph" Does not match "potassium"
  66. 66. Negative Look Ahead Look for a pattern which is not followed by some other pattern /p(?!h)/ - p not followed by h Matches potassium Does not match phone, telegraph or phish
  67. 67. Look aheads ● Positive and negative lookaheads do not capture anything ● They determine if a match is possible ● They are zero-width ● /p[^h]/ is not the same as /p(?!h)/ ● /ph/ is not the same as /p(?=h)/
  68. 68. Look behinds Positive Look Behind /(?<=oo)d/ - d preceded by oo - Matches the d in "food" and "mood" Negative Look Behind /(?<!oo)d/ - d not preceded by oo - Matches "dude", "crude" and "d"
  69. 69. With Great Power... Test your regular expressions before they go to production It's much easier to get them wrong than to get them right if you don't test Use tools like Sublime Text, Atom
  70. 70. When to not use regex When they are not needed If you can use strstr, strpos or str_replace If you cannot use those, maybe regex is appropriate Don't use regex when you need a parser
  71. 71. Resources http://regular-expressions.info http://php.net/manual/en/ref.pcre.php http://www.php.net/manual/en/reference.pcre.pattern.syntax.php
  72. 72. Photo Credits ● http://www.flickr.com/photos/justinbaeder/5317820857 (Hammer & Screw) ● http://www.flickr.com/photos/doug88888/5891638442 (Water Pattern) ● http://www.flickr.com/photos/mwparenteau/7566437660 (Laxative Cereal) ● http://www.flickr.com/photos/auyuchuco/3669864253 (Mantis Shrimp) ● http://www.flickr.com/photos/anderspiren/4678572968 (Spray Can) ● http://www.flickr.com/photos/dcmatt/473127479 (Comedy Club) ● http://www.flickr.com/photos/gschueler/72294706 (License Plate) ● http://www.flickr.com/photos/horiavarlan/4514164700 (Puzzle @ sign) ● http://www.flickr.com/photos/proimos/4199675334 (Facepalm) ● http://www.flickr.com/photos/mklapper/5812224468 (Teacher in Classroom) ● http://www.flickr.com/photos/light_arted/3927322326 (Anchor) ● http://www.flickr.com/photos/kpcauchi/5376768095 (Lizard) ● http://www.flickr.com/photos/focusshoot/5617788347 (Spider web) ● http://www.flickr.com/photos/oberazzi/318947873 (Cuff links)
  73. 73. dave@davidstockton.com
  74. 74. Please rate this talk https://joind.in/10642

×