SlideShare a Scribd company logo
1 of 236
Andrei Zmievski            andrei@php.net • twitter: @a




                  Andrei’s Regex Clinic
userfriendly: line noise
xkcd: regular expressions
xkcd: regular expressions
xkcd: regular expressions
xkcd: regular expressions
a bit of history


     Regular expressions can be
     traced back to early research on
     the human nervous system
a bit of history


     Neurophysiologists Warren
     McCulloch and Walter Pitts
     developed a mathematical way
     of describing the neural
     networks
a bit of history


     Later, mathematician Stephen
     Kleene published a paper that
     introduced the concept of
     regular expressions that were
     used to describe “the algebra of
     regular sets”
a bit of history
# grep “^[A-Z]” *.txt




                        Ken Thompson, one of the
                        fathers of Unix, found a practical
                        application for them in the
                        various tools of the early OS
what are they good for?
 Literal string searches are fast but
 inflexible

 With regular expressions you can:

   Find out whether a certain pattern
   occurs in the text

   Locate strings matching a pattern and
   remove them or replace them with
   something else

   Extract the strings matching the pattern
terminology
              Regex
terminology
              Regex
              a pattern describing a set of strings




              abcdef
terminology




              Subject String
terminology

                   apple
              Subject String
              text that the regex is applied to
terminology




              Match
terminology

                      a
                      apple
                                    Match
              a portion of the string that is successfully
                       described by the regex
terminology
              Engine
terminology
              Engine
               A program or a library that obtains
               matches given a regex and a string




                   PCRE
regex flavors

Regular expressions are like ice cream

    Common base

    Many flavors
regex flavors

Three main types of engines that determine how
matching is done:

  DFA

  Traditional NFA

  POSIX NFA

For our purposes we discuss the regex flavor that
PHPs’ Perl-compatible regular expressions use
how an NFA engine works


     The engine bumps along the
     string trying to match the
     regex

     Sometimes it goes back and
     tries again
how an NFA engine works
     Two basic things to
     understand about the engine

       It will always return the
       earliest (leftmost) match it
       finds

       The topic of the day is isotopes.

       Given a choice it always
       favors match over a non-
       match
how an NFA engine works
     Two basic things to
     understand about the engine

       It will always return the
       earliest (leftmost) match it
       finds

       The topic of the day is isotopes.

       Given a choice it always
       favors match over a non-
       match
how an NFA engine works
     Two basic things to
     understand about the engine

       It will always return the
       earliest (leftmost) match it
       finds

       The topic of the day is isotopes.

       Given a choice it always
       favors match over a non-
       match
color legend
color legend

          regular expression
color legend

          regular expression



         subject string
color legend

          regular expression



         subject string



          match
Syntax
building blocks


 Regexes are like LEGOs

 Small pieces combined into larger
 ones using connectors

 Arbitrarily complex
characters

                a   4
     ordinary     0 K
                x
characters

                  a     4
     ordinary        0 K
                  x
     special       *  . !
                ^     ?
characters

                                        a     4
 Special set is a well-defined subset of
                                           0 K
 ASCII

 Ordinary set consist of all characters
                                        x
 not designated special

 Special characters are also called
 metacharacters
                                         *  . !
                                      ^     ?
matching literals

              The most basic regex consists of a
              single ordinary character

              It matches the first occurrence of that
              character in the string

              Characters can be added together to
              form longer regexes
matching literals

              The most basic regex consists of a




123
              single ordinary character

              It matches the first occurrence of that
              character in the string

              Characters can be added together to
              form longer regexes
extended characters


To match an extended character, use xhh notation where hh are
hexadecimal digits

To match Unicode characters (in UTF-8 mode) mode use x{hhh..}
notation
Андрей
For example, the following pattern matches my name in Cyrillic:

 x{0410}x{043d}x{0434}x{0440}x{0435}x{0439}




                     extended characters
metacharacters
To use one of these literally, escape it, that
is prepend it with a backslash                   . [] ()
                                                  ^$
                                                 *+?
                                                   {} |
metacharacters
To use one of these literally, escape it, that
is prepend it with a backslash                   . [] ()
                                                  ^$
             $                                  *+?
                                                   {} |
metacharacters
To escape a sequence of characters, put
them between Q and E                    . [] ()
                                           ^$
                                          *+?
                                            {} |
metacharacters
To escape a sequence of characters, put
them between Q and E                    . [] ()
 Price is Q$12.36E                       ^$
             will match                   *+?
     Price is $12.36                        {} |
metacharacters
So will the backslashed version
                                  . [] ()
    Price is $12.36              ^$
             will match           *+?
     Price is $12.36                {} |
character classes   []
character classes                        []

 Consist of a set of characters placed
 inside square brackets

 Matches one and only one of the
 characters specified inside the class
character classes                            []
 matches an English vowel (lowercase)   [aeiou]


 matches surf or turf                   [st]urf
negated classes                                      [^]
 Placing a caret as the first character after the
 opening bracket negates the class

 Will match any character not in the class,
 including newlines

 [^<>] would match a character that is not left or
 right bracket
character ranges                         [-]
     Placing a dash (-) between two
     characters creates a range
     from the first one to the
     second one

     Useful for abbreviating a list of
     characters




          [a-z]
character ranges              [-]
     Ranges can be reversed




         [z-a]
character ranges                  [-]
     Ranges can be reversed

     A class can have more than
     one range and combine
     ranges with normal lists




     [a-z0-9:]
character ranges   [-]
character ranges                        [-]
  [0-9/]   matches a digit or a slash
character ranges                        [-]
  [0-9/]   matches a digit or a slash



  [z-w]    matches z, y, x, or w
character ranges                           [-]
  [0-9/]   matches a digit or a slash



  [z-w]    matches z, y, x, or w



[a-z0-9]   matches digits and lowercase letters
character ranges                              [-]
  [0-9/]      matches a digit or a slash



  [z-w]       matches z, y, x, or w



 [a-z0-9]     matches digits and lowercase letters



[x01-x1f]   matches control characters
shortcuts for ranges   [-]
w   word character         [A-Za-z0-9_]

  d   decimal digit          [0-9]

  s   whitespace             [ nrtf]

  W   not a word character   [^A-Za-z0-9_]

  D   not a decimal digit    [^0-9]

  S   not whitespace         [^ nrtf]



shortcuts for ranges                         [-]
classes and metacharacters


]     
            Inside a character class, most
            metacharacters lose their meaning

            Exceptions are:




^     -
classes and metacharacters


]     
            Inside a character class, most
            metacharacters lose their meaning

            Exceptions are:

              closing bracket




^     -
classes and metacharacters


]     
            Inside a character class, most
            metacharacters lose their meaning

            Exceptions are:

              closing bracket

              backslash



^     -
classes and metacharacters


]     
            Inside a character class, most
            metacharacters lose their meaning

            Exceptions are:

              closing bracket

              backslash



^     -       caret
classes and metacharacters


]     
            Inside a character class, most
            metacharacters lose their meaning

            Exceptions are:

              closing bracket

              backslash



^     -       caret

              dash
classes and metacharacters

[ab]]
           To use them literally, either escape them

 [ab^]
           with a backslash or put them where they
                 do not have special meaning




[a-z-]
dot metacharacter                .
 By default matches any single
 character
dot metacharacter                     .
 By default matches any single
 character

 Except a newline




                                 n
dot metacharacter       .

 Is equivalent to
                    [^n]
dot metacharacter                                .
                                        12345
 Use dot carefully - it might match
 something you did not intend           12945
 12.45 will match literal 12.45         12a45
 But it will also match these:
                                         12-45
                                      78812 45839
dot metacharacter                                .
                                        12345
 Use dot carefully - it might match
 something you did not intend           12945
 12.45 will match literal 12.45         12a45
 But it will also match these:
                                         12-45
                                      78812 45839
quantifiers



             We are almost never sure about
                the contents of the text.
quantifiers



             Quantifiers help us deal with this
                       uncertainty
quantifiers
                                                 ?

                                                 *
             Quantifiers help us deal with this
                       uncertainty
                                                 +
                                                 {}
quantifiers
                                              ?

              They specify how many times a   *
             regex component must repeat in
                order for the match to be
                       successful             +
                                              {}
repeatable components
             a
literal character
repeatable components
                    a
      literal character   w d s
.                         W D S
                           range shortcuts
dot metacharacter



              []                 subpattern
    character class
                                 backreference
zero-or-one                                                ?
 Indicates that the preceding component is optional

 Regex welcome!? will match either welcome or welcome!

 Regex supers?strong means that super and strong may have an
 optional whitespace character between them

 Regex hello[!?]? Will match hello, hello!, or hello?
Indicates that the preceding component has to appear once or
 more

 Regex a+h will match ah, aah, aaah, etc

 Regex -d+ will match negative integers, such as -33

 Regex [^”]+ means to match a sequence (more than one) of
 characters until the next quote




one-or-more                                                 +
zero-or-more                                                     *
 Indicates that the preceding component can match zero or more
 times

 Regex d+.d* will match 2., 3.1, 0.001

 Regex <[a-z][a-z0-9]*> will match an opening HTML tag with no
 attributes, such as <b> or <h2>, but not <> or </i>
general repetition                                    {}
 Specifies the minimum and the maximum number of times a
 component has to match

 Regex ha{1,3} matches ha, haa, haaa

 Regex d{8} matches exactly 8 digits

 If second number is omitted, no upper range is set

 Regex go{2,}al matches gooal, goooal, gooooal, etc
general repetition   {}
   {0,1}       =

   {1,}        =

   {0,}        =
general repetition       {}
   {0,1}       =     ?

   {1,}        =     +
   {0,}        =     *
greediness	



  matching as much as possible, up to a limit
greediness

 PHP 5?      PHP 5 is better than Perl 6




 d{2,4}             10/26/2004
greediness

 PHP 5?      PHP 5 is better than Perl 6




 d{2,4}             10/26/2004
greediness

 PHP 5?      PHP 5 is better than Perl 6




 d{2,4}             10/26/2004
                           2004
greediness


 Quantifiers try to grab as much as
 possible by default

 Applying <.+> to <i>greediness</i>
 matches the whole string rather than
 just <i>
greediness


 If the entire match fails because they
 consumed too much, then they are
 forced to give up as much as needed to
 make the rest of regex succeed
greediness

 To find words ending in ness, you will
 probably use w+ness

 On the first run w+ takes the whole
 word

 But since ness still has to match, it gives
 up the last 4 characters and the match
 succeeds
overcoming greediness

The simplest solution is to
make the repetition operators
non-greedy, or lazy

Lazy quantifiers grab as little
as possible

If the overall match fails, they
grab a little more and the
match is tried again
overcoming greediness
 *?

         To make a greedy quantifier
+?       lazy, append ?

         Note that this use of the
         question mark is different from
{ , }?   its use as a regular quantifier


 ??
overcoming greediness
 *?

         Applying <.+?>
+?
            to <i>
               <i>greediness</i>

{ , }?
         gets us <i>


 ??
overcoming greediness


            Another option is to use
            negated character classes

            More efficient and clearer than
            lazy repetition
overcoming greediness

            <.+?> can be turned into
            <[^>]+>

            Note that the second version
            will match tags spanning
            multiple lines

            Single-line version:
            <[^>rn]+>
assertions and anchors


 An assertion is a regex operator that

   expresses a statement about the current matching point

   consumes no characters
assertions and anchors


 The most common type of an assertion
 is an anchor

 Anchor matches a certain position in
 the subject string
caret                                             ^
                                             ^F
 Caret, or circumflex, is an anchor that
 matches at the beginning of the subject
 string

 ^F basically means that the subject
 string has to start with an F

                                           Fandango
caret                                             ^
                                             ^F
 Caret, or circumflex, is an anchor that
 matches at the beginning of the subject
 string

 ^F basically means that the subject
 string has to start with an F

                                           Fandango
                                           F
dollar sign                                           $
  Dollar sign is an anchor that matches at   d$
  the end of the subject string or right
  before the string-ending newline

  d$ means that the subject string has to
  end with a digit

  The string may be top 10 or top 10n,
  but either one will match
                                             top 10
dollar sign                                           $
  Dollar sign is an anchor that matches at   d$
  the end of the subject string or right
  before the string-ending newline

  d$ means that the subject string has to
  end with a digit

  The string may be top 10 or top 10n,
  but either one will match
                                             top 10
                                                  0
multiline matching

 Often subject strings consist of
                                        ^t.+
 multiple lines

 If the multiline option is set:

   Caret (^) also matches immediately
   after any newlines
                                        one
   Dollar sign ($) also matches
   immediately before any newlines      two
                                        three
multiline matching

 Often subject strings consist of
                                        ^t.+
 multiple lines

 If the multiline option is set:

   Caret (^) also matches immediately
   after any newlines
                                        one
   Dollar sign ($) also matches
   immediately before any newlines      two
                                        three
absolute start/end
 Sometimes you really want to match
 the absolute start or end of the subject
                                            At.+
 string when in the multiline mode

 These assertions are always valid:

   A matches only at the very beginning

   z matches only at the very end          three
   Z matches like $ used in single-line    tasty
   mode
                                            truffles
absolute start/end
 Sometimes you really want to match
 the absolute start or end of the subject
                                            At.+
 string when in the multiline mode

 These assertions are always valid:

   A matches only at the very beginning

   z matches only at the very end          three
   Z matches like $ used in single-line    tasty
   mode
                                            truffles
word boundaries                       b B
 btob         A word boundary is a position in the
                string with a word character (w) on one
                side and a non-word character (or
                string boundary) on the other

                b matches when the current position is
                a word boundary

                B matches when the current position is
right to vote   not a word boundary
word boundaries                         b B
 btob           A word boundary is a position in the
                  string with a word character (w) on one
                  side and a non-word character (or
                  string boundary) on the other

                  b matches when the current position is
                  a word boundary

                  B matches when the current position is
right |to| vote
       to         not a word boundary
word boundaries                       b B
 btob         A word boundary is a position in the
                string with a word character (w) on one
                side and a non-word character (or
                string boundary) on the other

                b matches when the current position is
                a word boundary

                B matches when the current position is
come together   not a word boundary
word boundaries                   b B
            A word boundary is a position in the
            string with a word character (w) on one
            side and a non-word character (or
            string boundary) on the other

            b matches when the current position is
            a word boundary

            B matches when the current position is
            not a word boundary
word boundaries                    b B
 B2B       A word boundary is a position in the
             string with a word character (w) on one
             side and a non-word character (or
             string boundary) on the other

             b matches when the current position is
             a word boundary

             B matches when the current position is
  doc2html
     2       not a word boundary
subpatterns                                  ()

 Parentheses can be used group a part of
 the regex together, creating a subpattern

 You can apply regex operators to a
 subpattern as a whole
grouping                                 ()

Regex is(land)? matches both is and
island

Regex (dd,)*dd will match a comma-
separated list of double-digit numbers
capturing subpatterns                                     ()

 All subpatterns by default are capturing

 A capturing subpattern stores the corresponding matched portion
 of the subject string in memory for later use
capturing subpatterns                                 ()
                                          (dd-(w+)-d{4})
 Subpatterns are numbered by counting
 their opening parentheses from left to
 right

 Regex (dd-(w+)-d{4}) has two
 subpatterns




                                           12-May-2004
capturing subpatterns                                 ()
                                          (dd-(w+)-d{4})
                                                (w+)
 Subpatterns are numbered by counting
 their opening parentheses from left to
 right

 Regex (dd-(w+)-d{4}) has two
 subpatterns

 When run against 12-May-2004 the
 second subpattern will capture May
                                           12-May-2004
capturing subpatterns                                 ()
                                          (dd-(w+)-d{4})
                                                (w+)
 Subpatterns are numbered by counting
 their opening parentheses from left to
 right

 Regex (dd-(w+)-d{4}) has two
 subpatterns

 When run against 12-May-2004 the
 second subpattern will capture May
                                           12-May-2004
                                              May
non-capturing subpatterns



 The capturing aspect of subpatterns is not always necessary

 It requires more memory and more processing time
non-capturing subpatterns


 Using ?: after the opening parenthesis makes a subpattern be a
 purely grouping one

 Regex box(?:ers)? will match boxers but will not capture anything

 The (?:) subpatterns are not included in the subpattern numbering
named subpatterns

 It can be hard to keep track of subpattern numbers in a
 complicated regex

 Using ?P<name> after the opening parenthesis creates a named
 subpattern

 Named subpatterns are still assigned numbers

 Pattern (?P<number>d+) will match and capture 99 into subpattern
 named number when run against 99 bottles
Alternation operator allows testing several sub-expressions at a
 given point

 The branches are tried in order, from left to right, until one
 succeeds

 Empty alternatives are permitted

 Regex sailing|cruising will match either sailing or cruising




alternation                                                       |
Since alternation has the lowest precedence, grouping is often
 necessary

 sixth|seventh sense will match the word sixth or the phrase seventh
 sense

 (sixth|seventh) sense will match sixth sense or seventh sense




alternation                                                       |
Remember that the regex engine is eager

 It will return a match as soon as it finds one

 camel|came|camera will only match came when run against camera

 Put more likely pattern as the first alternative




alternation                                               |
Applying ? to assertions is not permitted but..

 The branches may contain assertions, such as anchors for example

 (^|my|your) friend will match friend at the beginning of the string and
 after my or your




alternation                                                       |
backtracking

               Also known as “if at first you
               don’t succeed, try, try again”

               When faced with several
               options it could try to achieve a
               match, the engine picks one
               and remembers the others
backtracking


               If the picked option does not
               lead to an overall successful
               match, the engine backtracks
               to the decision point and tries
               another option
backtracking


               This continues until an overall
               match succeeds or all the
               options are exhausted

               The decision points include
               quantifiers and alternation
backtracking

                  Two important rules to
                       remember

               With greedy quantifiers the engine
               always attempts the match, and
               with lazy ones it delays the match

               If there were several decision
               points, the engine always goes back
               to the most recent one
backtracking example


d+00           12300
backtracking example


d+00           12300

                 start
backtracking example


d+00           12300
                1

                 add 1
backtracking example


d+00           12300
                12

                 add 2
backtracking example


d+00           12300
                123

                 add 3
backtracking example


d+00           12300
                1230

                 add 0
backtracking example


d+00           12300

                 add 0
backtracking example


d+00            12300

              string exhausted
           still need to match 00
backtracking example


d+00           12300
                1230

                give up 0
backtracking example


d+00           12300
                123

                give up 0
backtracking example


d+00           12300

                add 00
backtracking example


d+00           12300

                success
backtracking example


 d+ff          123dd
backtracking example


 d+ff          123dd

                  start
backtracking example


 d+ff          123dd
                1

                 add 1
backtracking example


 d+ff          123dd
                12

                 add 2
backtracking example


 d+ff          123
                123dd

                 add 3
backtracking example


 d+ff           123
                 123dd

            cannot match f here
backtracking example


 d+ff            123dd
                  12

                    give up 3
             still cannot match f
backtracking example


 d+ff            123dd
                  1

                    give up 2
             still cannot match f
backtracking example


 d+ff           123dd
                 1

            cannot give up more
               because of +
backtracking example


 d+ff          123dd

                 failure
backtracking example


 ab??c           abc
backtracking example


 ab??c           abc

                  start
backtracking example


 ab??c           abc
                 a

                 add a
backtracking example


 ab??c              abc
                    a

            skip matching b at first
backtracking example


 ab??c            abc
                  a

            c cannot match here
backtracking example


 ab??c            abc
                  ab

              go back and try
              matching b now
backtracking example


 ab??c            abc

             c can be matched
backtracking example


 ab??c           abc

                 success
atomic grouping


 Disabling backtracking can be useful

 The main goal is to speed up failed
 matches, especially with nested
 quantifiers
atomic grouping


 (?>regex) will treat regex as a single
 atomic token, no backtracking will
 occur inside it

 All the saved states are forgotten
atomic grouping


 (?>d+)ff will lock up all available digits
 and fail right away if the next two
 characters are not ff

 Atomic groups are not capturing
possessive quantifiers


            Atomic groups can be arbitrarily complex
            and nested

            Possessive quantifiers are simpler and
            apply to a single repeated item
possessive quantifiers


            To make a quantifier possessive append
            a single +

            d++ff is equivalent to (?>d+)ff
possessive quantifiers



            Other ones are *+, ?+, and {m,n}+

            Possessive quantifiers are always greedy
do not over-optimize




                                        !
 Keep in mind that atomic grouping and possessive quantifiers can
 change the outcome of the match

 When run against string abcdef

   w+d will match abcd

   w++d will not match at all

   w+ will match the whole string
backreferences                      n

 A backreference is an alias to a
 capturing subpattern

 It matches whatever the referent
 capturing subpattern has matched
backreferences                              n
 (re|le)w+1 matches words that start
 with re or le and end with the same
 thing

 For example, retire and legible, but not
 revocable or lecture

 Reference to a named subpattern can
 be made with (?P=name)
lookaround

 Assertions that test whether the characters before or after the
 current point match the given regex

 Consume no characters

 Do not capture anything

 Includes lookahead and lookbehind
positive lookahead                       (?=)

             Tests whether the characters after the
             current point match the given regex

            (w+)(?=:)(.*) matches surfing: a sport but
            colon ends up in the second subpattern
negative lookahead                        (?!)

 Tests whether the characters after the
 current point do not match the given
 regex

 fish(?!ing) matches fish not followed by
 ing

 Will match fisherman and fished
negative lookahead                           (?!)

 Difficult to do with character classes

 fish[^i][^n][^g] might work but will
 consume more than needed and fail on
 subjects shorter than 7 letters

 Character classes are no help at all with
 something like fish(?!hook|ing)
positive lookbehind                  (?<=)
             Tests whether the characters
             immediately preceding the current point
             match the given regex

             The regex must be of fixed size, but
             branches are allowed

             (?<=foo)bar matches bar only if
             preceded by foo, e.g. my foobar
negative lookbehind                         (?<!)
 Tests whether the characters
 immediately preceding the current point
 do not match the given regex

 Once again, regex must be of fixed size

 (?<!foo)bar matches bar only if not
 preceded by foo, e.g. in the bar but not
 my foobar
combining lookarounds

 Simply string them together

 Current matching point does not move

 foo(?=d{3})(?!999) matches foo followed by three digits that are not
 999

 Does not mean “match foo followed by 6 characters where the first
 3 are digits and the rest are not 999”
nesting lookarounds

 Nested lookarounds are also possible

 (?<=(?<!bar)foo)zoo matches zoo preceded by foo which in turn is
 not preceded by bar

 foo(?=d{3}(?!999)…) is the proper regex for the example on the
 previous slide
conditionals
 Conditionals let you apply a regex selectively or to choose between
 two regexes depending on a previous match

 (?(condition)yes-regex)

 (?(condition)yes-regex|no-regex)

 There are 3 kinds of conditions

   Subpattern match

   Lookaround assertion

   Recursive call (not discussed here)
subpattern conditions                                  (?(n))
 This condition is satisfied if the capturing subpattern number n has
 previously matched

 (“)? bw+b (?(1)”) matches words optionally enclosed by quotes

 There is a difference between (“)? and (“?) in this case: the second
 one will always capture

 b(?: (z) | w*x) w* (?(1) y )b matches words that contain x, or start
 with z and end in y, like axis and zoology
assertion conditions

 This type of condition relies on lookaround assertions to choose
 one path or the other

 href=(? (?=[‘”]) ([“’])S+1 | S+)

    Matches href=, then

    If the next character is single or double quote match a sequence of
    non-whitespace inside the matching quotes

    Otherwise just match it without quotes
recursion

 Allows matching that is impossible with conventional operators

 Expressions with arbitrary nesting depth present a problem

 Consider a math expression with an unknown number of
 parenthesized levels:

 (2-(3*(…)/4)+5)

 Recursive matching can help solve it
whole-pattern recursion
 The whole pattern can be recursed via (?R) construct

 ( ( (?>[d/*+-]+) | (?R) )* )

   Matches an opening parenthesis

   Then zero or more items which can be:

       A sequence of digits and arithmetic signs

       A recursive match of the whole pattern

   Matches a closing parenthesis

 Imagine that the whole pattern is plugged in place of (?R) over and
 over again
subpattern recursion
 In a larger pattern, it may be more convenient to recurse only a part
 of it

 In that case, (?n) syntax can be used to refer to the recursive
 subpattern

 Math: (( ( (?>[d/*+-]+) | (?1) )* ))

    Matches Math: followed by a recursive subpattern as before

    Only the first subpattern needs to recurse, so (?1) indicates that

 Don’t confuse (?1) with (?(1))
named subpattern recursion


 In still larger patterns it may be difficult to keep track of the number
 of the recursive subpattern

 Named subpatterns can be helpful in this case via (?P>name)
 construct

 Math: (?P<one>( ( (?>[d/*+-]+) | (?P>one) )* ))
recursion conditions


 This type of condition is satisfied if a recursive call to the pattern
 has been made

 At top level, the condition is false

 If the pattern has been recursed at least once, it is true
inline options
             The matching can be modified by options you
             put in the regular expression

    (?i)     enables case-insensitive mode


    (?m)     enables multiline matching for ^ and $


    (?s)     makes dot metacharacter match newline also


    (?x)     ignores literal whitespace


    (?U)     makes quantifiers ungreedy (lazy) by default
inline options

    (?i)         Options can be combined and unset

                 (?im-sx)
    (?m)
                 At top level, apply to the whole pattern
    (?s)         Localized inside subpatterns

    (?x)         (a(?i)b)c


    (?U)
comments   ?#
comments                                                ?#

 Here’s a regex I wrote when working on Smarty templating engine
comments                                                                                                                           ?#

  Here’s a regex I wrote when working on Smarty templating engine
 ^$w+(?>([(d+|$w+|w+(.w+)?)])|((.|->)$?w+))*(?>|@?w+(:(?>"[^"]*(?:.[^"]*)*"|'[^']*(?.[^']*) *'|[^|]+))*)*$
comments                                   ?#
             Let me blow that up for you


   ^$w+(?>([(d+|$w+|w+(.w+)?)])|
 ((.|->)$?w+))*(?>|@?w+(:(?>"[^"]*
      (?:.[^"]*)*"|'[^']*
      (?.[^']*)*'|[^|]+))*)*$
comments                                          ?#
              Let me blow that up for you


   ^$w+(?>([(d+|$w+|w+(.w+)?)])|
 ((.|->)$?w+))*(?>|@?w+(:(?>"[^"]*
      (?:.[^"]*)*"|'[^']*
      (?.[^']*)*'|[^|]+))*)*$

        Would you like some comments with that?
comments                                 ?#

 Most regexes could definitely use some
 comments

 (?#…) specifies a comment
comments                                 ?#

 Most regexes could definitely use some
 comments

 (?#…) specifies a comment


 d+(?# match some digits)
comments                                   ?#
 If (?x) option is set, anything after #
 outside a character class and up to the
 next newline is considered a comment

 To match literal whitespace, escape it
comments                                   ?#
 If (?x) option is set, anything after #
 outside a character class and up to the
 next newline is considered a comment

 To match literal whitespace, escape it



(?x) w+ # start with word characters
     [?!] # and end with ? or !
Regex Toolkit
regex toolkit

  In your day-to-day development, you will frequently find yourself
  running into situations calling for regular expressions

  It is useful to have a toolkit from which you can quickly draw the
  solution

  It is also important to know how to avoid problems in the regexes
  themselves
matching vs. validation


 In matching (extraction) the regex must
 account for boundary conditions

 In validation your boundary conditions
 are known – the whole string
matching vs. validation
 Matching an English word starting with
 a capital letter

  b[A-Z][a-zA-Z’-]*b
 Validating that a string fulfills the same
 condition

  ^[A-Z][a-zA-Z’-]*$
 Do not forget ^ and $ anchors for
 validation!
using dot properly
 One of the most used operators

 One of the most misused

 Remember - dot is a shortcut for [^n]

 May match more than you really want

 <.> will match <b> but also <!>, < >, etc

 Be explicit about what you want

 <[a-z]> is better
using dot properly

             When dot is combined with quantifiers it
             becomes greedy

             <.+> will consume any characters
             between the first bracket in the line and
             the last one

             Including any other brackets!
using dot properly

              It’s better to use negated character class
              instead

            <[^>]+> if bracketed expression spans lines

            <[^>rn]+> otherwise

              Lazy quantifiers can be used, but they
              are not as efficient, due to backtracking
optimizing unlimited repeats

  One of the most common problems is
  combining an inner repetition with an
  outer one

  If the initial match fails, the number of
  ways to split the string between the
  quantifiers grows exponentially

  The problem gets worse when the inner
  regex contains a dot, because it can
  match anything!
optimizing unlimited repeats
                                              (regex1|regex2|..)*
  One of the most common problems is
  combining an inner repetition with an
  outer one
                                                  (regex*)+
  If the initial match fails, the number of
  ways to split the string between the
  quantifiers grows exponentially
                                                  (regex+)*
  The problem gets worse when the inner
  regex contains a dot, because it can
  match anything!                                  (.*?bar)*
optimizing unlimited repeats
                                             (regex1|regex2|..)*

  PCRE has an optimization that helps in
  certain cases, and also has a                  (regex*)+
  hardcoded limit for the backtracking

  The best way to solve this is to prevent
  unnecessary backtracking in the first           (regex+)*
  place via atomic grouping or
  possessive quantifiers
                                                  (.*?bar)*
optimizing unlimited repeats

Consider the expression that is supposed to match a sequence of
words or spaces inside a quoted string

[“’](w+|s{1,2})*[“’]

When applied to the string “aaaaaaaaaa” (with final quote), it
matches quickly

When applied to the string “aaaaaaaaaa (no final quote), it runs 35
times slower!
optimizing unlimited repeats

We can prevent backtracking from going back to the matched
portion by adding a possessive quantifier:

[“’](w+|s{1,2})*+[“’]

With nested unlimited repeats, you should lock up as much of the
string as possible right away
removing duplicate items


 Naïve implementation:
   Match ([a-z]+) 1
   Replace with $1
   Has problems with This is island
removing duplicate items


 Better approach:
   Match b([a-z]+) 1b
   Replace with $1
   Handles This is island just fine
removing duplicate items
        Even better, concentrate on delimiters




     (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
removing duplicate items
        Even better, concentrate on delimiters
        First match a non-delimiter sequence




     (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
                    ([^s.,?!]+)
removing duplicate items
        Even better, concentrate on delimiters
        First match a non-delimiter sequence,
        that is preceded by a delimiter or
        beginning of string




     (?<=[s.,?!]|^)([^s.,?!]+)
     (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
removing duplicate items
        Even better, concentrate on delimiters
        First match a non-delimiter sequence,
        that is preceded by a delimiter or
        beginning of string
        Then match atomically one or more
        duplicates of the first match, separated
        by delimiters




     (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++
     (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
removing duplicate items
        Even better, concentrate on delimiters
        First match a non-delimiter sequence,
        that is preceded by a delimiter or
        beginning of string
        Then match atomically one or more
        duplicates of the first match, separated
        by delimiters
        And make sure it is followed by a
        delimiter or the end of the string



     (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
removing duplicate items
        Even better, concentrate on delimiters
        First match a non-delimiter sequence,
        that is preceded by a delimiter or
        beginning of string
        Then match atomically one or more
        duplicates of the first match, separated
        by delimiters
        And make sure it is followed by a
        delimiter or the end of the string
        Replace with $1

     (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
removing duplicate items
This is unwieldy. Let’s use PHP variable
interpolation.
$dlm = '[s.,?!]';
$n_dlm = '[^s.,?!]';
$s = preg_replace(”/
         (?<=$dlm|^)
         ($n_dlm+)
         ($dlm1)++
         (?=$dlm|$)
     /x”, '$1', $s);
removing multiline comments

 Simple, if the comments are not allowed to nest

 !/*.*?*/!s replaced with an empty string will work for C-like
 comments

 General pattern: /start.*?end/s

 For nested comments, a recursive pattern is necessary
extracting markup
   Possible to use preg_match_all() for grabbing marked up
   portions

   But for tokenizing approach, preg_split() is better



$s = 'a <b><I>test</I></b> of <br /> markup';
$tokens = preg_split(
     '!( < /? [a-zA-Z][a-zA-Z0-9]* [^/>]* /? > ) !x', $s, -1,
     PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

result is array('a','<b>','<I>','test','</I>',
                '</b>','of','<br />','markup')
restricting markup

 Suppose you want to strip all markup except for some allowed
 subset. What are your possible approaches?

   Use strip_tags() - which has limited functionality

   Multiple invocations of str_replace() or preg_replace() to
   remove script blocks, etc

   Custom tokenizer and processor, or..
restricting markup
 $s = preg_replace_callback(
      '! < (/?) ([a-zA-Z][a-zA-Z0-9]*) ([^/>]*) (/?) > !x',
      'my_strip', $s);

 function my_strip($match) {
     static $allowed_tags = array('b', 'i', 'p', 'br', 'a');
     $tag = $match[2];
     $attrs = $match[3];
     if (!in_array($tag, $allowed_tags)) return ‘’;
     if (!empty($match[1])) return "</$tag>";
     /* strip evil attributes here */
     if ($tag == 'a') { $attrs = ''; }
     /* any other kind of processing here */
     return "<$tag$attrs$match[4]>";
 }
matching numbers

Integers are easy: bd+b

Floating point numbers:

integer.fractional

.fractional

Can be covered by (bd+|B).d+b
matching numbers
To match both integers and floating
point numbers, either combine them
with alternation or use:
((bd+)?.)?bd+b

[+-]? can be prepended to any of these,
if sign matching is needed

b can be substituted by more
appropriate assertions based on the
required delimiters
matching quoted strings

 A simple case is a string that does not contain escaped quotes
 inside it

 Matching a quoted string that spans lines:

  “[^”]*”

 Matching a quoted string that does not span lines:

  “[^”rn]*”
matching quoted strings
           Matching a string with escaped quotes inside
                     “([^”]+|(?<=)”)*+”
matching quoted strings
           Matching a string with escaped quotes inside
                     “([^”]+|(?<=)”)*+”

    “      opening quote
matching quoted strings
           Matching a string with escaped quotes inside
                     “([^”]+|(?<=)”)*+”

    “      opening quote
    (      a component that is
matching quoted strings
           Matching a string with escaped quotes inside
                     “([^”]+|(?<=)”)*+”

     “     opening quote
     (     a component that is
   [^”]+   a segment without any quotes
matching quoted strings
           Matching a string with escaped quotes inside
                     “([^”]+|(?<=)”)*+”

     “     opening quote
     (     a component that is
   [^”]+   a segment without any quotes
     |     or
matching quoted strings
               Matching a string with escaped quotes inside
                         “([^”]+|(?<=)”)*+”

      “        opening quote
      (        a component that is
    [^”]+      a segment without any quotes
      |        or
  (?<=)”   a quote preceded by a backslash
matching quoted strings
               Matching a string with escaped quotes inside
                         “([^”]+|(?<=)”)*+”

      “        opening quote
      (        a component that is
    [^”]+      a segment without any quotes
      |        or
  (?<=)”   a quote preceded by a backslash
               component repeated zero or more times
     )*+
               without backtracking
matching quoted strings
               Matching a string with escaped quotes inside
                         “([^”]+|(?<=)”)*+”

      “        opening quote
      (        a component that is
    [^”]+      a segment without any quotes
      |        or
  (?<=)”   a quote preceded by a backslash
               component repeated zero or more times
     )*+
               without backtracking
      “        closing quote
matching e-mail addresses


 Yeah, right

 The complete regex is about as long as a book page in 10-point type

 Buy a copy of Jeffrey Friedl’s book and steal it from there
matching phone numbers

 Assuming we want to match US/Canada-style phone numbers

 800-555-1212          1-800-555-1212

 800.555.1212          1.800.555.1212

 (800) 555-1212        1 (800) 555-1212

 How would we do it?
matching phone numbers

 The simplistic approach could be:

 (1[ .-])? (? d{3} )? [ .-] d{3} [.-] d{4}

 But this would result in a lot of false positives:

 1.(800)-555 1212             800).555-1212

 1-800 555-1212               (800 555-1212
matching phone numbers
            ^(?:               anchor to the start of the string
        (?:1([.-]))?           may have 1. or 1- (remember the separator)
           d{3}               three digits
           ( (?(1)             if we had a separator
            1 |               match the same (and remember), otherwise
           [.-] ) )            match . or - as a separator (and remember)
           d{3}               another three digits
             2                same separator as before
           d{4}               final four digits
              |                or
1[ ]?(d{3})[ ]d{3}-d{4}   just match the third format
             )$                anchor to the end of the string
multi-requirement match
 Suppose you need to match a word between 6-10 characters long
 that contains sun or sail somewhere in it

 First use lookaround for the precondition and then describe the
 match

 b (?=w{6,10}b) w{0,3} (sun|sail) w*

 What if you wanted words that contains those substrings but do not
 start with a capital letter? Easy:

 b(?=w{6,10}b) (?![A-Z]) w{0,3}(sun|sail)w*
tips

       Don’t do everything in regex –
       a lot of tasks are best left to
       PHP

       Use string functions for simple
       tasks

       Make sure you know how
       backtracking works
tips

       Be aware of the context

       Capture only what you intend to
       use

       Don’t use single-character
       classes
tips


       Lazy vs. greedy, be specific

       Put most likely alternative first
       in the alternation list

       Think!
regex tools
 Rubular.com

 Regex buddy

 Komodo Regex tester (Rx toolkit)

 Reggy (reggyapp.com)

 RegExr (http://www.gskinner.com/RegExr/)

 http://www.spaweditor.com/scripts/regex/index.php

 http://regex.larsolavtorvik.com/
Thank You!
    Questions?


http://zmievski.org/talks

More Related Content

Viewers also liked

Viewers also liked (20)

Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Field Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your BuddyField Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your Buddy
 
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
Regular Expressions: Backtracking, and The Little Engine that Could(n't)?
 
44CON London 2015 - Indicators of Compromise: From malware analysis to eradic...
44CON London 2015 - Indicators of Compromise: From malware analysis to eradic...44CON London 2015 - Indicators of Compromise: From malware analysis to eradic...
44CON London 2015 - Indicators of Compromise: From malware analysis to eradic...
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
Processing Regex Python
Processing Regex PythonProcessing Regex Python
Processing Regex Python
 
Php Unicode I18n
Php Unicode I18nPhp Unicode I18n
Php Unicode I18n
 
Seo campixx: Kleine Stilkunde "Texte"
Seo campixx: Kleine Stilkunde "Texte"Seo campixx: Kleine Stilkunde "Texte"
Seo campixx: Kleine Stilkunde "Texte"
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Regular Expression (Regex) Fundamentals
Regular Expression (Regex) FundamentalsRegular Expression (Regex) Fundamentals
Regular Expression (Regex) Fundamentals
 
Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular Languages
 
Regular expression (compiler)
Regular expression (compiler)Regular expression (compiler)
Regular expression (compiler)
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
 
Test Mühendisliğine Giriş Eğitimi - Bölüm 1
Test Mühendisliğine Giriş Eğitimi - Bölüm 1Test Mühendisliğine Giriş Eğitimi - Bölüm 1
Test Mühendisliğine Giriş Eğitimi - Bölüm 1
 
SEO: Crawl Budget Optimierung & Onsite SEO
SEO: Crawl Budget Optimierung & Onsite SEOSEO: Crawl Budget Optimierung & Onsite SEO
SEO: Crawl Budget Optimierung & Onsite SEO
 
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
The Good, the Bad, and the Ugly: What Happened to Unicode and PHP 6
 
Google Analytics fürs Content-Marketing
Google Analytics fürs Content-MarketingGoogle Analytics fürs Content-Marketing
Google Analytics fürs Content-Marketing
 
Pagespeed Learnings aus mehreren Relaunches (SEO Campixx 2017)
Pagespeed Learnings aus mehreren Relaunches (SEO Campixx 2017)Pagespeed Learnings aus mehreren Relaunches (SEO Campixx 2017)
Pagespeed Learnings aus mehreren Relaunches (SEO Campixx 2017)
 
Content-Produktion aus der Praxis – Von der Idee zum Inhalt – SEO Campixx 2017
Content-Produktion aus der Praxis – Von der Idee zum Inhalt – SEO Campixx 2017Content-Produktion aus der Praxis – Von der Idee zum Inhalt – SEO Campixx 2017
Content-Produktion aus der Praxis – Von der Idee zum Inhalt – SEO Campixx 2017
 

Similar to Andrei's Regex Clinic

Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
Tri Truong
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Raghu nath
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Raj Gupta
 

Similar to Andrei's Regex Clinic (20)

2013 - Andrei Zmievski: Clínica Regex
2013 - Andrei Zmievski: Clínica Regex2013 - Andrei Zmievski: Clínica Regex
2013 - Andrei Zmievski: Clínica Regex
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Looking for Patterns
Looking for PatternsLooking for Patterns
Looking for Patterns
 
Python - Regular Expressions
Python - Regular ExpressionsPython - Regular Expressions
Python - Regular Expressions
 
Maxbox starter20
Maxbox starter20Maxbox starter20
Maxbox starter20
 
Adv. python regular expression by Rj
Adv. python regular expression by RjAdv. python regular expression by Rj
Adv. python regular expression by Rj
 
Regular expressions in oracle
Regular expressions in oracleRegular expressions in oracle
Regular expressions in oracle
 
Regular Expressions in Stata
Regular Expressions in StataRegular Expressions in Stata
Regular Expressions in Stata
 
Regex lecture
Regex lectureRegex lecture
Regex lecture
 
Regular_Expressions.pptx
Regular_Expressions.pptxRegular_Expressions.pptx
Regular_Expressions.pptx
 
A regex ekon16
A regex ekon16A regex ekon16
A regex ekon16
 
Python - Lecture 7
Python - Lecture 7Python - Lecture 7
Python - Lecture 7
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Chapter 3: Introduction to Regular Expression
Chapter 3: Introduction to Regular ExpressionChapter 3: Introduction to Regular Expression
Chapter 3: Introduction to Regular Expression
 
C# chap 10
C# chap 10C# chap 10
C# chap 10
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introduction
 
Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Andrei's Regex Clinic

  • 1. Andrei Zmievski andrei@php.net • twitter: @a Andrei’s Regex Clinic
  • 7. a bit of history Regular expressions can be traced back to early research on the human nervous system
  • 8. a bit of history Neurophysiologists Warren McCulloch and Walter Pitts developed a mathematical way of describing the neural networks
  • 9. a bit of history Later, mathematician Stephen Kleene published a paper that introduced the concept of regular expressions that were used to describe “the algebra of regular sets”
  • 10. a bit of history # grep “^[A-Z]” *.txt Ken Thompson, one of the fathers of Unix, found a practical application for them in the various tools of the early OS
  • 11. what are they good for? Literal string searches are fast but inflexible With regular expressions you can: Find out whether a certain pattern occurs in the text Locate strings matching a pattern and remove them or replace them with something else Extract the strings matching the pattern
  • 12. terminology Regex
  • 13. terminology Regex a pattern describing a set of strings abcdef
  • 14. terminology Subject String
  • 15. terminology apple Subject String text that the regex is applied to
  • 16. terminology Match
  • 17. terminology a apple Match a portion of the string that is successfully described by the regex
  • 18. terminology Engine
  • 19. terminology Engine A program or a library that obtains matches given a regex and a string PCRE
  • 20. regex flavors Regular expressions are like ice cream Common base Many flavors
  • 21. regex flavors Three main types of engines that determine how matching is done: DFA Traditional NFA POSIX NFA For our purposes we discuss the regex flavor that PHPs’ Perl-compatible regular expressions use
  • 22. how an NFA engine works The engine bumps along the string trying to match the regex Sometimes it goes back and tries again
  • 23. how an NFA engine works Two basic things to understand about the engine It will always return the earliest (leftmost) match it finds The topic of the day is isotopes. Given a choice it always favors match over a non- match
  • 24. how an NFA engine works Two basic things to understand about the engine It will always return the earliest (leftmost) match it finds The topic of the day is isotopes. Given a choice it always favors match over a non- match
  • 25. how an NFA engine works Two basic things to understand about the engine It will always return the earliest (leftmost) match it finds The topic of the day is isotopes. Given a choice it always favors match over a non- match
  • 27. color legend regular expression
  • 28. color legend regular expression subject string
  • 29. color legend regular expression subject string match
  • 31. building blocks Regexes are like LEGOs Small pieces combined into larger ones using connectors Arbitrarily complex
  • 32. characters a 4 ordinary 0 K x
  • 33. characters a 4 ordinary 0 K x special * . ! ^ ?
  • 34. characters a 4 Special set is a well-defined subset of 0 K ASCII Ordinary set consist of all characters x not designated special Special characters are also called metacharacters * . ! ^ ?
  • 35. matching literals The most basic regex consists of a single ordinary character It matches the first occurrence of that character in the string Characters can be added together to form longer regexes
  • 36. matching literals The most basic regex consists of a 123 single ordinary character It matches the first occurrence of that character in the string Characters can be added together to form longer regexes
  • 37. extended characters To match an extended character, use xhh notation where hh are hexadecimal digits To match Unicode characters (in UTF-8 mode) mode use x{hhh..} notation
  • 38. Андрей For example, the following pattern matches my name in Cyrillic: x{0410}x{043d}x{0434}x{0440}x{0435}x{0439} extended characters
  • 39. metacharacters To use one of these literally, escape it, that is prepend it with a backslash . [] () ^$ *+? {} |
  • 40. metacharacters To use one of these literally, escape it, that is prepend it with a backslash . [] () ^$ $ *+? {} |
  • 41. metacharacters To escape a sequence of characters, put them between Q and E . [] () ^$ *+? {} |
  • 42. metacharacters To escape a sequence of characters, put them between Q and E . [] () Price is Q$12.36E ^$ will match *+? Price is $12.36 {} |
  • 43. metacharacters So will the backslashed version . [] () Price is $12.36 ^$ will match *+? Price is $12.36 {} |
  • 45. character classes [] Consist of a set of characters placed inside square brackets Matches one and only one of the characters specified inside the class
  • 46. character classes [] matches an English vowel (lowercase) [aeiou] matches surf or turf [st]urf
  • 47. negated classes [^] Placing a caret as the first character after the opening bracket negates the class Will match any character not in the class, including newlines [^<>] would match a character that is not left or right bracket
  • 48. character ranges [-] Placing a dash (-) between two characters creates a range from the first one to the second one Useful for abbreviating a list of characters [a-z]
  • 49. character ranges [-] Ranges can be reversed [z-a]
  • 50. character ranges [-] Ranges can be reversed A class can have more than one range and combine ranges with normal lists [a-z0-9:]
  • 52. character ranges [-] [0-9/] matches a digit or a slash
  • 53. character ranges [-] [0-9/] matches a digit or a slash [z-w] matches z, y, x, or w
  • 54. character ranges [-] [0-9/] matches a digit or a slash [z-w] matches z, y, x, or w [a-z0-9] matches digits and lowercase letters
  • 55. character ranges [-] [0-9/] matches a digit or a slash [z-w] matches z, y, x, or w [a-z0-9] matches digits and lowercase letters [x01-x1f] matches control characters
  • 57. w word character [A-Za-z0-9_] d decimal digit [0-9] s whitespace [ nrtf] W not a word character [^A-Za-z0-9_] D not a decimal digit [^0-9] S not whitespace [^ nrtf] shortcuts for ranges [-]
  • 58. classes and metacharacters ] Inside a character class, most metacharacters lose their meaning Exceptions are: ^ -
  • 59. classes and metacharacters ] Inside a character class, most metacharacters lose their meaning Exceptions are: closing bracket ^ -
  • 60. classes and metacharacters ] Inside a character class, most metacharacters lose their meaning Exceptions are: closing bracket backslash ^ -
  • 61. classes and metacharacters ] Inside a character class, most metacharacters lose their meaning Exceptions are: closing bracket backslash ^ - caret
  • 62. classes and metacharacters ] Inside a character class, most metacharacters lose their meaning Exceptions are: closing bracket backslash ^ - caret dash
  • 63. classes and metacharacters [ab]] To use them literally, either escape them [ab^] with a backslash or put them where they do not have special meaning [a-z-]
  • 64. dot metacharacter . By default matches any single character
  • 65. dot metacharacter . By default matches any single character Except a newline n
  • 66. dot metacharacter . Is equivalent to [^n]
  • 67. dot metacharacter . 12345 Use dot carefully - it might match something you did not intend 12945 12.45 will match literal 12.45 12a45 But it will also match these: 12-45 78812 45839
  • 68. dot metacharacter . 12345 Use dot carefully - it might match something you did not intend 12945 12.45 will match literal 12.45 12a45 But it will also match these: 12-45 78812 45839
  • 69. quantifiers We are almost never sure about the contents of the text.
  • 70. quantifiers Quantifiers help us deal with this uncertainty
  • 71. quantifiers ? * Quantifiers help us deal with this uncertainty + {}
  • 72. quantifiers ? They specify how many times a * regex component must repeat in order for the match to be successful + {}
  • 73. repeatable components a literal character
  • 74. repeatable components a literal character w d s . W D S range shortcuts dot metacharacter [] subpattern character class backreference
  • 75. zero-or-one ? Indicates that the preceding component is optional Regex welcome!? will match either welcome or welcome! Regex supers?strong means that super and strong may have an optional whitespace character between them Regex hello[!?]? Will match hello, hello!, or hello?
  • 76. Indicates that the preceding component has to appear once or more Regex a+h will match ah, aah, aaah, etc Regex -d+ will match negative integers, such as -33 Regex [^”]+ means to match a sequence (more than one) of characters until the next quote one-or-more +
  • 77. zero-or-more * Indicates that the preceding component can match zero or more times Regex d+.d* will match 2., 3.1, 0.001 Regex <[a-z][a-z0-9]*> will match an opening HTML tag with no attributes, such as <b> or <h2>, but not <> or </i>
  • 78. general repetition {} Specifies the minimum and the maximum number of times a component has to match Regex ha{1,3} matches ha, haa, haaa Regex d{8} matches exactly 8 digits If second number is omitted, no upper range is set Regex go{2,}al matches gooal, goooal, gooooal, etc
  • 79. general repetition {} {0,1} = {1,} = {0,} =
  • 80. general repetition {} {0,1} = ? {1,} = + {0,} = *
  • 81. greediness matching as much as possible, up to a limit
  • 82. greediness PHP 5? PHP 5 is better than Perl 6 d{2,4} 10/26/2004
  • 83. greediness PHP 5? PHP 5 is better than Perl 6 d{2,4} 10/26/2004
  • 84. greediness PHP 5? PHP 5 is better than Perl 6 d{2,4} 10/26/2004 2004
  • 85. greediness Quantifiers try to grab as much as possible by default Applying <.+> to <i>greediness</i> matches the whole string rather than just <i>
  • 86. greediness If the entire match fails because they consumed too much, then they are forced to give up as much as needed to make the rest of regex succeed
  • 87. greediness To find words ending in ness, you will probably use w+ness On the first run w+ takes the whole word But since ness still has to match, it gives up the last 4 characters and the match succeeds
  • 88. overcoming greediness The simplest solution is to make the repetition operators non-greedy, or lazy Lazy quantifiers grab as little as possible If the overall match fails, they grab a little more and the match is tried again
  • 89. overcoming greediness *? To make a greedy quantifier +? lazy, append ? Note that this use of the question mark is different from { , }? its use as a regular quantifier ??
  • 90. overcoming greediness *? Applying <.+?> +? to <i> <i>greediness</i> { , }? gets us <i> ??
  • 91. overcoming greediness Another option is to use negated character classes More efficient and clearer than lazy repetition
  • 92. overcoming greediness <.+?> can be turned into <[^>]+> Note that the second version will match tags spanning multiple lines Single-line version: <[^>rn]+>
  • 93. assertions and anchors An assertion is a regex operator that expresses a statement about the current matching point consumes no characters
  • 94. assertions and anchors The most common type of an assertion is an anchor Anchor matches a certain position in the subject string
  • 95. caret ^ ^F Caret, or circumflex, is an anchor that matches at the beginning of the subject string ^F basically means that the subject string has to start with an F Fandango
  • 96. caret ^ ^F Caret, or circumflex, is an anchor that matches at the beginning of the subject string ^F basically means that the subject string has to start with an F Fandango F
  • 97. dollar sign $ Dollar sign is an anchor that matches at d$ the end of the subject string or right before the string-ending newline d$ means that the subject string has to end with a digit The string may be top 10 or top 10n, but either one will match top 10
  • 98. dollar sign $ Dollar sign is an anchor that matches at d$ the end of the subject string or right before the string-ending newline d$ means that the subject string has to end with a digit The string may be top 10 or top 10n, but either one will match top 10 0
  • 99. multiline matching Often subject strings consist of ^t.+ multiple lines If the multiline option is set: Caret (^) also matches immediately after any newlines one Dollar sign ($) also matches immediately before any newlines two three
  • 100. multiline matching Often subject strings consist of ^t.+ multiple lines If the multiline option is set: Caret (^) also matches immediately after any newlines one Dollar sign ($) also matches immediately before any newlines two three
  • 101. absolute start/end Sometimes you really want to match the absolute start or end of the subject At.+ string when in the multiline mode These assertions are always valid: A matches only at the very beginning z matches only at the very end three Z matches like $ used in single-line tasty mode truffles
  • 102. absolute start/end Sometimes you really want to match the absolute start or end of the subject At.+ string when in the multiline mode These assertions are always valid: A matches only at the very beginning z matches only at the very end three Z matches like $ used in single-line tasty mode truffles
  • 103. word boundaries b B btob A word boundary is a position in the string with a word character (w) on one side and a non-word character (or string boundary) on the other b matches when the current position is a word boundary B matches when the current position is right to vote not a word boundary
  • 104. word boundaries b B btob A word boundary is a position in the string with a word character (w) on one side and a non-word character (or string boundary) on the other b matches when the current position is a word boundary B matches when the current position is right |to| vote to not a word boundary
  • 105. word boundaries b B btob A word boundary is a position in the string with a word character (w) on one side and a non-word character (or string boundary) on the other b matches when the current position is a word boundary B matches when the current position is come together not a word boundary
  • 106. word boundaries b B A word boundary is a position in the string with a word character (w) on one side and a non-word character (or string boundary) on the other b matches when the current position is a word boundary B matches when the current position is not a word boundary
  • 107. word boundaries b B B2B A word boundary is a position in the string with a word character (w) on one side and a non-word character (or string boundary) on the other b matches when the current position is a word boundary B matches when the current position is doc2html 2 not a word boundary
  • 108. subpatterns () Parentheses can be used group a part of the regex together, creating a subpattern You can apply regex operators to a subpattern as a whole
  • 109. grouping () Regex is(land)? matches both is and island Regex (dd,)*dd will match a comma- separated list of double-digit numbers
  • 110. capturing subpatterns () All subpatterns by default are capturing A capturing subpattern stores the corresponding matched portion of the subject string in memory for later use
  • 111. capturing subpatterns () (dd-(w+)-d{4}) Subpatterns are numbered by counting their opening parentheses from left to right Regex (dd-(w+)-d{4}) has two subpatterns 12-May-2004
  • 112. capturing subpatterns () (dd-(w+)-d{4}) (w+) Subpatterns are numbered by counting their opening parentheses from left to right Regex (dd-(w+)-d{4}) has two subpatterns When run against 12-May-2004 the second subpattern will capture May 12-May-2004
  • 113. capturing subpatterns () (dd-(w+)-d{4}) (w+) Subpatterns are numbered by counting their opening parentheses from left to right Regex (dd-(w+)-d{4}) has two subpatterns When run against 12-May-2004 the second subpattern will capture May 12-May-2004 May
  • 114. non-capturing subpatterns The capturing aspect of subpatterns is not always necessary It requires more memory and more processing time
  • 115. non-capturing subpatterns Using ?: after the opening parenthesis makes a subpattern be a purely grouping one Regex box(?:ers)? will match boxers but will not capture anything The (?:) subpatterns are not included in the subpattern numbering
  • 116. named subpatterns It can be hard to keep track of subpattern numbers in a complicated regex Using ?P<name> after the opening parenthesis creates a named subpattern Named subpatterns are still assigned numbers Pattern (?P<number>d+) will match and capture 99 into subpattern named number when run against 99 bottles
  • 117. Alternation operator allows testing several sub-expressions at a given point The branches are tried in order, from left to right, until one succeeds Empty alternatives are permitted Regex sailing|cruising will match either sailing or cruising alternation |
  • 118. Since alternation has the lowest precedence, grouping is often necessary sixth|seventh sense will match the word sixth or the phrase seventh sense (sixth|seventh) sense will match sixth sense or seventh sense alternation |
  • 119. Remember that the regex engine is eager It will return a match as soon as it finds one camel|came|camera will only match came when run against camera Put more likely pattern as the first alternative alternation |
  • 120. Applying ? to assertions is not permitted but.. The branches may contain assertions, such as anchors for example (^|my|your) friend will match friend at the beginning of the string and after my or your alternation |
  • 121. backtracking Also known as “if at first you don’t succeed, try, try again” When faced with several options it could try to achieve a match, the engine picks one and remembers the others
  • 122. backtracking If the picked option does not lead to an overall successful match, the engine backtracks to the decision point and tries another option
  • 123. backtracking This continues until an overall match succeeds or all the options are exhausted The decision points include quantifiers and alternation
  • 124. backtracking Two important rules to remember With greedy quantifiers the engine always attempts the match, and with lazy ones it delays the match If there were several decision points, the engine always goes back to the most recent one
  • 127. backtracking example d+00 12300 1 add 1
  • 128. backtracking example d+00 12300 12 add 2
  • 129. backtracking example d+00 12300 123 add 3
  • 130. backtracking example d+00 12300 1230 add 0
  • 132. backtracking example d+00 12300 string exhausted still need to match 00
  • 133. backtracking example d+00 12300 1230 give up 0
  • 134. backtracking example d+00 12300 123 give up 0
  • 135. backtracking example d+00 12300 add 00
  • 136. backtracking example d+00 12300 success
  • 139. backtracking example d+ff 123dd 1 add 1
  • 140. backtracking example d+ff 123dd 12 add 2
  • 141. backtracking example d+ff 123 123dd add 3
  • 142. backtracking example d+ff 123 123dd cannot match f here
  • 143. backtracking example d+ff 123dd 12 give up 3 still cannot match f
  • 144. backtracking example d+ff 123dd 1 give up 2 still cannot match f
  • 145. backtracking example d+ff 123dd 1 cannot give up more because of +
  • 146. backtracking example d+ff 123dd failure
  • 150. backtracking example ab??c abc a skip matching b at first
  • 151. backtracking example ab??c abc a c cannot match here
  • 152. backtracking example ab??c abc ab go back and try matching b now
  • 153. backtracking example ab??c abc c can be matched
  • 155. atomic grouping Disabling backtracking can be useful The main goal is to speed up failed matches, especially with nested quantifiers
  • 156. atomic grouping (?>regex) will treat regex as a single atomic token, no backtracking will occur inside it All the saved states are forgotten
  • 157. atomic grouping (?>d+)ff will lock up all available digits and fail right away if the next two characters are not ff Atomic groups are not capturing
  • 158. possessive quantifiers Atomic groups can be arbitrarily complex and nested Possessive quantifiers are simpler and apply to a single repeated item
  • 159. possessive quantifiers To make a quantifier possessive append a single + d++ff is equivalent to (?>d+)ff
  • 160. possessive quantifiers Other ones are *+, ?+, and {m,n}+ Possessive quantifiers are always greedy
  • 161. do not over-optimize ! Keep in mind that atomic grouping and possessive quantifiers can change the outcome of the match When run against string abcdef w+d will match abcd w++d will not match at all w+ will match the whole string
  • 162. backreferences n A backreference is an alias to a capturing subpattern It matches whatever the referent capturing subpattern has matched
  • 163. backreferences n (re|le)w+1 matches words that start with re or le and end with the same thing For example, retire and legible, but not revocable or lecture Reference to a named subpattern can be made with (?P=name)
  • 164. lookaround Assertions that test whether the characters before or after the current point match the given regex Consume no characters Do not capture anything Includes lookahead and lookbehind
  • 165. positive lookahead (?=) Tests whether the characters after the current point match the given regex (w+)(?=:)(.*) matches surfing: a sport but colon ends up in the second subpattern
  • 166. negative lookahead (?!) Tests whether the characters after the current point do not match the given regex fish(?!ing) matches fish not followed by ing Will match fisherman and fished
  • 167. negative lookahead (?!) Difficult to do with character classes fish[^i][^n][^g] might work but will consume more than needed and fail on subjects shorter than 7 letters Character classes are no help at all with something like fish(?!hook|ing)
  • 168. positive lookbehind (?<=) Tests whether the characters immediately preceding the current point match the given regex The regex must be of fixed size, but branches are allowed (?<=foo)bar matches bar only if preceded by foo, e.g. my foobar
  • 169. negative lookbehind (?<!) Tests whether the characters immediately preceding the current point do not match the given regex Once again, regex must be of fixed size (?<!foo)bar matches bar only if not preceded by foo, e.g. in the bar but not my foobar
  • 170. combining lookarounds Simply string them together Current matching point does not move foo(?=d{3})(?!999) matches foo followed by three digits that are not 999 Does not mean “match foo followed by 6 characters where the first 3 are digits and the rest are not 999”
  • 171. nesting lookarounds Nested lookarounds are also possible (?<=(?<!bar)foo)zoo matches zoo preceded by foo which in turn is not preceded by bar foo(?=d{3}(?!999)…) is the proper regex for the example on the previous slide
  • 172. conditionals Conditionals let you apply a regex selectively or to choose between two regexes depending on a previous match (?(condition)yes-regex) (?(condition)yes-regex|no-regex) There are 3 kinds of conditions Subpattern match Lookaround assertion Recursive call (not discussed here)
  • 173. subpattern conditions (?(n)) This condition is satisfied if the capturing subpattern number n has previously matched (“)? bw+b (?(1)”) matches words optionally enclosed by quotes There is a difference between (“)? and (“?) in this case: the second one will always capture b(?: (z) | w*x) w* (?(1) y )b matches words that contain x, or start with z and end in y, like axis and zoology
  • 174. assertion conditions This type of condition relies on lookaround assertions to choose one path or the other href=(? (?=[‘”]) ([“’])S+1 | S+) Matches href=, then If the next character is single or double quote match a sequence of non-whitespace inside the matching quotes Otherwise just match it without quotes
  • 175. recursion Allows matching that is impossible with conventional operators Expressions with arbitrary nesting depth present a problem Consider a math expression with an unknown number of parenthesized levels: (2-(3*(…)/4)+5) Recursive matching can help solve it
  • 176. whole-pattern recursion The whole pattern can be recursed via (?R) construct ( ( (?>[d/*+-]+) | (?R) )* ) Matches an opening parenthesis Then zero or more items which can be: A sequence of digits and arithmetic signs A recursive match of the whole pattern Matches a closing parenthesis Imagine that the whole pattern is plugged in place of (?R) over and over again
  • 177. subpattern recursion In a larger pattern, it may be more convenient to recurse only a part of it In that case, (?n) syntax can be used to refer to the recursive subpattern Math: (( ( (?>[d/*+-]+) | (?1) )* )) Matches Math: followed by a recursive subpattern as before Only the first subpattern needs to recurse, so (?1) indicates that Don’t confuse (?1) with (?(1))
  • 178. named subpattern recursion In still larger patterns it may be difficult to keep track of the number of the recursive subpattern Named subpatterns can be helpful in this case via (?P>name) construct Math: (?P<one>( ( (?>[d/*+-]+) | (?P>one) )* ))
  • 179. recursion conditions This type of condition is satisfied if a recursive call to the pattern has been made At top level, the condition is false If the pattern has been recursed at least once, it is true
  • 180. inline options The matching can be modified by options you put in the regular expression (?i) enables case-insensitive mode (?m) enables multiline matching for ^ and $ (?s) makes dot metacharacter match newline also (?x) ignores literal whitespace (?U) makes quantifiers ungreedy (lazy) by default
  • 181. inline options (?i) Options can be combined and unset (?im-sx) (?m) At top level, apply to the whole pattern (?s) Localized inside subpatterns (?x) (a(?i)b)c (?U)
  • 182. comments ?#
  • 183. comments ?# Here’s a regex I wrote when working on Smarty templating engine
  • 184. comments ?# Here’s a regex I wrote when working on Smarty templating engine ^$w+(?>([(d+|$w+|w+(.w+)?)])|((.|->)$?w+))*(?>|@?w+(:(?>"[^"]*(?:.[^"]*)*"|'[^']*(?.[^']*) *'|[^|]+))*)*$
  • 185. comments ?# Let me blow that up for you ^$w+(?>([(d+|$w+|w+(.w+)?)])| ((.|->)$?w+))*(?>|@?w+(:(?>"[^"]* (?:.[^"]*)*"|'[^']* (?.[^']*)*'|[^|]+))*)*$
  • 186. comments ?# Let me blow that up for you ^$w+(?>([(d+|$w+|w+(.w+)?)])| ((.|->)$?w+))*(?>|@?w+(:(?>"[^"]* (?:.[^"]*)*"|'[^']* (?.[^']*)*'|[^|]+))*)*$ Would you like some comments with that?
  • 187. comments ?# Most regexes could definitely use some comments (?#…) specifies a comment
  • 188. comments ?# Most regexes could definitely use some comments (?#…) specifies a comment d+(?# match some digits)
  • 189. comments ?# If (?x) option is set, anything after # outside a character class and up to the next newline is considered a comment To match literal whitespace, escape it
  • 190. comments ?# If (?x) option is set, anything after # outside a character class and up to the next newline is considered a comment To match literal whitespace, escape it (?x) w+ # start with word characters [?!] # and end with ? or !
  • 192. regex toolkit In your day-to-day development, you will frequently find yourself running into situations calling for regular expressions It is useful to have a toolkit from which you can quickly draw the solution It is also important to know how to avoid problems in the regexes themselves
  • 193. matching vs. validation In matching (extraction) the regex must account for boundary conditions In validation your boundary conditions are known – the whole string
  • 194. matching vs. validation Matching an English word starting with a capital letter b[A-Z][a-zA-Z’-]*b Validating that a string fulfills the same condition ^[A-Z][a-zA-Z’-]*$ Do not forget ^ and $ anchors for validation!
  • 195. using dot properly One of the most used operators One of the most misused Remember - dot is a shortcut for [^n] May match more than you really want <.> will match <b> but also <!>, < >, etc Be explicit about what you want <[a-z]> is better
  • 196. using dot properly When dot is combined with quantifiers it becomes greedy <.+> will consume any characters between the first bracket in the line and the last one Including any other brackets!
  • 197. using dot properly It’s better to use negated character class instead <[^>]+> if bracketed expression spans lines <[^>rn]+> otherwise Lazy quantifiers can be used, but they are not as efficient, due to backtracking
  • 198. optimizing unlimited repeats One of the most common problems is combining an inner repetition with an outer one If the initial match fails, the number of ways to split the string between the quantifiers grows exponentially The problem gets worse when the inner regex contains a dot, because it can match anything!
  • 199. optimizing unlimited repeats (regex1|regex2|..)* One of the most common problems is combining an inner repetition with an outer one (regex*)+ If the initial match fails, the number of ways to split the string between the quantifiers grows exponentially (regex+)* The problem gets worse when the inner regex contains a dot, because it can match anything! (.*?bar)*
  • 200. optimizing unlimited repeats (regex1|regex2|..)* PCRE has an optimization that helps in certain cases, and also has a (regex*)+ hardcoded limit for the backtracking The best way to solve this is to prevent unnecessary backtracking in the first (regex+)* place via atomic grouping or possessive quantifiers (.*?bar)*
  • 201. optimizing unlimited repeats Consider the expression that is supposed to match a sequence of words or spaces inside a quoted string [“’](w+|s{1,2})*[“’] When applied to the string “aaaaaaaaaa” (with final quote), it matches quickly When applied to the string “aaaaaaaaaa (no final quote), it runs 35 times slower!
  • 202. optimizing unlimited repeats We can prevent backtracking from going back to the matched portion by adding a possessive quantifier: [“’](w+|s{1,2})*+[“’] With nested unlimited repeats, you should lock up as much of the string as possible right away
  • 203. removing duplicate items Naïve implementation: Match ([a-z]+) 1 Replace with $1 Has problems with This is island
  • 204. removing duplicate items Better approach: Match b([a-z]+) 1b Replace with $1 Handles This is island just fine
  • 205. removing duplicate items Even better, concentrate on delimiters (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
  • 206. removing duplicate items Even better, concentrate on delimiters First match a non-delimiter sequence (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$) ([^s.,?!]+)
  • 207. removing duplicate items Even better, concentrate on delimiters First match a non-delimiter sequence, that is preceded by a delimiter or beginning of string (?<=[s.,?!]|^)([^s.,?!]+) (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
  • 208. removing duplicate items Even better, concentrate on delimiters First match a non-delimiter sequence, that is preceded by a delimiter or beginning of string Then match atomically one or more duplicates of the first match, separated by delimiters (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++ (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
  • 209. removing duplicate items Even better, concentrate on delimiters First match a non-delimiter sequence, that is preceded by a delimiter or beginning of string Then match atomically one or more duplicates of the first match, separated by delimiters And make sure it is followed by a delimiter or the end of the string (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
  • 210. removing duplicate items Even better, concentrate on delimiters First match a non-delimiter sequence, that is preceded by a delimiter or beginning of string Then match atomically one or more duplicates of the first match, separated by delimiters And make sure it is followed by a delimiter or the end of the string Replace with $1 (?<=[s.,?!]|^)([^s.,?!]+)([s.,?!]1)++(?=[s.,?!]|$)
  • 211. removing duplicate items This is unwieldy. Let’s use PHP variable interpolation. $dlm = '[s.,?!]'; $n_dlm = '[^s.,?!]'; $s = preg_replace(”/ (?<=$dlm|^) ($n_dlm+) ($dlm1)++ (?=$dlm|$) /x”, '$1', $s);
  • 212. removing multiline comments Simple, if the comments are not allowed to nest !/*.*?*/!s replaced with an empty string will work for C-like comments General pattern: /start.*?end/s For nested comments, a recursive pattern is necessary
  • 213. extracting markup Possible to use preg_match_all() for grabbing marked up portions But for tokenizing approach, preg_split() is better $s = 'a <b><I>test</I></b> of <br /> markup'; $tokens = preg_split( '!( < /? [a-zA-Z][a-zA-Z0-9]* [^/>]* /? > ) !x', $s, -1,      PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE); result is array('a','<b>','<I>','test','</I>', '</b>','of','<br />','markup')
  • 214. restricting markup Suppose you want to strip all markup except for some allowed subset. What are your possible approaches? Use strip_tags() - which has limited functionality Multiple invocations of str_replace() or preg_replace() to remove script blocks, etc Custom tokenizer and processor, or..
  • 215. restricting markup $s = preg_replace_callback( '! < (/?) ([a-zA-Z][a-zA-Z0-9]*) ([^/>]*) (/?) > !x', 'my_strip', $s); function my_strip($match) { static $allowed_tags = array('b', 'i', 'p', 'br', 'a');     $tag = $match[2];     $attrs = $match[3];     if (!in_array($tag, $allowed_tags)) return ‘’;     if (!empty($match[1])) return "</$tag>"; /* strip evil attributes here */     if ($tag == 'a') { $attrs = ''; }     /* any other kind of processing here */     return "<$tag$attrs$match[4]>"; }
  • 216. matching numbers Integers are easy: bd+b Floating point numbers: integer.fractional .fractional Can be covered by (bd+|B).d+b
  • 217. matching numbers To match both integers and floating point numbers, either combine them with alternation or use: ((bd+)?.)?bd+b [+-]? can be prepended to any of these, if sign matching is needed b can be substituted by more appropriate assertions based on the required delimiters
  • 218. matching quoted strings A simple case is a string that does not contain escaped quotes inside it Matching a quoted string that spans lines: “[^”]*” Matching a quoted string that does not span lines: “[^”rn]*”
  • 219. matching quoted strings Matching a string with escaped quotes inside “([^”]+|(?<=)”)*+”
  • 220. matching quoted strings Matching a string with escaped quotes inside “([^”]+|(?<=)”)*+” “ opening quote
  • 221. matching quoted strings Matching a string with escaped quotes inside “([^”]+|(?<=)”)*+” “ opening quote ( a component that is
  • 222. matching quoted strings Matching a string with escaped quotes inside “([^”]+|(?<=)”)*+” “ opening quote ( a component that is [^”]+ a segment without any quotes
  • 223. matching quoted strings Matching a string with escaped quotes inside “([^”]+|(?<=)”)*+” “ opening quote ( a component that is [^”]+ a segment without any quotes | or
  • 224. matching quoted strings Matching a string with escaped quotes inside “([^”]+|(?<=)”)*+” “ opening quote ( a component that is [^”]+ a segment without any quotes | or (?<=)” a quote preceded by a backslash
  • 225. matching quoted strings Matching a string with escaped quotes inside “([^”]+|(?<=)”)*+” “ opening quote ( a component that is [^”]+ a segment without any quotes | or (?<=)” a quote preceded by a backslash component repeated zero or more times )*+ without backtracking
  • 226. matching quoted strings Matching a string with escaped quotes inside “([^”]+|(?<=)”)*+” “ opening quote ( a component that is [^”]+ a segment without any quotes | or (?<=)” a quote preceded by a backslash component repeated zero or more times )*+ without backtracking “ closing quote
  • 227. matching e-mail addresses Yeah, right The complete regex is about as long as a book page in 10-point type Buy a copy of Jeffrey Friedl’s book and steal it from there
  • 228. matching phone numbers Assuming we want to match US/Canada-style phone numbers 800-555-1212 1-800-555-1212 800.555.1212 1.800.555.1212 (800) 555-1212 1 (800) 555-1212 How would we do it?
  • 229. matching phone numbers The simplistic approach could be: (1[ .-])? (? d{3} )? [ .-] d{3} [.-] d{4} But this would result in a lot of false positives: 1.(800)-555 1212 800).555-1212 1-800 555-1212 (800 555-1212
  • 230. matching phone numbers ^(?: anchor to the start of the string (?:1([.-]))? may have 1. or 1- (remember the separator) d{3} three digits ( (?(1) if we had a separator 1 | match the same (and remember), otherwise [.-] ) ) match . or - as a separator (and remember) d{3} another three digits 2 same separator as before d{4} final four digits | or 1[ ]?(d{3})[ ]d{3}-d{4} just match the third format )$ anchor to the end of the string
  • 231. multi-requirement match Suppose you need to match a word between 6-10 characters long that contains sun or sail somewhere in it First use lookaround for the precondition and then describe the match b (?=w{6,10}b) w{0,3} (sun|sail) w* What if you wanted words that contains those substrings but do not start with a capital letter? Easy: b(?=w{6,10}b) (?![A-Z]) w{0,3}(sun|sail)w*
  • 232. tips Don’t do everything in regex – a lot of tasks are best left to PHP Use string functions for simple tasks Make sure you know how backtracking works
  • 233. tips Be aware of the context Capture only what you intend to use Don’t use single-character classes
  • 234. tips Lazy vs. greedy, be specific Put most likely alternative first in the alternation list Think!
  • 235. regex tools Rubular.com Regex buddy Komodo Regex tester (Rx toolkit) Reggy (reggyapp.com) RegExr (http://www.gskinner.com/RegExr/) http://www.spaweditor.com/scripts/regex/index.php http://regex.larsolavtorvik.com/
  • 236. Thank You! Questions? http://zmievski.org/talks

Editor's Notes