Your SlideShare is downloading. ×
0
Regular Expressions
      Redux
Scope

• medium to advanced
• 30 minutes
• performance / backtracking irrelevant
• no compatibility charts (yet)
TOC

• basic matching, quantifiers
• character classes, types, properties, anchors
• groups, options, replace string
• look...
RE overview
RE overview

              match “foo”           replace with “bar”
  Perl        /foo/     (on $_)        s/foo/bar/ (on ...
RE overview

              match “foo”           replace with “bar”
  Perl        /foo/     (on $_)        s/foo/bar/ (on ...
RE overview

              match “foo”           replace with “bar”
  Perl        /foo/     (on $_)        s/foo/bar/ (on ...
Quantifiers
Quantifiers
• classic greedy: ?, *, +
Quantifiers
• classic greedy: ?, *, +
• specific:{1,5}, {,5}
Quantifiers
• classic greedy: ?, *, +
• specific:{1,5}, {,5}
  •   ? == {0,1}
Quantifiers
• classic greedy: ?, *, +
• specific:{1,5}, {,5}
  •   ? == {0,1}

  •   * == {0,}
Quantifiers
• classic greedy: ?, *, +
• specific:{1,5}, {,5}
  •   ? == {0,1}

  •   * == {0,}

  •   + == {1,}
Quantifiers
• classic greedy: ?, *, +
• specific:{1,5}, {,5}
  •   ? == {0,1}

  •   * == {0,}

  •   + == {1,}

• non-greed...
Example
This reveals that plain text is in fact the
technical user's way to regard a file or a
sequence of bytes. In this s...
Example
This reveals that plain text is in fact the
technical user's way to regard a file or a
sequence of bytes. In this s...
Example
This reveals that plain text is in fact the
technical user's way to regard a file or a
sequence of bytes. In this s...
Example
This reveals that plain text is in fact the
technical user's way to regard a file or a
sequence of bytes. In this s...
Character Classes /
    Properties
Character Classes /
      Properties
• [0-9a-z]   (classes)
Character Classes /
      Properties
• [0-9a-z]     (classes)
 •   +420[0-9]{9} = simplified czech phone nr.
Character Classes /
      Properties
• [0-9a-z]      (classes)
 •   +420[0-9]{9} = simplified czech phone nr.

 •   don’t: ...
Character Classes /
      Properties
• [0-9a-z]       (classes)
  •   +420[0-9]{9} = simplified czech phone nr.

  •   don’...
Character Classes /
      Properties
• [0-9a-z]       (classes)
  •   +420[0-9]{9} = simplified czech phone nr.

  •   don’...
Character Classes /
      Properties
• [0-9a-z]       (classes)
  •   +420[0-9]{9} = simplified czech phone nr.

  •   don’...
Character Classes /
      Properties
• [0-9a-z]       (classes)
  •   +420[0-9]{9} = simplified czech phone nr.

  •   don’...
Character Types
Character Types
• . == anything (apart from newline)
Character Types
• . == anything (apart from newline)
• s == space == [tnvfr ]
  •   more in unicode
Character Types
• . == anything (apart from newline)
• s == space == [tnvfr ]
  •   more in unicode

• w == word char == c...
Character Types
• . == anything (apart from newline)
• s == space == [tnvfr ]
  •   more in unicode

• w == word char == c...
Character Types
• . == anything (apart from newline)
• s == space == [tnvfr ]
  •   more in unicode

• w == word char == c...
Example
This reveals that plain text is in fact the
technical user's way to regard a file or a
sequence of bytes. In this s...
Example
This reveals that plain text is in fact the
technical user's way to regard a file or a
sequence of bytes. In this s...
Anchors
Anchors

• ^ - begining (line, string)
Anchors

• ^ - begining (line, string)
• $ - end (line, string)
Anchors

• ^ - begining (line, string)
• $ - end (line, string)
• b - word boundary ~ wW (almost)
 •   b.{5}b != Ww{5}W
Anchors

• ^ - begining (line, string)
• $ - end (line, string)
• b - word boundary ~ wW (almost)
 •   b.{5}b != Ww{5}W

•...
Options
Options
• /foo/imsx
 •   i - case insensitive

 •   m - multiline (^,$ represent start of string/file)

 •   s - single lin...
Options
• /foo/imsx
  •   i - case insensitive

  •   m - multiline (^,$ represent start of string/file)

  •   s - single ...
Options
• /foo/imsx
  •   i - case insensitive

  •   m - multiline (^,$ represent start of string/file)

  •   s - single ...
Groups/Replacing
Groups/Replacing
• (...) - matched group
Groups/Replacing
• (...) - matched group
• $1 - $9
  •   alternatively 1 - 9 (not recommended)
Groups/Replacing
• (...) - matched group
• $1 - $9
  •   alternatively 1 - 9 (not recommended)

• nested groups ordered by...
Groups/Replacing
• (...) - matched group
• $1 - $9
  •   alternatively 1 - 9 (not recommended)

• nested groups ordered by...
Example
quot;foobarmanquot;.replace(
  /(?:f)((o)+)(bar)|(baz|man)/g,
  '$1, $2, $3, $4, $5')
Example
quot;foobarmanquot;.replace(
  /(?:f)((o)+)(bar)|(baz|man)/g,
  '$1, $2, $3, $4, $5')

    • foobar
      •   1 --...
Example
quot;foobarmanquot;.replace(
  /(?:f)((o)+)(bar)|(baz|man)/g,
  '$1, $2, $3, $4, $5')

    • foobar               ...
Look-ahead/behind
• defines custom zero-width anchors
Look-ahead/behind
• defines custom zero-width anchors
                   positive negative

          ahead     (?=...)   (...
Example

zdenek@gooddata.com
   /.*?@gooddata/


zdenek@gooddata.com
 /.*?(?=@gooddata)/
Recursive RE

• very important!
 •   quote & bracket matching

 •   technically not part of regular grammar

• two styles
...
Example
(?x:

 ( # match the initial opening parenthesis

 # Now make a named group 'balanced' which
     # matches a bala...
Example
(?x:

 ( # match the initial opening parenthesis

 # Now make a named group 'balanced' which
     # matches a bala...
Upcoming SlideShare
Loading in...5
×

Advanced Regular Expressions Redux

1,511

Published on

Brief RE refresher with some more advanced topics - non-greedy quantifiers, character properties, nested group ordering, recursive expressions

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,511
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
100
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide






  • escaping???
  • escaping???
  • escaping???
  • examples!
    possessive (?+, *+, ++)
  • examples!
    possessive (?+, *+, ++)
  • examples!
    possessive (?+, *+, ++)
  • examples!
    possessive (?+, *+, ++)
  • examples!
    possessive (?+, *+, ++)
  • examples!
    possessive (?+, *+, ++)








  • unicode compat table!
  • unicode compat table!
  • unicode compat table!
  • unicode compat table!
  • unicode compat table!
  • unicode compat table!
  • unicode compat table!
  • notice the space at the end, capital reverses
  • notice the space at the end, capital reverses
  • notice the space at the end, capital reverses
  • notice the space at the end, capital reverses
  • notice the space at the end, capital reverses












  • how about /g??
  • how about /g??
  • how about /g??




















  • Transcript of "Advanced Regular Expressions Redux"

    1. 1. Regular Expressions Redux
    2. 2. Scope • medium to advanced • 30 minutes • performance / backtracking irrelevant • no compatibility charts (yet)
    3. 3. TOC • basic matching, quantifiers • character classes, types, properties, anchors • groups, options, replace string • look-ahead/behind • subexpressions
    4. 4. RE overview
    5. 5. RE overview match “foo” replace with “bar” Perl /foo/ (on $_) s/foo/bar/ (on $_) Javascript /foo/ “foolish”.replace(/foo/, “bar”) Vi /foo/ :s/foo/bar/ TextMate ⌘-F, Find: foo ⌘-F Find: foo, Replace: bar
    6. 6. RE overview match “foo” replace with “bar” Perl /foo/ (on $_) s/foo/bar/ (on $_) Javascript /foo/ “foolish”.replace(/foo/, “bar”) Vi /foo/ :s/foo/bar/ TextMate ⌘-F, Find: foo ⌘-F Find: foo, Replace: bar
    7. 7. RE overview match “foo” replace with “bar” Perl /foo/ (on $_) s/foo/bar/ (on $_) Javascript /foo/ “foolish”.replace(/foo/, “bar”) Vi /foo/ :s/foo/bar/ TextMate ⌘-F, Find: foo ⌘-F Find: foo, Replace: bar
    8. 8. Quantifiers
    9. 9. Quantifiers • classic greedy: ?, *, +
    10. 10. Quantifiers • classic greedy: ?, *, + • specific:{1,5}, {,5}
    11. 11. Quantifiers • classic greedy: ?, *, + • specific:{1,5}, {,5} • ? == {0,1}
    12. 12. Quantifiers • classic greedy: ?, *, + • specific:{1,5}, {,5} • ? == {0,1} • * == {0,}
    13. 13. Quantifiers • classic greedy: ?, *, + • specific:{1,5}, {,5} • ? == {0,1} • * == {0,} • + == {1,}
    14. 14. Quantifiers • classic greedy: ?, *, + • specific:{1,5}, {,5} • ? == {0,1} • * == {0,} • + == {1,} • non-greedy: ??, *?, +?, {5,7}?
    15. 15. Example This reveals that plain text is in fact the technical user's way to regard a file or a sequence of bytes. In this sense, there is no plain text. /reveal(.*)plain/ /reveal(.*?)plain/ /t.{2,3}t/
    16. 16. Example This reveals that plain text is in fact the technical user's way to regard a file or a sequence of bytes. In this sense, there is no plain text. /reveal(.*)plain/ /reveal(.*?)plain/ /t.{2,3}t/
    17. 17. Example This reveals that plain text is in fact the technical user's way to regard a file or a sequence of bytes. In this sense, there is no plain text. /reveal(.*)plain/ /reveal(.*?)plain/ /t.{2,3}t/
    18. 18. Example This reveals that plain text is in fact the technical user's way to regard a file or a sequence of bytes. In this sense, there is no plain text. /reveal(.*)plain/ /reveal(.*?)plain/ /t.{2,3}t/
    19. 19. Character Classes / Properties
    20. 20. Character Classes / Properties • [0-9a-z] (classes)
    21. 21. Character Classes / Properties • [0-9a-z] (classes) • +420[0-9]{9} = simplified czech phone nr.
    22. 22. Character Classes / Properties • [0-9a-z] (classes) • +420[0-9]{9} = simplified czech phone nr. • don’t: [A-z0-]
    23. 23. Character Classes / Properties • [0-9a-z] (classes) • +420[0-9]{9} = simplified czech phone nr. • don’t: [A-z0-] • [a-z&&[^j-n]] == [a-io-z]
    24. 24. Character Classes / Properties • [0-9a-z] (classes) • +420[0-9]{9} = simplified czech phone nr. • don’t: [A-z0-] • [a-z&&[^j-n]] == [a-io-z] • p{Upper} (properties)
    25. 25. Character Classes / Properties • [0-9a-z] (classes) • +420[0-9]{9} = simplified czech phone nr. • don’t: [A-z0-] • [a-z&&[^j-n]] == [a-io-z] • p{Upper} (properties) • works great on Unicode text (Latin,Katakana)
    26. 26. Character Classes / Properties • [0-9a-z] (classes) • +420[0-9]{9} = simplified czech phone nr. • don’t: [A-z0-] • [a-z&&[^j-n]] == [a-io-z] • p{Upper} (properties) • works great on Unicode text (Latin,Katakana) • [:alnum:], [:^space:] (POSIX bracket)
    27. 27. Character Types
    28. 28. Character Types • . == anything (apart from newline)
    29. 29. Character Types • . == anything (apart from newline) • s == space == [tnvfr ] • more in unicode
    30. 30. Character Types • . == anything (apart from newline) • s == space == [tnvfr ] • more in unicode • w == word char == cca [0-9a-zA-Z_] • is complicated in unicode
    31. 31. Character Types • . == anything (apart from newline) • s == space == [tnvfr ] • more in unicode • w == word char == cca [0-9a-zA-Z_] • is complicated in unicode • d == digit == [0-9] • h == hexadecimal digit == [0-9a-fA-F]
    32. 32. Character Types • . == anything (apart from newline) • s == space == [tnvfr ] • more in unicode • w == word char == cca [0-9a-zA-Z_] • is complicated in unicode • d == digit == [0-9] • h == hexadecimal digit == [0-9a-fA-F] • SWD == [^s][^w][^d]
    33. 33. Example This reveals that plain text is in fact the technical user's way to regard a file or a sequence of bytes. In this sense, there is no plain text. /b[w&&[^aA]]+b/ /W{2,}w+b/
    34. 34. Example This reveals that plain text is in fact the technical user's way to regard a file or a sequence of bytes. In this sense, there is no plain text. /b[w&&[^aA]]+b/ /W{2,}w+b/
    35. 35. Anchors
    36. 36. Anchors • ^ - begining (line, string)
    37. 37. Anchors • ^ - begining (line, string) • $ - end (line, string)
    38. 38. Anchors • ^ - begining (line, string) • $ - end (line, string) • b - word boundary ~ wW (almost) • b.{5}b != Ww{5}W
    39. 39. Anchors • ^ - begining (line, string) • $ - end (line, string) • b - word boundary ~ wW (almost) • b.{5}b != Ww{5}W • zero width!
    40. 40. Options
    41. 41. Options • /foo/imsx • i - case insensitive • m - multiline (^,$ represent start of string/file) • s - single line (. matches newlines) • x - extended! • g - global
    42. 42. Options • /foo/imsx • i - case insensitive • m - multiline (^,$ represent start of string/file) • s - single line (. matches newlines) • x - extended! • g - global • can be written inline • (?imsx-imsx) • (?imsx-imsx:...)
    43. 43. Options • /foo/imsx • i - case insensitive • m - multiline (^,$ represent start of string/file) • s - single line (. matches newlines) • x - extended! • g - global (?x-i) #this is cool • can be written inline ( foo #my important value • | #don't forget the alternative (?imsx-imsx) bar • ) # result equals to (foo|bar) (?imsx-imsx:...)
    44. 44. Groups/Replacing
    45. 45. Groups/Replacing • (...) - matched group
    46. 46. Groups/Replacing • (...) - matched group • $1 - $9 • alternatively 1 - 9 (not recommended)
    47. 47. Groups/Replacing • (...) - matched group • $1 - $9 • alternatively 1 - 9 (not recommended) • nested groups ordered by left bracket
    48. 48. Groups/Replacing • (...) - matched group • $1 - $9 • alternatively 1 - 9 (not recommended) • nested groups ordered by left bracket • (?:...) - non-captured group • useful for (?:foo)+ or (?:foo|bar)
    49. 49. Example quot;foobarmanquot;.replace( /(?:f)((o)+)(bar)|(baz|man)/g, '$1, $2, $3, $4, $5')
    50. 50. Example quot;foobarmanquot;.replace( /(?:f)((o)+)(bar)|(baz|man)/g, '$1, $2, $3, $4, $5') • foobar • 1 -- oo • 2 -- o • 3 -- bar • 4 --
    51. 51. Example quot;foobarmanquot;.replace( /(?:f)((o)+)(bar)|(baz|man)/g, '$1, $2, $3, $4, $5') • foobar • man • • 1 -- oo 1 -- • • 2 -- o 2 -- • • 3 -- bar 3 -- • • 4 -- 4 -- man
    52. 52. Look-ahead/behind • defines custom zero-width anchors
    53. 53. Look-ahead/behind • defines custom zero-width anchors positive negative ahead (?=...) (?!...) behind (?<=...) (?<!...)
    54. 54. Example zdenek@gooddata.com /.*?@gooddata/ zdenek@gooddata.com /.*?(?=@gooddata)/
    55. 55. Recursive RE • very important! • quote & bracket matching • technically not part of regular grammar • two styles • g<name> or g<n> - TextMate • (?R) - Perl
    56. 56. Example (?x: ( # match the initial opening parenthesis # Now make a named group 'balanced' which # matches a balanced substring. (?<balanced> [^()] # A balanced substring is either something # that is not a parenthesis: | # …or a parenthesised string: ( # A parenthesised string begins with an opening parenthesis g<balanced>* # …followed by a sequence of balanced substrings ) # …and ends with a closing parenthesis )* # Look for a sequence of balanced substrings ) # Finally, the outer closing parenthesis )
    57. 57. Example (?x: ( # match the initial opening parenthesis # Now make a named group 'balanced' which # matches a balanced substring. (?<balanced> [^()] # A balanced substring is either something # that is not a parenthesis: | # …or a parenthesised string: ( # A parenthesised string begins with an opening parenthesis g<balanced>* # …followed by a sequence of balanced substrings ) # …and ends with a closing parenthesis )* # Look for a sequence of balanced substrings ) # Finally, the outer closing parenthesis ) or: (([^()]|(?R))*)
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×