Introduction to regular expressions


Published on

A quick start introduction to the world of regular expressions, through special characters, quantifiers, character classes..

Assumes no knowledge of regular expressions.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Pluarals
  • There's a lot of shorthand when talking about Perl. e.g. Array of Arrays. I'll try to avoid this shorthand.
  • See handout
  •  – reject any match where the cursor is not now at the end of the input
  • There are a load on your handout
  • Introduction to regular expressions

    1. 1. Quick Intro to Regexen Brian McCauley (nobull)
    2. 2. About this talk <ul><li>For Perl Newbies
    3. 3. For Regex Newbies
    4. 4. Assumes programming experience
    5. 5. Only scratches surface </li><ul><li>Full tutorial could last days </li></ul><li>Takes some liberties
    6. 6. Somewhat revised compared to proceedings
    7. 7. Not suitable for world authorities! </li></ul>
    8. 8. What is a RE? <ul><li>Compact description of a set of strings
    9. 9. Notation does not a regex make
    10. 10. We're talking Perl notation </li></ul>
    11. 11. Truly “Regular”? <ul><li>“Regular expression” from formal language theory
    12. 12. True regular expressions only a tiny subset of what we commonly mean
    13. 13. Perl5 (Java, Ruby etc..) regex perhaps better called “patterns” </li><ul><li>I'll tend to use the terms interchangeably </li></ul></ul>
    14. 14. Notational aside <ul><li>Perl patterns conventionally written between //
    15. 15. One writes “the pattern /foo/” </li><ul><li>Looks just like pattern match operator
    16. 16. But it's not </li></ul><li>I'm talking about the pattern
    17. 17. I'm not talking about the match operator </li></ul>
    18. 18. Simple regex syntax <ul><li>Literal characters / tokens match a literal </li><ul><li>Alphanumerics
    19. 19. Escaped non-alphanumerics
    20. 20. (Most) double-quotish escapes </li></ul><li>Anything else may have special meaning </li><ul><li>Without specials, a pattern describes one string </li></ul><li>Concatenation is concatenation </li></ul>
    21. 21. “Matches” v “Describes” <ul><li>Initially said “RE describes a set of strings”
    22. 22. Why do I keep saying “matches”?
    23. 23. Can also think of a pattern as a bit of code </li><ul><li>Passed an input string (and a cursor)
    24. 24. Locates string described by the RE (following the cursor)
    25. 25. May also record additional information </li></ul></ul>
    26. 26. “Matches” v “Matches” <ul><li>People use “matches” loosely
    27. 27. Shorthand terminology </li><ul><li>Usually clear from context
    28. 28. Confusion if shorthand taken literally </li></ul></ul>
    29. 29. Alternation <ul><li>Match “this or that”
    30. 30. Lower precedence than concatenation
    31. 31. Parentheses DWIM
    32. 32. Grouping with parentheses has a side-effect </li></ul>
    33. 33. Character classes <ul><li>Alternation of a single token (character)
    34. 34. Negation </li><ul><li>/[^ac]/ any single character other than 'a' or 'c' </li></ul></ul>
    35. 35. Shorthand character classes <ul><li>The (almost) universal class </li><ul><li>Sometimes any character at all (depends on switches) </li></ul><li>“Well known” classes </li></ul>
    36. 36. Character encoding <ul><li>Beyond chr(127) “DWIM” gets complicated! </li><ul><li>Locales, Unicode (the utf8 flag)
    37. 37. Exact version of Perl
    38. 38. Cited as one of the most annoying features in Perl </li></ul></ul>
    39. 39. Quantifiers <ul><li>Match a number of repeats of pattern
    40. 40. Pattern, not string, repeated
    41. 41. Range (can be open-ended)
    42. 42. Precedence </li></ul>
    43. 43. Quantifiers <ul><li>Shorthand forms for well known ranges </li></ul>
    44. 44. Best match <ul><li>Theoretical RE just defines a set of strings
    45. 45. Matching in Perl also says what it matched </li><ul><li>But a lot of possible matches
    46. 46. 19 in all! </li></ul><li>Choose the first match found </li><ul><li>For some definition of “first” </li></ul></ul>
    47. 47. First match <ul><li>Must match complete pattern
    48. 48. First starting position in input
    49. 49. First choice in alternation
    50. 50. Most repeats in repeat </li></ul>
    51. 51. Non-greedy <ul><li>Usual rule “as many repeats as possible”
    52. 52. Can also go for the fewest
    53. 53. Only useful in the context of a larger expression </li></ul>
    54. 54. Greedy but impatient <ul><li>Remember (non-)greediness is local
    55. 55. This is sometimes called “eager” or “impatient” </li><ul><li>I've got a complete match so take it </li></ul><li>But “must match whole pattern still applies” </li></ul>
    56. 56. Anchors <ul><li>Zero-width assertions - match the empty string
    57. 57. Only where something that I assert holds true </li><ul><li>Gross simplification! </li></ul><li>These assertions also called “anchors” </li><ul><li>Using term “anchor” for the more complex zero-width assertions can result in false expectations </li></ul></ul>
    58. 58. Capturing <ul><li>Match can return more than overall position
    59. 59. Records last cursor position at each ( )
    60. 60. “captures” the bit between </li><ul><li>$1='g'
    61. 61. $2='34'
    62. 62. $3='3' </li></ul><li>There's an overhead so can group without capture </li></ul>1 2 3
    63. 63. Back references <ul><li>Match whatever a previous capture matched </li></ul>2 nd caputure – any single character As few characters as possible The character we captured before
    64. 64. Switches <ul><li>Vagueness earlier
    65. 65. Controlled by switches </li><ul><li>Usually referred to as /i /m /x and /s </li></ul></ul>
    66. 66. The rest! <ul><li>This is only a tiny subset
    67. 67. Lots more assertions
    68. 68. The Perl substitution operator s///
    69. 69. Naming your captures
    70. 70. Embedding Perl code in your regex
    71. 71. Creating complex grammars by defining named subpatterns and using them later
    72. 72. It would take an hour just to enumerate them! </li></ul>
    73. 73. Live floor show <ul><li>Requests?
    74. 74. Questions? </li></ul>