Regular Expressions 101

1,705 views
1,579 views

Published on

An introduction to regular expressions. With a slight twist towards CFML developers, but useful and appropriate for the general population.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,705
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Regular Expressions 101

  1. 1. Regular Expressions Clinic Kai Koenig CFCAMP Munich 2013 Tuesday, 15 October 13
  2. 2. Me Tuesday, 15 October 13
  3. 3. Tuesday, 15 October 13
  4. 4. Kai Koenig Working with CFML since 1999 Developing for mobile since 2003 Recovering Flex developer being sucked into JS deeper and deeper Recently rediscovered Functional Programming and sometimes enjoy dabbling with formal aspects of computer science bloginblack.de 2ddu.com twitter.com/agentK Tuesday, 15 October 13
  5. 5. So...what’s this about Tuesday, 15 October 13
  6. 6. Agenda What is a RegEx? History Some theory: RE, Automatons & Formal Language Pattern Matching Syntax Greediness and Laziness Lookahead and Lookbehind Common problems and Engines Tuesday, 15 October 13
  7. 7. Tuesday, 15 October 13
  8. 8. Is this how you feel? Tuesday, 15 October 13
  9. 9. RegEx = pattern describing a set of characters Tuesday, 15 October 13
  10. 10. Use cases Does a certain pattern occur in a text? Locate occurrences and replace them with something else. Grab the matched strings for further processing Tuesday, 15 October 13
  11. 11. History 1950s: Stephen Kleene, American mathematician, credited for inventing Regular Expressions in the using a mathematic notation called regular sets. Ken Thompson & Henry Spencer used Kleene’s theory to implement RegEx-based searching in text editors Tuesday, 15 October 13
  12. 12. Some terms RegEx: pattern describing a set of characters Subject: string or text we apply the RegEx to Match: part of the subject that the RegEx is successfully used to describe Tuesday, 15 October 13
  13. 13. Example Subject: Kai lives in Wellington in New Zealand. RegEx: gt Match: Kai lives in Wellington in New Zealand. Tuesday, 15 October 13
  14. 14. Theory Regular expressions are an algebraic way to describe languages. They describe exactly the regular languages. If E is a regular expression, then L(E) is the language it defines. Tuesday, 15 October 13
  15. 15. Definitions Basis 1: If a is any symbol, then a is a RE, and L(a) = {a}. Note: {a} is the language containing one string, and that string is of length 1. Basis 2: ε is a RE, and L(ε) = {ε}. Basis 3: ∅ is a RE, and L(∅) = ∅. Tuesday, 15 October 13
  16. 16. Definitions Induction 1: If E1 and E2 are regular expressions, then E1+E2 is a regular expression, and L(E1+E2) = L(E1)∪L(E2). Induction 2: If E1 and E2 are regular expressions, then E1E2 is a regular expression, and L(E1E2) = L(E1)L(E2). Induction 3: If E is a RE, then E* is a RE, and L(E*) = (L(E))*. Tuesday, 15 October 13
  17. 17. Examples L(01) = {01}. L(01+0) = {01, 0}. L(0(1+0)) = {01, 00}. Note order of precedence... L(0*) = {ε, 0, 00, 000,… }. L((0+10)*(ε+1)) = all strings of 0’s and 1’s without two consecutive 1’s. Tuesday, 15 October 13
  18. 18. Automatons a|bc* Tuesday, 15 October 13
  19. 19. NFA/DFA/RegEx Each of the three types of automata (DFA, NFA, ε-NFA), and regular expressions as well, define exactly the same set of languages: the regular languages. (∃N, NFA: L(N)= L) ⇔ (∃D, DFA: L(D)= L) ⇔ (∃R, RegEx: L(R)= L) Tuesday, 15 October 13
  20. 20. Characters Ordinary characters (“alphabet” or “language”) a b c 6 3 9 1 ü etc Special characters / Meta characters * ^ ! ? + . etc Tuesday, 15 October 13
  21. 21. Pattern Matching Most basic example: Single, ordinary character: h In the string “Matching”, it’ll match the first occurrence of the character: Matching Tuesday, 15 October 13
  22. 22. Pattern Matching Matching Meta characters: Escape them by prepending them with a backslash: $ NZ$14.78 would match against “The Price of this book is NZ$14.78” Tuesday, 15 October 13
  23. 23. Syntax: [] Character Classes contain a set of characters in square brackets [abcde] -> [a-e] [aeiou] [0123456789] -> [0-9] [HW]ahn ... matches either Hahn or Wahn [^abcde] ... negates ALL characters in a class Tuesday, 15 October 13
  24. 24. Class shortcuts w -> [A-Za-z0-9_] d -> [0-9] s -> [ nrft] W, D and S are the negated versions. Tuesday, 15 October 13
  25. 25. “Any” . The dot matches any character, except a newline . -> [^n] Careful: 47.11 matches the literal “47.11” but also 47211, 47p11, 47z11 etc Tuesday, 15 October 13
  26. 26. Quantifiers Specify how often a component of a RegEx must be repeated for a match to occur ?: 0-or-1 +: 1-or-more *: 0-or-more {min,max}: general repetition Tuesday, 15 October 13
  27. 27. Examples Optional whitespace between words: metas?character will match both “meta character” and “metacharacter” Negative Integers: -d+ Opening HTML/XML tag: <[a-z][a-z0-9]*> Tuesday, 15 October 13
  28. 28. Greediness Quantifiers are greedy, they want to match as much as they can. d{2,4} applied to “14/10/2013” will match 14/10/2013 <.+> applied to “<h3>My Heading</h3>” will match the whole string and not <h3> Tuesday, 15 October 13
  29. 29. Laziness Lazy quantifiers grab as less as possible If the overall match fails, they try to grab a little it more and try again To make a quantifier lazy, append ? *? +? {min,max}? and ?? <.+?> applied to “<h3>My Heading</h3>” will match <h3>My Heading</h3> Tuesday, 15 October 13
  30. 30. Laziness alternative Negations: <.+?> is <[^>rn]+> Can be more efficient depending on RegEx engine Tuesday, 15 October 13
  31. 31. ^ and $: anchors ^ is an anchor that matches at the beginning of the subject $ is an anchor that matches at the end of the subject ^Ka[iy] in Germering$ matches “Kay in Germering” but not “Diane and Kai in Germering” Tuesday, 15 October 13
  32. 32. Subpatterns () Group part of a RegEx together Can be applied operators to Subpatterns are capturing, i.e. they store their contribution to the match in memory, some RegEx engines allow non-capturing subpatterns by prepending ?: right after the opening parenthesis. Tuesday, 15 October 13
  33. 33. Subpatterns (dd-(w+)-d{4}) and (dd-(w+)-d{4}) String: “14-Oct-2013” Match 1: 14-Oct-2013 Match 2: 14-Oct-2013 Tuesday, 15 October 13
  34. 34. Backreferences 1 • Backreferences are used to refer to captured subpatterns • (ha|ho)w+1 matches words that start and end with either ha or ho - the 1 refers to the first subpattern and what it matched Tuesday, 15 October 13
  35. 35. Alternation | Lowest precedence operator, often needs grouping Peter|Paul Mueller (Peter|Paul) Mueller This is possible: (^|my|your) friend Concept of Eagerness Tuesday, 15 October 13
  36. 36. Backtracking When faced with several options to achieve a match, the engine will try one and retreat from it if necessary Decision points are usually quantifiers and alternation Refer to greedy and lazy behaviour... Tuesday, 15 October 13
  37. 37. Example w+st w+st Strongest w+st ... Strongest ... w+st Strongest w+st Strongest w+st Tuesday, 15 October 13 Strongest Strongest
  38. 38. Example w+st Strongest w+st Strongest w+st Strongest matches st successfull Tuesday, 15 October 13
  39. 39. Atomic Grouping (?>regexp) can be used to treat the regexp as an atomic entity and backtracking inside will not occur. Can speed up the discovery of failed matches Tuesday, 15 October 13
  40. 40. Lookarounds Do not capture or consume Positive Lookahead (?=) Negative Lookahead (?!) --- hard with character classes Positive Lookbehind (?<=) Negative Lookbehind (?<!) Tuesday, 15 October 13
  41. 41. That’s not even it There are things like Assertions Possessive Quantifiers (preventing bt) Inline options Comments etc Tuesday, 15 October 13
  42. 42. RegEx Engines There’s not just ONE RegEx engine Differences: Internal implementation Features Performance Tuesday, 15 October 13
  43. 43. RegEx Engines Some well-known flavours: Perl, PCRE, Java, Javascript, POSIX and many more Really good overview: http://www.regular-expressions.info/ refflavors.html Tuesday, 15 October 13
  44. 44. RegEx Engines What does CFML do? ACF since version 6 uses Apache Jakarta Oro (http://jakarta.apache.org/oro/), discontinued since 2010. Railo uses Apache Jakarta Oro as well --- to provide compatibility with ACF. BOTH use Oro for CFML tags/functions RegEx support (Perl 5-compatible) Tuesday, 15 October 13
  45. 45. RexEx in CFML Apache Jakarta Oro for example doesn’t do lookaheads/lookbehinds well or at all. Better: Leverage Java’s internal RegEx handling (java.util.regex) Made easy through CFRegEx (http:// cfregex.net) Tuesday, 15 October 13
  46. 46. More about the dot Very commonly used Very often abused It’s often better (more performant) to use [a-z] instead of . . with a quantifier becomes greedy (see earlier example) Tuesday, 15 October 13
  47. 47. Useful tools http://regex.larsolavtorvik.com http://www.gskinner.com/RegExr/ http://rubular.com https://github.com/downloads/samsouder/ reggy/Reggy_v1.3.tbz http://wiki.tcl.tk/1345 Tuesday, 15 October 13
  48. 48. Get in touch Twitter: @AgentK Blog: http://bloginblack.de Podcast (2 Developers Down Under): http://2ddu.com About me: http://about.me/agentk Tuesday, 15 October 13

×