Your SlideShare is downloading. ×
Introduction to regular expressions
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Introduction to regular expressions


Published on

The slides of my brown-bag session dedicated to introducing regular expressions.

The slides of my brown-bag session dedicated to introducing regular expressions.

Published in: Software, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Gianluca Costa Introduction to regular expressions
  • 2. Before starting Regular expressions are a tool: it's up to you to use them wisely. Like every tool, they require: Practice Tests Patience
  • 3. Why “regular expressions”? ● 1956: mathematical definition of regular sets by Stephen Cole Kleen ● 1968: “Regular Expression Search Algorithm” - by Ken Thompson. Description of a regular expression compiler. ● Regular expressions employed in text editors. Introduction of the grep command.
  • 4. Examples of text matching ● Given an IIS log, keep just the requests to the web app “/PicnicAPI” ● Perform LIKE queries on MongoDB ● Get the dir and basename of a file path ● Get the src attribute of an <img> tag ● Read a key-value file having “” line continuations
  • 5. Generalized problems ● Determine if a pattern is contained (matches) a given string ● Extract substrings from a matching string ● Replace one or more substrings ● Generalizable to files and streams
  • 6. Regular expressions Regular expressions describe text patterns. For example: “At least 3 digits, but not more than 5”.
  • 7. A simple example /d{3,5}/ Matches “3482”, but not “Hello”
  • 8. How to apply regexes ● Functions/classes provided by programming languages/frameworks ● Command-line tools (sed, awk, egrep, …) ● Other interfaces (eg: MongoDB queries)
  • 9. Interactive testing ● - currently provides a free multi-engine test environment, explaining your regex and showing the matches on a text. ● - another regex test environment, targeting Ruby's flavour.
  • 10. The dualism regex-target The regular expression is applied to a string, to check for a match. Both the regex and the string have their own cursor. Which cursor drives the matching process? T h e q u i c k b r q u iText: Regex:
  • 11. Engine types ● DFA ● Traditional NFA ● POSIX NFA ● Hybrid solutions
  • 12. DFA ● Matching is driven by the cursor on the text ● Very fast matching ● Takes longer to compile ● Takes more memory ● Declarative regex ● Always returns the longest possible match.
  • 13. Traditional NFA ● Matching is driven by the cursor on the regex ● Creates a stack of states, and performs backtracking ● Supports more language constructs ● Imperative regex ● Usually returns the first match found ● Employed by standard Java, .NET, Python, PHP, Perl, Ruby, …
  • 14. POSIX NFA ● Very, very similar to traditional NFA, but returns the longest possible match. ● Further performance issues!
  • 15. Hybrid solutions Double engine: first-scan with DFA, then scan with NFA if required by the pattern. Further implementations are possible.
  • 16. Our target: NFAs ● DFAs are less common than NFAs, their syntax is almost a subset and they are generally simpler. ● We will concentrate on NFA regexes
  • 17. Know your engine There are common rules, but several engines. Every engine has its own implementation. You must know your engine. And write tests.
  • 18. Regex basics Literal text, such as /rain/ matches if and only if the string contains, somewhere, that sequence, matching character after character.
  • 19. The first rule of matching Matching starts from the leftmost character. Therefore: “The rainbow shines after the rain” /rain/
  • 20. The second rule of matching The engine returns a success if and only if the regex cursor reaches the end of the regex.
  • 21. Escaping characters ● Some characters (, *, ?, +, ., (, ), [, ], {, }, |, ^, $, #) must be escaped when they are used literally ● Escape is performed by prepending “”. For example: /?/ to represent a literal “?” ● Where raw strings are not supported, a double escape might be required. In Java, the regex /+/ becomes: “+”.
  • 22. Escape sequences ● r ● n ● v ● f ● t ● They work just like in C
  • 23. Character classes ● [abc] = “a, b or c in this position” ● [a-z] = “a, b, c, …, z here” ● [A-Za-z] = “A, B, …, Z, a, b, …, z here” ● [A-Za-z0] = “A, …, Z, a, …, z, 0 here” ● [A-Z-] = [-A-Z] = “A, …, Z or – here” ● What about accents? (é, è, …) And cedilla? ● Know your engine.
  • 24. Negated character classes ● [^ab] = “Something not a and not b here” ● [^a-z] = “Something not a, b, c, …, z here” ● [^A-Za-c] = “Something not “A, …, Z, a, …, c here” ● Negating a character set requires the existence of a character in that position, not belonging to the specified class.
  • 25. Common character classes ● d = a digit ● D = [^d] ● w = a letter, a digit or “_” ● W = [^w] ● s = a space character ● S = [^s] ● . = any character except newline
  • 26. What are letters and spaces? ● The answer depends on the encoding and on your engine. ● In ASCII, usually: – w = [A-Za-z0-9_] – s = [r ntfv] (includes ASCII-32 common space) ● But what about Latin-1 or Unicode? ● Know your engine
  • 27. Unicode character classes ● uXXXX: matches the Unicode code point whose hex value is XXXX ● There should also be support for Unicode's categories and scripts, especially via p ● Much more Unicode-related, non-standard features ● Know your engine
  • 28. Capturing groups ● ( and ) define a capturing group ● Capturing groups are assigned a 1-based index, according to the position of their ( ● /(w+)bet/ tries to match a string and, if successful, creates a capturing group for the text matching w+, having index 1 ● If the above regex is applied to “alphabet”, it matches and its group 1 is “alpha”
  • 29. Non-capturing groups ● Groups can just be used to clarify precedence: capturing is not always needed ● Skipping capturing can save memory and speed up the matching process ● To define a non-capturing group, use (?: and ). ● Therefore, /(?:w+)bet/ is just like /w+bet/, as no capturing is performed and this grouping alters precedence without effects.
  • 30. Backreferences ● Backreference = the content of a capturing group that becomes part of the regex ● Use N in your regex, replacing N with the index of the captured group in question ● For example: /(['”])w+1/ to pair single and double quotes ● Some engines support named capturing and backreferences
  • 31. Alternation ● Alternatives are separated by | ● For example: /alpha|beta/ means “alpha” or “beta” ● Alternation has very low precedence; its scope is the current group: use grouping to force precedence. ● For example: /A(?:pril|ugust)/ means “A, followed by “pril” or “ugust”.
  • 32. Alternation VS char classes ● A character class (asserted or negated) always matches one and only one character ● The branches of an alternation can be strings of any length (at least one character, to be consistent)
  • 33. Matching in a DFA /nice|cute/ applied to: “Pandas are cute animals” It scans the string, starting from P, and, at every character, tries to apply the regex. In a DFA regex, the engine only chooses which regex components remain valid at a given position of the text cursor.
  • 34. Matching in NFA ● NFA also keeps a stack of states! ● Each decision point saves a state in the stack ● State = position of the 2 cursors ● If a choice in the regex leads to no match, the engine backtracks (=pops a state from the stack and makes a different choice)
  • 35. Backtracking S1 S2 S5 S3 S4 S6 S7 S8 1 2 4 7 8 10 11 3 5 6 9
  • 36. Performance implications ● In NFA, a failure is returned only when all the regex paths have been explored ● NFA regexes must be written with performances in mind.
  • 37. Alternation in NFA ● Ordered in most implementations. ● Affects what is matched and performances. ● Know your engine
  • 38. Greedy quantifiers ● All quantifiers can be applied to single characters, classes or even groups ● * = any number of occurrences (even 0) ● ? = 0 or 1 occurrencies ● + = 1 or infinite occurrencies ● {n} = exactly n occurrencies ● {m, n} = m to n occurrencies (included) ● {m,} = at least m occurrencies
  • 39. First example of greedy quantifiers ● Let's consider the regex /be?(er|ar)/ ● How is it applied to “I'd like a chocolate bar” ? ● The regex cursor stays on “b” until the text cursor reaches its “b” too ● Then, the following regex paths are tried: – be => b(er) => b(ar)
  • 40. Greedy quantifiers and backtracking ● Consider the regex /.* are/ ● Applied to: “Pandas are cute animals” ● .* will consume the whole text at first ● However, when reaching the end of the text, it stops matching and the regex cursor goes on.
  • 41. Greedy quantifiers and backtracking (2) ● Now, “ “ can't match (no more text is available), so the engine backtracks! ● Some backtracking is performed, until the first available space is reached (between “cute” and “animals”) ● The regex cursor moves on to “a”, that matches the “a” in “animals”. But “r” doesn't match “n” => more backtracking!
  • 42. Greedy quantifiers and backtracking (3) ● The failures and backtracking go on until the space between “are” and “cute”... “a” doesn't match the “c” in “cute” => backtracking, again! ● The next space is ok: it is followed by “are”, that matches the rest of the regex.
  • 43. Pandas are cute animals! ^__^!
  • 44. Lazy quantifiers ● Quantifiers become lazy if followed by a ? ● *? ● ?? ● +? ● {m, n}? ● {m, }? ● {n} cannot be lazy: it indicates a precise n
  • 45. Lazy quantifiers and backtracking ● When applying /.*? are/ to “Pandas are cute animals”, what happens? ● The engine must choose whether to apply .*? to “P”. But it's lazy, so the engine chooses to move the regex cursor forward ● The regex cursor goes on to “ “, but it doesn't match “P” so the engine backtracks ● The engine must now take the remaining path – applying .*? to “P”, which is viable
  • 46. Lazy quantifiers and backtracking (2) ● This goes on until the first space in the text is reached: it matches the space in the regex, so the regex cursor can go on ● The matching process continues until the regex ends ● In this case, the match of greedy and lazy evaluation was the same – but the lazy quantifiers required less backtracking
  • 47. Apply or skip? Greedy VS Lazy ● When a quantifier is encountered, the regex engine must choose whether to apply its element to the text or not ● Greedy quantifiers prefer the “apply” path whenever possible ● Lazy quantifiers prefer the “skip” path whenever possible ● Choosing greedy VS lazy quantifiers can impact performances and what is matched, but not the presence/absence of a match.
  • 48. Greedy VS Lazy: an example ● Given the text “987”: – /d{1,3}/ matches the whole “987”: the greedy quantifier tries to consume as much as possible – /d{1,3}?/ matches just “9”: the lazy quantifier must honour the constraints (at least 1 match), but chooses to skip application whenever possible
  • 49. Atomic grouping ● (?> and ) define an atomic group ● All the states created within an atomic group are removed from the engine's stack as soon as the group closes ● Atomic groups are non-capturing, but can have capturing groups ● Atomic grouping can alter the match/failure result of a regex, as well as affecting performances
  • 50. Possessive quantifiers ● Obtained by adding a “+” to greedy quantifiers ● Possessive quantifiers are equivalent to greedy quantifiers wrapped within an atomic group. ● For example: /d++/ = /(?>d+)/
  • 51. Regex flags ● Regex engines can turn on/off features, for customized behaviour ● Enabling and disabling flags usually affects the whole regex, but some engines support flags on just regions. ● Flag manipulation is engine- and API- dependent ● Every engine has its own flags, but some are definitely common.
  • 52. Most common regex flags ● Case insensitive ● Dot-all: . matches any character, including n ● Multiline anchors: ^ and $ (see later) work on lines instead of the whole text ● Extended: spaces – including newlines - are ignored unless escaped or within a character class; lines starting with # are comments. More readable regexes.
  • 53. Anchors ● Anchors do not consume text: they are basic conditions on the text cursor. ● They must be verified for the regex to match
  • 54. Common anchors ● ^: the cursor is at the beginning of the text (of a line, in multiline mode) ● $: the cursor is at the end of the text (of a line, in multiline mode. And before or after n? Know your engine). ● A: the cursor is at the beginning of the text ● Z: the cursor is at the end of the text ● b: the cursor is at a word boundary (what's a word boundary? Know your engine)
  • 55. Lookaround ● Lookaround = a regex-based condition on the text cursor. Can be positive (the regex must match) or negative (the regex must fail). ● Lookahead = a lookaround on the text following the cursor ● Lookbehind = a lookaround on the text preceding the cursor.
  • 56. Lookaround notation Lookbehind Lookahead Positive (?<= regex) (?= regex ) Negative (?<! regex) (?! regex )
  • 57. Lookaround basics ● Their position in the regex matters, as the other characters in the regex consume the text and make the text cursor shift forward. ● On the other hand, lookarounds do not consume text ● Juxtaposed lookarounds all apply, bound by a logic and, to the position marked by the text cursor
  • 58. Lookaround limitations ● Lookarounds behave like nested regexes having their own stack ● They are also called zero-length assertions ● Lookahead can be full-fledged regexes ● Lookbehinds are usually much more restricted, depending on the engine
  • 59. Lookarounds and the stack ● Each lookaround maintains its own stack, that gets deleted at the end of the lookaround. ● An important detail: capturing groups within lookarounds are considered capturing groups of the whole regex => their result is saved.
  • 60. Lookahead + Backreference = Atomic group ● Lookaheads are full-fledged regexes with their own stack, which is thrown away. ● This is exactly like an atomic group, but the lookahead does not consume text ● However, capturing groups in a lookahead are stored by the regex => use a backreference to capture that text ● Therefore, for example: /(?=(d+))1/ = /(?>d+)/
  • 61. Regexes and C# ● .NET encapsulates regexes in a class, System.Text.RegularExpressions.Regex ● Its constructor accepts the regex and, optionally, global flags ● C# supports raw strings (preceded by @), to avoid over-escaping, that can be found in Java.
  • 62. Regexes and Java ● Java's regex class is java.util.regex.Pattern ● In lieu of a constructor, it's a static method, Pattern.compile(), that creates a regex ● It takes the regex and, optionally, the global flags ● In Java, the regex /test/ becomes “test”, because each “” in the regex must be escaped in Java, too, for a total of 4 “”.
  • 63. Regexes in MongoDB ● MongoDB supports regexes ● Just use /regex/ (with slashes and without double quotes) as the right side of an equality assertion in your query ● Important: a regex could hit indexes on a field, but the best results are achieved when the regex starts with ^
  • 64. Regexes in Python ● Python provides the standard module re ● To create a regex, just use re.compile(), that takes, as usual, the regex string and the optional global flags
  • 65. Regexes in JavaScript ● In JavaScript, it's quite common to use this notation to create a regex object: var regex = /regexPattern/ var regexWithFlags = /regexPattern/flags ● Alternatively, the RegExp class can be used
  • 66. Final notes ● Don't forget that regexes must be kept simple, just like any other construct ● To achieve this result, a good knowledge of the text, as well as of the requirements, is needed. ● Write tests for your regexes
  • 67. Further references ● “Mastering Regular Expressions” - by Jeffrey E. F. Friedl, published by O'Reilly Media ● ● ●