Regular expressions
Upcoming SlideShare
Loading in...5
×
 

Regular expressions

on

  • 433 views

Trainning given a few years ago about Regular Expressions

Trainning given a few years ago about Regular Expressions

Statistics

Views

Total Views
433
Views on SlideShare
433
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Anchors do not match any character at all. Instead, they match a position before, after or between characters. They can be used to "anchor" the regex match at a certain position. The caret ^ matches the position before the first character in the string. Applying ^a to abc matches a. ^b will not match abc at all, because the b cannot be matched right after the start of the string, matched by ^. See below for the inside view of the regex engine.Similarly, $ matches right after the last character in the string. c$ matches c in abc, while a$ does not match at all.There are three different positions that qualify as wordboundaries:Before the first character in the string, if the first character is a word character.After the last character in the string, if the last character is a word character.Between two characters in the string, where one is a word character and the other is not a word character.
  • Limiting Repetition { }Modern regex flavors (regex engines), like those discussed in this tutorial, have an additional repetition operator that allows you to specify how many times a token can be repeated. The syntax is {min,max}, where min is a positive integer number indicating the minimum number of matches, and max is an integer equal to or greater than min indicating the maximum number of matches. If the comma is present but max is omitted, the maximum number of matches is infinite. So {0,} is the same as*, and {1,} is the same as +. Omitting both the comma and max tells the engine to repeat the token exactly min times.
  • If you want to search for the literal text cat or dog, separate both options with a vertical bar or pipe symbol: cat|dog. If you want more options, simply expand the list: cat|dog|mouse|fish .The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you will need to use round brackets for grouping. If we want to improve the first example to match whole words only, we would need to use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either "cat" or "dog", and then another word boundary.

Regular expressions Regular expressions Presentation Transcript

  • Paolo Carrasco The Nearshore Advantage!
  • • • • Definition How a regex engine works Applications – Matching • • Regex Pattern Characters – – – • • Literal Characters Special characters Character classes/sets Anchors Repetition and optional items – Extracting • • • Grouping Alternation Groups – Replacing • Advanced topics – Greediness – Lookahead and lookbehind • Techniques for Faster Expressions The Nearshore Advantage!
  • • Basically, a regular expression is a pattern describing a certain amount of text. • Their name comes from the mathematical theory on which they are based. • Regular expressions provide a powerful, flexible, and efficient method for processing text. • Sometimes it is also named regex or regexp. The Nearshore Advantage!
  • The centerpiece of text processing with regular expressions is the regex engine. A regex "engine" is a piece of software that can process regular expressions, trying to match the pattern to the given string. Usually, the engine is part of a larger application and you do not access the engine directly. The Nearshore Advantage! Regex Engine
  • At a minimum, processing text using regular expressions requires that the Pattern Input engine be provided with the following two items of information: •The regular expression pattern to Replacement identify in the text. •The text to parse for the regular expression pattern. • (Optionally) A replacement string. The Nearshore Advantage! Regex Engine
  • • Matching The extensive pattern-matching notation of regular expressions enables you to quickly parse large amounts of text to find specific character patterns; to validate text to Extracting ensure that it matches a predefined pattern; to extract, edit, replace, or delete text substrings; and to add the extracted strings to a collection in Replacing The Nearshore Advantage! order to generate a report.
  • Applications: The Nearshore Advantage!
  • Operators Literal Characters Constructs Pattern A regular expression is a pattern that the regular expression engine attempts to match in input text. A pattern consists of one or more character literals, operators, or constructs. The Nearshore Advantage!
  • • All characters except [^$.|?*+() refer to the simple meaning of characters. • All characters except the listed special characters match a single instance of themselves. • Note: The regex engines are case sensitive by default. The Nearshore Advantage!
  • • Most of the times we will need special characters for complex matches. • These characters are often called metacharacters. • The dot or period is one of the most commonly used. Unfortunately, it is also the most commonly misused metacharacter. The Nearshore Advantage! • When a special char is needed as literal, it requires a backslash followed by any of metacharacters. • A backslash escapes special characters to suppress their special meaning.
  • Escaped character Description Pattern Matches t Matches a tab, u0009. (w+)t "item1t", "item2t" in "item1titem2t" r Matches a carriage rn(w+) return, u000D. (r is not equivalent to the newline character, n.) "rnThese" in "rnThese arentwo lines." n Matches a new line, u000A. rn(w+) "rnThese" in "rnThese arentwo lines." s Matches an empty w+sw+ space The Nearshore Advantage! “Hello world” in “Hello world”
  • A character class matches any one of a set of characters. The order of the characters inside a character class does not matter. Positive character group • A character in the input string must match one of a specified set of characters. Negative character group • A character in the input string must not match one of a specified set of characters. Any character • The dot or period character is a wildcard character that matches any character except n The Nearshore Advantage! Shorthand classes • Since certain character classes are used often, a series of shorthand character classes are available. • Shorthand character classes can be used both inside and outside the square brackets. • Some shorthand have negated versions.
  • Anchor Description ^ The match must occur at the beginning of the string or line. $ The match must occur at the end of the string or line, or before n at the end of the string or line. b The match must occur on a word boundary. B The match must not occur on a word boundary. The Nearshore Advantage! • Anchors match a position before, after or between characters. • They can be used to "anchor" the regex match at a certain position.
  • * •Match-zero-or-more + •Match-one-or-more ? •Match-zero-or-one {} •Interval The Nearshore Advantage!
  • The Nearshore Advantage!
  • Applications: The Nearshore Advantage!
  • • A group, also known as a subexpression, consists of an "open-group operator", any number of other operators, and a "closegroup operator". open-group-operator close-group-operator • Regex treats this sequence as a unit, just as mathematics and programming languages treat a parenthesized expression as a unit. The Nearshore Advantage!
  • • Alternation match one of a choice of regular expressions: – If you put the character(s) representing the alternation operator between any two regular expressions A and B, the result matches the union of the strings that A and B match. • It operates on the largest possible surrounding regular expressions. Thus, the only way you can delimit its arguments is to use grouping. The Nearshore Advantage!
  • • By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. The Nearshore Advantage!
  • Applications: The Nearshore Advantage!
  • Feature .NET Java Perl ECMA Ruby $& (whole regex match) YES error YES YES no $0 (whole regex match) YES YES no no no $1 through $99 (backreference) YES YES YES YES no ${1} through ${99} (backreference) YES error YES no no ${group} (named backreference) YES error no no no $` (backtick; subject text to the left of the match) YES error YES YES no $' (straight quote; subject text to the right of the match) YES error YES YES no $_ (entire subject string) YES error YES IE only no $+ (highest-numbered group in the regex) YES error no IE and Firefox no $$ (escape dollar with another dollar) YES error no YES no YES error no YES YES $ (unescaped dollar as The Nearshore Advantage!literal text)
  • The Nearshore Advantage!
  • • The repetition operators or quantifiers are greedy. • They will expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex. • The quick fix to this problem is to make the quantifier lazy instead of greedy. Lazy quantifiers are sometimes also called "ungreedy" or "reluctant". You can do this by putting a question mark behind the plus in the regex. The Nearshore Advantage!
  • Negative lookahead • It is indispensable if you want to match something not followed by something else. Positive lookahead • It is indispensable if you want to match something not followed by something else. Collectively, these are called "lookaround". They do not consume characters in the string, but only assert whether a match is possible or not. The Nearshore Advantage!
  • Common Sense Techniques •Avoid recompiling •Use non-capturing parentheses •Don't add superfluous parentheses •Don't use superfluous character classes •Use leading anchors Expose Anchors •Expose ^ and G at the front of expressions •Expose $ at the end of expressions Lazy Versus Greedy: Be Specific Lead the Engine to a Match The Nearshore Advantage! •The repetition operators or quantifiers are greedy. •Put the most likely alternative first •Distribute into the end of alternation
  • The Nearshore Advantage!
  • • Regex Testers – http://www.gskinner.com/RegExr/ – http://osteele.com/tools/rework/ • Regex Patterns Library – http://regexlib.com • Complete Tutorials – http://www.regular-expressions.info/ – Javascript: • http://www.w3schools.com/jsref/jsref_obj_regexp.asp – .NET: • http://msdn.microsoft.com/en-us/library/hs600312.aspx – Java: • http://java.sun.com/docs/books/tutorial/essential/regex/index.html The Nearshore Advantage!