Regular
Expressions
Jesse Anderson
What Are They?
• Language to parse text
• Apply logic and constraints
• Concise (but not readable)
• Consistent (mostly)
• Widely supported in programming
languages
Hello Regex
Source Text Regular Expression Yield
“hello world” “hello” { “hello” }
“hello world hello world” “hello” { “hello”, “hello” }
“hello world hello world” “world” { “world”, “world” }
“hello world hello world” “hello world”
{ “hello world”, “hello
world” }
Java Regex Code
Pattern pattern = Pattern.compile("hello");
Matcher matcher = pattern.matcher("hello world");
// Find all matches
while (matcher.find()) {
// Get the matching string
String match = matcher.group();
// match = “hello”
}
C# Regex Code
foreach (Match match in
Regex.Matches("hello world", "hello",
RegexOptions.IgnoreCase)) {
// Get the matching string
String match = match.Value;
// match = “hello”
}
Python Regex Code
regex = re.compile("hello");
results = regex.search("hello world");
// results = "hello"
Perl Regex Code
$value = "hello world";
$value =~ m/hello/;
$result = $1;
// result = "hello"
The (Ugly) Alternative
String needle = "hello";
String haystack = "hello world hello world";
int index = 0;
while ((index = haystack.indexOf( needle,
index )) != -1) {
String match = haystack.substring( index,
index + needle.length() );
index++;
}
Regex Metacharacters
• * - Match zero or more times
• ? - Match zero or 1 time
• + - Match one or more times
• ^ - Match the start of a string
• $ - Match the end of a string
Character Classes
POSIXPOSIX ShorthandShorthand LonghandLonghand DescriptionDescription
[:word:] w [A-Za-z0-9_] Alphanumeric Chars.
W [^A-Za-z0-9_]
Non-alphanumeric
Chars.
[:alpha:] [A-Za-z] Alphabetic Chars.
[:blank:] [ t] Space and tab
[:digit:] d [0-9] Numeric Characters
D [^0-9] Non-numeric Chars.
[:space:] s [ trnvf] Whitespace Characters
Groups
Source TextSource Text Regular ExpressionRegular Expression YieldYield
“hello world” “([a-z]+)s+([a-z]+)
{ “hello world”,
“hello”, “world” }
“hello world12345” “([a-z]+)s+([a-z]+)
{ “hello world”,
“hello”, “world” }
“hello world12345” “([a-z]+)s+([a-z]+)(d+)
{ “hello
world12345”, “hello”,
“world”, “12345” }
Example
• Example that parses, cleans up, and
normalizes input
Recommended
Reading
• Mastering Regular Expressions by Jeffry
Friedl
• Regular Expressions Cheat Sheet
http://www.addedbytes.com/cheat-sheets/regular-e
• Regex Evaluator
http://www.cuneytyilmaz.com/prog/jrx/

Introduction to Regular Expressions

  • 1.
  • 2.
    What Are They? •Language to parse text • Apply logic and constraints • Concise (but not readable) • Consistent (mostly) • Widely supported in programming languages
  • 3.
    Hello Regex Source TextRegular Expression Yield “hello world” “hello” { “hello” } “hello world hello world” “hello” { “hello”, “hello” } “hello world hello world” “world” { “world”, “world” } “hello world hello world” “hello world” { “hello world”, “hello world” }
  • 4.
    Java Regex Code Patternpattern = Pattern.compile("hello"); Matcher matcher = pattern.matcher("hello world"); // Find all matches while (matcher.find()) { // Get the matching string String match = matcher.group(); // match = “hello” }
  • 5.
    C# Regex Code foreach(Match match in Regex.Matches("hello world", "hello", RegexOptions.IgnoreCase)) { // Get the matching string String match = match.Value; // match = “hello” }
  • 6.
    Python Regex Code regex= re.compile("hello"); results = regex.search("hello world"); // results = "hello"
  • 7.
    Perl Regex Code $value= "hello world"; $value =~ m/hello/; $result = $1; // result = "hello"
  • 8.
    The (Ugly) Alternative Stringneedle = "hello"; String haystack = "hello world hello world"; int index = 0; while ((index = haystack.indexOf( needle, index )) != -1) { String match = haystack.substring( index, index + needle.length() ); index++; }
  • 9.
    Regex Metacharacters • *- Match zero or more times • ? - Match zero or 1 time • + - Match one or more times • ^ - Match the start of a string • $ - Match the end of a string
  • 10.
    Character Classes POSIXPOSIX ShorthandShorthandLonghandLonghand DescriptionDescription [:word:] w [A-Za-z0-9_] Alphanumeric Chars. W [^A-Za-z0-9_] Non-alphanumeric Chars. [:alpha:] [A-Za-z] Alphabetic Chars. [:blank:] [ t] Space and tab [:digit:] d [0-9] Numeric Characters D [^0-9] Non-numeric Chars. [:space:] s [ trnvf] Whitespace Characters
  • 11.
    Groups Source TextSource TextRegular ExpressionRegular Expression YieldYield “hello world” “([a-z]+)s+([a-z]+) { “hello world”, “hello”, “world” } “hello world12345” “([a-z]+)s+([a-z]+) { “hello world”, “hello”, “world” } “hello world12345” “([a-z]+)s+([a-z]+)(d+) { “hello world12345”, “hello”, “world”, “12345” }
  • 12.
    Example • Example thatparses, cleans up, and normalizes input
  • 13.
    Recommended Reading • Mastering RegularExpressions by Jeffry Friedl • Regular Expressions Cheat Sheet http://www.addedbytes.com/cheat-sheets/regular-e • Regex Evaluator http://www.cuneytyilmaz.com/prog/jrx/