Regular Expression
Become a Deft
Manipulator of Text Data
Presented by Lambert Lum
Beginners Welcome
No previous Perl instruction required.
Practitioners of other languages welcome
PCRE
PCRE: Perl compatible Regular Expressions
Available for non-Perl languages.
We don’t cover
Infrequently used regex
Optimizations
Abbreviations
Regex is a shortened term for regular expression.
Perl Refresher
my $var;
$scalar = "jay";
@list = ($scalar, "leno");
Jay Leno
Q: Who made the first American flag?
Actual question from
Jay Leno’s Jaywalker segment
Jay Leno
Hint:
Last name begins with 'r',
ends with 'ss', and
has an 'o' in the middle
Answer:
Betsy Ross
Regex
$last_name =~ m{^r.*o.*ss$};
Regex
$last_name =~ m{^r.*o.*ss$};
ross
rouss
rokess
rodiss
ridoss
Symbols
^
$
.
*
.*
Begins with
Ends with
Any Character
Zero or more times
Any Character, zero or more times
$last_name =~ m{^r.*o.*ss$};
Quantifier Symbols
*
+
?
{2}
{2,}
{2,4}
Zero or more times
one or more times
Zero or one time
2 times
2 or more times
2 to 4 times
Practice #1
Q: which one of the following matches m{^r.*o.*s{2,4}$}
a) ros
b) ross
c) rosss
d) rossss
e) rossssss
f ) answers b,c,d
g) answers b,c,d,e
Practice #1
Answer: g) answers b,c,d,e
Match, substitute, split
# match
my $last_name = 'ross';
if ($last_name =~ m{^r.*o.*ss$}) {
print "match foundn";
}
# substitution
my $full_name = "betsy ross";
$full_name =~ s{s+}{-};
print "substitute space with dash: $full_namen";
# split
$full_name = "ross, betsy";
my @nomens = split m{,s*}, $full_name;
print join (':', @nomens) . "n";
Match, substitute, split
# match
my $last_name = 'ross';
if ($last_name =~ /^r.*o.*ss$/) {
print "match foundn";
}
# substitution
my $full_name = "betsy ross";
$full_name =~ s/s+/-/;
print "substitute space with dash: $full_namen";
# split
$full_name = "ross, betsy";
my @nomens = split /,s*/, $full_name;
print join (':', @nomens) . "n";
// vs {}
Traditional: //
Most common. Not so good if matching for '/'
e.g. $str =~ ///;
Better: {}
Nesting brace characters in regex do not need escaping
Practice #2
Using regex, remove all spaces from the beginning and end of a
string. Hint: Use two regex.
Practice #2
$last_name =~ s{^s+}{};
$last_name =~ s{s+$}{};
Non-destructive
Substitution
$full_name = "betsy ross";
$new_name = $full_name =~ s{s+}{-}r; # r modifier
print "$full_name becomes $new_namen";
Character Symbols
.
s
S
d
D

.

Any Character
White space
Non-whitespace
Digit, 0-9
Non digit
Back slash
Literal Dot
Quote meta-character
Negated match
$last_name !~ m{^r.*o.*ss$};
Practice #3
Write a regex that determines if a non-empty string is entirely
composed of digits.
Bonus: do it two different ways
Practice #3
$str =~ m{^d+$};
$str !~ m{D};
Character class
split m{[.,;s]+}, $names;
$names =~ s{[^a-zA-Z0-9]}{-}g;
# leading ^ will negate the character class
# g modifier for global
# run character_class.pl
Modifiers
r
i
g
Non-destructive substitution
Case insensitive
Global match
Case insensitive
$last_name =~ m{^r.*o.*ss$}i;
Global match
@list = $str =~ m{[a-z]+}g;
$scalar = $str =~ m{[a-z]+}g;
$str =~ s{[^a-z]+}{-}g;
Grouping
($prefix, $last_name)
= $str =~ m{(mr[s]?[.]?)s+(S+)}i;
Alternation
($prefix, $last_name) = $str
=~ m{(mister|misses|miss|mr[s]?[.]?)s+(S+)}i;
Alternation
non-capturing group
($prefix, $last_name) = $str
=~ m{(mister|miss(?:es)?|mr[s]?[.]?)s+(S+)}i;
# the 'es' is in a non-capturing group
Alternation
non-capturing group
($prefix, $last_name) = $str
=~ m{(mi(?:ster|ss(?:es)?)|mr[s]?[.]?)s+(S+)}i;
# Too much non-capturing groups. Hard to read
Modifiers
r
i
g
n
Non-destructive substitution
Case insensitive
Global match
Make all groups, non-capturing
Practice #4
# what value is printed for year?
my $str = "Copyright 2013";
my $year;
($year) = $str =~ m{.*([0-9]+)};
print "$yearn";
Practice #4
Year is 3
Greedy vs. non-greedy
($year) = $str =~ m{.*([0-9]+)};
# $year is 3
# greedy maximizes the matching
($year) = $str =~ m{.*?([0-9]+)};
# $year is 2013
# non-greedy minimizes the matching
Non-greedy
*?
+?
??
Back reference
my $str = "****Spangled****";
my ($star, $word) = $str =~ m{^([*]+)([^*]+)1$};
Match Variables
$str = "Betsy Ross";
$str =~ s{(S+)s(S+)}{$2, $1};
print "$strn";
print "1: $1n";
print "2: $2n";
Practice #5
Using regex, remove all spaces from the beginning and end of a
string. This time, do it with one regex, not two.
Practice #5
$str =~ s{^s*(.+?)s*$}{$1};
Look-around
?<=
?=
Look behind
Look ahead
Look-around
my $pop = 281421906;
# 281,421,906
print "The US population is $popn";
Look-around
my $pop = 281421906;
# 281,421,906
print "The US population is $popn";
# add commas inbetween each three digits
$pop =~ s{(?<=d)(?=(ddd)+$)}{,}g;
print "The US population is $popn";
Negated Look-around
?<!
?!
Negated look behind
Negated look ahead
/xms
/x whitespace and comments
/m awk/grep/sed matching ^ and $
/s multi-line text (‘.’ matches newline)
A and z
# by default, beginning and end of string
# with m, beginning and end of line
$last_name =~ m{^r.*o.*ss$};
# always beginning and end of string
$last_name =~ m{Ar.*o.*ssz};
Anchors
^
$
A
z
b
Begins with
Ends with
String begins with
String ends with
Word boundary
tr
my $name = "JagerMech";
$name =~ tr/A-Z/a-z/;
# one for one translation.
# replace chars of left with those on right.
print "$namen";
# jagermech
# tr is almost never used
w
Not used on the homework.
Almost never used.
w describes alphanumerics and ‘_’
Further Reading
Mastering Regular Expressions
perldoc perlrequick
perldoc perlretut
perldoc perlre

Regular Expression