Regular Expression

Regular Expression
Become a Deft
Manipulator of Text Data
Presented by Lambert Lum

Beginners Welcome
No previous Perl instruction required.
Practitioners of other languages welcome

PCRE
PCRE: Perl compatible Regular Expressions
Available for non-Perl languages.

We don’t cover
Infrequently used regex
Optimizations

Abbreviations
Regex is a shortened term for regular expression.

Perl Refresher
my $var;
$scalar = "jay";
@list = ($scalar, "leno");

Jay Leno
Q: Who made the first American flag?
Actual question from
Jay Leno’s Jaywalker segment

Jay Leno
Hint:
Last name begins with 'r',
ends with 'ss', and
has an 'o' in the middle
Answer:
Betsy Ross

Regex
$last_name =~ m{^r.*o.*ss$};

Regex
ross
rouss
rokess
rodiss
ridoss

Symbols
^
$
.
*
.*
Begins with
Ends with
Any Character
Zero or more times
Any Character, zero or more times

Quantifier Symbols
*
+
?
{2}
{2,}
{2,4}
Zero or more times
one or more times
Zero or one time
2 times
2 or more times
2 to 4 times

Practice #1
Q: which one of the following matches m{^r.*o.*s{2,4}$}
a) ros
b) ross
c) rosss
d) rossss
e) rossssss
f ) answers b,c,d
g) answers b,c,d,e

Practice #1
Answer: g) answers b,c,d,e

Match, substitute, split
# match
my $last_name = 'ross';
if ($last_name =~ m{^r.*o.*ss$}) {
print "match foundn";
}
# substitution
my $full_name = "betsy ross";
$full_name =~ s{s+}{-};
print "substitute space with dash: $full_namen";
# split
$full_name = "ross, betsy";
my @nomens = split m{,s*}, $full_name;
print join (':', @nomens) . "n";

Match, substitute, split
# match
my $last_name = 'ross';
if ($last_name =~ /^r.*o.*ss$/) {
print "match foundn";
}
# substitution
my $full_name = "betsy ross";
$full_name =~ s/s+/-/;
print "substitute space with dash: $full_namen";
# split
$full_name = "ross, betsy";
my @nomens = split /,s*/, $full_name;
print join (':', @nomens) . "n";

// vs {}
Traditional: //
Most common. Not so good if matching for '/'
e.g. $str =~ ///;
Better: {}
Nesting brace characters in regex do not need escaping

Practice #2
Using regex, remove all spaces from the beginning and end of a
string. Hint: Use two regex.

Practice #2
$last_name =~ s{^s+}{};
$last_name =~ s{s+$}{};

Non-destructive
Substitution
$full_name = "betsy ross";
$new_name = $full_name =~ s{s+}{-}r; # r modifier
print "$full_name becomes $new_namen";

Character Symbols
.
s
S
d
D

.

Any Character
White space
Non-whitespace
Digit, 0-9
Non digit
Back slash
Literal Dot
Quote meta-character

Negated match
$last_name !~ m{^r.*o.*ss$};

Practice #3
Write a regex that determines if a non-empty string is entirely
composed of digits.
Bonus: do it two different ways

Practice #3
$str =~ m{^d+$};
$str !~ m{D};

Character class
split m{[.,;s]+}, $names;
$names =~ s{[^a-zA-Z0-9]}{-}g;
# leading ^ will negate the character class
# g modifier for global
# run character_class.pl

Modifiers
r
i
g
Non-destructive substitution
Case insensitive
Global match

Case insensitive
$last_name =~ m{^r.*o.*ss$}i;

Global match
@list = $str =~ m{[a-z]+}g;
$scalar = $str =~ m{[a-z]+}g;
$str =~ s{[^a-z]+}{-}g;

Grouping
($prefix, $last_name)
= $str =~ m{(mr[s]?[.]?)s+(S+)}i;

Alternation
($prefix, $last_name) = $str
=~ m{(mister|misses|miss|mr[s]?[.]?)s+(S+)}i;

Alternation
non-capturing group
=~ m{(mister|miss(?:es)?|mr[s]?[.]?)s+(S+)}i;
# the 'es' is in a non-capturing group

Alternation
non-capturing group
=~ m{(mi(?:ster|ss(?:es)?)|mr[s]?[.]?)s+(S+)}i;
# Too much non-capturing groups. Hard to read

Modifiers
r
i
g
n
Non-destructive substitution
Case insensitive
Global match
Make all groups, non-capturing

Practice #4
# what value is printed for year?
my $str = "Copyright 2013";
my $year;
($year) = $str =~ m{.*([0-9]+)};
print "$yearn";

Greedy vs. non-greedy
($year) = $str =~ m{.*([0-9]+)};
# $year is 3
# greedy maximizes the matching
($year) = $str =~ m{.*?([0-9]+)};
# $year is 2013
# non-greedy minimizes the matching

Back reference
my $str = "****Spangled****";
my ($star, $word) = $str =~ m{^([*]+)([^*]+)1$};

Match Variables
$str = "Betsy Ross";
$str =~ s{(S+)s(S+)}{$2, $1};
print "$strn";
print "1: $1n";
print "2: $2n";

Practice #5
Using regex, remove all spaces from the beginning and end of a
string. This time, do it with one regex, not two.

Practice #5
$str =~ s{^s*(.+?)s*$}{$1};

Look-around
?<=
?=
Look behind
Look ahead

Look-around
my $pop = 281421906;
# 281,421,906
print "The US population is $popn";

Look-around
my $pop = 281421906;
# 281,421,906
# add commas inbetween each three digits
$pop =~ s{(?<=d)(?=(ddd)+$)}{,}g;

Negated Look-around
?<!
?!
Negated look behind
Negated look ahead

/xms
/x whitespace and comments
/m awk/grep/sed matching ^ and $
/s multi-line text (‘.’ matches newline)

A and z
# by default, beginning and end of string
# with m, beginning and end of line
# always beginning and end of string
$last_name =~ m{Ar.*o.*ssz};

Anchors
^
$
A
z
b
Begins with
Ends with
String begins with
String ends with
Word boundary

tr
my $name = "JagerMech";
$name =~ tr/A-Z/a-z/;
# one for one translation.
# replace chars of left with those on right.
print "$namen";
# jagermech
# tr is almost never used

w
Not used on the homework.
Almost never used.
w describes alphanumerics and ‘_’

Further Reading
Mastering Regular Expressions
perldoc perlrequick
perldoc perlretut
perldoc perlre

Regular Expression

More Related Content

What's hot

Similar to Regular Expression

More from Lambert Lum

Recently uploaded

Regular Expression