Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Regular Expression
Become a Deft
Manipulator of Text Data
Presented by Lambert Lum
Beginners Welcome
No previous Perl instruction required.
Practitioners of other languages welcome
PCRE
PCRE: Perl compatible Regular Expressions
Available for non-Perl languages.
We don’t cover
Infrequently used regex
Optimizations
Abbreviations
Regex is a shortened term for regular expression.
Perl Refresher
my $var;
$scalar = "jay";
@list = ($scalar, "leno");
Jay Leno
Q: Who made the first American flag?
Actual question from
Jay Leno’s Jaywalker segment
Jay Leno
Hint:
Last name begins with 'r',
ends with 'ss', and
has an 'o' in the middle
Answer:
Betsy Ross
Regex
$last_name =~ m{^r.*o.*ss$};
Regex
$last_name =~ m{^r.*o.*ss$};
ross
rouss
rokess
rodiss
ridoss
Symbols
^
$
.
*
.*
Begins with
Ends with
Any Character
Zero or more times
Any Character, zero or more times
$last_name =~ ...
Quantifier Symbols
*
+
?
{2}
{2,}
{2,4}
Zero or more times
one or more times
Zero or one time
2 times
2 or more times
2 to...
Practice #1
Q: which one of the following matches m{^r.*o.*s{2,4}$}
a) ros
b) ross
c) rosss
d) rossss
e) rossssss
f ) answ...
Practice #1
Answer: g) answers b,c,d,e
Match, substitute, split
# match
my $last_name = 'ross';
if ($last_name =~ m{^r.*o.*ss$}) {
print "match foundn";
}
# subs...
Match, substitute, split
# match
my $last_name = 'ross';
if ($last_name =~ /^r.*o.*ss$/) {
print "match foundn";
}
# subst...
// vs {}
Traditional: //
Most common. Not so good if matching for '/'
e.g. $str =~ ///;
Better: {}
Nesting brace character...
Practice #2
Using regex, remove all spaces from the beginning and end of a
string. Hint: Use two regex.
Practice #2
$last_name =~ s{^s+}{};
$last_name =~ s{s+$}{};
Non-destructive
Substitution
$full_name = "betsy ross";
$new_name = $full_name =~ s{s+}{-}r; # r modifier
print "$full_nam...
Character Symbols
.
s
S
d
D

.

Any Character
White space
Non-whitespace
Digit, 0-9
Non digit
Back slash
Literal Dot
Quote...
Negated match
$last_name !~ m{^r.*o.*ss$};
Practice #3
Write a regex that determines if a non-empty string is entirely
composed of digits.
Bonus: do it two different...
Practice #3
$str =~ m{^d+$};
$str !~ m{D};
Character class
split m{[.,;s]+}, $names;
$names =~ s{[^a-zA-Z0-9]}{-}g;
# leading ^ will negate the character class
# g m...
Modifiers
r
i
g
Non-destructive substitution
Case insensitive
Global match
Case insensitive
$last_name =~ m{^r.*o.*ss$}i;
Global match
@list = $str =~ m{[a-z]+}g;
$scalar = $str =~ m{[a-z]+}g;
$str =~ s{[^a-z]+}{-}g;
Grouping
($prefix, $last_name)
= $str =~ m{(mr[s]?[.]?)s+(S+)}i;
Alternation
($prefix, $last_name) = $str
=~ m{(mister|misses|miss|mr[s]?[.]?)s+(S+)}i;
Alternation
non-capturing group
($prefix, $last_name) = $str
=~ m{(mister|miss(?:es)?|mr[s]?[.]?)s+(S+)}i;
# the 'es' is i...
Alternation
non-capturing group
($prefix, $last_name) = $str
=~ m{(mi(?:ster|ss(?:es)?)|mr[s]?[.]?)s+(S+)}i;
# Too much no...
Modifiers
r
i
g
n
Non-destructive substitution
Case insensitive
Global match
Make all groups, non-capturing
Practice #4
# what value is printed for year?
my $str = "Copyright 2013";
my $year;
($year) = $str =~ m{.*([0-9]+)};
print...
Practice #4
Year is 3
Greedy vs. non-greedy
($year) = $str =~ m{.*([0-9]+)};
# $year is 3
# greedy maximizes the matching
($year) = $str =~ m{.*...
Non-greedy
*?
+?
??
Back reference
my $str = "****Spangled****";
my ($star, $word) = $str =~ m{^([*]+)([^*]+)1$};
Match Variables
$str = "Betsy Ross";
$str =~ s{(S+)s(S+)}{$2, $1};
print "$strn";
print "1: $1n";
print "2: $2n";
Practice #5
Using regex, remove all spaces from the beginning and end of a
string. This time, do it with one regex, not tw...
Practice #5
$str =~ s{^s*(.+?)s*$}{$1};
Look-around
?<=
?=
Look behind
Look ahead
Look-around
my $pop = 281421906;
# 281,421,906
print "The US population is $popn";
Look-around
my $pop = 281421906;
# 281,421,906
print "The US population is $popn";
# add commas inbetween each three digit...
Negated Look-around
?<!
?!
Negated look behind
Negated look ahead
/xms
/x whitespace and comments
/m awk/grep/sed matching ^ and $
/s multi-line text (‘.’ matches newline)
A and z
# by default, beginning and end of string
# with m, beginning and end of line
$last_name =~ m{^r.*o.*ss$};
# alway...
Anchors
^
$
A
z
b
Begins with
Ends with
String begins with
String ends with
Word boundary
tr
my $name = "JagerMech";
$name =~ tr/A-Z/a-z/;
# one for one translation.
# replace chars of left with those on right.
p...
w
Not used on the homework.
Almost never used.
w describes alphanumerics and ‘_’
Further Reading
Mastering Regular Expressions
perldoc perlrequick
perldoc perlretut
perldoc perlre
Upcoming SlideShare
Loading in …5
×

Regular Expression

101 views

Published on

Become a deft manipulator of text data. Regular Expression is the miracle of text extraction. If you got a text patten in mind, you can write your own pattern match in regular expression.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Regular Expression

  1. 1. Regular Expression Become a Deft Manipulator of Text Data Presented by Lambert Lum
  2. 2. Beginners Welcome No previous Perl instruction required. Practitioners of other languages welcome
  3. 3. PCRE PCRE: Perl compatible Regular Expressions Available for non-Perl languages.
  4. 4. We don’t cover Infrequently used regex Optimizations
  5. 5. Abbreviations Regex is a shortened term for regular expression.
  6. 6. Perl Refresher my $var; $scalar = "jay"; @list = ($scalar, "leno");
  7. 7. Jay Leno Q: Who made the first American flag? Actual question from Jay Leno’s Jaywalker segment
  8. 8. Jay Leno Hint: Last name begins with 'r', ends with 'ss', and has an 'o' in the middle Answer: Betsy Ross
  9. 9. Regex $last_name =~ m{^r.*o.*ss$};
  10. 10. Regex $last_name =~ m{^r.*o.*ss$}; ross rouss rokess rodiss ridoss
  11. 11. Symbols ^ $ . * .* Begins with Ends with Any Character Zero or more times Any Character, zero or more times $last_name =~ m{^r.*o.*ss$};
  12. 12. Quantifier Symbols * + ? {2} {2,} {2,4} Zero or more times one or more times Zero or one time 2 times 2 or more times 2 to 4 times
  13. 13. Practice #1 Q: which one of the following matches m{^r.*o.*s{2,4}$} a) ros b) ross c) rosss d) rossss e) rossssss f ) answers b,c,d g) answers b,c,d,e
  14. 14. Practice #1 Answer: g) answers b,c,d,e
  15. 15. Match, substitute, split # match my $last_name = 'ross'; if ($last_name =~ m{^r.*o.*ss$}) { print "match foundn"; } # substitution my $full_name = "betsy ross"; $full_name =~ s{s+}{-}; print "substitute space with dash: $full_namen"; # split $full_name = "ross, betsy"; my @nomens = split m{,s*}, $full_name; print join (':', @nomens) . "n";
  16. 16. Match, substitute, split # match my $last_name = 'ross'; if ($last_name =~ /^r.*o.*ss$/) { print "match foundn"; } # substitution my $full_name = "betsy ross"; $full_name =~ s/s+/-/; print "substitute space with dash: $full_namen"; # split $full_name = "ross, betsy"; my @nomens = split /,s*/, $full_name; print join (':', @nomens) . "n";
  17. 17. // vs {} Traditional: // Most common. Not so good if matching for '/' e.g. $str =~ ///; Better: {} Nesting brace characters in regex do not need escaping
  18. 18. Practice #2 Using regex, remove all spaces from the beginning and end of a string. Hint: Use two regex.
  19. 19. Practice #2 $last_name =~ s{^s+}{}; $last_name =~ s{s+$}{};
  20. 20. Non-destructive Substitution $full_name = "betsy ross"; $new_name = $full_name =~ s{s+}{-}r; # r modifier print "$full_name becomes $new_namen";
  21. 21. Character Symbols . s S d D . Any Character White space Non-whitespace Digit, 0-9 Non digit Back slash Literal Dot Quote meta-character
  22. 22. Negated match $last_name !~ m{^r.*o.*ss$};
  23. 23. Practice #3 Write a regex that determines if a non-empty string is entirely composed of digits. Bonus: do it two different ways
  24. 24. Practice #3 $str =~ m{^d+$}; $str !~ m{D};
  25. 25. Character class split m{[.,;s]+}, $names; $names =~ s{[^a-zA-Z0-9]}{-}g; # leading ^ will negate the character class # g modifier for global # run character_class.pl
  26. 26. Modifiers r i g Non-destructive substitution Case insensitive Global match
  27. 27. Case insensitive $last_name =~ m{^r.*o.*ss$}i;
  28. 28. Global match @list = $str =~ m{[a-z]+}g; $scalar = $str =~ m{[a-z]+}g; $str =~ s{[^a-z]+}{-}g;
  29. 29. Grouping ($prefix, $last_name) = $str =~ m{(mr[s]?[.]?)s+(S+)}i;
  30. 30. Alternation ($prefix, $last_name) = $str =~ m{(mister|misses|miss|mr[s]?[.]?)s+(S+)}i;
  31. 31. Alternation non-capturing group ($prefix, $last_name) = $str =~ m{(mister|miss(?:es)?|mr[s]?[.]?)s+(S+)}i; # the 'es' is in a non-capturing group
  32. 32. Alternation non-capturing group ($prefix, $last_name) = $str =~ m{(mi(?:ster|ss(?:es)?)|mr[s]?[.]?)s+(S+)}i; # Too much non-capturing groups. Hard to read
  33. 33. Modifiers r i g n Non-destructive substitution Case insensitive Global match Make all groups, non-capturing
  34. 34. Practice #4 # what value is printed for year? my $str = "Copyright 2013"; my $year; ($year) = $str =~ m{.*([0-9]+)}; print "$yearn";
  35. 35. Practice #4 Year is 3
  36. 36. Greedy vs. non-greedy ($year) = $str =~ m{.*([0-9]+)}; # $year is 3 # greedy maximizes the matching ($year) = $str =~ m{.*?([0-9]+)}; # $year is 2013 # non-greedy minimizes the matching
  37. 37. Non-greedy *? +? ??
  38. 38. Back reference my $str = "****Spangled****"; my ($star, $word) = $str =~ m{^([*]+)([^*]+)1$};
  39. 39. Match Variables $str = "Betsy Ross"; $str =~ s{(S+)s(S+)}{$2, $1}; print "$strn"; print "1: $1n"; print "2: $2n";
  40. 40. Practice #5 Using regex, remove all spaces from the beginning and end of a string. This time, do it with one regex, not two.
  41. 41. Practice #5 $str =~ s{^s*(.+?)s*$}{$1};
  42. 42. Look-around ?<= ?= Look behind Look ahead
  43. 43. Look-around my $pop = 281421906; # 281,421,906 print "The US population is $popn";
  44. 44. Look-around my $pop = 281421906; # 281,421,906 print "The US population is $popn"; # add commas inbetween each three digits $pop =~ s{(?<=d)(?=(ddd)+$)}{,}g; print "The US population is $popn";
  45. 45. Negated Look-around ?<! ?! Negated look behind Negated look ahead
  46. 46. /xms /x whitespace and comments /m awk/grep/sed matching ^ and $ /s multi-line text (‘.’ matches newline)
  47. 47. A and z # by default, beginning and end of string # with m, beginning and end of line $last_name =~ m{^r.*o.*ss$}; # always beginning and end of string $last_name =~ m{Ar.*o.*ssz};
  48. 48. Anchors ^ $ A z b Begins with Ends with String begins with String ends with Word boundary
  49. 49. tr my $name = "JagerMech"; $name =~ tr/A-Z/a-z/; # one for one translation. # replace chars of left with those on right. print "$namen"; # jagermech # tr is almost never used
  50. 50. w Not used on the homework. Almost never used. w describes alphanumerics and ‘_’
  51. 51. Further Reading Mastering Regular Expressions perldoc perlrequick perldoc perlretut perldoc perlre

×