Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Regular Expression

86 views

Published on

Become a deft manipulator of text data. Regular Expression is the miracle of text extraction. If you got a text patten in mind, you can write your own pattern match in regular expression.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Regular Expression

  1. 1. Regular Expression Become a Deft Manipulator of Text Data Presented by Lambert Lum
  2. 2. Beginners Welcome No previous Perl instruction required. Practitioners of other languages welcome
  3. 3. PCRE PCRE: Perl compatible Regular Expressions Available for non-Perl languages.
  4. 4. We don’t cover Infrequently used regex Optimizations
  5. 5. Abbreviations Regex is a shortened term for regular expression.
  6. 6. Perl Refresher my $var; $scalar = "jay"; @list = ($scalar, "leno");
  7. 7. Jay Leno Q: Who made the first American flag? Actual question from Jay Leno’s Jaywalker segment
  8. 8. Jay Leno Hint: Last name begins with 'r', ends with 'ss', and has an 'o' in the middle Answer: Betsy Ross
  9. 9. Regex $last_name =~ m{^r.*o.*ss$};
  10. 10. Regex $last_name =~ m{^r.*o.*ss$}; ross rouss rokess rodiss ridoss
  11. 11. Symbols ^ $ . * .* Begins with Ends with Any Character Zero or more times Any Character, zero or more times $last_name =~ m{^r.*o.*ss$};
  12. 12. Quantifier Symbols * + ? {2} {2,} {2,4} Zero or more times one or more times Zero or one time 2 times 2 or more times 2 to 4 times
  13. 13. Practice #1 Q: which one of the following matches m{^r.*o.*s{2,4}$} a) ros b) ross c) rosss d) rossss e) rossssss f ) answers b,c,d g) answers b,c,d,e
  14. 14. Practice #1 Answer: g) answers b,c,d,e
  15. 15. Match, substitute, split # match my $last_name = 'ross'; if ($last_name =~ m{^r.*o.*ss$}) { print "match foundn"; } # substitution my $full_name = "betsy ross"; $full_name =~ s{s+}{-}; print "substitute space with dash: $full_namen"; # split $full_name = "ross, betsy"; my @nomens = split m{,s*}, $full_name; print join (':', @nomens) . "n";
  16. 16. Match, substitute, split # match my $last_name = 'ross'; if ($last_name =~ /^r.*o.*ss$/) { print "match foundn"; } # substitution my $full_name = "betsy ross"; $full_name =~ s/s+/-/; print "substitute space with dash: $full_namen"; # split $full_name = "ross, betsy"; my @nomens = split /,s*/, $full_name; print join (':', @nomens) . "n";
  17. 17. // vs {} Traditional: // Most common. Not so good if matching for '/' e.g. $str =~ ///; Better: {} Nesting brace characters in regex do not need escaping
  18. 18. Practice #2 Using regex, remove all spaces from the beginning and end of a string. Hint: Use two regex.
  19. 19. Practice #2 $last_name =~ s{^s+}{}; $last_name =~ s{s+$}{};
  20. 20. Non-destructive Substitution $full_name = "betsy ross"; $new_name = $full_name =~ s{s+}{-}r; # r modifier print "$full_name becomes $new_namen";
  21. 21. Character Symbols . s S d D . Any Character White space Non-whitespace Digit, 0-9 Non digit Back slash Literal Dot Quote meta-character
  22. 22. Negated match $last_name !~ m{^r.*o.*ss$};
  23. 23. Practice #3 Write a regex that determines if a non-empty string is entirely composed of digits. Bonus: do it two different ways
  24. 24. Practice #3 $str =~ m{^d+$}; $str !~ m{D};
  25. 25. Character class split m{[.,;s]+}, $names; $names =~ s{[^a-zA-Z0-9]}{-}g; # leading ^ will negate the character class # g modifier for global # run character_class.pl
  26. 26. Modifiers r i g Non-destructive substitution Case insensitive Global match
  27. 27. Case insensitive $last_name =~ m{^r.*o.*ss$}i;
  28. 28. Global match @list = $str =~ m{[a-z]+}g; $scalar = $str =~ m{[a-z]+}g; $str =~ s{[^a-z]+}{-}g;
  29. 29. Grouping ($prefix, $last_name) = $str =~ m{(mr[s]?[.]?)s+(S+)}i;
  30. 30. Alternation ($prefix, $last_name) = $str =~ m{(mister|misses|miss|mr[s]?[.]?)s+(S+)}i;
  31. 31. Alternation non-capturing group ($prefix, $last_name) = $str =~ m{(mister|miss(?:es)?|mr[s]?[.]?)s+(S+)}i; # the 'es' is in a non-capturing group
  32. 32. Alternation non-capturing group ($prefix, $last_name) = $str =~ m{(mi(?:ster|ss(?:es)?)|mr[s]?[.]?)s+(S+)}i; # Too much non-capturing groups. Hard to read
  33. 33. Modifiers r i g n Non-destructive substitution Case insensitive Global match Make all groups, non-capturing
  34. 34. Practice #4 # what value is printed for year? my $str = "Copyright 2013"; my $year; ($year) = $str =~ m{.*([0-9]+)}; print "$yearn";
  35. 35. Practice #4 Year is 3
  36. 36. Greedy vs. non-greedy ($year) = $str =~ m{.*([0-9]+)}; # $year is 3 # greedy maximizes the matching ($year) = $str =~ m{.*?([0-9]+)}; # $year is 2013 # non-greedy minimizes the matching
  37. 37. Non-greedy *? +? ??
  38. 38. Back reference my $str = "****Spangled****"; my ($star, $word) = $str =~ m{^([*]+)([^*]+)1$};
  39. 39. Match Variables $str = "Betsy Ross"; $str =~ s{(S+)s(S+)}{$2, $1}; print "$strn"; print "1: $1n"; print "2: $2n";
  40. 40. Practice #5 Using regex, remove all spaces from the beginning and end of a string. This time, do it with one regex, not two.
  41. 41. Practice #5 $str =~ s{^s*(.+?)s*$}{$1};
  42. 42. Look-around ?<= ?= Look behind Look ahead
  43. 43. Look-around my $pop = 281421906; # 281,421,906 print "The US population is $popn";
  44. 44. Look-around my $pop = 281421906; # 281,421,906 print "The US population is $popn"; # add commas inbetween each three digits $pop =~ s{(?<=d)(?=(ddd)+$)}{,}g; print "The US population is $popn";
  45. 45. Negated Look-around ?<! ?! Negated look behind Negated look ahead
  46. 46. /xms /x whitespace and comments /m awk/grep/sed matching ^ and $ /s multi-line text (‘.’ matches newline)
  47. 47. A and z # by default, beginning and end of string # with m, beginning and end of line $last_name =~ m{^r.*o.*ss$}; # always beginning and end of string $last_name =~ m{Ar.*o.*ssz};
  48. 48. Anchors ^ $ A z b Begins with Ends with String begins with String ends with Word boundary
  49. 49. tr my $name = "JagerMech"; $name =~ tr/A-Z/a-z/; # one for one translation. # replace chars of left with those on right. print "$namen"; # jagermech # tr is almost never used
  50. 50. w Not used on the homework. Almost never used. w describes alphanumerics and ‘_’
  51. 51. Further Reading Mastering Regular Expressions perldoc perlrequick perldoc perlretut perldoc perlre

×