Regular Expressions
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,010
On Slideshare
4,010
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
188
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819 THAT’S why we need the version when promoting
  • LAN011213001-23445-819 THAT’S why we need the version when promoting
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819

Transcript

  • 1. Regular Expressions Satyanarayana D < satyavvd@yahoo-inc.com>
  • 2. Topics
    • What?
    • Why?
    • History - Who?
    • Flavou?rs
    • Grammar
    • Meta Chars
    • Character Classes
    • Shorthand Char Classes
    • Anchors
    • Repeaters or Quantifiers
    • Grouping & Capturing
    • Alternation
    • Match Float
    • Atomic Grouping
    • Look Around
    • Conditional Expr.
    • Recursive Regex
    • Code Evalution
    • Code Expr.
    • Inline Modifiers
    • Regex Tools
    • Q&A
  • 3. What are Regular Expressions?
    • A Regular expression is a pattern describing a certain amount of text.
    • A regular expression, often called a pattern, is an expression that describes a set of strings. - Wikipedia
  • 4. Why do we need?
    • Regular expressions allow matching and manipulation of textual data.
    • Requirements
      • Matching/Finding
      • Doing something with matched text
      • Validation of data
      • Case insensitive matching
      • Parsing data ( ex: html )
      • Converting data into diff. form etc.
  • 5. History Stephen Kleene A mathematician discovered ‘ regular sets ’.
  • 6. History Ken Thompson 1968 - Regular Expression Search Algorithm. Qed -> ed -> g/re/p
  • 7. History Henry Spencer 1986 – Wrote a regex library in C
  • 8. Regex Flavors
    • BRE - Basic Regular Expressions
      • ?, +, {, |, (, and )
      • ed, g/re/p, sed
    • ERE - Extended Regular Expressions
      • ?, +, {, |, (, and )
      • grep –E == egrep, awk
    • PCRE - Philip Hazel
      • Perl, PHP, Tcl etc.
  • 9. Grammar of Regex * RE = one or more non-empty ‘ branches ‘ separated by ‘|’ Branch = one or more ‘ pieces ’ Piece = atom followed by quantifier Quantifier = ‘*,+,?’ or ‘ bound ’ Bound = atom{n}, atom{n,}, atom {m, n} Atom = (RE) or () or ‘ ^,$,’ or followed by `^.[$()|*+?{’ or any-char or ‘ bracket expression ’ Bracket Expression = is a list of characters enclosed in `[ ]'
  • 10. Meta Chars? 2 + 4 Here ‘+’ has some special meaning In a normal Expression like :
  • 11. Meta Chars Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) = [^n] $ Match the end of the line (or before newline at the end) | Alternation ( ) Grouping [ ] Character class { } Match m to n times * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times
  • 12. Non-printable Chars t tab (HT, TAB) n newline (LF, NL) r return (CR) f form feed (FF) a alarm (bell) (BEL) e escape (think troff) (ESC) 033 octal char (example: ESC) x1B hex char (example: ESC) x{263a} long hex char (example: Unicode SMILEY) cK control char (example: VT) N{name} named Unicode character
  • 13. Character Classes – [ ]
    • Set of character placed inside square brackets. Inside brackets meta characters lose their meaning ( except ‘] ^ - ‘)
    • Requirements
      • Matches one and only one character of a specified chars.
      • Range can be specified using ‘-’.
        • a-z matches 26 lower case English alphabets
        • 0-9 matches any digit.
        • Negation can be specified using ‘^’ at the beginning of class.
        • In order to match above specified exceptional chars literally either escape them or need to specify at end.
    [0-9] Matches any one of 0,1,2,3,4,5,6,7,8,9. [aeiou] Matches one English vowel char. [^aeiou] Matches any non-vowel char. [a-z-] Matches a to z and ‘-’ [a-z0-9] Union matches a to z and 0 to 9. [a-z&&[m-z]] Intersection matches m to z. [a-z-[m-z] Subtraction matches a to l.
  • 14. POSIX Character Classes – [: … :] [^[:digit:] ]= D = [^0-9]
  • 15. Shorthand Chars w word character [A-Za-z0-9_] d decimal digit [0-9] s whitespace [ nrtf] W not a word character [^A-Za-z0-9_] D not a decimal digit [^0-9] S not whitespace [^ nrtf]
  • 16. Anchors/Assertions
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    ^ Match the beginning of the line $ Match the end of the line (or before newline at the end) A Matches only at the very beginning z Matches only at the very end Z Matches like $ used in single-line mode b Matches when the current position is a word boundary <,> Matches when the current position is a word boundary B Matches when the current position is not a word boundary
  • 17. ^Anchors
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    ^ Match the beginning of the line Anchor matches a certain position In the subject string and it won’t consume a ny characters /^a/ String begin with ‘a’
  • 18. Anchors$
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    $ Match the end of the line (or before newline at the end) Anchor matche s a certain position In the subject string and it won’t consume any character s /s$/ String end with ‘s’
  • 19. A Anchors
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    A Matches only at the very beginning Anchor matches a certain position In the subject string and it won’t consume any characters ^ Vs A
  • 20. z, Z Anchors
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    z Matches only at the very end Z Matches like $ used in single-line mode Anchor matches a certain position In the subject string and it won’t consume any characters $ Vs z , Z
  • 21. b, B Anchors
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    b = Ww|wW = Matches a word boundary B Matches when the current position is not a word boundary /b2b/ /B2B/ $ xl 2 twiki file 2 > /dev/null
  • 22. Quantifiers
    • Why? – Because we are not sure about text. Specifies how many times regex component must repeat.
    {m, n} = Matches minimum of m and a max of n occurrences. * = {0,} = Matches zero or more occurrences ( any amount). + = {1,} = Matches one or more occurrences. ? = {0,1} = Matches zero or one occurrence ( means optional ). Quantifiers ( repetition) :
  • 23. Quantifiers
    • By default quantifiers are greedy.
    /d{2,4}/ 2010 /<.+>/ My first <strong> regex </strong> test. <strong> regex </strong> /w+sion/ Expression If the entire match fails because they consumed too much, then they are forced to give up as much as needed to make the rest of regex succeed
  • 24. Non Greedy Quantifiers {,}? *? +? ?? To make non greedy quantifiers append ‘?’ <.+?> My first <strong> regex </strong> test. <strong> Use negated classes <[^>]+> My first <strong> regex </strong> test. <strong>
  • 25. Grouping – ( )
    • Why? – To create sub patterns, so that you can apply regex operators to whole sub patterns or you can reference them by corresponding sub group numbers.
    d{2}-d{2} -d{2} (d{2})? Will match 01-01-10 and 01-01-2010 also.
    • Grouping can be used for alternation.
  • 26. Alternation - |
    • Why? – Lets you to match more than one sub-expression at same point.
    /b( get | set )Valueb/ Match either getValue or setValue .
    • Branches are tried from left->right.
    • Eagerness - Most likely pattern as first alternative
      • (and|android) -> ‘robot and an and roid fight’
  • 27. Capturing – ( )
    • Allows us to access sub-parts of pattern for later processing.
      • All captured sub patterns are stored in memory.
      • Captured patterns are numbered from left to right.
    /b((d{2})-(d{2})-(d{2}(d{2})?))b/ b ( ( d{2} ) - ( d{2} ) - (d{2} ( d{2} ) ?) ) b Today is ‘ 18-08-2010 ’. 1 -> date -> 18-08-2010 2 -> day-> 18 3 -> month -> 08 4 -> year -> 2010 5 -> year -> last two digits -> 10
  • 28. Non-Capturing sub patterns– (?: )
    • If you really don’t require back referencing make sub expressions as non-capture, It will save memory and processing time.
    d{2}-d{2} -d{2} (?:d{2})? Will match 01-01-10 and 01-01-2010 also.
  • 29.
    • We can give names for sub patterns instead of numbers.
    (?P<name>pattern) -> Python Style, Perl 5.12 (?P=name) -> Back reference (?<name>pattern) or (?’name’pattern) ->Perl 5.10 k<name> or k’name’ or -> Back reference g{name} g{-1}, g{-2} -> Relative Back reference. (?<vowel>[ai]).k<vowel>.1 abr acada bra !! /(w+)s+g{-1}/ &quot;Thus joyful Troy Troy maintained the the watch of night...” $date=&quot;18-08-2010&quot;; $date =~ s/(?<day>d{2})-(?<month>d{2})-(?<year>d{4})/$+{year}-$+{month}-$+{day}/; Named Capture – (?<> )
  • 30.
    • Hits
      • Lines that I want to match.
    • Misses
      • Lines that I don’t want to match.
    • Omissions
      • Lines that I didn’t match but wanted to match.
    • False alarms
      • Lines that I matched but didn’t want to match.
    Before Evaluating Regex
  • 31. Float number = integerpart.factionalpart Matching a float number Basic Principle – Split your task into sub tasks
  • 32. Integerpart = d+ -> will match one or more digits Matching a float number
  • 33. Matching a float number Literal dot = . Integerpart = d+ -> will match one or more digits
  • 34. Matching a float number Literal dot = . Integerpart = d+ -> will match one or more digits Fractional part= d+ -> will match one or more digits
  • 35. Integerpart = d+ Matching a float number Literal dot = . Fractional part = d+ Combine all of them = d+.d+
  • 36. Matching a float number /d+.d+/ -> Is generic. It won’t match -123.45 or +123.45
  • 37. Matching a float number /d+.d+/ -> Is generic. It won’t match -123.45 or +123.45 /[+-]?d+.d+/ -> will match.
  • 38. Matching a float number But It won’t match - 123.45 or + 123.45 /[+-]?d+.d+/ -> will match. /[+-]? *d+.d+/ -> will match. But It won’t match 123. or .45
  • 39. Matching a float number /[+-]? *(?:d+.d+|d+.|.d+)/ -> will match. But It won’t match 123. or .45 /[+-]? * (?: d+.d+ | d+. | .d+ ) /
  • 40. Matching a float number /[+-]? *(?:d+.d+|d+.|.d+)(?:[eE]d+)?/ -> will match. But It won’t match 10e2 or 101E5 / [+-]? * (?: d+.d+ | d+. | .d+ ) (?: [eE]d+ )? /
  • 41. Matching a float number /^[+-]? *(?:d+.d+|d+.|.d+)(?:[eE][+-]?d+)?$/ -> will match. But It won’t match 10e-2 / ^[+-]? * (?: d+.d+ | d+. | .d+ ) (?: [eE][+-]?d+ )? $/x
  • 42. Match a float number /^ [+-]? * # first, match an optional sign (?: # then match integers or f.p. mantissas: d+.d+ # mantissa of the form a.b |d+. # mantissa of the form a. |.d+ # mantissa of the form .b |d+ # integer of the form a ) (?:[eE][+-]?d+)? # finally, optionally match an exponent $/x;
  • 43. Atomic Grouping – (?> )
    • Before looking into Atomic grouping need to know about Backtracking.
    • Backtracking – If you don’t succeed try and try again...
    d+99 19999 d 1 9999 -> Add 1 to match -> 1 d+ 19 999 -> Add 9 to match -> 19 d+ 199 99 -> Add 9 to match -> 199 d+ 1999 9 -> Add 9 to match -> 1999 d+ 19999 -> Add 9 to match -> 19999 d+ 19999 -> Still need to match 99 d+ 99 1999 9 -> Give up a 9 d+ 99 199 99 -> Give up one more 9 d+99 19999 -> Success
  • 44. Atomic Grouping – (?> )
    • Before looking into Atomic grouping need to know about Backtracking.
    • Backtracking – If you don’t succeed try and try again...
    d+xx 199Rs d 1 99Rs -> Add 1 to match -> 1 d+ 19 9Rs -> Add 9 to match -> 19 d+ 199 Rs -> Add 9 to match -> 199 d+x 199 Rs -> x not matched with R d+x 19 9 Rs -> Give up 9, still cannot match x d+x 1 99 Rs -> Give up 9, still cannot match x d+x 1 99 Rs -> Cannot give 1 due to d+ d+xx 199Rs -> Failure
  • 45. Atomic Grouping – (?> )
    • Atomic Grouping disables backtracking and speeds up the process.
      • (?>pattern) here pattern will be treated as atomic token.
      • (?>d+)xx here (?>d+) won’t give up any digits and it locks.
        • fails right at matching x with R.
      • Atomic groups are not captured and can be nested.
    • Use Possessive quantifiers for single items to overcome backtracking.
      • Adding ‘ + ’ will make quantifier as possessive
      • (?>d+)xx == d+ + xx
    Atomic Grouping: Possessive Quantifiers:
  • 46. Look Around Ahead Behind Positive Negative Positive Negative (?=...) (?!...) (?<=...) (?<!...) (?=...) Zero-width positive lookahead assertion (?!...) Zero-width negative lookahead assertion (?<=...) Zero-width positive lookbehind assertion (?<!...) Zero-width negative lookbehind assertion *Note : Assertions can be nested. Example : /(?<=, (?! (?<=d,)(?=d) ) )/
  • 47. Look Around
      • /cat(?=s+)/ I catch the house cat 'Tom-cat' with catnip
      • /(?<=s)catw+/ I cat ch the housecat 'Tom-cat' with cat nip
      • /bcatb / I catch the housecat 'Tom- cat ' with catnip
      • /(?<=s)cat(?=s)/ no isolated 'cat’
    “ I catch the housecat 'Tom-cat' with catnip”
      • /cat(?!s)/ I cat ch the housecat 'Tom- cat ' with cat nip
      • /(?<!s)cat/ I catch the house cat 'Tom- cat ' with catnip
    *Note : look-behind expressions cannot be of variable length. means you cannot use quantifiers (?, *, +, or {1,5}) or alternation of different-length items inside them.
  • 48. Conditional expressions
    • A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition
      • (?(condition)yes-regexp)&quot; is like an 'if () {}' statement
      • (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement
        • Condition can be
          • Sub pattern match corresponding number
          • Lookaround Assertion
          • Recursive call
    Match a (quoted)? string -> / ^(&quot;|')?[^”’]*(?(1)1)$ / Matches 'blah blah’ Matches “blah blah” Matches blah blah Won’t Match ‘blah blah”
  • 49. Conditional expression
    • A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition
      • (?(condition)yes-regexp)&quot; is like an 'if () {}' statement
      • (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement
    / (.)1(?(<=AA)G|C)$ / ATGAAG TAGBBC GATGGC /usr/share/dict/words -> / ^(.+)(.+)?(?(2)21|1)$ / aa baba beriberi maam vetitive
  • 50. Recursive Patterns – (?)
    • (x(x)y(x)x)
    • Palindrome -> /^((.)(?:(?1)|w)*(2))$/
      • qr/
      • ^ # Start of string
      • ( # Start capture group 1
      • ( # Open paren
      • (?> # Possessive capture subgroup
      • [^()]++ # Grab all the non parens we can
      • | # or
      • (?1) # Recurse into group 1
      • )* # Zero more times
      • ) # Close Paren
      • ) # End capture group 1
      • $ # End of string
      • /x;
  • 51. Code Evaluation – (?{ })
    • Perl code can be evaluated inside regular expressions using
      • (?{ }) construct.
    $x = &quot;aaaa”; $x =~ /(a(?{print &quot;Yown&quot;;}))*aa/; produces Yow Yow Yow Yow
  • 52. Pattern Code Expression – (??{ })
    • Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.
      • Construct is (??{ })
    $length = 5; $char = 'a'; $str = 'aaaaabb'; $str =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
  • 53. Inline modifiers & Comments Matching can be modified inline by placing modifiers. (?i) enables case-insensitive mode (?m) enables multiline matching for ^ and $ (?s) makes dot metacharacter match newline also (?x) ignores literal whitespace (?U) makes quantifiers ungreedy (lazy) by default $answers =~ / (?i) y (?-i) (?:es)?/ -> Will match ‘y’, ’Y’, ’yes’, ’Yes’ but not ‘YES’. Comments can be inserted inline using (?#) construct. /^ (?#begin) d+ (?#match integer part) . (?#match dot) d+ (?#match fractional part) $/
  • 54. Regex Testers Tools Editors Vim, TextMate, Edit Pad Pro, NoteTab, UltraEdit RegexBuddy Reggy – http:// reggyapp.com http://rubular.com (Ruby) RegexPal (JavaScript) - http://www.regexpal.com http://www.gskinner.com/RegExr/ http://www.spaweditor.com/scripts/regex/index.php http://regex.larsolavtorvik.com/ (PHP, JavaScript) http://www.nregex.com/ ( .NET ) http://www.myregexp.com/ ( Java ) http://osteele.com/tools/reanimator ( NFA Graphic repr. ) Expresso - http://www.ultrapico.com/Expresso.htm ( .NET ) Regulator - http://sourceforge.net/projects/regulator ( .NET ) RegexRenamer - http://regexrenamer.sourceforge.net/ ( .NET ) PowerGREP http://www.powergrep.com/ Windows Grep - http://www.wingrep.com/
  • 55. Regex Resources $perldoc perlre perlretut perlreref $man re_format “ Mastering Regular Expressions” by Jeffrey Friedl http://oreilly.com/catalog/9780596528126/ “ Regular Expressions Cookbook” by Jan Goyvaerts & Steven Levithan http://oreilly.com/catalog/9780596520694
  • 56. Questions? * { } ^ ] + $ [ ( ? . ) - : #
  • 57. Thank Y!ou * { } ^ ] + $ [ ( ? . ) - : #
  • 58. Java Regex
    • import java.util.regex.*;
    • public class MatchTest {
    • public static void main(String[] args) throws Exception {
    • String date = &quot;12/30/1969&quot;;
    • Pattern p =Pattern.compile(&quot;^(dd)[-/](dd)[-/](dd(?:dd)?$&quot;);
    • Matcher m = p.matcher(date);
    • if (m.find( )) {
        • String month = m.group(1);
        • String day = m.group(2);
        • String year = m.group(3);
        • System.out.printf(&quot;Found %s-%s-%sn&quot;, year, month, day);
        • }
    • }
    • }
  • 59. PHP Regex
    • $date = &quot;12/30/1969&quot;;
    • $p = &quot;!^( dd)[-/](dd)[-/](dd(?:dd)?)$ !&quot;;
    • if (preg_match($p,$date,$matches) {
      • $month = $matches[1];
      • $day = $matches[2];
      • $year = $matches[3];
    • }
    • $text = &quot;Hello world. <br>&quot;;
    • $pattern = &quot;{<br>}i&quot;;
    • echo preg_replace($pattern, &quot;<br />&quot;, $text);
  • 60. JavaScript Regex
    • var date = &quot;12/30/1969&quot;;
    • var p =new RegExp(&quot;^( dd)[-/](dd)[-/](dd(?:dd)?)$ &quot;);
    • var result = p.exec(date);
    • if (result != null) {
      • var month = result[1];
      • var day = result[2];
      • var year = result[3];
    • }
    • String text = &quot;Hello world. <br>&quot;;
    • var pattern = /<br>/ig;
    • test.replace(pattern, &quot;<br />&quot;);
  • 61. .NET Regex
    • using System.Text.RegularExpressions;
    • class MatchTest {
    • static void Main( ) {
      • string date = &quot;12/30/1969&quot;;
      • Regex r =
      • new Regex( @&quot;^(dd)[-/](dd)[-/](dd(?:dd)?)$&quot; );
      • Match m = r.Match(date);
      • if (m.Success) {
        • string month = m.Groups[1].Value;
        • string day = m.Groups[2].Value;
        • string year = m.Groups[3].Value;
      • }
      • }
    • }
  • 62. Python Regex
    • import re
    • date = '12/30/1969’
    • regex = re.compile(r'^(dd)[-/](dd)[-/](dd(?:dd)?)$')
    • match = regex.match(date)
    • if match:
      • month = match.group(1) #12
      • day = match.group(2) #30
      • year = match.group(3) #1969
  • 63. Ruby Regex
    • date = '12/30/1969’
    • regexp = Regexp.new('^(dd)[-/](dd)[-/](dd(?:dd)?)$')
    • if md = regexp.match(date)
      • month = md[1] #12
      • day = md[2] #30
      • year = md[3] #1969
    • end
  • 64. Unicode Properties
  • 65. Pattern Code Expression – (??{ })
    • Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.
      • (??{ })
    Find Incremental numbers ? $str=&quot;abc 123 hai cde 34567 efg 1245 a132 123456789 10adf&quot;; print &quot;$1n&quot; while($str=~/D ( (d) (?{$x=$2}) ( (??{++$x%10}) )* ) D/gx);'
  • 66. Commify a number $no=123456789; substr($no,0,length($no)-1)=~s/(?=(?<=d)(?:dd)+$)/,/g; print $no’ Produce 12,34,56,789
  • 67. Find Incremental numbers ? $str=&quot;abc 123 hai cde 34567 efg 1245 a132 123456789 10adf&quot;; print &quot;$1n&quot; while($str=~/D ( (d) (?{$x=$2}) ( (??{++$x%10}) )* ) D/gx);’ Non Capture group in a capture group won’t work : perl -e '$x=&quot;cat cat cat&quot;;$x=~/(cat(?:s+))/;print &quot;:$1:&quot;;’