Regular Expressions
Upcoming SlideShare
Loading in...5
×
 

Regular Expressions

on

  • 3,676 views

 

Statistics

Views

Total Views
3,676
Views on SlideShare
3,676
Embed Views
0

Actions

Likes
5
Downloads
180
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819 THAT’S why we need the version when promoting
  • LAN011213001-23445-819 THAT’S why we need the version when promoting
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819
  • LAN011213001-23445-819

Regular Expressions Regular Expressions Presentation Transcript

  • Regular Expressions Satyanarayana D < satyavvd@yahoo-inc.com>
  • Topics
    • What?
    • Why?
    • History - Who?
    • Flavou?rs
    • Grammar
    • Meta Chars
    • Character Classes
    • Shorthand Char Classes
    • Anchors
    • Repeaters or Quantifiers
    • Grouping & Capturing
    • Alternation
    • Match Float
    • Atomic Grouping
    • Look Around
    • Conditional Expr.
    • Recursive Regex
    • Code Evalution
    • Code Expr.
    • Inline Modifiers
    • Regex Tools
    • Q&A
  • What are Regular Expressions?
    • A Regular expression is a pattern describing a certain amount of text.
    • A regular expression, often called a pattern, is an expression that describes a set of strings. - Wikipedia
  • Why do we need?
    • Regular expressions allow matching and manipulation of textual data.
    • Requirements
      • Matching/Finding
      • Doing something with matched text
      • Validation of data
      • Case insensitive matching
      • Parsing data ( ex: html )
      • Converting data into diff. form etc.
  • History Stephen Kleene A mathematician discovered ‘ regular sets ’.
  • History Ken Thompson 1968 - Regular Expression Search Algorithm. Qed -> ed -> g/re/p
  • History Henry Spencer 1986 – Wrote a regex library in C
  • Regex Flavors
    • BRE - Basic Regular Expressions
      • ?, +, {, |, (, and )
      • ed, g/re/p, sed
    • ERE - Extended Regular Expressions
      • ?, +, {, |, (, and )
      • grep –E == egrep, awk
    • PCRE - Philip Hazel
      • Perl, PHP, Tcl etc.
  • Grammar of Regex * RE = one or more non-empty ‘ branches ‘ separated by ‘|’ Branch = one or more ‘ pieces ’ Piece = atom followed by quantifier Quantifier = ‘*,+,?’ or ‘ bound ’ Bound = atom{n}, atom{n,}, atom {m, n} Atom = (RE) or () or ‘ ^,$,’ or followed by `^.[$()|*+?{’ or any-char or ‘ bracket expression ’ Bracket Expression = is a list of characters enclosed in `[ ]'
  • Meta Chars? 2 + 4 Here ‘+’ has some special meaning In a normal Expression like :
  • Meta Chars Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) = [^n] $ Match the end of the line (or before newline at the end) | Alternation ( ) Grouping [ ] Character class { } Match m to n times * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times
  • Non-printable Chars t tab (HT, TAB) n newline (LF, NL) r return (CR) f form feed (FF) a alarm (bell) (BEL) e escape (think troff) (ESC) 033 octal char (example: ESC) x1B hex char (example: ESC) x{263a} long hex char (example: Unicode SMILEY) cK control char (example: VT) N{name} named Unicode character
  • Character Classes – [ ]
    • Set of character placed inside square brackets. Inside brackets meta characters lose their meaning ( except ‘] ^ - ‘)
    • Requirements
      • Matches one and only one character of a specified chars.
      • Range can be specified using ‘-’.
        • a-z matches 26 lower case English alphabets
        • 0-9 matches any digit.
        • Negation can be specified using ‘^’ at the beginning of class.
        • In order to match above specified exceptional chars literally either escape them or need to specify at end.
    [0-9] Matches any one of 0,1,2,3,4,5,6,7,8,9. [aeiou] Matches one English vowel char. [^aeiou] Matches any non-vowel char. [a-z-] Matches a to z and ‘-’ [a-z0-9] Union matches a to z and 0 to 9. [a-z&&[m-z]] Intersection matches m to z. [a-z-[m-z] Subtraction matches a to l.
  • POSIX Character Classes – [: … :] [^[:digit:] ]= D = [^0-9]
  • Shorthand Chars w word character [A-Za-z0-9_] d decimal digit [0-9] s whitespace [ nrtf] W not a word character [^A-Za-z0-9_] D not a decimal digit [^0-9] S not whitespace [^ nrtf]
  • Anchors/Assertions
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    ^ Match the beginning of the line $ Match the end of the line (or before newline at the end) A Matches only at the very beginning z Matches only at the very end Z Matches like $ used in single-line mode b Matches when the current position is a word boundary <,> Matches when the current position is a word boundary B Matches when the current position is not a word boundary
  • ^Anchors
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    ^ Match the beginning of the line Anchor matches a certain position In the subject string and it won’t consume a ny characters /^a/ String begin with ‘a’
  • Anchors$
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    $ Match the end of the line (or before newline at the end) Anchor matche s a certain position In the subject string and it won’t consume any character s /s$/ String end with ‘s’
  • A Anchors
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    A Matches only at the very beginning Anchor matches a certain position In the subject string and it won’t consume any characters ^ Vs A
  • z, Z Anchors
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    z Matches only at the very end Z Matches like $ used in single-line mode Anchor matches a certain position In the subject string and it won’t consume any characters $ Vs z , Z
  • b, B Anchors
    • Anchor matches a certain position in the subject string and it won’t consume any characters.
    b = Ww|wW = Matches a word boundary B Matches when the current position is not a word boundary /b2b/ /B2B/ $ xl 2 twiki file 2 > /dev/null
  • Quantifiers
    • Why? – Because we are not sure about text. Specifies how many times regex component must repeat.
    {m, n} = Matches minimum of m and a max of n occurrences. * = {0,} = Matches zero or more occurrences ( any amount). + = {1,} = Matches one or more occurrences. ? = {0,1} = Matches zero or one occurrence ( means optional ). Quantifiers ( repetition) :
  • Quantifiers
    • By default quantifiers are greedy.
    /d{2,4}/ 2010 /<.+>/ My first <strong> regex </strong> test. <strong> regex </strong> /w+sion/ Expression If the entire match fails because they consumed too much, then they are forced to give up as much as needed to make the rest of regex succeed
  • Non Greedy Quantifiers {,}? *? +? ?? To make non greedy quantifiers append ‘?’ <.+?> My first <strong> regex </strong> test. <strong> Use negated classes <[^>]+> My first <strong> regex </strong> test. <strong>
  • Grouping – ( )
    • Why? – To create sub patterns, so that you can apply regex operators to whole sub patterns or you can reference them by corresponding sub group numbers.
    d{2}-d{2} -d{2} (d{2})? Will match 01-01-10 and 01-01-2010 also.
    • Grouping can be used for alternation.
  • Alternation - |
    • Why? – Lets you to match more than one sub-expression at same point.
    /b( get | set )Valueb/ Match either getValue or setValue .
    • Branches are tried from left->right.
    • Eagerness - Most likely pattern as first alternative
      • (and|android) -> ‘robot and an and roid fight’
  • Capturing – ( )
    • Allows us to access sub-parts of pattern for later processing.
      • All captured sub patterns are stored in memory.
      • Captured patterns are numbered from left to right.
    /b((d{2})-(d{2})-(d{2}(d{2})?))b/ b ( ( d{2} ) - ( d{2} ) - (d{2} ( d{2} ) ?) ) b Today is ‘ 18-08-2010 ’. 1 -> date -> 18-08-2010 2 -> day-> 18 3 -> month -> 08 4 -> year -> 2010 5 -> year -> last two digits -> 10
  • Non-Capturing sub patterns– (?: )
    • If you really don’t require back referencing make sub expressions as non-capture, It will save memory and processing time.
    d{2}-d{2} -d{2} (?:d{2})? Will match 01-01-10 and 01-01-2010 also.
    • We can give names for sub patterns instead of numbers.
    (?P<name>pattern) -> Python Style, Perl 5.12 (?P=name) -> Back reference (?<name>pattern) or (?’name’pattern) ->Perl 5.10 k<name> or k’name’ or -> Back reference g{name} g{-1}, g{-2} -> Relative Back reference. (?<vowel>[ai]).k<vowel>.1 abr acada bra !! /(w+)s+g{-1}/ &quot;Thus joyful Troy Troy maintained the the watch of night...” $date=&quot;18-08-2010&quot;; $date =~ s/(?<day>d{2})-(?<month>d{2})-(?<year>d{4})/$+{year}-$+{month}-$+{day}/; Named Capture – (?<> )
    • Hits
      • Lines that I want to match.
    • Misses
      • Lines that I don’t want to match.
    • Omissions
      • Lines that I didn’t match but wanted to match.
    • False alarms
      • Lines that I matched but didn’t want to match.
    Before Evaluating Regex
  • Float number = integerpart.factionalpart Matching a float number Basic Principle – Split your task into sub tasks
  • Integerpart = d+ -> will match one or more digits Matching a float number
  • Matching a float number Literal dot = . Integerpart = d+ -> will match one or more digits
  • Matching a float number Literal dot = . Integerpart = d+ -> will match one or more digits Fractional part= d+ -> will match one or more digits
  • Integerpart = d+ Matching a float number Literal dot = . Fractional part = d+ Combine all of them = d+.d+
  • Matching a float number /d+.d+/ -> Is generic. It won’t match -123.45 or +123.45
  • Matching a float number /d+.d+/ -> Is generic. It won’t match -123.45 or +123.45 /[+-]?d+.d+/ -> will match.
  • Matching a float number But It won’t match - 123.45 or + 123.45 /[+-]?d+.d+/ -> will match. /[+-]? *d+.d+/ -> will match. But It won’t match 123. or .45
  • Matching a float number /[+-]? *(?:d+.d+|d+.|.d+)/ -> will match. But It won’t match 123. or .45 /[+-]? * (?: d+.d+ | d+. | .d+ ) /
  • Matching a float number /[+-]? *(?:d+.d+|d+.|.d+)(?:[eE]d+)?/ -> will match. But It won’t match 10e2 or 101E5 / [+-]? * (?: d+.d+ | d+. | .d+ ) (?: [eE]d+ )? /
  • Matching a float number /^[+-]? *(?:d+.d+|d+.|.d+)(?:[eE][+-]?d+)?$/ -> will match. But It won’t match 10e-2 / ^[+-]? * (?: d+.d+ | d+. | .d+ ) (?: [eE][+-]?d+ )? $/x
  • Match a float number /^ [+-]? * # first, match an optional sign (?: # then match integers or f.p. mantissas: d+.d+ # mantissa of the form a.b |d+. # mantissa of the form a. |.d+ # mantissa of the form .b |d+ # integer of the form a ) (?:[eE][+-]?d+)? # finally, optionally match an exponent $/x;
  • Atomic Grouping – (?> )
    • Before looking into Atomic grouping need to know about Backtracking.
    • Backtracking – If you don’t succeed try and try again...
    d+99 19999 d 1 9999 -> Add 1 to match -> 1 d+ 19 999 -> Add 9 to match -> 19 d+ 199 99 -> Add 9 to match -> 199 d+ 1999 9 -> Add 9 to match -> 1999 d+ 19999 -> Add 9 to match -> 19999 d+ 19999 -> Still need to match 99 d+ 99 1999 9 -> Give up a 9 d+ 99 199 99 -> Give up one more 9 d+99 19999 -> Success
  • Atomic Grouping – (?> )
    • Before looking into Atomic grouping need to know about Backtracking.
    • Backtracking – If you don’t succeed try and try again...
    d+xx 199Rs d 1 99Rs -> Add 1 to match -> 1 d+ 19 9Rs -> Add 9 to match -> 19 d+ 199 Rs -> Add 9 to match -> 199 d+x 199 Rs -> x not matched with R d+x 19 9 Rs -> Give up 9, still cannot match x d+x 1 99 Rs -> Give up 9, still cannot match x d+x 1 99 Rs -> Cannot give 1 due to d+ d+xx 199Rs -> Failure
  • Atomic Grouping – (?> )
    • Atomic Grouping disables backtracking and speeds up the process.
      • (?>pattern) here pattern will be treated as atomic token.
      • (?>d+)xx here (?>d+) won’t give up any digits and it locks.
        • fails right at matching x with R.
      • Atomic groups are not captured and can be nested.
    • Use Possessive quantifiers for single items to overcome backtracking.
      • Adding ‘ + ’ will make quantifier as possessive
      • (?>d+)xx == d+ + xx
    Atomic Grouping: Possessive Quantifiers:
  • Look Around Ahead Behind Positive Negative Positive Negative (?=...) (?!...) (?<=...) (?<!...) (?=...) Zero-width positive lookahead assertion (?!...) Zero-width negative lookahead assertion (?<=...) Zero-width positive lookbehind assertion (?<!...) Zero-width negative lookbehind assertion *Note : Assertions can be nested. Example : /(?<=, (?! (?<=d,)(?=d) ) )/
  • Look Around
      • /cat(?=s+)/ I catch the house cat 'Tom-cat' with catnip
      • /(?<=s)catw+/ I cat ch the housecat 'Tom-cat' with cat nip
      • /bcatb / I catch the housecat 'Tom- cat ' with catnip
      • /(?<=s)cat(?=s)/ no isolated 'cat’
    “ I catch the housecat 'Tom-cat' with catnip”
      • /cat(?!s)/ I cat ch the housecat 'Tom- cat ' with cat nip
      • /(?<!s)cat/ I catch the house cat 'Tom- cat ' with catnip
    *Note : look-behind expressions cannot be of variable length. means you cannot use quantifiers (?, *, +, or {1,5}) or alternation of different-length items inside them.
  • Conditional expressions
    • A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition
      • (?(condition)yes-regexp)&quot; is like an 'if () {}' statement
      • (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement
        • Condition can be
          • Sub pattern match corresponding number
          • Lookaround Assertion
          • Recursive call
    Match a (quoted)? string -> / ^(&quot;|')?[^”’]*(?(1)1)$ / Matches 'blah blah’ Matches “blah blah” Matches blah blah Won’t Match ‘blah blah”
  • Conditional expression
    • A conditional expression is a form of if-then-else statement that allows one to choose which patterns are to be matched, based on some condition
      • (?(condition)yes-regexp)&quot; is like an 'if () {}' statement
      • (?(condition)yes-regexp|no-regexp) 'if () {} else {}' statement
    / (.)1(?(<=AA)G|C)$ / ATGAAG TAGBBC GATGGC /usr/share/dict/words -> / ^(.+)(.+)?(?(2)21|1)$ / aa baba beriberi maam vetitive
  • Recursive Patterns – (?)
    • (x(x)y(x)x)
    • Palindrome -> /^((.)(?:(?1)|w)*(2))$/
      • qr/
      • ^ # Start of string
      • ( # Start capture group 1
      • ( # Open paren
      • (?> # Possessive capture subgroup
      • [^()]++ # Grab all the non parens we can
      • | # or
      • (?1) # Recurse into group 1
      • )* # Zero more times
      • ) # Close Paren
      • ) # End capture group 1
      • $ # End of string
      • /x;
  • Code Evaluation – (?{ })
    • Perl code can be evaluated inside regular expressions using
      • (?{ }) construct.
    $x = &quot;aaaa”; $x =~ /(a(?{print &quot;Yown&quot;;}))*aa/; produces Yow Yow Yow Yow
  • Pattern Code Expression – (??{ })
    • Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.
      • Construct is (??{ })
    $length = 5; $char = 'a'; $str = 'aaaaabb'; $str =~ /(??{$char x $length})/x; # matches, there are 5 of 'a'
  • Inline modifiers & Comments Matching can be modified inline by placing modifiers. (?i) enables case-insensitive mode (?m) enables multiline matching for ^ and $ (?s) makes dot metacharacter match newline also (?x) ignores literal whitespace (?U) makes quantifiers ungreedy (lazy) by default $answers =~ / (?i) y (?-i) (?:es)?/ -> Will match ‘y’, ’Y’, ’yes’, ’Yes’ but not ‘YES’. Comments can be inserted inline using (?#) construct. /^ (?#begin) d+ (?#match integer part) . (?#match dot) d+ (?#match fractional part) $/
  • Regex Testers Tools Editors Vim, TextMate, Edit Pad Pro, NoteTab, UltraEdit RegexBuddy Reggy – http:// reggyapp.com http://rubular.com (Ruby) RegexPal (JavaScript) - http://www.regexpal.com http://www.gskinner.com/RegExr/ http://www.spaweditor.com/scripts/regex/index.php http://regex.larsolavtorvik.com/ (PHP, JavaScript) http://www.nregex.com/ ( .NET ) http://www.myregexp.com/ ( Java ) http://osteele.com/tools/reanimator ( NFA Graphic repr. ) Expresso - http://www.ultrapico.com/Expresso.htm ( .NET ) Regulator - http://sourceforge.net/projects/regulator ( .NET ) RegexRenamer - http://regexrenamer.sourceforge.net/ ( .NET ) PowerGREP http://www.powergrep.com/ Windows Grep - http://www.wingrep.com/
  • Regex Resources $perldoc perlre perlretut perlreref $man re_format “ Mastering Regular Expressions” by Jeffrey Friedl http://oreilly.com/catalog/9780596528126/ “ Regular Expressions Cookbook” by Jan Goyvaerts & Steven Levithan http://oreilly.com/catalog/9780596520694
  • Questions? * { } ^ ] + $ [ ( ? . ) - : #
  • Thank Y!ou * { } ^ ] + $ [ ( ? . ) - : #
  • Java Regex
    • import java.util.regex.*;
    • public class MatchTest {
    • public static void main(String[] args) throws Exception {
    • String date = &quot;12/30/1969&quot;;
    • Pattern p =Pattern.compile(&quot;^(dd)[-/](dd)[-/](dd(?:dd)?$&quot;);
    • Matcher m = p.matcher(date);
    • if (m.find( )) {
        • String month = m.group(1);
        • String day = m.group(2);
        • String year = m.group(3);
        • System.out.printf(&quot;Found %s-%s-%sn&quot;, year, month, day);
        • }
    • }
    • }
  • PHP Regex
    • $date = &quot;12/30/1969&quot;;
    • $p = &quot;!^( dd)[-/](dd)[-/](dd(?:dd)?)$ !&quot;;
    • if (preg_match($p,$date,$matches) {
      • $month = $matches[1];
      • $day = $matches[2];
      • $year = $matches[3];
    • }
    • $text = &quot;Hello world. <br>&quot;;
    • $pattern = &quot;{<br>}i&quot;;
    • echo preg_replace($pattern, &quot;<br />&quot;, $text);
  • JavaScript Regex
    • var date = &quot;12/30/1969&quot;;
    • var p =new RegExp(&quot;^( dd)[-/](dd)[-/](dd(?:dd)?)$ &quot;);
    • var result = p.exec(date);
    • if (result != null) {
      • var month = result[1];
      • var day = result[2];
      • var year = result[3];
    • }
    • String text = &quot;Hello world. <br>&quot;;
    • var pattern = /<br>/ig;
    • test.replace(pattern, &quot;<br />&quot;);
  • .NET Regex
    • using System.Text.RegularExpressions;
    • class MatchTest {
    • static void Main( ) {
      • string date = &quot;12/30/1969&quot;;
      • Regex r =
      • new Regex( @&quot;^(dd)[-/](dd)[-/](dd(?:dd)?)$&quot; );
      • Match m = r.Match(date);
      • if (m.Success) {
        • string month = m.Groups[1].Value;
        • string day = m.Groups[2].Value;
        • string year = m.Groups[3].Value;
      • }
      • }
    • }
  • Python Regex
    • import re
    • date = '12/30/1969’
    • regex = re.compile(r'^(dd)[-/](dd)[-/](dd(?:dd)?)$')
    • match = regex.match(date)
    • if match:
      • month = match.group(1) #12
      • day = match.group(2) #30
      • year = match.group(3) #1969
  • Ruby Regex
    • date = '12/30/1969’
    • regexp = Regexp.new('^(dd)[-/](dd)[-/](dd(?:dd)?)$')
    • if md = regexp.match(date)
      • month = md[1] #12
      • day = md[2] #30
      • year = md[3] #1969
    • end
  • Unicode Properties
  • Pattern Code Expression – (??{ })
    • Pattern code expression - the result of the code evaluation is treated as a regular expression and matched immediately.
      • (??{ })
    Find Incremental numbers ? $str=&quot;abc 123 hai cde 34567 efg 1245 a132 123456789 10adf&quot;; print &quot;$1n&quot; while($str=~/D ( (d) (?{$x=$2}) ( (??{++$x%10}) )* ) D/gx);'
  • Commify a number $no=123456789; substr($no,0,length($no)-1)=~s/(?=(?<=d)(?:dd)+$)/,/g; print $no’ Produce 12,34,56,789
  • Find Incremental numbers ? $str=&quot;abc 123 hai cde 34567 efg 1245 a132 123456789 10adf&quot;; print &quot;$1n&quot; while($str=~/D ( (d) (?{$x=$2}) ( (??{++$x%10}) )* ) D/gx);’ Non Capture group in a capture group won’t work : perl -e '$x=&quot;cat cat cat&quot;;$x=~/(cat(?:s+))/;print &quot;:$1:&quot;;’