Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Regular Expression

2,072 views

Published on

Published in: Technology, Business
  • Be the first to comment

Regular Expression

  1. 1. Regular Expression Supported By : java.util.regex
  2. 2. Introduction & uses <ul><li>It’s a way to describe a group of Strings based on common characteristics shared by each String in the group. </li></ul><ul><li>In the normal sense, We may have a sequence of characters, that we’ll call a pattern. We can use this pattern to collect those character sequences that match the pattern. </li></ul><ul><li>Generally used in text parsing, searching & replacing mechanism, editing & other kinds of text manipulation. </li></ul>
  3. 3. Pattern & Matcher Objects <ul><li>Pattern </li></ul><ul><li>A pattern object is a compiled representation of a regular expression. </li></ul><ul><li>This class doesn’t have public constructors. </li></ul><ul><li>To create a pattern, we invoke one of its public static compile methods. </li></ul><ul><li>Matcher </li></ul><ul><li>A Matcher object is the engine that interprets the pattern & performs match operations against an input string. </li></ul><ul><li>Like the Pattern class, Matcher defines no public constructors. </li></ul><ul><li>We get a matcher object by invoking the matcher() on a Pattern object. </li></ul>
  4. 4. <ul><li>static Pattern compile(String regex ) </li></ul><ul><li>Matcher matcher(CharSequence str) </li></ul><ul><li>Pattern p = Pattern.compile(&quot;sec-58&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;sec-58&quot;); </li></ul><ul><li>System.out.println(m.matches()); </li></ul><ul><li>m = p.matcher(&quot;Sec-58&quot;); </li></ul><ul><li>System.out.println(m.matches()); </li></ul><ul><li>Output </li></ul><ul><li>true </li></ul><ul><li>false </li></ul>Performing Pattern Matching
  5. 5. Performing Pattern Matching contd. <ul><li>boolean find() </li></ul><ul><li>To determine if a subsequence of the input sequence matches the Pattern. </li></ul><ul><li>String group() </li></ul><ul><ul><ul><li>To get a string containing the last matching sequence. </li></ul></ul></ul><ul><li>int start() </li></ul><ul><li>Returns the index of the current match in the input sequence. </li></ul><ul><li>int end </li></ul><ul><li>Returns the index one past the end of the current match. </li></ul><ul><li>Both throws IllegalStateException if there is no match. </li></ul><ul><li>String replaceAll(String) </li></ul><ul><ul><li>To replace all occurrences of a matching sequence with another sequence. </li></ul></ul>
  6. 6. Performing Pattern Matching contd. <ul><li>Pattern p = Pattern.compile(&quot;sec-58&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;C-58,sec-58;D-20,sec-58;F-14,sec-57;C-45,sec-58&quot;); </li></ul><ul><li>while( m.find() ) </li></ul><ul><li>{ </li></ul><ul><li>System.out.println( m.group() +&quot; Starting at &quot;+m.start() ); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>sec-58 Starting at 5 </li></ul><ul><li>sec-58 Starting at 17 </li></ul><ul><li>sec-58 Starting at 41 </li></ul>
  7. 7. Metacharacters <ul><li>The API supports a no. of special characters that affect the way a pattern is matched. </li></ul><ul><li>Supported Metacharacters are : </li></ul><ul><li>( [ { ^ - $ | ] } ) ? * + . </li></ul><ul><li>Note : In certain conditions, the characters listed above won’t be treated as Metacharacters. </li></ul><ul><li>There are 2 ways to force a metacharacter to be treated as an ordinary character : </li></ul><ul><li>Precede the metacharacter with a backslash. or </li></ul><ul><li>Enclose it within Q and E. </li></ul><ul><li>Pattern p = Pattern.compile(&quot;&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;+.+&quot;); </li></ul><ul><li>while(m.find()) </li></ul><ul><li>{ </li></ul><ul><li>System.out.println( m.group()); </li></ul><ul><li>} </li></ul><ul><li>Output : +. </li></ul>
  8. 8. Character Classes <ul><li>A character class is a set of characters enclosed within square brackets. </li></ul><ul><li>It specifies the characters that will successfully match a single character from a give input string. </li></ul>a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^bc]] a through z, and not m through p: [a-lq-z] (subtraction) [a-z&&[^m-p]] d,e, or f (intersection) [a-z&&[def]] A through d, or m through p: [a-dm-p] (union) [a-d[m-p]] A through z, or A through Z, inclusive (range) [a-zA-Z] Any character except a,b, or c (negation) [^abc] a,b or c (simple class) [abc] Conditions under which there’ll be a match Regular Expression
  9. 9. Simple Classes <ul><li>Pattern p = Pattern.compile(&quot;[Hh][abit]&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;Hi, how r u?.Hey! Shall we go for the dinner tonight.&quot;); </li></ul><ul><li>while(m.find()) </li></ul><ul><li>{ </li></ul><ul><li>System.out.println( m.group()); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>Hi </li></ul><ul><li>ha </li></ul><ul><li>ht </li></ul>
  10. 10. Negation <ul><li>To match all the characters except those listed within brackets, insert the ^ metacharacter at the beginning of the character class. </li></ul><ul><li>Pattern p = Pattern.compile(&quot;[^Hh][^abit]&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;Hindustan times.&quot;); </li></ul><ul><li>while(m.find()) </li></ul><ul><li>{ </li></ul><ul><li>System.out.print( m.group()+&quot; &quot;); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>in du an im es </li></ul>
  11. 11. Ranges <ul><li>Metacharacter used is - (hyphen) </li></ul><ul><li>Ex : a-z, A-P, 5-8 </li></ul><ul><li>Pattern p = Pattern.compile(&quot;[^2-6][0-7]&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;4500-569-3286-5639&quot;); </li></ul><ul><li>while(m.find()) </li></ul><ul><li>{ </li></ul><ul><li>System.out.print( m.group()+&quot; &quot;); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>0-5 </li></ul><ul><li>9-3 </li></ul>
  12. 12. Union <ul><li>Used to create a single character class comprised of 2 or more separate character classes. </li></ul><ul><li>To create a union, simply nest one class inside the other. Such as [0-5[6-8]]. </li></ul><ul><li>Pattern p = Pattern.compile(&quot;[4-8][[5-9][02]]&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;4590-569-3286-5639&quot;); </li></ul><ul><li>while(m.find()) </li></ul><ul><li>{ </li></ul><ul><li>System.out.print( m.group()+&quot; &quot;); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>45 56 86 56 </li></ul>
  13. 13. Intersections <ul><li>To create a single character class matching only the characters common to all of its nested classes. </li></ul><ul><li>Ex : [0-6&&[345]] </li></ul><ul><li>Pattern p = Pattern.compile(&quot;[2-6&&[23478]]&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;978979321326&quot;); </li></ul><ul><li>while(m.find()) </li></ul><ul><li>{ </li></ul><ul><li>System.out.print(m.group()+&quot; &quot;); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>3 2 3 2 </li></ul>
  14. 14. Subtraction <ul><li>Used to negate one or more nested character classes. </li></ul><ul><li>Ex : [0-6&&[^345]] </li></ul><ul><li>Pattern p = Pattern.compile(&quot;[2-6&&[^234]]&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;978979321326&quot;); </li></ul><ul><li>while(m.find()) </li></ul><ul><li>{ </li></ul><ul><li>System.out.print(m.group()+&quot; &quot;); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>6 </li></ul>
  15. 15. Predefined Character Classes <ul><li>The Pattern API contains a no. of useful predefined character classes, which offer convenient shorthands for commonly used regular expressions. </li></ul>A non-word character: [^w] W A word character: [a-zA-Z_0-9] w A non-whitespace character: [^s] S A whitespace character: [ x0Bf ] s A non-digit: [^0-9] D A digit: [0-9] d Any character . (Dot) Character class Shorthand
  16. 16. Subtraction <ul><li>Used to negate one or more nested character classes. </li></ul><ul><li>Ex : [0-6&&[^345]] </li></ul><ul><li>Pattern p = Pattern.compile(&quot;[2-6&&[^234]]&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;978979321326&quot;); </li></ul><ul><li>while(m.find()) </li></ul><ul><li>{ </li></ul><ul><li>System.out.print(m.group()+&quot; &quot;); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>6 </li></ul>
  17. 17. Quantifiers <ul><li>Allow us to specify the no. of occurrences to match against. </li></ul><ul><li>Types : 3 </li></ul><ul><ul><ul><li>Greedy </li></ul></ul></ul><ul><ul><ul><li>Reluctant </li></ul></ul></ul><ul><ul><ul><li>Possessive </li></ul></ul></ul>X, at least n but not more than m times X{n,m}+ X{n,m}? X{n,m} X,at least n times X{n,}+ X{n,}? X{n,} X, exactly n times X{n}+ X{n}? X{n} X, One or more times X++ X+? X+ X, Zero or more times X*+ X*? X* X,Once or not at all X?+ X?? X? Meaning Possessive Reluctant Greedy
  18. 18. Greedy Quantifiers <ul><li>String regex1 = &quot;a?&quot;; </li></ul><ul><li>String regex2 = &quot;a*&quot;; </li></ul><ul><li>String regex3 = &quot;a+&quot;; </li></ul><ul><li>Pattern p = Pattern.compile(regex1); </li></ul><ul><li>Matcher m = p.matcher(&quot;&quot;); </li></ul><ul><li>if( m.find() ) </li></ul><ul><li>System.out.println(&quot;Match found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); </li></ul><ul><li>p = Pattern.compile(regex2); </li></ul><ul><li>m = p.matcher(&quot;&quot;); </li></ul><ul><li>if( m.find() ) </li></ul><ul><li>System.out.println(&quot;Match found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); </li></ul><ul><li>p = Pattern.compile(regex3); </li></ul><ul><li>m = p.matcher(&quot;&quot;); </li></ul><ul><li>if( m.find() ) </li></ul><ul><li>System.out.println(&quot;Match found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); </li></ul><ul><li>Output </li></ul><ul><li>Match found at 0 ending at 0 </li></ul><ul><li>Match found at 0 ending at 0 </li></ul>
  19. 19. Zero-length Matches <ul><li>Both a* and a? allow zero occurrences of the letter a. </li></ul><ul><li>Cases where a zero-length match can occur : </li></ul><ul><ul><ul><li>It can occur in an empty input string. </li></ul></ul></ul><ul><ul><ul><li>At the beginning of an input string. </li></ul></ul></ul><ul><ul><ul><li>After the last character of an input string </li></ul></ul></ul><ul><ul><ul><li>Between any 2 characters of an input string. </li></ul></ul></ul><ul><li>A zero-length match always start and end at the same index position. </li></ul><ul><li>Pattern p = Pattern.compile(&quot;a?&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;a&quot;); </li></ul><ul><li>while( m.find() ) </li></ul><ul><li>{ </li></ul><ul><li>System.out.println(m.group()+&quot; found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>a found at 0 ending at 1 </li></ul><ul><li>found at 1 ending at 1 </li></ul>
  20. 20. Zero-length Matches contd. <ul><li>Pattern p = Pattern.compile(&quot;a?&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;aaaa&quot;); </li></ul><ul><li>while( m.find() ) </li></ul><ul><li>{ </li></ul><ul><li>System.out.println(m.group()+&quot; found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>a found at 0 ending at 1 </li></ul><ul><li>a found at 1 ending at 2 </li></ul><ul><li>a found at 2 ending at 3 </li></ul><ul><li>a found at 3 ending at 4 </li></ul><ul><li>found at 4 ending at 4 </li></ul>
  21. 21. Zero-length Matches contd. <ul><li>Pattern p = Pattern.compile(&quot;a*&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;aaaa&quot;); </li></ul><ul><li>while( m.find() ) </li></ul><ul><li>{ </li></ul><ul><li>System.out.println(m.group()+&quot; found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>aaaa found at 0 ending at 4 </li></ul><ul><li>found at 4 ending at 4 </li></ul>
  22. 22. Zero-length Matches contd. <ul><li>Pattern p = Pattern.compile(&quot;a+&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;aaaa&quot;); </li></ul><ul><li>while( m.find() ) </li></ul><ul><li>{ </li></ul><ul><li>System.out.println(m.group()+&quot; found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); </li></ul><ul><li>} </li></ul><ul><li>Output </li></ul><ul><li>aaaa found at 0 ending at 4 </li></ul>
  23. 23. Capturing Groups & Character classes with Quantifiers <ul><li>Quantifiers can only be attached to one character at a time. So, it means the regular expression abc+ would mean </li></ul><ul><li>a, followed by b followed by c one or more times. </li></ul><ul><li>[abc]+ means a or b or c, one or more times. </li></ul><ul><li>(abc)+ means the group “abc” one or more times. </li></ul><ul><li>Pattern p = Pattern.compile(&quot;(hp)+&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;hphclhpcompaq IBM hp&quot;); </li></ul><ul><li>while( m.find() ) </li></ul><ul><li>{ System.out.println(m.group()+&quot; found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); } </li></ul><ul><li>Output </li></ul><ul><li>hp found at 0 ending at 2 </li></ul><ul><li>hp found at 5 ending at 7 </li></ul><ul><li>hp found at 18 ending at 20 </li></ul>
  24. 24. Capturing Character classes with Quantifiers <ul><li>Pattern p = Pattern.compile(“[hp]+&quot;); </li></ul><ul><li>Matcher m = p.matcher(&quot;hphclhpcompaq IBM hp&quot;); </li></ul><ul><li>while( m.find() ) </li></ul><ul><li>{ System.out.println(m.group()+&quot; found at &quot;+m.start()+&quot; ending at &quot;+m.end() ); } </li></ul><ul><li>Output </li></ul><ul><li>hph found at 0 ending at 3 </li></ul><ul><li>hp found at 5 ending at 7 </li></ul><ul><li>p found at 10 ending at 11 </li></ul><ul><li>hp found at 18 ending at 20 </li></ul>
  25. 25. Difference Among Greedy, Reluctant & Possessive Quantifiers <ul><li>Greedy Quantifier : </li></ul><ul><li>It forces the matcher to read in the entire input string prior to attempting the first match. </li></ul><ul><li>If the first match attempt( i.e., the entire string) fails, the matcher backs off the input string by one character and tries again. </li></ul><ul><li>This process is repeated until a match is found or there are no more characters left to back off from. </li></ul><ul><li>Depending on the quantifier used in the expression, the last thing it’ll try matching against is 1 or 0 characters. </li></ul><ul><li>Ex: </li></ul><ul><li>Pattern = . *foo </li></ul><ul><li>Input String = xfooxxxxxxfoo </li></ul><ul><li>Match found at index 0 & ending at index 13. </li></ul>
  26. 26. Difference Among Greedy, Reluctant & Possessive Quantifiers <ul><li>Reluctant Quantifier : </li></ul><ul><li>Starts at the beginning of the input string. Then read one character at a time & looks for a match. </li></ul><ul><li>The last thing it tries is the entire input string. </li></ul><ul><li>Ex: </li></ul><ul><li>Pattern = . *?foo </li></ul><ul><li>Input String = xfooxxxxxxfoo </li></ul><ul><li>Match found at index 0 & ending at index 4. </li></ul><ul><li>Match found at index 4 & ending at index 13. </li></ul>
  27. 27. Difference Among Greedy, Reluctant & Possessive Quantifiers <ul><li>Possessive Quantifier : </li></ul><ul><li>Reads in the entire input string. </li></ul><ul><li>It tries once and only once for a match. </li></ul><ul><li>Ex: </li></ul><ul><li>Pattern = . *+foo </li></ul><ul><li>Input String = xfooxxxxxxfoo </li></ul><ul><li>No Match found </li></ul>
  28. 28. Capturing Groups <ul><li>A way to treat multiple characters as a single unit. </li></ul><ul><li>A Group is created by placing the characters to be grouped inside a set of parentheses. Such as (890). </li></ul><ul><li>Numbering </li></ul><ul><li>Capturing groups are numbered by counting their opening parentheses from left to right. </li></ul><ul><li>The expression ((A)(B(C))) has 4 groups </li></ul><ul><ul><li>((A)(B(C))) </li></ul></ul><ul><ul><li>(A) </li></ul></ul><ul><ul><li>(B(C)) </li></ul></ul><ul><ul><li>(c) </li></ul></ul>
  29. 29. Counting the Capturing Groups <ul><li>groupCount() : To count the no. of capturing groups present in the matcher’s pattern. </li></ul><ul><li>There is also a special group, group 0. It always represents the entire expression. It’s not included in the total reported by groupCount(). </li></ul><ul><li>public int start(int group) : Returns the start index of the subsequence captured by the given group during the previous match. </li></ul><ul><li>public int end(int group) : Returns the index of the last character, plus one, of the subsequence captured by the given group during the previous match. </li></ul><ul><li>public String end(int group) : Returns the input subsequence captured by the given group during the previous match operation. </li></ul>
  30. 30. Backreferences <ul><li>The portion of the input string that matches the capturing group is saved in the memory for a later recall via backreferences. </li></ul><ul><li>A backreference is specified in the regular expression as a backslash ( ) followed by a digit indicating the no. of the group to be recalled. </li></ul><ul><li>Ex : (dd) defines one capturing group matching 2 digits in a row, which can be recalled later in the expression via the backreference 1. </li></ul><ul><li>Example </li></ul><ul><li>Regex : (dd)1 </li></ul><ul><li>String : 1212 </li></ul><ul><li>String found at index 0 & ending at 4. </li></ul><ul><li>Regex : (dd)1 </li></ul><ul><li>String : 1234 </li></ul><ul><li>No match found. </li></ul><ul><li>Note : For nested capturing groups, backreferencing works in exactly the same way: speciy a backslash followed by the no. of groups to be recalled. </li></ul>
  31. 31. Boundary Matchers <ul><li>Situation : We want to find a word in a file, but only if it appears at the beginning or end of a line. </li></ul>The end of the input z End of the input but for the final terminator, if any  End of the previous match G Beginning of the input A A non-word boundary B A word boundary  End of a line $ Beginning of a line ^ Meaning Boundary Matchers
  32. 32. Boundary Matchers <ul><li>Regex : ^dog$ </li></ul><ul><ul><li>Input String : dog </li></ul></ul><ul><ul><li>Match found at index 0 ending at index 3. </li></ul></ul><ul><li>Regex : ^dog$ </li></ul><ul><ul><li>Input String : “ dog” </li></ul></ul><ul><ul><li>No match found </li></ul></ul><ul><li>Regex : s*dog$ </li></ul><ul><ul><li>Input String : “ dog” </li></ul></ul><ul><ul><li>Match found at index 0 ending at index 8. </li></ul></ul><ul><li>Regex : ^dogw* </li></ul><ul><ul><li>Input String : dogblahblah </li></ul></ul><ul><ul><li>Match found at index 0 ending at index 11. </li></ul></ul>
  33. 33. Boundary Matchers <ul><li>Regex : dog </li></ul><ul><ul><li>Input String : The dog plays in the yard </li></ul></ul><ul><ul><li>“ dog” found at index 4 ending at index 7. </li></ul></ul><ul><li>Regex : dog </li></ul><ul><li> Input String : The doggie plays in the yard. </li></ul><ul><ul><li>No match found </li></ul></ul><ul><li>Regex : dogB </li></ul><ul><ul><li>Input String : The dog plays in the yard </li></ul></ul><ul><ul><li>No match found. </li></ul></ul><ul><li>Regex :dogB </li></ul><ul><li>Input String : The doggie plays in the yard. </li></ul><ul><ul><li>“ dog” found at index 4 ending at index 7. </li></ul></ul>
  34. 34. Boundary Matchers <ul><li>Regex : dog </li></ul><ul><ul><li>Input String : dog dog </li></ul></ul><ul><ul><li>“ dog” found at index 0 ending at index 3. </li></ul></ul><ul><ul><li>“ dog” found at index 4 ending at index 7. </li></ul></ul><ul><li>Regex : Gdog </li></ul><ul><li> Input String : dog dog </li></ul><ul><li>“ dog” found at index 0 ending at index 3. </li></ul>

×