  1. 1. Regular Expressions for Beginners Srikanth Modegunta
  2. 2. Introduction  Also referred to as Regex or RegExp  Used to match the pattern of text − Ex: maven and maeven can be matched with regex “mae?ven”  Regular Expressions are processed by a piece of software called “Regular Expressions Engine”  Most of the languages support Regex − Ex: perl, java, c# etc.
  3. 3. Introduction (Contd..)  Used where text processing is required.  XML parsing involves Regex as it is based on the pattern matching. − We will see how to match xml or html tag.  Automation of the tasks − Ex: if mail subject contains “<operation> <some task name> <command>” then start processing the task.  Text Editors updating the comments to functions automatically(Replacing a pattern with some text) − Ex: replace − “sub subroutine(parameters){<statements>}” by /* this is a sample subroutine*/ sub subroutine(parameters){<statements>}
  4. 4. Meta Characters The following are the meta characters | ( ) [ { ^ $ * + ? .
  5. 5. Meta Characters (Contd..) Character Meaning * 0 or more + 1 or more ? 0 or 1 (optional) . All characters excluding new-line ^ Start of line. But [^abc] means character other than 'a' or 'b' or 'c' $ End of line A Start of string Z End of string
  6. 6. Meta Characters (Contd..) Character Meaning { } If I know How many times the pattern repeats I can use this Ex: a{2, 5} matches 'a' repeated minimum 2 times and maximum 5 times. | Saying 'or' in patterns Ex: cat|dog|mouse () Used to capture groups [ ] Only one letter from the set
  7. 7. Quantifiers  To specify the quantity − Ex: ear, eaaaar – the quantity of a is 1 and 4 in these two cases.  If a pattern is repeated then we need to use quantifiers to match that repeated pattern.  To match the above case we use the following regex − ea+r means a can come 1 or more times
  8. 8. Quantifiers (Contd..) * 0 or more times (it is hungry matching) Ex: ca* matches c, ca, caa, caaa etc. Matches even if the character does not exist and matches any number of 'a' s generally till last occurrence of pattern + 1 or more times (it is hungry matching) Ex: ca+ matches ca, caa, caaa etc {n} Match exactly n times Ex: ca{4}r matches caaaar {m,} Matches minimum of m times and maximum of more than m times Ex: ca{2,}r matches only if a repeats greater than 2 times. (hungry matching) {m,n} Matches minimum m times and maximum n times. Ex: ca{2,3}r matches and 'a' repeats minimum 2 times and maximum 3 times. (hungry matching) Hungry Matching refers to the behavior that the pattern matches maximum possible text. Ex: for ca{0,4} the text “caaaa” matches I.e all the 4 'a's are matched.
  9. 9. Quantifiers (Contd..) *? Lazy matching i.e it matches 0 or more times but stops at first match Ex: if text is “caaaaaa” then “ca*?” will match only 'c'. +? Lazy matching i.e it matches 1 or more times but stops at first match Ex: if text is “caaaaaa” then “ca+?” will match only 'ca'. ?? Lazy matching i.e it matches 0 or 1 times but stops at first match Ex: if text is “ca” then “ca??” will match only 'c'. {min,}? {n}? {min, max}? Lazy matching Lazy Matching refers to the behavior that the pattern matches minimum possible text. Ex: for ca{0,4}? the text “caaaa” matches only “c”
  10. 10. Character Sets  Matches one character among the set of characters  [abcd] is same as [a-d]  [a-di-l] is same as [abcdijkl]  [^abcd] matches any character other than a,b,c,d  Quantifiers can be applied to the character sets − [a-z]+ matches the string 'hello' in 'hello1234E'
  11. 11. Characters for Matching Common character classes shorthand [a-zA-Z0-9_] w [0-9] d [ tnr] s [^a-zA-Z0-9_] W [^0-9] D [^ tnr] S b Word Boundary B Other than a Word Boundary
  12. 12. Simple Matching  − mail id should not start with number or special symbols − Mail id id can start with _ − Mail id can have '.' in the middle − Should end with  Pattern : − [a-zA-Z_][a-zA-Z_.]+@w+.(com| − Meta characters must be escaped in the pattern to match them as normal characters
  13. 13. Modifiers Modifier Meaning i Case insensitive g Global matching (in perl) m Multiline matching s Dot all ('.' matches n also) x Extended regex pattern (pretty format ref: perl) e (Used for replacing string) evaluate the replacing pattern as an expression (ref: perl)
  14. 14. Grouping  Groups can be captured using parenthesis − (<pattern>) − Saves the text identified by the group into a backreference (we will see it later)  Groups are to capture part of text in the matching pattern − Ex: take simple xml element <root>test</root> − <(w+)>.*?</1> − Here 1 is back reference  Java has a method “group(int)” method in “java.util.regex.Matcher” class.
  15. 15. Grouping Example  If the command is − /sbin/service <service-name> <command> − ([^s]+)s+([w-_]+)s+(start|stop|status) − Group 0=matched pattern − Group 1=”/sbin/service” − Group 2=<service-name> − Group 3=<command> − Command can be start, stop or status
  16. 16. Back References  Stores the part of the string matched by the part of the regular expression inside the parentheses  If there is any string that occurs multiple times in the input, we can use back reference to identify the match  Ex: xml/html start-tag should have the end-tag  Here if we capture the start-tag name in first group, we can put end-tag name as back reference (1)
  17. 17. Back references example  For example take the xml tag − <root id=”E12”>test</root> − <([w-_]+)s*([^<>]+)?>w+</1> matches xml element − Group 0: <root id=”E12”>test</root> − Group 1: root − Group 2: id=”E12” − 1 in the regex pattern is the back reference to group 1.
  18. 18. No grouping with parenthesis  If groups are not required for the parenthesized patterns − Use ?: inside group (?:) − (text1|text2|text3) is any on of text1, text2 and text3 − (?:text1|text2|text3) but will not be a group
  19. 19. Look ahead and Look behind  Positive look-ahead − w+(?=:) not all words.... select words that come before ':'  Negative look-ahead − w+(?!:) words other than those coming before :  When the pattern comes the regex engine looks ahead for the filtering pattern in case of Look ahead.  Positive look-behind − (?<=a)b selects 'b' that follows 'a'  Negative look-behind − (?<!a)b selects 'b' that doesn't follow 'a'  When the pattern comes the regex engine looks behind for the filtering pattern in case of Look behind.
  20. 20. References: 1) 2) Thinking in java 4th Editon – Chapter: Strings page 392
  21. 21. Thank You