Regular expressions
Upcoming SlideShare
Loading in...5
×
 

Regular expressions

on

  • 1,199 views

An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.

An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.

Statistics

Views

Total Views
1,199
Views on SlideShare
1,199
Embed Views
0

Actions

Likes
0
Downloads
20
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Regular expressions Regular expressions Presentation Transcript

  • Regular ExpressionsPowerful string validation and extractionIgnaz Wanders – Architect @ Archimiddle@ignazw
  • Topics• What are regular expressions?• Patterns• Character classes• Quantifiers• Capturing groups• Boundaries• Internationalization• Regular expressions in Java• Quiz• References
  • What are regular expressions?• A regex is a string pattern used to search and manipulate text• A regex has special syntax• Very powerful for any type of String manipulation ranging from simple to verycomplex structures:– Input validation– S(ubs)tring replacement– ...• Example:• [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
  • History• Originates from automata and formal-language theories of computer science• Stephen Kleene  50’s: Kleene algebra• Kenneth Thompson  1969: unix: qed, ed• 70’s - 90’s: unix: grep, awk, sed, emacs• Programming languages:– C, Perl– JavaScript, Java
  • Patterns• Regex is based on pattern matching: Strings are searched for certain patterns• Simplest regex is a string-literal pattern• Metacharacters: ([{^$|)?*+.– Period means “any character”– To search for period as string literal, escape with “”REGEX: foxTEXT: The quick brown foxRESULT: foxREGEX: fo.TEXT: The quick brown foxRESULT: foxREGEX: .o.TEXT: The quick brown foxRESULT: row, fox
  • Character classes (1/3)• Syntax: any characters between [ and ]• Character classes denote one letter• Negation: ^REGEX: [rcb]atTEXT: batRESULT: batREGEX: [rcb]atTEXT: ratRESULT: ratREGEX: [rcb]atTEXT: catRESULT: catREGEX: [rcb]atTEXT: hatRESULT: -REGEX: [^rcb]atTEXT: ratRESULT: -REGEX: [^rcb]atTEXT: hatRESULT: hat
  • Character classes (2/3)• Ranges: [a-z], [0-9], [i-n], [a-zA-Z]...• Unions: [0-4[6-8]], [a-p[r-w]], ...• Intersections: [a-f&&[efg]], [a-f&&[e-k]], ...• Subtractions: [a-f&&[^efg]], ...REGEX: [rcb]at[1-5]TEXT: bat4 RESULT: bat4REGEX: [rcb]at[1-5[7-8]]TEXT: hat7 RESULT: -REGEX: [rcb]at[1-7&&[78]]TEXT: rat7 RESULT: rat7REGEX: [rcb]at[1-5&&[^34]]TEXT: bat4 RESULT: -
  • Character classes (3/3)predefined character classes equivalence. any characterd any digit [0-9]D any non-digit [^0-9], [^d]s any white-space character [ tnx0Bfr]S any non-white-space character [^s]w any word character [a-zA-Z_0-9]W any non-word character [^w]
  • Quantifiers (1/5)• Quantifiers allow character classes to match more than one character at a time.Quantifiers for character classes XX? zero or one timeX* zero or more timesX+ one or more timesX{n} exactly n timesX{n,} at least n timesX{n,m} at least n and at most m times
  • Quantifiers (2/5)• Examples of X?, X*, X+REGEX: “a?”TEXT: “”RESULT: “”REGEX: “a*”TEXT: “”RESULT: “”REGEX: “a+”TEXT: “”RESULT: -REGEX: “a?”TEXT: “a”RESULT: “a”REGEX: “a*”TEXT: “a”RESULT: “a”REGEX: “a+”TEXT: “a”RESULT: “a”REGEX: “a?”TEXT: “aaa”RESULT:“a”,”a”,”a”REGEX: “a*”TEXT: “aaa”RESULT: “aaa”REGEX: “a+”TEXT: “aaa”RESULT: “aaa”
  • Quantifiers (3/5)REGEX: “[abc]{3}”TEXT: “abccabaaaccbbbc”RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc”REGEX: “abc{3}”TEXT: “abccabaaaccbbbc”RESULT: -REGEX: “(dog){3}”TEXT: “dogdogdogdogdogdog”RESULT: “dogdogdog”,”dogdogdog”
  • Quantifiers (4/5)• Greedy quantifiers:– read complete string– work backwards until match found– syntax: X?, X*, X+, ...• Reluctant quantifiers:– read one character at a time– work forward until match found– syntax: X??, X*?, X+?, ...• Possessive quantifiers:– read complete string– try match only once– syntax: X?+, X*+, X++, ...
  • Quantifiers (5/5)REGEX: “.*foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfooxxxxxxfoo”REGEX: .*?foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfoo”, “xxxxxxfoo”REGEX: “.*+foo”TEXT: “xfooxxxxxxfoo”RESULT: -greedyreluctantpossessive
  • Capturing groups (1/2)• Capturing groups treat multiple characters as a single unit• Syntax: between braces ( and )• Example: (dog){3}• Numbering from left to right– Example: ((A)(B(C)))• Group 1: ((A)(B(C)))• Group 2: (A)• Group 3: (B(C))• Group 4: (C)
  • Capturing groups (2/2)• Backreferences to capturing groups are denoted by i with i an integer numberREGEX: “(dd)1”TEXT: “1212”RESULT: “1212”REGEX: “(dd)1”TEXT: “1234”RESULT: -
  • Boundaries (1/2)Boundary characters^ beginning of line$ end of lineb a word boundaryB a non-word boundaryA beginning of inputG end of previous matchz end of inputZ end of input, but before final terminator, if any
  • Boundaries (2/2)• Be aware:• End-of-line marker is $– Unix EOL is n– Windows EOL is rn– JDK uses any of the following as EOL:• n, rn, u0085, u2028, u2029• Always test your regular expressions on the target OS
  • Internationalization (1/2)• Regular expressions originally designed for the ascii Basic Latin set of characters.– Thus “België” is not matched by ^w+$• Extension to unicode character sets denoted by p{...}• Character set: [p{InCharacterSet}]– Create character classes from symbols in character sets.– “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
  • Internationalization (2/2)• Note that there are non-letters in character sets as well:– Latin-1 Supplement:• Categories:– Letters: p{L}– Uppercase letters: p{Lu}– “België” is matched by ^p{L}+$• Other (POSIX) categories:– Unicode currency symbols: p{Sc}– ASCII punctuation characters: p{Punct}¡¢£¤¥¦§¨©ª«-®¯°±²³´µ·¸¹º»¼½¾¿÷
  • Regular expressions in Java• Since JDK 1.4• Package java.util.regex– Pattern class– Matcher class• Convenience methods in java.lang.String• Alternative for JDK 1.3– Jakarta ORO project
  • java.util.regex.Pattern• Wrapper class for regular expressions• Useful methods:– compile(String regex): Pattern– matches(String regex, CharSequence text): boolean– split(String text): String[]String regex = “(dd)1”;Pattern p = Pattern.compile(regex);
  • java.util.regex.Matcher• Useful methods:– matches(): boolean– find(): boolean– find(int start): boolean– group(): String– replaceFirst(String replace): String– replaceAll(String replace): StringString regex = “(dd)1”;Pattern p = Pattern.compile(regex);String text = “1212”;Matcher m = p.matcher(text);boolean matches = m.matches();
  • java.lang.String• Pattern and Matcher methods in String:– matches(String regex): boolean– split(String regex): String[]– replaceFirst(String regex, String replace): String– replaceAll(String regex, String replace): String
  • Examples• Validation• Searching text• Filtering• Parsing• Removing duplicate lines• On-the-fly editing
  • Examples: validation• Validate an e-mail address• A URL[A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}(http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7}|localw*)(:d+)?(/(w+[w/-.]*)?)?
  • Examples: searching text• Write HttpUnit test to submit HTML form and check whether HTTP response is aconfirmation screen containing a generated form number of the form 9xxxxxx-xxxxxx:9[0-9]{6}-[0-9]{6}Pattern p = Pattern.compile(regexp);Matcher m = p.matcher(text);boolean ok = m.find();String nr = m.group();
  • Examples: filtering• Filter e-mail with subjects with capitals only, and including a leading “Re:”(R[eE]:)*[^a-z]*$
  • Examples: parsing• Matches any opening and closing XML tag:– Note the use of the back reference<([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
  • Examples: duplicate lines• Suppose you want to remove duplicate lines from a text.– requirement here is that the lines are sorted alphabetically^(.*)(r?n1)+$
  • Examples: on-the-fly editing• Suppose you want to edit a file in batch: all occurrances of a certain string patternshould be replaced with another string.• In unix: use the sed command with a regex• In Java: use string.replaceAll(regex,”mystring”)• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptorsdepending on environment
  • Quiz• What are the following regular expressions looking for?d+ at least one digit[-+]?d+ any integer((d*.?)?d+|d+(.?d*)) any positive decimal[p{L}][-.p{L} ]+ a place name
  • Conclusion• When doing one of the following:– validating strings– on-the-fly editing of strings– searching strings– filtering strings• think regex!
  • References• http://www.regular-expressions.info/• http://www.regexlib.com/• http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/• http://java.sun.com/docs/books/tutorial/extra/regex/• http://www.wellho.net/regex/javare.html• >JDK 1.4 API• Mastering Regular Expressions