Regular expressions
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Regular expressions

  • 1,462 views
Uploaded on

An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.

An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,462
On Slideshare
1,462
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
24
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Regular ExpressionsPowerful string validation and extractionIgnaz Wanders – Architect @ Archimiddle@ignazw
  • 2. Topics• What are regular expressions?• Patterns• Character classes• Quantifiers• Capturing groups• Boundaries• Internationalization• Regular expressions in Java• Quiz• References
  • 3. What are regular expressions?• A regex is a string pattern used to search and manipulate text• A regex has special syntax• Very powerful for any type of String manipulation ranging from simple to verycomplex structures:– Input validation– S(ubs)tring replacement– ...• Example:• [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
  • 4. History• Originates from automata and formal-language theories of computer science• Stephen Kleene  50’s: Kleene algebra• Kenneth Thompson  1969: unix: qed, ed• 70’s - 90’s: unix: grep, awk, sed, emacs• Programming languages:– C, Perl– JavaScript, Java
  • 5. Patterns• Regex is based on pattern matching: Strings are searched for certain patterns• Simplest regex is a string-literal pattern• Metacharacters: ([{^$|)?*+.– Period means “any character”– To search for period as string literal, escape with “”REGEX: foxTEXT: The quick brown foxRESULT: foxREGEX: fo.TEXT: The quick brown foxRESULT: foxREGEX: .o.TEXT: The quick brown foxRESULT: row, fox
  • 6. Character classes (1/3)• Syntax: any characters between [ and ]• Character classes denote one letter• Negation: ^REGEX: [rcb]atTEXT: batRESULT: batREGEX: [rcb]atTEXT: ratRESULT: ratREGEX: [rcb]atTEXT: catRESULT: catREGEX: [rcb]atTEXT: hatRESULT: -REGEX: [^rcb]atTEXT: ratRESULT: -REGEX: [^rcb]atTEXT: hatRESULT: hat
  • 7. Character classes (2/3)• Ranges: [a-z], [0-9], [i-n], [a-zA-Z]...• Unions: [0-4[6-8]], [a-p[r-w]], ...• Intersections: [a-f&&[efg]], [a-f&&[e-k]], ...• Subtractions: [a-f&&[^efg]], ...REGEX: [rcb]at[1-5]TEXT: bat4 RESULT: bat4REGEX: [rcb]at[1-5[7-8]]TEXT: hat7 RESULT: -REGEX: [rcb]at[1-7&&[78]]TEXT: rat7 RESULT: rat7REGEX: [rcb]at[1-5&&[^34]]TEXT: bat4 RESULT: -
  • 8. Character classes (3/3)predefined character classes equivalence. any characterd any digit [0-9]D any non-digit [^0-9], [^d]s any white-space character [ tnx0Bfr]S any non-white-space character [^s]w any word character [a-zA-Z_0-9]W any non-word character [^w]
  • 9. Quantifiers (1/5)• Quantifiers allow character classes to match more than one character at a time.Quantifiers for character classes XX? zero or one timeX* zero or more timesX+ one or more timesX{n} exactly n timesX{n,} at least n timesX{n,m} at least n and at most m times
  • 10. Quantifiers (2/5)• Examples of X?, X*, X+REGEX: “a?”TEXT: “”RESULT: “”REGEX: “a*”TEXT: “”RESULT: “”REGEX: “a+”TEXT: “”RESULT: -REGEX: “a?”TEXT: “a”RESULT: “a”REGEX: “a*”TEXT: “a”RESULT: “a”REGEX: “a+”TEXT: “a”RESULT: “a”REGEX: “a?”TEXT: “aaa”RESULT:“a”,”a”,”a”REGEX: “a*”TEXT: “aaa”RESULT: “aaa”REGEX: “a+”TEXT: “aaa”RESULT: “aaa”
  • 11. Quantifiers (3/5)REGEX: “[abc]{3}”TEXT: “abccabaaaccbbbc”RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc”REGEX: “abc{3}”TEXT: “abccabaaaccbbbc”RESULT: -REGEX: “(dog){3}”TEXT: “dogdogdogdogdogdog”RESULT: “dogdogdog”,”dogdogdog”
  • 12. Quantifiers (4/5)• Greedy quantifiers:– read complete string– work backwards until match found– syntax: X?, X*, X+, ...• Reluctant quantifiers:– read one character at a time– work forward until match found– syntax: X??, X*?, X+?, ...• Possessive quantifiers:– read complete string– try match only once– syntax: X?+, X*+, X++, ...
  • 13. Quantifiers (5/5)REGEX: “.*foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfooxxxxxxfoo”REGEX: .*?foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfoo”, “xxxxxxfoo”REGEX: “.*+foo”TEXT: “xfooxxxxxxfoo”RESULT: -greedyreluctantpossessive
  • 14. Capturing groups (1/2)• Capturing groups treat multiple characters as a single unit• Syntax: between braces ( and )• Example: (dog){3}• Numbering from left to right– Example: ((A)(B(C)))• Group 1: ((A)(B(C)))• Group 2: (A)• Group 3: (B(C))• Group 4: (C)
  • 15. Capturing groups (2/2)• Backreferences to capturing groups are denoted by i with i an integer numberREGEX: “(dd)1”TEXT: “1212”RESULT: “1212”REGEX: “(dd)1”TEXT: “1234”RESULT: -
  • 16. Boundaries (1/2)Boundary characters^ beginning of line$ end of lineb a word boundaryB a non-word boundaryA beginning of inputG end of previous matchz end of inputZ end of input, but before final terminator, if any
  • 17. Boundaries (2/2)• Be aware:• End-of-line marker is $– Unix EOL is n– Windows EOL is rn– JDK uses any of the following as EOL:• n, rn, u0085, u2028, u2029• Always test your regular expressions on the target OS
  • 18. Internationalization (1/2)• Regular expressions originally designed for the ascii Basic Latin set of characters.– Thus “België” is not matched by ^w+$• Extension to unicode character sets denoted by p{...}• Character set: [p{InCharacterSet}]– Create character classes from symbols in character sets.– “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
  • 19. Internationalization (2/2)• Note that there are non-letters in character sets as well:– Latin-1 Supplement:• Categories:– Letters: p{L}– Uppercase letters: p{Lu}– “België” is matched by ^p{L}+$• Other (POSIX) categories:– Unicode currency symbols: p{Sc}– ASCII punctuation characters: p{Punct}¡¢£¤¥¦§¨©ª«-®¯°±²³´µ·¸¹º»¼½¾¿÷
  • 20. Regular expressions in Java• Since JDK 1.4• Package java.util.regex– Pattern class– Matcher class• Convenience methods in java.lang.String• Alternative for JDK 1.3– Jakarta ORO project
  • 21. java.util.regex.Pattern• Wrapper class for regular expressions• Useful methods:– compile(String regex): Pattern– matches(String regex, CharSequence text): boolean– split(String text): String[]String regex = “(dd)1”;Pattern p = Pattern.compile(regex);
  • 22. java.util.regex.Matcher• Useful methods:– matches(): boolean– find(): boolean– find(int start): boolean– group(): String– replaceFirst(String replace): String– replaceAll(String replace): StringString regex = “(dd)1”;Pattern p = Pattern.compile(regex);String text = “1212”;Matcher m = p.matcher(text);boolean matches = m.matches();
  • 23. java.lang.String• Pattern and Matcher methods in String:– matches(String regex): boolean– split(String regex): String[]– replaceFirst(String regex, String replace): String– replaceAll(String regex, String replace): String
  • 24. Examples• Validation• Searching text• Filtering• Parsing• Removing duplicate lines• On-the-fly editing
  • 25. Examples: validation• Validate an e-mail address• A URL[A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}(http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7}|localw*)(:d+)?(/(w+[w/-.]*)?)?
  • 26. Examples: searching text• Write HttpUnit test to submit HTML form and check whether HTTP response is aconfirmation screen containing a generated form number of the form 9xxxxxx-xxxxxx:9[0-9]{6}-[0-9]{6}Pattern p = Pattern.compile(regexp);Matcher m = p.matcher(text);boolean ok = m.find();String nr = m.group();
  • 27. Examples: filtering• Filter e-mail with subjects with capitals only, and including a leading “Re:”(R[eE]:)*[^a-z]*$
  • 28. Examples: parsing• Matches any opening and closing XML tag:– Note the use of the back reference<([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
  • 29. Examples: duplicate lines• Suppose you want to remove duplicate lines from a text.– requirement here is that the lines are sorted alphabetically^(.*)(r?n1)+$
  • 30. Examples: on-the-fly editing• Suppose you want to edit a file in batch: all occurrances of a certain string patternshould be replaced with another string.• In unix: use the sed command with a regex• In Java: use string.replaceAll(regex,”mystring”)• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptorsdepending on environment
  • 31. Quiz• What are the following regular expressions looking for?d+ at least one digit[-+]?d+ any integer((d*.?)?d+|d+(.?d*)) any positive decimal[p{L}][-.p{L} ]+ a place name
  • 32. Conclusion• When doing one of the following:– validating strings– on-the-fly editing of strings– searching strings– filtering strings• think regex!
  • 33. References• http://www.regular-expressions.info/• http://www.regexlib.com/• http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/• http://java.sun.com/docs/books/tutorial/extra/regex/• http://www.wellho.net/regex/javare.html• >JDK 1.4 API• Mastering Regular Expressions