This document provides an overview of regular expressions including what they are, their history and usage, common patterns and syntax, and examples of using regular expressions in Java. Regular expressions allow complex searches and text manipulation through special pattern syntax. They are very powerful for tasks like validation, extraction, replacement and more. The document covers topics such as character classes, quantifiers, capturing groups, boundaries, and internationalization considerations.
2. Topics
• What are regular expressions?
• Patterns
• Character classes
• Quantifiers
• Capturing groups
• Boundaries
• Internationalization
• Regular expressions in Java
• Quiz
• References
3. What are regular expressions?
• A regex is a string pattern used to search and manipulate text
• A regex has special syntax
• Very powerful for any type of String manipulation ranging from simple to very
complex structures:
– Input validation
– S(ubs)tring replacement
– ...
• Example:
• [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
4. History
• Originates from automata and formal-language theories of computer science
• Stephen Kleene 50’s: Kleene algebra
• Kenneth Thompson 1969: unix: qed, ed
• 70’s - 90’s: unix: grep, awk, sed, emacs
• Programming languages:
– C, Perl
– JavaScript, Java
5. Patterns
• Regex is based on pattern matching: Strings are searched for certain patterns
• Simplest regex is a string-literal pattern
• Metacharacters: ([{^$|)?*+.
– Period means “any character”
– To search for period as string literal, escape with “”
REGEX: fox
TEXT: The quick brown fox
RESULT: fox
REGEX: fo.
TEXT: The quick brown fox
RESULT: fox
REGEX: .o.
TEXT: The quick brown fox
RESULT: row, fox
6. Character classes (1/3)
• Syntax: any characters between [ and ]
• Character classes denote one letter
• Negation: ^
REGEX: [rcb]at
TEXT: bat
RESULT: bat
REGEX: [rcb]at
TEXT: rat
RESULT: rat
REGEX: [rcb]at
TEXT: cat
RESULT: cat
REGEX: [rcb]at
TEXT: hat
RESULT: -
REGEX: [^rcb]at
TEXT: rat
RESULT: -
REGEX: [^rcb]at
TEXT: hat
RESULT: hat
8. Character classes (3/3)
predefined character classes equivalence
. any character
d any digit [0-9]
D any non-digit [^0-9], [^d]
s any white-space character [ tnx0Bfr]
S any non-white-space character [^s]
w any word character [a-zA-Z_0-9]
W any non-word character [^w]
9. Quantifiers (1/5)
• Quantifiers allow character classes to match more than one character at a time.
Quantifiers for character classes X
X? zero or one time
X* zero or more times
X+ one or more times
X{n} exactly n times
X{n,} at least n times
X{n,m} at least n and at most m times
12. Quantifiers (4/5)
• Greedy quantifiers:
– read complete string
– work backwards until match found
– syntax: X?, X*, X+, ...
• Reluctant quantifiers:
– read one character at a time
– work forward until match found
– syntax: X??, X*?, X+?, ...
• Possessive quantifiers:
– read complete string
– try match only once
– syntax: X?+, X*+, X++, ...
14. Capturing groups (1/2)
• Capturing groups treat multiple characters as a single unit
• Syntax: between braces ( and )
• Example: (dog){3}
• Numbering from left to right
– Example: ((A)(B(C)))
• Group 1: ((A)(B(C)))
• Group 2: (A)
• Group 3: (B(C))
• Group 4: (C)
15. Capturing groups (2/2)
• Backreferences to capturing groups are denoted by i with i an integer number
REGEX: “(dd)1”
TEXT: “1212”
RESULT: “1212”
REGEX: “(dd)1”
TEXT: “1234”
RESULT: -
16. Boundaries (1/2)
Boundary characters
^ beginning of line
$ end of line
b a word boundary
B a non-word boundary
A beginning of input
G end of previous match
z end of input
Z end of input, but before final terminator, if any
17. Boundaries (2/2)
• Be aware:
• End-of-line marker is $
– Unix EOL is n
– Windows EOL is rn
– JDK uses any of the following as EOL:
• 'n', 'rn', 'u0085', 'u2028', 'u2029'
• Always test your regular expressions on the target OS
18. Internationalization (1/2)
• Regular expressions originally designed for the ascii Basic Latin set of characters.
– Thus “België” is not matched by ^w+$
• Extension to unicode character sets denoted by p{...}
• Character set: [p{InCharacterSet}]
– Create character classes from symbols in character sets.
– “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
20. Regular expressions in Java
• Since JDK 1.4
• Package java.util.regex
– Pattern class
– Matcher class
• Convenience methods in java.lang.String
• Alternative for JDK 1.3
– Jakarta ORO project
25. Examples: validation
• Validate an e-mail address
• A URL
[A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
(http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7}
|localw*)(:d+)?(/(w+[w/-.]*)?)?
26. Examples: searching text
• Write HttpUnit test to submit HTML form and check whether HTTP response is a
confirmation screen containing a generated form number of the form 9xxxxxx-
xxxxxx:
9[0-9]{6}-[0-9]{6}
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(text);
boolean ok = m.find();
String nr = m.group();
27. Examples: filtering
• Filter e-mail with subjects with capitals only, and including a leading “Re:”
(R[eE]:)*[^a-z]*$
28. Examples: parsing
• Matches any opening and closing XML tag:
– Note the use of the back reference
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
29. Examples: duplicate lines
• Suppose you want to remove duplicate lines from a text.
– requirement here is that the lines are sorted alphabetically
^(.*)(r?n1)+$
30. Examples: on-the-fly editing
• Suppose you want to edit a file in batch: all occurrances of a certain string pattern
should be replaced with another string.
• In unix: use the sed command with a regex
• In Java: use string.replaceAll(regex,”mystring”)
• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptors
depending on environment
31. Quiz
• What are the following regular expressions looking for?
d+ at least one digit
[-+]?d+ any integer
((d*.?)?d+|d+(.?d*)) any positive decimal
[p{L}']['-.p{L} ]+ a place name
32. Conclusion
• When doing one of the following:
– validating strings
– on-the-fly editing of strings
– searching strings
– filtering strings
• think regex!