Your SlideShare is downloading. ×
0
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Regular expressions
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Regular expressions

1,907

Published on

An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.

An old - but still very relevant - short course on regular expressions, plus examples on how to use them, and references where to find more.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,907
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
32
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Regular ExpressionsPowerful string validation and extractionIgnaz Wanders – Architect @ Archimiddle@ignazw
  • 2. Topics• What are regular expressions?• Patterns• Character classes• Quantifiers• Capturing groups• Boundaries• Internationalization• Regular expressions in Java• Quiz• References
  • 3. What are regular expressions?• A regex is a string pattern used to search and manipulate text• A regex has special syntax• Very powerful for any type of String manipulation ranging from simple to verycomplex structures:– Input validation– S(ubs)tring replacement– ...• Example:• [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
  • 4. History• Originates from automata and formal-language theories of computer science• Stephen Kleene  50’s: Kleene algebra• Kenneth Thompson  1969: unix: qed, ed• 70’s - 90’s: unix: grep, awk, sed, emacs• Programming languages:– C, Perl– JavaScript, Java
  • 5. Patterns• Regex is based on pattern matching: Strings are searched for certain patterns• Simplest regex is a string-literal pattern• Metacharacters: ([{^$|)?*+.– Period means “any character”– To search for period as string literal, escape with “”REGEX: foxTEXT: The quick brown foxRESULT: foxREGEX: fo.TEXT: The quick brown foxRESULT: foxREGEX: .o.TEXT: The quick brown foxRESULT: row, fox
  • 6. Character classes (1/3)• Syntax: any characters between [ and ]• Character classes denote one letter• Negation: ^REGEX: [rcb]atTEXT: batRESULT: batREGEX: [rcb]atTEXT: ratRESULT: ratREGEX: [rcb]atTEXT: catRESULT: catREGEX: [rcb]atTEXT: hatRESULT: -REGEX: [^rcb]atTEXT: ratRESULT: -REGEX: [^rcb]atTEXT: hatRESULT: hat
  • 7. Character classes (2/3)• Ranges: [a-z], [0-9], [i-n], [a-zA-Z]...• Unions: [0-4[6-8]], [a-p[r-w]], ...• Intersections: [a-f&&[efg]], [a-f&&[e-k]], ...• Subtractions: [a-f&&[^efg]], ...REGEX: [rcb]at[1-5]TEXT: bat4 RESULT: bat4REGEX: [rcb]at[1-5[7-8]]TEXT: hat7 RESULT: -REGEX: [rcb]at[1-7&&[78]]TEXT: rat7 RESULT: rat7REGEX: [rcb]at[1-5&&[^34]]TEXT: bat4 RESULT: -
  • 8. Character classes (3/3)predefined character classes equivalence. any characterd any digit [0-9]D any non-digit [^0-9], [^d]s any white-space character [ tnx0Bfr]S any non-white-space character [^s]w any word character [a-zA-Z_0-9]W any non-word character [^w]
  • 9. Quantifiers (1/5)• Quantifiers allow character classes to match more than one character at a time.Quantifiers for character classes XX? zero or one timeX* zero or more timesX+ one or more timesX{n} exactly n timesX{n,} at least n timesX{n,m} at least n and at most m times
  • 10. Quantifiers (2/5)• Examples of X?, X*, X+REGEX: “a?”TEXT: “”RESULT: “”REGEX: “a*”TEXT: “”RESULT: “”REGEX: “a+”TEXT: “”RESULT: -REGEX: “a?”TEXT: “a”RESULT: “a”REGEX: “a*”TEXT: “a”RESULT: “a”REGEX: “a+”TEXT: “a”RESULT: “a”REGEX: “a?”TEXT: “aaa”RESULT:“a”,”a”,”a”REGEX: “a*”TEXT: “aaa”RESULT: “aaa”REGEX: “a+”TEXT: “aaa”RESULT: “aaa”
  • 11. Quantifiers (3/5)REGEX: “[abc]{3}”TEXT: “abccabaaaccbbbc”RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc”REGEX: “abc{3}”TEXT: “abccabaaaccbbbc”RESULT: -REGEX: “(dog){3}”TEXT: “dogdogdogdogdogdog”RESULT: “dogdogdog”,”dogdogdog”
  • 12. Quantifiers (4/5)• Greedy quantifiers:– read complete string– work backwards until match found– syntax: X?, X*, X+, ...• Reluctant quantifiers:– read one character at a time– work forward until match found– syntax: X??, X*?, X+?, ...• Possessive quantifiers:– read complete string– try match only once– syntax: X?+, X*+, X++, ...
  • 13. Quantifiers (5/5)REGEX: “.*foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfooxxxxxxfoo”REGEX: .*?foo”TEXT: “xfooxxxxxxfoo”RESULT: “xfoo”, “xxxxxxfoo”REGEX: “.*+foo”TEXT: “xfooxxxxxxfoo”RESULT: -greedyreluctantpossessive
  • 14. Capturing groups (1/2)• Capturing groups treat multiple characters as a single unit• Syntax: between braces ( and )• Example: (dog){3}• Numbering from left to right– Example: ((A)(B(C)))• Group 1: ((A)(B(C)))• Group 2: (A)• Group 3: (B(C))• Group 4: (C)
  • 15. Capturing groups (2/2)• Backreferences to capturing groups are denoted by i with i an integer numberREGEX: “(dd)1”TEXT: “1212”RESULT: “1212”REGEX: “(dd)1”TEXT: “1234”RESULT: -
  • 16. Boundaries (1/2)Boundary characters^ beginning of line$ end of lineb a word boundaryB a non-word boundaryA beginning of inputG end of previous matchz end of inputZ end of input, but before final terminator, if any
  • 17. Boundaries (2/2)• Be aware:• End-of-line marker is $– Unix EOL is n– Windows EOL is rn– JDK uses any of the following as EOL:• n, rn, u0085, u2028, u2029• Always test your regular expressions on the target OS
  • 18. Internationalization (1/2)• Regular expressions originally designed for the ascii Basic Latin set of characters.– Thus “België” is not matched by ^w+$• Extension to unicode character sets denoted by p{...}• Character set: [p{InCharacterSet}]– Create character classes from symbols in character sets.– “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
  • 19. Internationalization (2/2)• Note that there are non-letters in character sets as well:– Latin-1 Supplement:• Categories:– Letters: p{L}– Uppercase letters: p{Lu}– “België” is matched by ^p{L}+$• Other (POSIX) categories:– Unicode currency symbols: p{Sc}– ASCII punctuation characters: p{Punct}¡¢£¤¥¦§¨©ª«-®¯°±²³´µ·¸¹º»¼½¾¿÷
  • 20. Regular expressions in Java• Since JDK 1.4• Package java.util.regex– Pattern class– Matcher class• Convenience methods in java.lang.String• Alternative for JDK 1.3– Jakarta ORO project
  • 21. java.util.regex.Pattern• Wrapper class for regular expressions• Useful methods:– compile(String regex): Pattern– matches(String regex, CharSequence text): boolean– split(String text): String[]String regex = “(dd)1”;Pattern p = Pattern.compile(regex);
  • 22. java.util.regex.Matcher• Useful methods:– matches(): boolean– find(): boolean– find(int start): boolean– group(): String– replaceFirst(String replace): String– replaceAll(String replace): StringString regex = “(dd)1”;Pattern p = Pattern.compile(regex);String text = “1212”;Matcher m = p.matcher(text);boolean matches = m.matches();
  • 23. java.lang.String• Pattern and Matcher methods in String:– matches(String regex): boolean– split(String regex): String[]– replaceFirst(String regex, String replace): String– replaceAll(String regex, String replace): String
  • 24. Examples• Validation• Searching text• Filtering• Parsing• Removing duplicate lines• On-the-fly editing
  • 25. Examples: validation• Validate an e-mail address• A URL[A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}(http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7}|localw*)(:d+)?(/(w+[w/-.]*)?)?
  • 26. Examples: searching text• Write HttpUnit test to submit HTML form and check whether HTTP response is aconfirmation screen containing a generated form number of the form 9xxxxxx-xxxxxx:9[0-9]{6}-[0-9]{6}Pattern p = Pattern.compile(regexp);Matcher m = p.matcher(text);boolean ok = m.find();String nr = m.group();
  • 27. Examples: filtering• Filter e-mail with subjects with capitals only, and including a leading “Re:”(R[eE]:)*[^a-z]*$
  • 28. Examples: parsing• Matches any opening and closing XML tag:– Note the use of the back reference<([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
  • 29. Examples: duplicate lines• Suppose you want to remove duplicate lines from a text.– requirement here is that the lines are sorted alphabetically^(.*)(r?n1)+$
  • 30. Examples: on-the-fly editing• Suppose you want to edit a file in batch: all occurrances of a certain string patternshould be replaced with another string.• In unix: use the sed command with a regex• In Java: use string.replaceAll(regex,”mystring”)• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptorsdepending on environment
  • 31. Quiz• What are the following regular expressions looking for?d+ at least one digit[-+]?d+ any integer((d*.?)?d+|d+(.?d*)) any positive decimal[p{L}][-.p{L} ]+ a place name
  • 32. Conclusion• When doing one of the following:– validating strings– on-the-fly editing of strings– searching strings– filtering strings• think regex!
  • 33. References• http://www.regular-expressions.info/• http://www.regexlib.com/• http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/• http://java.sun.com/docs/books/tutorial/extra/regex/• http://www.wellho.net/regex/javare.html• >JDK 1.4 API• Mastering Regular Expressions

×