SlideShare a Scribd company logo
1 of 33
Regular Expressions
Powerful string validation and extraction
Ignaz Wanders – Architect @ Archimiddle
@ignazw
Topics
• What are regular expressions?
• Patterns
• Character classes
• Quantifiers
• Capturing groups
• Boundaries
• Internationalization
• Regular expressions in Java
• Quiz
• References
What are regular expressions?
• A regex is a string pattern used to search and manipulate text
• A regex has special syntax
• Very powerful for any type of String manipulation ranging from simple to very
complex structures:
– Input validation
– S(ubs)tring replacement
– ...
• Example:
• [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
History
• Originates from automata and formal-language theories of computer science
• Stephen Kleene  50’s: Kleene algebra
• Kenneth Thompson  1969: unix: qed, ed
• 70’s - 90’s: unix: grep, awk, sed, emacs
• Programming languages:
– C, Perl
– JavaScript, Java
Patterns
• Regex is based on pattern matching: Strings are searched for certain patterns
• Simplest regex is a string-literal pattern
• Metacharacters: ([{^$|)?*+.
– Period means “any character”
– To search for period as string literal, escape with “”
REGEX: fox
TEXT: The quick brown fox
RESULT: fox
REGEX: fo.
TEXT: The quick brown fox
RESULT: fox
REGEX: .o.
TEXT: The quick brown fox
RESULT: row, fox
Character classes (1/3)
• Syntax: any characters between [ and ]
• Character classes denote one letter
• Negation: ^
REGEX: [rcb]at
TEXT: bat
RESULT: bat
REGEX: [rcb]at
TEXT: rat
RESULT: rat
REGEX: [rcb]at
TEXT: cat
RESULT: cat
REGEX: [rcb]at
TEXT: hat
RESULT: -
REGEX: [^rcb]at
TEXT: rat
RESULT: -
REGEX: [^rcb]at
TEXT: hat
RESULT: hat
Character classes (2/3)
• Ranges: [a-z], [0-9], [i-n], [a-zA-Z]...
• Unions: [0-4[6-8]], [a-p[r-w]], ...
• Intersections: [a-f&&[efg]], [a-f&&[e-k]], ...
• Subtractions: [a-f&&[^efg]], ...
REGEX: [rcb]at[1-5]
TEXT: bat4 RESULT: bat4
REGEX: [rcb]at[1-5[7-8]]
TEXT: hat7 RESULT: -
REGEX: [rcb]at[1-7&&[78]]
TEXT: rat7 RESULT: rat7
REGEX: [rcb]at[1-5&&[^34]]
TEXT: bat4 RESULT: -
Character classes (3/3)
predefined character classes equivalence
. any character
d any digit [0-9]
D any non-digit [^0-9], [^d]
s any white-space character [ tnx0Bfr]
S any non-white-space character [^s]
w any word character [a-zA-Z_0-9]
W any non-word character [^w]
Quantifiers (1/5)
• Quantifiers allow character classes to match more than one character at a time.
Quantifiers for character classes X
X? zero or one time
X* zero or more times
X+ one or more times
X{n} exactly n times
X{n,} at least n times
X{n,m} at least n and at most m times
Quantifiers (2/5)
• Examples of X?, X*, X+
REGEX: “a?”
TEXT: “”
RESULT: “”
REGEX: “a*”
TEXT: “”
RESULT: “”
REGEX: “a+”
TEXT: “”
RESULT: -
REGEX: “a?”
TEXT: “a”
RESULT: “a”
REGEX: “a*”
TEXT: “a”
RESULT: “a”
REGEX: “a+”
TEXT: “a”
RESULT: “a”
REGEX: “a?”
TEXT: “aaa”
RESULT:
“a”,”a”,”a”
REGEX: “a*”
TEXT: “aaa”
RESULT: “aaa”
REGEX: “a+”
TEXT: “aaa”
RESULT: “aaa”
Quantifiers (3/5)
REGEX: “[abc]{3}”
TEXT: “abccabaaaccbbbc”
RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc”
REGEX: “abc{3}”
TEXT: “abccabaaaccbbbc”
RESULT: -
REGEX: “(dog){3}”
TEXT: “dogdogdogdogdogdog”
RESULT: “dogdogdog”,”dogdogdog”
Quantifiers (4/5)
• Greedy quantifiers:
– read complete string
– work backwards until match found
– syntax: X?, X*, X+, ...
• Reluctant quantifiers:
– read one character at a time
– work forward until match found
– syntax: X??, X*?, X+?, ...
• Possessive quantifiers:
– read complete string
– try match only once
– syntax: X?+, X*+, X++, ...
Quantifiers (5/5)
REGEX: “.*foo”
TEXT: “xfooxxxxxxfoo”
RESULT: “xfooxxxxxxfoo”
REGEX: .*?foo”
TEXT: “xfooxxxxxxfoo”
RESULT: “xfoo”, “xxxxxxfoo”
REGEX: “.*+foo”
TEXT: “xfooxxxxxxfoo”
RESULT: -
greedy
reluctant
possessive
Capturing groups (1/2)
• Capturing groups treat multiple characters as a single unit
• Syntax: between braces ( and )
• Example: (dog){3}
• Numbering from left to right
– Example: ((A)(B(C)))
• Group 1: ((A)(B(C)))
• Group 2: (A)
• Group 3: (B(C))
• Group 4: (C)
Capturing groups (2/2)
• Backreferences to capturing groups are denoted by i with i an integer number
REGEX: “(dd)1”
TEXT: “1212”
RESULT: “1212”
REGEX: “(dd)1”
TEXT: “1234”
RESULT: -
Boundaries (1/2)
Boundary characters
^ beginning of line
$ end of line
b a word boundary
B a non-word boundary
A beginning of input
G end of previous match
z end of input
Z end of input, but before final terminator, if any
Boundaries (2/2)
• Be aware:
• End-of-line marker is $
– Unix EOL is n
– Windows EOL is rn
– JDK uses any of the following as EOL:
• 'n', 'rn', 'u0085', 'u2028', 'u2029'
• Always test your regular expressions on the target OS
Internationalization (1/2)
• Regular expressions originally designed for the ascii Basic Latin set of characters.
– Thus “België” is not matched by ^w+$
• Extension to unicode character sets denoted by p{...}
• Character set: [p{InCharacterSet}]
– Create character classes from symbols in character sets.
– “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
Internationalization (2/2)
• Note that there are non-letters in character sets as well:
– Latin-1 Supplement:
• Categories:
– Letters: p{L}
– Uppercase letters: p{Lu}
– “België” is matched by ^p{L}+$
• Other (POSIX) categories:
– Unicode currency symbols: p{Sc}
– ASCII punctuation characters: p{Punct}
¡¢£¤¥¦§¨©ª«-®¯°±²³´µ·¸¹º»¼½¾¿÷
Regular expressions in Java
• Since JDK 1.4
• Package java.util.regex
– Pattern class
– Matcher class
• Convenience methods in java.lang.String
• Alternative for JDK 1.3
– Jakarta ORO project
java.util.regex.Pattern
• Wrapper class for regular expressions
• Useful methods:
– compile(String regex): Pattern
– matches(String regex, CharSequence text): boolean
– split(String text): String[]
String regex = “(dd)1”;
Pattern p = Pattern.compile(regex);
java.util.regex.Matcher
• Useful methods:
– matches(): boolean
– find(): boolean
– find(int start): boolean
– group(): String
– replaceFirst(String replace): String
– replaceAll(String replace): String
String regex = “(dd)1”;
Pattern p = Pattern.compile(regex);
String text = “1212”;
Matcher m = p.matcher(text);
boolean matches = m.matches();
java.lang.String
• Pattern and Matcher methods in String:
– matches(String regex): boolean
– split(String regex): String[]
– replaceFirst(String regex, String replace): String
– replaceAll(String regex, String replace): String
Examples
• Validation
• Searching text
• Filtering
• Parsing
• Removing duplicate lines
• On-the-fly editing
Examples: validation
• Validate an e-mail address
• A URL
[A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
(http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7}
|localw*)(:d+)?(/(w+[w/-.]*)?)?
Examples: searching text
• Write HttpUnit test to submit HTML form and check whether HTTP response is a
confirmation screen containing a generated form number of the form 9xxxxxx-
xxxxxx:
9[0-9]{6}-[0-9]{6}
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(text);
boolean ok = m.find();
String nr = m.group();
Examples: filtering
• Filter e-mail with subjects with capitals only, and including a leading “Re:”
(R[eE]:)*[^a-z]*$
Examples: parsing
• Matches any opening and closing XML tag:
– Note the use of the back reference
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
Examples: duplicate lines
• Suppose you want to remove duplicate lines from a text.
– requirement here is that the lines are sorted alphabetically
^(.*)(r?n1)+$
Examples: on-the-fly editing
• Suppose you want to edit a file in batch: all occurrances of a certain string pattern
should be replaced with another string.
• In unix: use the sed command with a regex
• In Java: use string.replaceAll(regex,”mystring”)
• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptors
depending on environment
Quiz
• What are the following regular expressions looking for?
d+ at least one digit
[-+]?d+ any integer
((d*.?)?d+|d+(.?d*)) any positive decimal
[p{L}']['-.p{L} ]+ a place name
Conclusion
• When doing one of the following:
– validating strings
– on-the-fly editing of strings
– searching strings
– filtering strings
• think regex!
References
• http://www.regular-expressions.info/
• http://www.regexlib.com/
• http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/
• http://java.sun.com/docs/books/tutorial/extra/regex/
• http://www.wellho.net/regex/javare.html
• >JDK 1.4 API
• Mastering Regular Expressions

More Related Content

What's hot

Regular expressions
Regular expressionsRegular expressions
Regular expressionsEran Zimbler
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expressionvaluebound
 
Regular expression
Regular expressionRegular expression
Regular expressionLarry Nung
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsShiraz316
 
Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"Ra'Fat Al-Msie'deen
 
Chapter 5 -Syntax Directed Translation - Copy.ppt
Chapter 5 -Syntax Directed Translation - Copy.pptChapter 5 -Syntax Directed Translation - Copy.ppt
Chapter 5 -Syntax Directed Translation - Copy.pptFamiDan
 
Regular expressions-Theory of computation
Regular expressions-Theory of computationRegular expressions-Theory of computation
Regular expressions-Theory of computationBipul Roy Bpl
 
Advanced regular expressions
Advanced regular expressionsAdvanced regular expressions
Advanced regular expressionsNeha Jain
 
Compiler design syntax analysis
Compiler design syntax analysisCompiler design syntax analysis
Compiler design syntax analysisRicha Sharma
 
Ambiguous & Unambiguous Grammar
Ambiguous & Unambiguous GrammarAmbiguous & Unambiguous Grammar
Ambiguous & Unambiguous GrammarMdImamHasan1
 
Regex - Regular Expression Basics
Regex - Regular Expression BasicsRegex - Regular Expression Basics
Regex - Regular Expression BasicsEterna Han Tsai
 
Syntax analyzer
Syntax analyzerSyntax analyzer
Syntax analyzerahmed51236
 
Automata
AutomataAutomata
AutomataGaditek
 
Lexical Analysis - Compiler design
Lexical Analysis - Compiler design Lexical Analysis - Compiler design
Lexical Analysis - Compiler design Aman Sharma
 

What's hot (20)

Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regex Basics
Regex BasicsRegex Basics
Regex Basics
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Regular expression
Regular expressionRegular expression
Regular expression
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Parse Tree
Parse TreeParse Tree
Parse Tree
 
Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"
 
Regex cheatsheet
Regex cheatsheetRegex cheatsheet
Regex cheatsheet
 
Chapter 5 -Syntax Directed Translation - Copy.ppt
Chapter 5 -Syntax Directed Translation - Copy.pptChapter 5 -Syntax Directed Translation - Copy.ppt
Chapter 5 -Syntax Directed Translation - Copy.ppt
 
Regular expressions-Theory of computation
Regular expressions-Theory of computationRegular expressions-Theory of computation
Regular expressions-Theory of computation
 
Advanced regular expressions
Advanced regular expressionsAdvanced regular expressions
Advanced regular expressions
 
Compiler design syntax analysis
Compiler design syntax analysisCompiler design syntax analysis
Compiler design syntax analysis
 
Ambiguous & Unambiguous Grammar
Ambiguous & Unambiguous GrammarAmbiguous & Unambiguous Grammar
Ambiguous & Unambiguous Grammar
 
Regex - Regular Expression Basics
Regex - Regular Expression BasicsRegex - Regular Expression Basics
Regex - Regular Expression Basics
 
Syntax analyzer
Syntax analyzerSyntax analyzer
Syntax analyzer
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Automata
AutomataAutomata
Automata
 
Finite automata
Finite automataFinite automata
Finite automata
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Lexical Analysis - Compiler design
Lexical Analysis - Compiler design Lexical Analysis - Compiler design
Lexical Analysis - Compiler design
 

Viewers also liked

Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesMarina Santini
 
Learn PHP Lacture1
Learn PHP Lacture1Learn PHP Lacture1
Learn PHP Lacture1ADARSH BHATT
 
Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondMax Shirshin
 
Introduction to regular expressions
Introduction to regular expressionsIntroduction to regular expressions
Introduction to regular expressionsBen Brumfield
 
Bitcoin: the future money, or a scam?
Bitcoin: the future money, or a scam?Bitcoin: the future money, or a scam?
Bitcoin: the future money, or a scam?Ignaz Wanders
 
The Service doing "Ping"
The Service doing "Ping"The Service doing "Ping"
The Service doing "Ping"Ignaz Wanders
 
Web Service Versioning
Web Service VersioningWeb Service Versioning
Web Service VersioningIgnaz Wanders
 
Lecture 03 lexical analysis
Lecture 03 lexical analysisLecture 03 lexical analysis
Lecture 03 lexical analysisIffat Anjum
 
Finite Automata
Finite AutomataFinite Automata
Finite AutomataShiraz316
 
Regular expression with DFA
Regular expression with DFARegular expression with DFA
Regular expression with DFAMaulik Togadiya
 
Field Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your BuddyField Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your BuddyMichael Wilde
 

Viewers also liked (16)

Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular Languages
 
Regular expression (compiler)
Regular expression (compiler)Regular expression (compiler)
Regular expression (compiler)
 
Learn PHP Lacture1
Learn PHP Lacture1Learn PHP Lacture1
Learn PHP Lacture1
 
Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And Beyond
 
Introduction to regular expressions
Introduction to regular expressionsIntroduction to regular expressions
Introduction to regular expressions
 
Bitcoin: the future money, or a scam?
Bitcoin: the future money, or a scam?Bitcoin: the future money, or a scam?
Bitcoin: the future money, or a scam?
 
The Service doing "Ping"
The Service doing "Ping"The Service doing "Ping"
The Service doing "Ping"
 
Reflexive Access List
Reflexive Access ListReflexive Access List
Reflexive Access List
 
Tests
TestsTests
Tests
 
Regular expression examples
Regular expression examplesRegular expression examples
Regular expression examples
 
Lecture2 B
Lecture2 BLecture2 B
Lecture2 B
 
Web Service Versioning
Web Service VersioningWeb Service Versioning
Web Service Versioning
 
Lecture 03 lexical analysis
Lecture 03 lexical analysisLecture 03 lexical analysis
Lecture 03 lexical analysis
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
Regular expression with DFA
Regular expression with DFARegular expression with DFA
Regular expression with DFA
 
Field Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your BuddyField Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your Buddy
 

Similar to Regular expressions

Regular Expressions: QA Challenge Accepted Conf (March 2015)
Regular Expressions: QA Challenge Accepted Conf (March 2015)Regular Expressions: QA Challenge Accepted Conf (March 2015)
Regular Expressions: QA Challenge Accepted Conf (March 2015)Svetlin Nakov
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and YouJames Armes
 
Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013Ben Brumfield
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Ahmed El-Arabawy
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrepTri Truong
 
Regular expressions-ada-2018
Regular expressions-ada-2018Regular expressions-ada-2018
Regular expressions-ada-2018Emma Burrows
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfBryan Alejos
 
Regular Expressions Boot Camp
Regular Expressions Boot CampRegular Expressions Boot Camp
Regular Expressions Boot CampChris Schiffhauer
 
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeWeek-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeBertram Ludäscher
 
Js reg正则表达式
Js reg正则表达式Js reg正则表达式
Js reg正则表达式keke302
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular ExpressionsJesse Anderson
 

Similar to Regular expressions (20)

Regular expression for everyone
Regular expression for everyoneRegular expression for everyone
Regular expression for everyone
 
Regular Expressions: QA Challenge Accepted Conf (March 2015)
Regular Expressions: QA Challenge Accepted Conf (March 2015)Regular Expressions: QA Challenge Accepted Conf (March 2015)
Regular Expressions: QA Challenge Accepted Conf (March 2015)
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and You
 
Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Bioinformatica p2-p3-introduction
Bioinformatica p2-p3-introductionBioinformatica p2-p3-introduction
Bioinformatica p2-p3-introduction
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions
 
JavaScript.pptx
JavaScript.pptxJavaScript.pptx
JavaScript.pptx
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Json the-x-in-ajax1588
Json the-x-in-ajax1588Json the-x-in-ajax1588
Json the-x-in-ajax1588
 
Regular expressions-ada-2018
Regular expressions-ada-2018Regular expressions-ada-2018
Regular expressions-ada-2018
 
Json demo
Json demoJson demo
Json demo
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
 
Regular Expressions Boot Camp
Regular Expressions Boot CampRegular Expressions Boot Camp
Regular Expressions Boot Camp
 
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeWeek-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
 
Js reg正则表达式
Js reg正则表达式Js reg正则表达式
Js reg正则表达式
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
 
Quick start reg ex
Quick start reg exQuick start reg ex
Quick start reg ex
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 

Regular expressions

  • 1. Regular Expressions Powerful string validation and extraction Ignaz Wanders – Architect @ Archimiddle @ignazw
  • 2. Topics • What are regular expressions? • Patterns • Character classes • Quantifiers • Capturing groups • Boundaries • Internationalization • Regular expressions in Java • Quiz • References
  • 3. What are regular expressions? • A regex is a string pattern used to search and manipulate text • A regex has special syntax • Very powerful for any type of String manipulation ranging from simple to very complex structures: – Input validation – S(ubs)tring replacement – ... • Example: • [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
  • 4. History • Originates from automata and formal-language theories of computer science • Stephen Kleene  50’s: Kleene algebra • Kenneth Thompson  1969: unix: qed, ed • 70’s - 90’s: unix: grep, awk, sed, emacs • Programming languages: – C, Perl – JavaScript, Java
  • 5. Patterns • Regex is based on pattern matching: Strings are searched for certain patterns • Simplest regex is a string-literal pattern • Metacharacters: ([{^$|)?*+. – Period means “any character” – To search for period as string literal, escape with “” REGEX: fox TEXT: The quick brown fox RESULT: fox REGEX: fo. TEXT: The quick brown fox RESULT: fox REGEX: .o. TEXT: The quick brown fox RESULT: row, fox
  • 6. Character classes (1/3) • Syntax: any characters between [ and ] • Character classes denote one letter • Negation: ^ REGEX: [rcb]at TEXT: bat RESULT: bat REGEX: [rcb]at TEXT: rat RESULT: rat REGEX: [rcb]at TEXT: cat RESULT: cat REGEX: [rcb]at TEXT: hat RESULT: - REGEX: [^rcb]at TEXT: rat RESULT: - REGEX: [^rcb]at TEXT: hat RESULT: hat
  • 7. Character classes (2/3) • Ranges: [a-z], [0-9], [i-n], [a-zA-Z]... • Unions: [0-4[6-8]], [a-p[r-w]], ... • Intersections: [a-f&&[efg]], [a-f&&[e-k]], ... • Subtractions: [a-f&&[^efg]], ... REGEX: [rcb]at[1-5] TEXT: bat4 RESULT: bat4 REGEX: [rcb]at[1-5[7-8]] TEXT: hat7 RESULT: - REGEX: [rcb]at[1-7&&[78]] TEXT: rat7 RESULT: rat7 REGEX: [rcb]at[1-5&&[^34]] TEXT: bat4 RESULT: -
  • 8. Character classes (3/3) predefined character classes equivalence . any character d any digit [0-9] D any non-digit [^0-9], [^d] s any white-space character [ tnx0Bfr] S any non-white-space character [^s] w any word character [a-zA-Z_0-9] W any non-word character [^w]
  • 9. Quantifiers (1/5) • Quantifiers allow character classes to match more than one character at a time. Quantifiers for character classes X X? zero or one time X* zero or more times X+ one or more times X{n} exactly n times X{n,} at least n times X{n,m} at least n and at most m times
  • 10. Quantifiers (2/5) • Examples of X?, X*, X+ REGEX: “a?” TEXT: “” RESULT: “” REGEX: “a*” TEXT: “” RESULT: “” REGEX: “a+” TEXT: “” RESULT: - REGEX: “a?” TEXT: “a” RESULT: “a” REGEX: “a*” TEXT: “a” RESULT: “a” REGEX: “a+” TEXT: “a” RESULT: “a” REGEX: “a?” TEXT: “aaa” RESULT: “a”,”a”,”a” REGEX: “a*” TEXT: “aaa” RESULT: “aaa” REGEX: “a+” TEXT: “aaa” RESULT: “aaa”
  • 11. Quantifiers (3/5) REGEX: “[abc]{3}” TEXT: “abccabaaaccbbbc” RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc” REGEX: “abc{3}” TEXT: “abccabaaaccbbbc” RESULT: - REGEX: “(dog){3}” TEXT: “dogdogdogdogdogdog” RESULT: “dogdogdog”,”dogdogdog”
  • 12. Quantifiers (4/5) • Greedy quantifiers: – read complete string – work backwards until match found – syntax: X?, X*, X+, ... • Reluctant quantifiers: – read one character at a time – work forward until match found – syntax: X??, X*?, X+?, ... • Possessive quantifiers: – read complete string – try match only once – syntax: X?+, X*+, X++, ...
  • 13. Quantifiers (5/5) REGEX: “.*foo” TEXT: “xfooxxxxxxfoo” RESULT: “xfooxxxxxxfoo” REGEX: .*?foo” TEXT: “xfooxxxxxxfoo” RESULT: “xfoo”, “xxxxxxfoo” REGEX: “.*+foo” TEXT: “xfooxxxxxxfoo” RESULT: - greedy reluctant possessive
  • 14. Capturing groups (1/2) • Capturing groups treat multiple characters as a single unit • Syntax: between braces ( and ) • Example: (dog){3} • Numbering from left to right – Example: ((A)(B(C))) • Group 1: ((A)(B(C))) • Group 2: (A) • Group 3: (B(C)) • Group 4: (C)
  • 15. Capturing groups (2/2) • Backreferences to capturing groups are denoted by i with i an integer number REGEX: “(dd)1” TEXT: “1212” RESULT: “1212” REGEX: “(dd)1” TEXT: “1234” RESULT: -
  • 16. Boundaries (1/2) Boundary characters ^ beginning of line $ end of line b a word boundary B a non-word boundary A beginning of input G end of previous match z end of input Z end of input, but before final terminator, if any
  • 17. Boundaries (2/2) • Be aware: • End-of-line marker is $ – Unix EOL is n – Windows EOL is rn – JDK uses any of the following as EOL: • 'n', 'rn', 'u0085', 'u2028', 'u2029' • Always test your regular expressions on the target OS
  • 18. Internationalization (1/2) • Regular expressions originally designed for the ascii Basic Latin set of characters. – Thus “België” is not matched by ^w+$ • Extension to unicode character sets denoted by p{...} • Character set: [p{InCharacterSet}] – Create character classes from symbols in character sets. – “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
  • 19. Internationalization (2/2) • Note that there are non-letters in character sets as well: – Latin-1 Supplement: • Categories: – Letters: p{L} – Uppercase letters: p{Lu} – “België” is matched by ^p{L}+$ • Other (POSIX) categories: – Unicode currency symbols: p{Sc} – ASCII punctuation characters: p{Punct} ¡¢£¤¥¦§¨©ª«-®¯°±²³´µ·¸¹º»¼½¾¿÷
  • 20. Regular expressions in Java • Since JDK 1.4 • Package java.util.regex – Pattern class – Matcher class • Convenience methods in java.lang.String • Alternative for JDK 1.3 – Jakarta ORO project
  • 21. java.util.regex.Pattern • Wrapper class for regular expressions • Useful methods: – compile(String regex): Pattern – matches(String regex, CharSequence text): boolean – split(String text): String[] String regex = “(dd)1”; Pattern p = Pattern.compile(regex);
  • 22. java.util.regex.Matcher • Useful methods: – matches(): boolean – find(): boolean – find(int start): boolean – group(): String – replaceFirst(String replace): String – replaceAll(String replace): String String regex = “(dd)1”; Pattern p = Pattern.compile(regex); String text = “1212”; Matcher m = p.matcher(text); boolean matches = m.matches();
  • 23. java.lang.String • Pattern and Matcher methods in String: – matches(String regex): boolean – split(String regex): String[] – replaceFirst(String regex, String replace): String – replaceAll(String regex, String replace): String
  • 24. Examples • Validation • Searching text • Filtering • Parsing • Removing duplicate lines • On-the-fly editing
  • 25. Examples: validation • Validate an e-mail address • A URL [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4} (http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7} |localw*)(:d+)?(/(w+[w/-.]*)?)?
  • 26. Examples: searching text • Write HttpUnit test to submit HTML form and check whether HTTP response is a confirmation screen containing a generated form number of the form 9xxxxxx- xxxxxx: 9[0-9]{6}-[0-9]{6} Pattern p = Pattern.compile(regexp); Matcher m = p.matcher(text); boolean ok = m.find(); String nr = m.group();
  • 27. Examples: filtering • Filter e-mail with subjects with capitals only, and including a leading “Re:” (R[eE]:)*[^a-z]*$
  • 28. Examples: parsing • Matches any opening and closing XML tag: – Note the use of the back reference <([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
  • 29. Examples: duplicate lines • Suppose you want to remove duplicate lines from a text. – requirement here is that the lines are sorted alphabetically ^(.*)(r?n1)+$
  • 30. Examples: on-the-fly editing • Suppose you want to edit a file in batch: all occurrances of a certain string pattern should be replaced with another string. • In unix: use the sed command with a regex • In Java: use string.replaceAll(regex,”mystring”) • In Ant: use replaceregexp optional task to, e.g., edit deployment descriptors depending on environment
  • 31. Quiz • What are the following regular expressions looking for? d+ at least one digit [-+]?d+ any integer ((d*.?)?d+|d+(.?d*)) any positive decimal [p{L}']['-.p{L} ]+ a place name
  • 32. Conclusion • When doing one of the following: – validating strings – on-the-fly editing of strings – searching strings – filtering strings • think regex!
  • 33. References • http://www.regular-expressions.info/ • http://www.regexlib.com/ • http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/ • http://java.sun.com/docs/books/tutorial/extra/regex/ • http://www.wellho.net/regex/javare.html • >JDK 1.4 API • Mastering Regular Expressions