3. Introduction
Regular expressions (regex/regexp/rational expressions) are special characters
that defines a search pattern.
Types of Regular Expressions
Extended Regular Expressions: Used majorly by programming languages.
Default in those languages.
Perl-Like Regular Expressions: Use syntax and semantics of Perl Language
Literal/Fixed Regular Expressions: Most basic although it can still be combined
with special characters for complex pattern.
4. Definition ofTerms
Characters(Metacharacters): d = digits while D represents non digits, w
represents word etc.
Quantifiers: + = one or more, ?= once or more, * = zero or more and { } for
specifying minimum, maximum or both.
Logic: | = or, (...)= Group1 2 3 = Contents of group 1,2,3) you can also use
negate a value in group by using (?: ...) e.g. (?:lunalo|John)= John
Character Classes: Denoted by [...] provides ranges of characters.
Anchors: Declare boundaries e.g. ^ for start and $ for end POSIX Classes:
e.g. [:punct:] for punctuations. and [:alpha:] for alphabets.
Look rounds:(?=...)= Positive look ahead (?<=...)= Positive lookbehind
(?!...)= Negative look ahead (?<!..)= Negative look behind.
5. Regular Expressions Symbols in Stata
Quantifiers symbols:
Metacharacters
* match zero or more of the preceding expression
+ Match one or more of the preceding expression
? Match either zero or one of the preceding expression
a–z match a range of characters or numbers .The “a” and “z” are an example. It could also be 1-9, etc.
This is used together with square characters. e.g. [1-9].
. match any character
It used for Escaping a metacharacter
6. Regular Expressions Symbols in
Stata
Anchors
Groups
Logic
^ Match expression at beginning of string. E.g. “^[hj]” matches hj at the beginning of the
string. Be careful this “[^hj]” will negate hj in a string.
$ Match expression at end of string. E.g. “hj$” will match hj at the end of the string.
( ) Subexpression e.g. (1-9) (a-z) etc.
| The vertical bar /pipe character signifies a logical “or”
7. Regular Expressions Commands
in StataIn Stata we have three commands that uses regular expressions in their operations:-
1) regexm
2) regexr
3) regexs
M- Matches. it is Boolean
R- Replace
S- Subexpression
Characters with special meaning
Represents the number of times you want a character, literal or a pattern as a whole needs to appear in a regular expressions
Defining arrangement of patterns.