Regular Expressions for Data
Management in STATA
John Lunalo
Overview
1. Introduction
2. Definition of terms
3. Regular Expressions Symbols in STATA
4. Regular Expressions Commands in Stata
5. Examples
Introduction
Regular expressions (regex/regexp/rational expressions) are special characters
that defines a search pattern.
Types of Regular Expressions
Extended Regular Expressions: Used majorly by programming languages.
Default in those languages.
Perl-Like Regular Expressions: Use syntax and semantics of Perl Language
Literal/Fixed Regular Expressions: Most basic although it can still be combined
with special characters for complex pattern.
Definition ofTerms
Characters(Metacharacters): d = digits while D represents non digits, w
represents word etc.
Quantifiers: + = one or more, ?= once or more, * = zero or more and { } for
specifying minimum, maximum or both.
Logic: | = or, (...)= Group1 2 3 = Contents of group 1,2,3) you can also use
negate a value in group by using (?: ...) e.g. (?:lunalo|John)= John
Character Classes: Denoted by [...] provides ranges of characters.
Anchors: Declare boundaries e.g. ^ for start and $ for end POSIX Classes:
e.g. [:punct:] for punctuations. and [:alpha:] for alphabets.
Look rounds:(?=...)= Positive look ahead (?<=...)= Positive lookbehind
(?!...)= Negative look ahead (?<!..)= Negative look behind.
Regular Expressions Symbols in Stata
Quantifiers symbols:
Metacharacters
* match zero or more of the preceding expression
+ Match one or more of the preceding expression
? Match either zero or one of the preceding expression
a–z match a range of characters or numbers .The “a” and “z” are an example. It could also be 1-9, etc.
This is used together with square characters. e.g. [1-9].
. match any character
 It used for Escaping a metacharacter
Regular Expressions Symbols in
Stata
Anchors
Groups
Logic
^ Match expression at beginning of string. E.g. “^[hj]” matches hj at the beginning of the
string. Be careful this “[^hj]” will negate hj in a string.
$ Match expression at end of string. E.g. “hj$” will match hj at the end of the string.
( ) Subexpression e.g. (1-9) (a-z) etc.
| The vertical bar /pipe character signifies a logical “or”
Regular Expressions Commands
in StataIn Stata we have three commands that uses regular expressions in their operations:-
1) regexm
2) regexr
3) regexs
M- Matches. it is Boolean
R- Replace
S- Subexpression
Examples
Example Datasets to be used to experiment on three commands
THANKS FORYOUR
ATTENTION

Regular Expressions in Stata

  • 1.
    Regular Expressions forData Management in STATA John Lunalo
  • 2.
    Overview 1. Introduction 2. Definitionof terms 3. Regular Expressions Symbols in STATA 4. Regular Expressions Commands in Stata 5. Examples
  • 3.
    Introduction Regular expressions (regex/regexp/rationalexpressions) are special characters that defines a search pattern. Types of Regular Expressions Extended Regular Expressions: Used majorly by programming languages. Default in those languages. Perl-Like Regular Expressions: Use syntax and semantics of Perl Language Literal/Fixed Regular Expressions: Most basic although it can still be combined with special characters for complex pattern.
  • 4.
    Definition ofTerms Characters(Metacharacters): d= digits while D represents non digits, w represents word etc. Quantifiers: + = one or more, ?= once or more, * = zero or more and { } for specifying minimum, maximum or both. Logic: | = or, (...)= Group1 2 3 = Contents of group 1,2,3) you can also use negate a value in group by using (?: ...) e.g. (?:lunalo|John)= John Character Classes: Denoted by [...] provides ranges of characters. Anchors: Declare boundaries e.g. ^ for start and $ for end POSIX Classes: e.g. [:punct:] for punctuations. and [:alpha:] for alphabets. Look rounds:(?=...)= Positive look ahead (?<=...)= Positive lookbehind (?!...)= Negative look ahead (?<!..)= Negative look behind.
  • 5.
    Regular Expressions Symbolsin Stata Quantifiers symbols: Metacharacters * match zero or more of the preceding expression + Match one or more of the preceding expression ? Match either zero or one of the preceding expression a–z match a range of characters or numbers .The “a” and “z” are an example. It could also be 1-9, etc. This is used together with square characters. e.g. [1-9]. . match any character It used for Escaping a metacharacter
  • 6.
    Regular Expressions Symbolsin Stata Anchors Groups Logic ^ Match expression at beginning of string. E.g. “^[hj]” matches hj at the beginning of the string. Be careful this “[^hj]” will negate hj in a string. $ Match expression at end of string. E.g. “hj$” will match hj at the end of the string. ( ) Subexpression e.g. (1-9) (a-z) etc. | The vertical bar /pipe character signifies a logical “or”
  • 7.
    Regular Expressions Commands inStataIn Stata we have three commands that uses regular expressions in their operations:- 1) regexm 2) regexr 3) regexs M- Matches. it is Boolean R- Replace S- Subexpression
  • 8.
    Examples Example Datasets tobe used to experiment on three commands
  • 9.

Editor's Notes

  • #5 Characters with special meaning Represents the number of times you want a character, literal or a pattern as a whole needs to appear in a regular expressions Defining arrangement of patterns.