SlideShare a Scribd company logo
1 of 29
Download to read offline
Regular Expressions in
R
Houston R Users Group
10.05.2011
Ed Goodwin
twitter: @egoodwintx
What is a Regular
Expression?
Regexes are an extremely flexible tool for
nding and replacing text. They can easily
be applied globally across a document,
dataset, or specically to individual strings.
Example
LastName, FirstName, Address, Phone
Baker, Tom, 123 Unit St., 555-452-1324
Smith, Matt, 456 Tardis St., 555-326-4567
Tennant, David, 567 Torchwood Ave., 555-563-8974
Data
gsub(“St.”, “Street”, data[i])
*Note the double-slash “” to escape the ‘.’
Regular Expression to Convert “St.” to “Street”
Benets of Regex
• Flexible (can be applied globally or
specically across data)
• Terse (very powerful scripting template)
• Portable (sort of) across languages
• Rich history
Disadvantages of regex
• Non-intuitive
• Easy to make errors (unintended
consequences)
• Difficult to robustly debug
• Various flavors may cause portability issues.
Why do this in R?
• Easier to locate all code in one place
• (Relatively) Robust regex tools
• May be the only tool available
• Familiarity
Other alternatives?
• Perl
• Python
• Java
• Ruby
• Others (grep, sed, awk, bash, csh, ksh, etc.)
Components of a
Regular Expression
• Characters
• Metacharacters
• Character classes
The R regex functions
Note: all functions are in the base package
Function Purpose
strsplit()
breaks apart strings at predened points
grep()
returns a vector of indices where a
pattern is matched
grepl()
returns a logical vector (TRUE/FALSE)
for each element of the data
sub()
replaces one pattern with another at
rst matching location
gsub()
replaces one pattern with another at
every matching location
regexpr()
returns an integer vector giving the starting position of
the rst match, along with a match.length attribute
giving the length of the matched text.
gregexpr()
returns an integer vector giving the starting position of
the all matches, along with a match.length attribute
giving the length of the matched text.
Metacharacter Symbols
Modier Meaning
^ anchors expression to beginning of target
$ anchors expression to end of target
. matches any single character except newline
| separates alternative patterns
[] accepts any of the enclosed characters
[^] accepts any characters but the ones enclosed in brackets
() groups patterns together for assignment or constraint
* matches zero or more occurrences of preceding entity
? matches zero or one occurrences of preceding entity
+ matches one or more occurrences of preceding entity
{n} matches exactly n occurrences of preceding entity
{n,} matches at least n occurrences of preceding entity
{n,m} matches n to m occurrences of preceding entity
 interpret succeeding character as literal
Source:“Data Manipulation with R”. Spector, Phil. Springer, 2008. page 92.
Examples
[A-Za-z]+ matches one or more alphabetic characters
.* matches zero or more of any character up to the newline
.*.* matches zero or more characters followed by a literal .*
(July? ) Accept ‘Jul’ or ‘July’ but not ‘Julyy’. Note the space.
(abc|123) Match “abc” or “123”
[abc|123] Match a, b, c, 1, 2 or 3.The ‘|’ is extraneous.
^(From|Subject|Date):
Matches lines starting with “From:” or “Subject:” or
“Date:”
Let’s work through some examples...
Data
LastName, FirstName, Address, Phone
Baker, Tom, 123 Unit St., 555-452-1324
Smith, Matt, 456 Tardis St., 555-326-4567
Tennant, David, 567 Torchwood Ave., 555-563-8974
1. Locate all phone numbers.
2. Locate all addresses.
3. Locate all addresses ending in ‘Street’ (including
abbreviations).
4. Read in full names, reverse the order and remove
the comma.
So how would you write the regular
expression to match a calendar date in
format “mm/dd/yyyy” or “mm.dd.yyyy”?
Regex to identify date
format?
What’s wrong with
“[0-9]{2}(.|/)[0-9]{2}(.|/)[0-9]{4}” ?
Or with
“[1-12](.|/)[1-31](.|/)[0001-9999]” ?
Dates are not an easy problem
because they are not a simple text
pattern
Best bet is to validate the textual pattern
(mm.dd.yyyy) and then pass to a separate
function to validate the date (leap years, odd
days in month, etc.)
“^(1[0-2]|0[1-9])(.|/)(3[0-1]|[1-2][0-9]|0[1-9])(.|/)
([0-9]{4})$”
Supported flavors of
regex in R
• POSIX 1003.2
• Perl
Perl is the more robust of the two. POSIX
has a few idiosyncracies handling ‘’ that may
trip you up.
Usage Patterns
• Data validation
• String replace on dataset
• String identify in dataset (subset of data)
• Pattern arithmetic (how prevalent is string
in data?)
• Error prevention/detection
The R regex functions
Note: all functions are in the base package
Function Purpose
strsplit()
breaks apart strings at predened points
grep()
returns a vector of indices where a
pattern is matched
grepl()
returns a logical vector (TRUE/FALSE)
for each element of the data
sub()
replaces one pattern with another at
rst matching location
gsub()
replaces one pattern with another at
every matching location
regexpr()
returns an integer vector giving the starting position of
the rst match, along with a match.length attribute
giving the length of the matched text.
gregexpr()
returns an integer vector giving the starting position of
the all matches, along with a match.length attribute
giving the length of the matched text.
strsplit( )
Definition:
strsplit(x, split, fixed=FALSE, perl=FALSE, useBytes=FALSE)
Example:
str <- “This is some dummy data to parse x785y8099”
strsplit(str, “[ xy]”, perl=TRUE)
Result:
[[1]]
[1] "This" "is" "some" "dumm" "" "data" "to"
"parse" ""
[10] "785" "8099"
grep( )
Definition:
grep(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)
Example:
str <- “This is some dummy data to parse x785y8099”
grep(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE,
value=TRUE)
Result:
[1] "This is some dummy data to parse x785y8099"
grepl( )
Definition:
grepl(pattern, x, ignore.case=FALSE, perl=FALSE,
value=FALSE,fixed = FALSE, useBytes = FALSE, invert = FALSE)
Example:
str <- “This is some dummy data to parse x785y8099”
grepl(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE)
Result:
[1] TRUE
sub( )
Definition:
sub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE,
fixed=FALSE, useBytes=FALSE)
Example:
str <- “This is some dummy data to parse x785y8099”
sub("dummy(.* )([a-z][0-9]{3}).([0-9]{4})",
"awesome12H3", str, perl=TRUE)
Result:
[1] "This is some awesome data to parse x785H8099"
gsub( )
Definition:
gsub(pattern, replacement, x, ignore.case=FALSE,
perl=FALSE,fixed=FALSE, useBytes=FALSE)
Example:
str <- “This is some dummy data to parse x785y8099 you
dummy”
gsub(“dummy”, “awesome”, perl=TRUE)
Result:
[1] "This is some awesome data to parse x785y8099 you
awesome"
regexpr( )
Definition:
regexpr(pattern, text, ignore.case=FALSE, perl=FALSE,
fixed = FALSE, useBytes = FALSE)
Example:
duckgoose <- "Duck, duck, duck, goose, duck, duck, goose,
duck, duck"
regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE)
Result:
[1] 1
attr(,"match.length")
[1] 4
gregexpr( )
Definition:
gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE,
fixed=FALSE, useBytes=FALSE)
Example:
duckgoose <- "Duck, duck, duck, goose, duck, duck, goose,
duck, duck"
regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE)
Result:
[[1]]
[1] 1 7 13 26 32 45 51
attr(,"match.length")
[1] 4 4 4 4 4 4 4
Problem Solving &
Debugging
• Remember that regexes are greedy by
default. They will try to grab the largest
matching string possible unless constrained.
• Dummy data - small datasets
• Unit testing - testthis, etc.
• Build up regex complexity incrementally
Best Practices for
Regex in R
• Store regex string as variable to pass to function
• Try to make regex expression as exact as possible
(avoid lazy matching)
• Pick one type of regex syntax and stick with it
(POSIX or Perl)
• Document all regexes in code with liberal comments
• use cat() to verify regex string
• Test, test, and test some more
Regex Workflow
• Define initial data pattern
• Define desired data pattern
• Define transformation steps
• Incremental iteration to desired regex
• Testing & QA
Regex Resources
• http://regexpal.com/ - online regex tester
• Data Manipulation with R. Spector, Phil. Springer, 2008.
• Regular Expression Cheat Sheet. http://
www.addedbytes.com/cheat-sheets/regular-expressions-
cheat-sheet/
• Regular Expressions Cookbook. Goyvaerts, Jan and
Levithan, Steven. O’Reilly, 2009.
• Mastering Regular Expressions. Friedl, Jeffrey E.F. O’Reilly,
2006.
• Twitter: @RegexTip - regex tips and tricks

More Related Content

What's hot

R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulationAvjinder (Avi) Kaler
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformationTim Essam
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11Computer Science Club
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformationLaura Hughes
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Dr. Volkan OBAN
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh
 
Pi j3.4 data-structures
Pi j3.4 data-structuresPi j3.4 data-structures
Pi j3.4 data-structuresmcollison
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processingTim Essam
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Bioinformatics p2-p3-perl-regexes v2014
Bioinformatics p2-p3-perl-regexes v2014Bioinformatics p2-p3-perl-regexes v2014
Bioinformatics p2-p3-perl-regexes v2014Prof. Wim Van Criekinge
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
Arrays in Data Structure and Algorithm
Arrays in Data Structure and Algorithm Arrays in Data Structure and Algorithm
Arrays in Data Structure and Algorithm KristinaBorooah
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language IntroductionKhaled Al-Shamaa
 
Linear regression with R 1
Linear regression with R 1Linear regression with R 1
Linear regression with R 1Kazuki Yoshida
 

What's hot (20)

Java
JavaJava
Java
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Cs341
Cs341Cs341
Cs341
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11
 
Array
ArrayArray
Array
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
 
Rcommands-for those who interested in R.
Rcommands-for those who interested in R.Rcommands-for those who interested in R.
Rcommands-for those who interested in R.
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Hashing
HashingHashing
Hashing
 
Pi j3.4 data-structures
Pi j3.4 data-structuresPi j3.4 data-structures
Pi j3.4 data-structures
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Bioinformatics p2-p3-perl-regexes v2014
Bioinformatics p2-p3-perl-regexes v2014Bioinformatics p2-p3-perl-regexes v2014
Bioinformatics p2-p3-perl-regexes v2014
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
Arrays in Data Structure and Algorithm
Arrays in Data Structure and Algorithm Arrays in Data Structure and Algorithm
Arrays in Data Structure and Algorithm
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Linear regression with R 1
Linear regression with R 1Linear regression with R 1
Linear regression with R 1
 

Viewers also liked

Opening @ VMUG.IT 20150304
Opening @ VMUG.IT 20150304Opening @ VMUG.IT 20150304
Opening @ VMUG.IT 20150304VMUG IT
 
VMUGIT UC 2013 - 07b Enrico Signoretti
VMUGIT UC 2013 - 07b Enrico SignorettiVMUGIT UC 2013 - 07b Enrico Signoretti
VMUGIT UC 2013 - 07b Enrico SignorettiVMUG IT
 
Simple Parenting Tips
Simple Parenting Tips Simple Parenting Tips
Simple Parenting Tips NashyH
 
VMUGIT UC 2013 - 06 Mike Laverick
VMUGIT UC 2013 - 06 Mike LaverickVMUGIT UC 2013 - 06 Mike Laverick
VMUGIT UC 2013 - 06 Mike LaverickVMUG IT
 
Intro To Forecasting - Part 2 - HRUG
Intro To Forecasting - Part 2 - HRUGIntro To Forecasting - Part 2 - HRUG
Intro To Forecasting - Part 2 - HRUGegoodwintx
 
VMUGIT UC 2013 - VMUG Opening
VMUGIT UC 2013 - VMUG OpeningVMUGIT UC 2013 - VMUG Opening
VMUGIT UC 2013 - VMUG OpeningVMUG IT
 
VMUGIT UC 2013 - 03c VMware Horizon
VMUGIT UC 2013 - 03c VMware HorizonVMUGIT UC 2013 - 03c VMware Horizon
VMUGIT UC 2013 - 03c VMware HorizonVMUG IT
 
Intro to Forecasting in R - Part 4
Intro to Forecasting in R - Part 4Intro to Forecasting in R - Part 4
Intro to Forecasting in R - Part 4egoodwintx
 
Intro to RStudio
Intro to RStudioIntro to RStudio
Intro to RStudioegoodwintx
 
THE INTERNET 101
THE INTERNET 101 THE INTERNET 101
THE INTERNET 101 NashyH
 
Fantasy Football Draft Optimization in R - HRUG
Fantasy Football Draft Optimization in R - HRUGFantasy Football Draft Optimization in R - HRUG
Fantasy Football Draft Optimization in R - HRUGegoodwintx
 
Enciclopedia escolar tematica historia universal
Enciclopedia escolar tematica historia universalEnciclopedia escolar tematica historia universal
Enciclopedia escolar tematica historia universalBIBLIOTECA VIRTUAL CASCOL
 
Enciclopedia escolar tematica historia universal
Enciclopedia escolar tematica historia universalEnciclopedia escolar tematica historia universal
Enciclopedia escolar tematica historia universalBIBLIOTECA VIRTUAL CASCOL
 
Unit Testing in R with Testthat - HRUG
Unit Testing in R with Testthat - HRUGUnit Testing in R with Testthat - HRUG
Unit Testing in R with Testthat - HRUGegoodwintx
 

Viewers also liked (16)

Opening @ VMUG.IT 20150304
Opening @ VMUG.IT 20150304Opening @ VMUG.IT 20150304
Opening @ VMUG.IT 20150304
 
VMUGIT UC 2013 - 07b Enrico Signoretti
VMUGIT UC 2013 - 07b Enrico SignorettiVMUGIT UC 2013 - 07b Enrico Signoretti
VMUGIT UC 2013 - 07b Enrico Signoretti
 
Simple Parenting Tips
Simple Parenting Tips Simple Parenting Tips
Simple Parenting Tips
 
VMUGIT UC 2013 - 06 Mike Laverick
VMUGIT UC 2013 - 06 Mike LaverickVMUGIT UC 2013 - 06 Mike Laverick
VMUGIT UC 2013 - 06 Mike Laverick
 
Internet facil como un click
Internet facil como un clickInternet facil como un click
Internet facil como un click
 
Intro To Forecasting - Part 2 - HRUG
Intro To Forecasting - Part 2 - HRUGIntro To Forecasting - Part 2 - HRUG
Intro To Forecasting - Part 2 - HRUG
 
VMUGIT UC 2013 - VMUG Opening
VMUGIT UC 2013 - VMUG OpeningVMUGIT UC 2013 - VMUG Opening
VMUGIT UC 2013 - VMUG Opening
 
Caimi Ăąucanchic shi,iyu-panca
Caimi  Ăąucanchic shi,iyu-pancaCaimi  Ăąucanchic shi,iyu-panca
Caimi Ăąucanchic shi,iyu-panca
 
VMUGIT UC 2013 - 03c VMware Horizon
VMUGIT UC 2013 - 03c VMware HorizonVMUGIT UC 2013 - 03c VMware Horizon
VMUGIT UC 2013 - 03c VMware Horizon
 
Intro to Forecasting in R - Part 4
Intro to Forecasting in R - Part 4Intro to Forecasting in R - Part 4
Intro to Forecasting in R - Part 4
 
Intro to RStudio
Intro to RStudioIntro to RStudio
Intro to RStudio
 
THE INTERNET 101
THE INTERNET 101 THE INTERNET 101
THE INTERNET 101
 
Fantasy Football Draft Optimization in R - HRUG
Fantasy Football Draft Optimization in R - HRUGFantasy Football Draft Optimization in R - HRUG
Fantasy Football Draft Optimization in R - HRUG
 
Enciclopedia escolar tematica historia universal
Enciclopedia escolar tematica historia universalEnciclopedia escolar tematica historia universal
Enciclopedia escolar tematica historia universal
 
Enciclopedia escolar tematica historia universal
Enciclopedia escolar tematica historia universalEnciclopedia escolar tematica historia universal
Enciclopedia escolar tematica historia universal
 
Unit Testing in R with Testthat - HRUG
Unit Testing in R with Testthat - HRUGUnit Testing in R with Testthat - HRUG
Unit Testing in R with Testthat - HRUG
 

Similar to Eag 201110-hrugregexpresentation-111006104128-phpapp02

regex-presentation_ed_goodwin
regex-presentation_ed_goodwinregex-presentation_ed_goodwin
regex-presentation_ed_goodwinschamber
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionProf. Wim Van Criekinge
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in RDr Nisha Arora
 
Maxbox starter20
Maxbox starter20Maxbox starter20
Maxbox starter20Max Kleiner
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaj Gupta
 
Short Reference Card for R users.
Short Reference Card for R users.Short Reference Card for R users.
Short Reference Card for R users.Dr. Volkan OBAN
 
Reference card for R
Reference card for RReference card for R
Reference card for RDr. Volkan OBAN
 
Regular expressions in oracle
Regular expressions in oracleRegular expressions in oracle
Regular expressions in oracleLogan Palanisamy
 
3.2 javascript regex
3.2 javascript regex3.2 javascript regex
3.2 javascript regexJalpesh Vasa
 
R command cheatsheet.pdf
R command cheatsheet.pdfR command cheatsheet.pdf
R command cheatsheet.pdfNgcnh947953
 
@ R reference
@ R reference@ R reference
@ R referencevickyrolando
 
R language introduction
R language introductionR language introduction
R language introductionShashwat Shriparv
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in PythonSujith Kumar
 
Unit 1-array,lists and hashes
Unit 1-array,lists and hashesUnit 1-array,lists and hashes
Unit 1-array,lists and hashessana mateen
 
Regular_Expressions.pptx
Regular_Expressions.pptxRegular_Expressions.pptx
Regular_Expressions.pptxDurgaNayak4
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular ExpressionsRupak Roy
 

Similar to Eag 201110-hrugregexpresentation-111006104128-phpapp02 (20)

regex-presentation_ed_goodwin
regex-presentation_ed_goodwinregex-presentation_ed_goodwin
regex-presentation_ed_goodwin
 
Bioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introductionBioinformatica 06-10-2011-p2 introduction
Bioinformatica 06-10-2011-p2 introduction
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
 
Bioinformatica p2-p3-introduction
Bioinformatica p2-p3-introductionBioinformatica p2-p3-introduction
Bioinformatica p2-p3-introduction
 
Maxbox starter20
Maxbox starter20Maxbox starter20
Maxbox starter20
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Short Reference Card for R users.
Short Reference Card for R users.Short Reference Card for R users.
Short Reference Card for R users.
 
Reference card for R
Reference card for RReference card for R
Reference card for R
 
Regular expressions in oracle
Regular expressions in oracleRegular expressions in oracle
Regular expressions in oracle
 
3.2 javascript regex
3.2 javascript regex3.2 javascript regex
3.2 javascript regex
 
R command cheatsheet.pdf
R command cheatsheet.pdfR command cheatsheet.pdf
R command cheatsheet.pdf
 
@ R reference
@ R reference@ R reference
@ R reference
 
PHP - Introduction to String Handling
PHP -  Introduction to  String Handling PHP -  Introduction to  String Handling
PHP - Introduction to String Handling
 
R language introduction
R language introductionR language introduction
R language introduction
 
20170509 rand db_lesugent
20170509 rand db_lesugent20170509 rand db_lesugent
20170509 rand db_lesugent
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
 
Unit 1-array,lists and hashes
Unit 1-array,lists and hashesUnit 1-array,lists and hashes
Unit 1-array,lists and hashes
 
Regular_Expressions.pptx
Regular_Expressions.pptxRegular_Expressions.pptx
Regular_Expressions.pptx
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
 

More from egoodwintx

HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with Regoodwintx
 
Scaling in R
Scaling in RScaling in R
Scaling in Regoodwintx
 
Boardgamegeek scraping
Boardgamegeek scrapingBoardgamegeek scraping
Boardgamegeek scrapingegoodwintx
 
HRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal ModelsHRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal Modelsegoodwintx
 
Collaborative Package Development in R
Collaborative Package Development in RCollaborative Package Development in R
Collaborative Package Development in Regoodwintx
 
Intro to Forecasting - Part 3 - HRUG
Intro to Forecasting - Part 3 - HRUGIntro to Forecasting - Part 3 - HRUG
Intro to Forecasting - Part 3 - HRUGegoodwintx
 
Hrug intro to forecasting
Hrug intro to forecastingHrug intro to forecasting
Hrug intro to forecastingegoodwintx
 

More from egoodwintx (7)

HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
Scaling in R
Scaling in RScaling in R
Scaling in R
 
Boardgamegeek scraping
Boardgamegeek scrapingBoardgamegeek scraping
Boardgamegeek scraping
 
HRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal ModelsHRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal Models
 
Collaborative Package Development in R
Collaborative Package Development in RCollaborative Package Development in R
Collaborative Package Development in R
 
Intro to Forecasting - Part 3 - HRUG
Intro to Forecasting - Part 3 - HRUGIntro to Forecasting - Part 3 - HRUG
Intro to Forecasting - Part 3 - HRUG
 
Hrug intro to forecasting
Hrug intro to forecastingHrug intro to forecasting
Hrug intro to forecasting
 

Recently uploaded

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Christopher Logan Kennedy
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vĂĄzquez
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Recently uploaded (20)

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Eag 201110-hrugregexpresentation-111006104128-phpapp02

  • 1. Regular Expressions in R Houston R Users Group 10.05.2011 Ed Goodwin twitter: @egoodwintx
  • 2. What is a Regular Expression? Regexes are an extremely flexible tool for nding and replacing text. They can easily be applied globally across a document, dataset, or specically to individual strings.
  • 3. Example LastName, FirstName, Address, Phone Baker, Tom, 123 Unit St., 555-452-1324 Smith, Matt, 456 Tardis St., 555-326-4567 Tennant, David, 567 Torchwood Ave., 555-563-8974 Data gsub(“St.”, “Street”, data[i]) *Note the double-slash “” to escape the ‘.’ Regular Expression to Convert “St.” to “Street”
  • 4. Benets of Regex • Flexible (can be applied globally or specically across data) • Terse (very powerful scripting template) • Portable (sort of) across languages • Rich history
  • 5. Disadvantages of regex • Non-intuitive • Easy to make errors (unintended consequences) • Difcult to robustly debug • Various flavors may cause portability issues.
  • 6. Why do this in R? • Easier to locate all code in one place • (Relatively) Robust regex tools • May be the only tool available • Familiarity
  • 7. Other alternatives? • Perl • Python • Java • Ruby • Others (grep, sed, awk, bash, csh, ksh, etc.)
  • 8. Components of a Regular Expression • Characters • Metacharacters • Character classes
  • 9. The R regex functions Note: all functions are in the base package Function Purpose strsplit() breaks apart strings at predened points grep() returns a vector of indices where a pattern is matched grepl() returns a logical vector (TRUE/FALSE) for each element of the data sub() replaces one pattern with another at rst matching location gsub() replaces one pattern with another at every matching location regexpr() returns an integer vector giving the starting position of the rst match, along with a match.length attribute giving the length of the matched text. gregexpr() returns an integer vector giving the starting position of the all matches, along with a match.length attribute giving the length of the matched text.
  • 10. Metacharacter Symbols Modier Meaning ^ anchors expression to beginning of target $ anchors expression to end of target . matches any single character except newline | separates alternative patterns [] accepts any of the enclosed characters [^] accepts any characters but the ones enclosed in brackets () groups patterns together for assignment or constraint * matches zero or more occurrences of preceding entity ? matches zero or one occurrences of preceding entity + matches one or more occurrences of preceding entity {n} matches exactly n occurrences of preceding entity {n,} matches at least n occurrences of preceding entity {n,m} matches n to m occurrences of preceding entity interpret succeeding character as literal Source:“Data Manipulation with R”. Spector, Phil. Springer, 2008. page 92.
  • 11. Examples [A-Za-z]+ matches one or more alphabetic characters .* matches zero or more of any character up to the newline .*.* matches zero or more characters followed by a literal .* (July? ) Accept ‘Jul’ or ‘July’ but not ‘Julyy’. Note the space. (abc|123) Match “abc” or “123” [abc|123] Match a, b, c, 1, 2 or 3.The ‘|’ is extraneous. ^(From|Subject|Date): Matches lines starting with “From:” or “Subject:” or “Date:”
  • 12. Let’s work through some examples... Data LastName, FirstName, Address, Phone Baker, Tom, 123 Unit St., 555-452-1324 Smith, Matt, 456 Tardis St., 555-326-4567 Tennant, David, 567 Torchwood Ave., 555-563-8974 1. Locate all phone numbers. 2. Locate all addresses. 3. Locate all addresses ending in ‘Street’ (including abbreviations). 4. Read in full names, reverse the order and remove the comma.
  • 13. So how would you write the regular expression to match a calendar date in format “mm/dd/yyyy” or “mm.dd.yyyy”?
  • 14. Regex to identify date format? What’s wrong with “[0-9]{2}(.|/)[0-9]{2}(.|/)[0-9]{4}” ? Or with “[1-12](.|/)[1-31](.|/)[0001-9999]” ?
  • 15. Dates are not an easy problem because they are not a simple text pattern Best bet is to validate the textual pattern (mm.dd.yyyy) and then pass to a separate function to validate the date (leap years, odd days in month, etc.) “^(1[0-2]|0[1-9])(.|/)(3[0-1]|[1-2][0-9]|0[1-9])(.|/) ([0-9]{4})$”
  • 16. Supported flavors of regex in R • POSIX 1003.2 • Perl Perl is the more robust of the two. POSIX has a few idiosyncracies handling ‘’ that may trip you up.
  • 17. Usage Patterns • Data validation • String replace on dataset • String identify in dataset (subset of data) • Pattern arithmetic (how prevalent is string in data?) • Error prevention/detection
  • 18. The R regex functions Note: all functions are in the base package Function Purpose strsplit() breaks apart strings at predened points grep() returns a vector of indices where a pattern is matched grepl() returns a logical vector (TRUE/FALSE) for each element of the data sub() replaces one pattern with another at rst matching location gsub() replaces one pattern with another at every matching location regexpr() returns an integer vector giving the starting position of the rst match, along with a match.length attribute giving the length of the matched text. gregexpr() returns an integer vector giving the starting position of the all matches, along with a match.length attribute giving the length of the matched text.
  • 19. strsplit( ) Definition: strsplit(x, split, fixed=FALSE, perl=FALSE, useBytes=FALSE) Example: str <- “This is some dummy data to parse x785y8099” strsplit(str, “[ xy]”, perl=TRUE) Result: [[1]] [1] "This" "is" "some" "dumm" "" "data" "to" "parse" "" [10] "785" "8099"
  • 20. grep( ) Definition: grep(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) Example: str <- “This is some dummy data to parse x785y8099” grep(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE, value=TRUE) Result: [1] "This is some dummy data to parse x785y8099"
  • 21. grepl( ) Definition: grepl(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE,fixed = FALSE, useBytes = FALSE, invert = FALSE) Example: str <- “This is some dummy data to parse x785y8099” grepl(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE) Result: [1] TRUE
  • 22. sub( ) Definition: sub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE, fixed=FALSE, useBytes=FALSE) Example: str <- “This is some dummy data to parse x785y8099” sub("dummy(.* )([a-z][0-9]{3}).([0-9]{4})", "awesome12H3", str, perl=TRUE) Result: [1] "This is some awesome data to parse x785H8099"
  • 23. gsub( ) Definition: gsub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE,fixed=FALSE, useBytes=FALSE) Example: str <- “This is some dummy data to parse x785y8099 you dummy” gsub(“dummy”, “awesome”, perl=TRUE) Result: [1] "This is some awesome data to parse x785y8099 you awesome"
  • 24. regexpr( ) Definition: regexpr(pattern, text, ignore.case=FALSE, perl=FALSE, fixed = FALSE, useBytes = FALSE) Example: duckgoose <- "Duck, duck, duck, goose, duck, duck, goose, duck, duck" regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE) Result: [1] 1 attr(,"match.length") [1] 4
  • 25. gregexpr( ) Definition: gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE, fixed=FALSE, useBytes=FALSE) Example: duckgoose <- "Duck, duck, duck, goose, duck, duck, goose, duck, duck" regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE) Result: [[1]] [1] 1 7 13 26 32 45 51 attr(,"match.length") [1] 4 4 4 4 4 4 4
  • 26. Problem Solving & Debugging • Remember that regexes are greedy by default. They will try to grab the largest matching string possible unless constrained. • Dummy data - small datasets • Unit testing - testthis, etc. • Build up regex complexity incrementally
  • 27. Best Practices for Regex in R • Store regex string as variable to pass to function • Try to make regex expression as exact as possible (avoid lazy matching) • Pick one type of regex syntax and stick with it (POSIX or Perl) • Document all regexes in code with liberal comments • use cat() to verify regex string • Test, test, and test some more
  • 28. Regex Workflow • Dene initial data pattern • Dene desired data pattern • Dene transformation steps • Incremental iteration to desired regex • Testing & QA
  • 29. Regex Resources • http://regexpal.com/ - online regex tester • Data Manipulation with R. Spector, Phil. Springer, 2008. • Regular Expression Cheat Sheet. http:// www.addedbytes.com/cheat-sheets/regular-expressions- cheat-sheet/ • Regular Expressions Cookbook. Goyvaerts, Jan and Levithan, Steven. O’Reilly, 2009. • Mastering Regular Expressions. Friedl, Jeffrey E.F. O’Reilly, 2006. • Twitter: @RegexTip - regex tips and tricks