SlideShare a Scribd company logo
Regular Expressions in
          R
      Houston R Users Group
             10.05.2011
            Ed Goodwin
       twitter: @egoodwintx
What is a Regular
   Expression?

Regexes are an extremely flexible tool for
finding and replacing text. They can easily
  be applied globally across a document,
dataset, or specifically to individual strings.
Example
Data
  LastName, FirstName, Address, Phone
  Baker, Tom, 123 Unit St., 555-452-1324
  Smith, Matt, 456 Tardis St., 555-326-4567
  Tennant, David, 567 Torchwood Ave., 555-563-8974


Regular Expression to Convert “St.” to “Street”
  gsub(“St.”, “Street”, data[i])


  *Note the double-slash “” to escape the ‘.’
Benefits of Regex
• Flexible (can be applied globally or
  specifically across data)
• Terse (very powerful scripting template)
• Portable (sort of) across languages
• Rich history
Disadvantages of regex
• Non-intuitive
• Easy to make errors (unintended
  consequences)
• Difficult to robustly debug
• Various flavors may cause portability issues.
Why do this in R?
• Easier to locate all code in one place
• (Relatively) Robust regex tools
• May be the only tool available
• Familiarity
Other alternatives?
• Perl
• Python
• Java
• Ruby
• Others (grep, sed, awk, bash, csh, ksh, etc.)
Components of a
   Regular Expression
• Characters
• Metacharacters
• Character classes
The R regex functions
          Function                                             Purpose
                                               breaks apart strings at predefined points
strsplit()
                                               returns a vector of indices where a
grep()                                         pattern is matched
                                               returns a logical vector (TRUE/FALSE)
grepl()                                        for each element of the data
                                               replaces one pattern with another at
sub()                                          first matching location
                                               replaces one pattern with another at
gsub()                                         every matching location
                                               returns an integer vector giving the starting position of
regexpr()                                      the first match, along with a match.length attribute
                                               giving the length of the matched text.

                                               returns an integer vector giving the starting position of
gregexpr()                                     the all matches, along with a match.length attribute
                                               giving the length of the matched text.

 Note: all functions are in the base package
Metacharacter Symbols
     Modifier                                                    Meaning
           ^                               anchors expression to beginning of target
           $                                   anchors expression to end of target
           .                             matches any single character except newline
           |                                       separates alternative patterns
           []                                accepts any of the enclosed characters
          [^]                   accepts any characters but the ones enclosed in brackets
           ()                     groups patterns together for assignment or constraint
           *                     matches zero or more occurrences of preceding entity
           ?                       matches zero or one occurrences of preceding entity
           +                      matches one or more occurrences of preceding entity
          {n}                        matches exactly n occurrences of preceding entity
         {n,}                        matches at least n occurrences of preceding entity
        {n,m}                         matches n to m occurrences of preceding entity
                                        interpret succeeding character as literal
Source: “Data Manipulation with R”. Spector, Phil. Springer, 2008. page 92.
Examples
     [A-Za-z]+                matches one or more alphabetic characters


         .*             matches zero or more of any character up to the newline


      .*.*          matches zero or more characters followed by a literal .*


      (July? )             Accept ‘Jul’ or ‘July’ but not ‘Julyy’. Note the space.


     (abc|123)                             Match “abc” or “123”


     [abc|123]                  Match a, b, c, 1, 2 or 3.The ‘|’ is extraneous.

                          Matches lines starting with “From:” or “Subject:” or
^(From|Subject|Date):                            “Date:”
Let’s work through some examples...
Data

LastName, FirstName, Address, Phone
Baker, Tom, 123 Unit St., 555-452-1324
Smith, Matt, 456 Tardis St., 555-326-4567
Tennant, David, 567 Torchwood Ave., 555-563-8974


1. Locate all phone numbers.
2. Locate all addresses.
3. Locate all addresses ending in ‘Street’ (including
abbreviations).
4. Read in full names, reverse the order and remove
the comma.
So how would you write the regular
 expression to match a calendar date in
format “mm/dd/yyyy” or “mm.dd.yyyy”?
Regex to identify date
      format?
What’s wrong with
“[0-9]{2}(.|/)[0-9]{2}(.|/)[0-9]{4}”    ?
Or with
“[1-12](.|/)[1-31](.|/)[0001-9999]” ?
Dates are not an easy problem
because they are not a simple text
             pattern

Best bet is to validate the textual pattern
(mm.dd.yyyy) and then pass to a separate
function to validate the date (leap years, odd
days in month, etc.)
“^(1[0-2]|0[1-9])(.|/)(3[0-1]|[1-2][0-9]|0[1-9])(.|/)
([0-9]{4})$”
Supported flavors of
     regex in R
• POSIX 1003.2
• Perl
Perl is the more robust of the two. POSIX
has a few idiosyncracies handling ‘’ that may
trip you up.
Usage Patterns
• Data validation
• String replace on dataset
• String identify in dataset (subset of data)
• Pattern arithmetic (how prevalent is string
  in data?)
• Error prevention/detection
The R regex functions
          Function                                             Purpose
                                               breaks apart strings at predefined points
strsplit()
                                               returns a vector of indices where a
grep()                                         pattern is matched
                                               returns a logical vector (TRUE/FALSE)
grepl()                                        for each element of the data
                                               replaces one pattern with another at
sub()                                          first matching location
                                               replaces one pattern with another at
gsub()                                         every matching location
                                               returns an integer vector giving the starting position of
regexpr()                                      the first match, along with a match.length attribute
                                               giving the length of the matched text.

                                               returns an integer vector giving the starting position of
gregexpr()                                     the all matches, along with a match.length attribute
                                               giving the length of the matched text.

 Note: all functions are in the base package
strsplit( )
Definition:
 strsplit(x, split, fixed=FALSE, perl=FALSE, useBytes=FALSE)

Example:

str <- “This is some dummy data to parse x785y8099”
strsplit(str, “[ xy]”, perl=TRUE)

Result:
[[1]]
 [1] "This"   "is"      "some"   "dumm"   ""   "data"   "to"
"parse" ""
[10] "785"    "8099"
grep( )
Definition:
grep(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE,
            fixed = FALSE, useBytes = FALSE, invert = FALSE)

Example:

str <- “This is some dummy data to parse x785y8099”
grep(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE,
value=TRUE)

Result:
[1] "This is some dummy data to parse x785y8099"
grepl( )
Definition:
grepl(pattern, x, ignore.case=FALSE, perl=FALSE,
value=FALSE,fixed = FALSE, useBytes = FALSE, invert = FALSE)

Example:

str <- “This is some dummy data to parse x785y8099”
grepl(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE)

Result:

[1] TRUE
sub( )
Definition:
sub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE,
fixed=FALSE, useBytes=FALSE)

Example:

str <- “This is some dummy data to parse x785y8099”
sub("dummy(.* )([a-z][0-9]{3}).([0-9]{4})",
"awesome12H3", str, perl=TRUE)

Result:
[1] "This is some awesome data to parse x785H8099"
gsub( )
Definition:
gsub(pattern, replacement, x, ignore.case=FALSE,
perl=FALSE,fixed=FALSE, useBytes=FALSE)

Example:

str <- “This is some dummy data to parse x785y8099 you
dummy”
gsub(“dummy”, “awesome”, perl=TRUE)

Result:
[1] "This is some awesome data to parse x785y8099 you
awesome"
regexpr( )
Definition:
regexpr(pattern, text, ignore.case=FALSE, perl=FALSE,
fixed = FALSE, useBytes = FALSE)

Example:

duckgoose <- "Duck, duck, duck, goose, duck, duck, goose,
duck, duck"

regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE)

Result:
[1] 1
attr(,"match.length")
[1] 4
gregexpr( )
Definition:
gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE,
fixed=FALSE, useBytes=FALSE)

Example:

duckgoose <- "Duck, duck, duck, goose, duck, duck, goose,
duck, duck"

regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE)

Result:
[[1]]
[1] 1 7 13 26 32 45 51
attr(,"match.length")
[1] 4 4 4 4 4 4 4
Problem Solving &
       Debugging
• Remember that regexes are greedy by
  default. They will try to grab the largest
  matching string possible unless constrained.
• Dummy data - small datasets
• Unit testing - testthis, etc.
• Build up regex complexity incrementally
Best Practices for
         Regex in R
•   Store regex string as variable to pass to function
•   Try to make regex expression as exact as possible
    (avoid lazy matching)
•   Pick one type of regex syntax and stick with it
    (POSIX or Perl)
•   Document all regexes in code with liberal comments
•   use cat() to verify regex string
•   Test, test, and test some more
Regex Workflow

• Define initial data pattern
• Define desired data pattern
• Define transformation steps
• Incremental iteration to desired regex
• Testing & QA
Regex Resources
•   http://regexpal.com/ - online regex tester
•   Data Manipulation with R. Spector, Phil. Springer, 2008.
•   Regular Expression Cheat Sheet. http://
    www.addedbytes.com/cheat-sheets/regular-expressions-
    cheat-sheet/
•   Regular Expressions Cookbook. Goyvaerts, Jan and
    Levithan, Steven. O’Reilly, 2009.
•   Mastering Regular Expressions. Friedl, Jeffrey E.F. O’Reilly,
    2006.
•   Twitter: @RegexTip - regex tips and tricks

More Related Content

What's hot

Textpad and Regular Expressions
Textpad and Regular ExpressionsTextpad and Regular Expressions
Textpad and Regular Expressions
OCSI
 
Array and functions
Array and functionsArray and functions
Array and functions
Sun Technlogies
 
Common fixed point theorems of integral type in menger pm spaces
Common fixed point theorems of integral type in menger pm spacesCommon fixed point theorems of integral type in menger pm spaces
Common fixed point theorems of integral type in menger pm spaces
Alexander Decker
 
Andrei's Regex Clinic
Andrei's Regex ClinicAndrei's Regex Clinic
Andrei's Regex Clinic
Andrei Zmievski
 
2.regular expressions
2.regular expressions2.regular expressions
2.regular expressions
Praveen Gorantla
 
Java căn bản - Chapter9
Java căn bản - Chapter9Java căn bản - Chapter9
Java căn bản - Chapter9
Vince Vo
 
Strong convergence of an algorithm about strongly quasi nonexpansive mappings
Strong convergence of an algorithm about strongly quasi nonexpansive mappingsStrong convergence of an algorithm about strongly quasi nonexpansive mappings
Strong convergence of an algorithm about strongly quasi nonexpansive mappings
Alexander Decker
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Thomas Langston
 
Deriving the Y Combinator
Deriving the Y CombinatorDeriving the Y Combinator
Deriving the Y Combinator
Yuta Okazaki
 
Regex Basics
Regex BasicsRegex Basics
Regex Basics
Jeremy Coates
 
Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in python
John(Qiang) Zhang
 
Bd32360363
Bd32360363Bd32360363
Bd32360363
IJERA Editor
 
Python - Regular Expressions
Python - Regular ExpressionsPython - Regular Expressions
Python - Regular Expressions
Mukesh Tekwani
 
Java DSLs with Xtext
Java DSLs with XtextJava DSLs with Xtext
Java DSLs with Xtext
Dr. Jan Köhnlein
 
Continuation calculus at Term Rewriting Seminar
Continuation calculus at Term Rewriting SeminarContinuation calculus at Term Rewriting Seminar
Continuation calculus at Term Rewriting Seminar
bgeron
 
Cwkaa 2010
Cwkaa 2010Cwkaa 2010
Cwkaa 2010
Sam Neaves
 
Master of Computer Application (MCA) – Semester 4 MC0080
Master of Computer Application (MCA) – Semester 4  MC0080Master of Computer Application (MCA) – Semester 4  MC0080
Master of Computer Application (MCA) – Semester 4 MC0080
Aravind NC
 
Processing Regex Python
Processing Regex PythonProcessing Regex Python
Processing Regex Python
primeteacher32
 
Quiz 1 solution
Quiz 1 solutionQuiz 1 solution
Quiz 1 solution
Gopi Saiteja
 
H-MLQ
H-MLQH-MLQ

What's hot (20)

Textpad and Regular Expressions
Textpad and Regular ExpressionsTextpad and Regular Expressions
Textpad and Regular Expressions
 
Array and functions
Array and functionsArray and functions
Array and functions
 
Common fixed point theorems of integral type in menger pm spaces
Common fixed point theorems of integral type in menger pm spacesCommon fixed point theorems of integral type in menger pm spaces
Common fixed point theorems of integral type in menger pm spaces
 
Andrei's Regex Clinic
Andrei's Regex ClinicAndrei's Regex Clinic
Andrei's Regex Clinic
 
2.regular expressions
2.regular expressions2.regular expressions
2.regular expressions
 
Java căn bản - Chapter9
Java căn bản - Chapter9Java căn bản - Chapter9
Java căn bản - Chapter9
 
Strong convergence of an algorithm about strongly quasi nonexpansive mappings
Strong convergence of an algorithm about strongly quasi nonexpansive mappingsStrong convergence of an algorithm about strongly quasi nonexpansive mappings
Strong convergence of an algorithm about strongly quasi nonexpansive mappings
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Deriving the Y Combinator
Deriving the Y CombinatorDeriving the Y Combinator
Deriving the Y Combinator
 
Regex Basics
Regex BasicsRegex Basics
Regex Basics
 
Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in python
 
Bd32360363
Bd32360363Bd32360363
Bd32360363
 
Python - Regular Expressions
Python - Regular ExpressionsPython - Regular Expressions
Python - Regular Expressions
 
Java DSLs with Xtext
Java DSLs with XtextJava DSLs with Xtext
Java DSLs with Xtext
 
Continuation calculus at Term Rewriting Seminar
Continuation calculus at Term Rewriting SeminarContinuation calculus at Term Rewriting Seminar
Continuation calculus at Term Rewriting Seminar
 
Cwkaa 2010
Cwkaa 2010Cwkaa 2010
Cwkaa 2010
 
Master of Computer Application (MCA) – Semester 4 MC0080
Master of Computer Application (MCA) – Semester 4  MC0080Master of Computer Application (MCA) – Semester 4  MC0080
Master of Computer Application (MCA) – Semester 4 MC0080
 
Processing Regex Python
Processing Regex PythonProcessing Regex Python
Processing Regex Python
 
Quiz 1 solution
Quiz 1 solutionQuiz 1 solution
Quiz 1 solution
 
H-MLQ
H-MLQH-MLQ
H-MLQ
 

Similar to regex-presentation_ed_goodwin

Eag 201110-hrugregexpresentation-111006104128-phpapp02
Eag 201110-hrugregexpresentation-111006104128-phpapp02Eag 201110-hrugregexpresentation-111006104128-phpapp02
Eag 201110-hrugregexpresentation-111006104128-phpapp02
egoodwintx
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Raj Gupta
 
Regular_Expressions.pptx
Regular_Expressions.pptxRegular_Expressions.pptx
Regular_Expressions.pptx
DurgaNayak4
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
Sujith Kumar
 
Maxbox starter20
Maxbox starter20Maxbox starter20
Maxbox starter20
Max Kleiner
 
Regular expressions in oracle
Regular expressions in oracleRegular expressions in oracle
Regular expressions in oracle
Logan Palanisamy
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
Rupak Roy
 
Adv. python regular expression by Rj
Adv. python regular expression by RjAdv. python regular expression by Rj
Adv. python regular expression by Rj
Shree M.L.Kakadiya MCA mahila college, Amreli
 
arrays.pptx
arrays.pptxarrays.pptx
arrays.pptx
SachinBhosale73
 
Regular Expressions Cheat Sheet
Regular Expressions Cheat SheetRegular Expressions Cheat Sheet
Regular Expressions Cheat Sheet
Akash Bisariya
 
11. using regular expressions with oracle database
11. using regular expressions with oracle database11. using regular expressions with oracle database
11. using regular expressions with oracle database
Amrit Kaur
 
Intoduction to php strings
Intoduction to php  stringsIntoduction to php  strings
Regular Expressions in PHP, MySQL by programmerblog.net
Regular Expressions in PHP, MySQL by programmerblog.netRegular Expressions in PHP, MySQL by programmerblog.net
Regular Expressions in PHP, MySQL by programmerblog.net
Programmer Blog
 
Module 3 - Regular Expressions, Dictionaries.pdf
Module 3 - Regular  Expressions,  Dictionaries.pdfModule 3 - Regular  Expressions,  Dictionaries.pdf
Module 3 - Regular Expressions, Dictionaries.pdf
GaneshRaghu4
 
PHP Web Programming
PHP Web ProgrammingPHP Web Programming
PHP Web Programming
Muthuselvam RS
 
Underscore.js
Underscore.jsUnderscore.js
Underscore.js
timourian
 
Unit 1-array,lists and hashes
Unit 1-array,lists and hashesUnit 1-array,lists and hashes
Unit 1-array,lists and hashes
sana mateen
 
Regular expressionfunction
Regular expressionfunctionRegular expressionfunction
Regular expressionfunction
ADARSH BHATT
 
SQL for pattern matching (Oracle 12c)
SQL for pattern matching (Oracle 12c)SQL for pattern matching (Oracle 12c)
SQL for pattern matching (Oracle 12c)
Logan Palanisamy
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
Dr Nisha Arora
 

Similar to regex-presentation_ed_goodwin (20)

Eag 201110-hrugregexpresentation-111006104128-phpapp02
Eag 201110-hrugregexpresentation-111006104128-phpapp02Eag 201110-hrugregexpresentation-111006104128-phpapp02
Eag 201110-hrugregexpresentation-111006104128-phpapp02
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regular_Expressions.pptx
Regular_Expressions.pptxRegular_Expressions.pptx
Regular_Expressions.pptx
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
 
Maxbox starter20
Maxbox starter20Maxbox starter20
Maxbox starter20
 
Regular expressions in oracle
Regular expressions in oracleRegular expressions in oracle
Regular expressions in oracle
 
Text Mining using Regular Expressions
Text Mining using Regular ExpressionsText Mining using Regular Expressions
Text Mining using Regular Expressions
 
Adv. python regular expression by Rj
Adv. python regular expression by RjAdv. python regular expression by Rj
Adv. python regular expression by Rj
 
arrays.pptx
arrays.pptxarrays.pptx
arrays.pptx
 
Regular Expressions Cheat Sheet
Regular Expressions Cheat SheetRegular Expressions Cheat Sheet
Regular Expressions Cheat Sheet
 
11. using regular expressions with oracle database
11. using regular expressions with oracle database11. using regular expressions with oracle database
11. using regular expressions with oracle database
 
Intoduction to php strings
Intoduction to php  stringsIntoduction to php  strings
Intoduction to php strings
 
Regular Expressions in PHP, MySQL by programmerblog.net
Regular Expressions in PHP, MySQL by programmerblog.netRegular Expressions in PHP, MySQL by programmerblog.net
Regular Expressions in PHP, MySQL by programmerblog.net
 
Module 3 - Regular Expressions, Dictionaries.pdf
Module 3 - Regular  Expressions,  Dictionaries.pdfModule 3 - Regular  Expressions,  Dictionaries.pdf
Module 3 - Regular Expressions, Dictionaries.pdf
 
PHP Web Programming
PHP Web ProgrammingPHP Web Programming
PHP Web Programming
 
Underscore.js
Underscore.jsUnderscore.js
Underscore.js
 
Unit 1-array,lists and hashes
Unit 1-array,lists and hashesUnit 1-array,lists and hashes
Unit 1-array,lists and hashes
 
Regular expressionfunction
Regular expressionfunctionRegular expressionfunction
Regular expressionfunction
 
SQL for pattern matching (Oracle 12c)
SQL for pattern matching (Oracle 12c)SQL for pattern matching (Oracle 12c)
SQL for pattern matching (Oracle 12c)
 
3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
 

More from schamber

Poster
PosterPoster
Poster
schamber
 
Poster
PosterPoster
Poster
schamber
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
schamber
 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
schamber
 
Web data from R
Web data from RWeb data from R
Web data from R
schamber
 
R Introduction
R IntroductionR Introduction
R Introduction
schamber
 

More from schamber (6)

Poster
PosterPoster
Poster
 
Poster
PosterPoster
Poster
 
Chamberlain PhD Thesis
Chamberlain PhD ThesisChamberlain PhD Thesis
Chamberlain PhD Thesis
 
Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
Web data from R
Web data from RWeb data from R
Web data from R
 
R Introduction
R IntroductionR Introduction
R Introduction
 

Recently uploaded

Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
shanihomely
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
DianaGray10
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
ankush9927
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
313mohammedarshad
 
Step-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From ScratchStep-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From Scratch
softsuave
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Torry Harris
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Kunal Gupta
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
alexjohnson7307
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
maigasapphire
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
Arpan Buwa
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Baishakhi Ray
 
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
Priyanka Aash
 

Recently uploaded (20)

Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
Premium Girls Call Mumbai 9920725232 Unlimited Short Providing Girls Service ...
 
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision MakingConnector Corner: Leveraging Snowflake Integration for Smarter Decision Making
Connector Corner: Leveraging Snowflake Integration for Smarter Decision Making
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
 
Step-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From ScratchStep-By-Step Process to Develop a Mobile App From Scratch
Step-By-Step Process to Develop a Mobile App From Scratch
 
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...Evolution of iPaaS - simplify IT workloads to provide a unified view of  data...
Evolution of iPaaS - simplify IT workloads to provide a unified view of data...
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
 
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
Girls Call Churchgate 9910780858 Provide Best And Top Girl Service And No1 in...
 
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and CitiesThe Impact of the Internet of Things (IoT) on Smart Homes and Cities
The Impact of the Internet of Things (IoT) on Smart Homes and Cities
 
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
 
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
(CISOPlatform Summit & SACON 2024) Gen AI & Deepfake In Overall Security.pdf
 

regex-presentation_ed_goodwin

  • 1. Regular Expressions in R Houston R Users Group 10.05.2011 Ed Goodwin twitter: @egoodwintx
  • 2. What is a Regular Expression? Regexes are an extremely flexible tool for finding and replacing text. They can easily be applied globally across a document, dataset, or specifically to individual strings.
  • 3. Example Data LastName, FirstName, Address, Phone Baker, Tom, 123 Unit St., 555-452-1324 Smith, Matt, 456 Tardis St., 555-326-4567 Tennant, David, 567 Torchwood Ave., 555-563-8974 Regular Expression to Convert “St.” to “Street” gsub(“St.”, “Street”, data[i]) *Note the double-slash “” to escape the ‘.’
  • 4. Benefits of Regex • Flexible (can be applied globally or specifically across data) • Terse (very powerful scripting template) • Portable (sort of) across languages • Rich history
  • 5. Disadvantages of regex • Non-intuitive • Easy to make errors (unintended consequences) • Difficult to robustly debug • Various flavors may cause portability issues.
  • 6. Why do this in R? • Easier to locate all code in one place • (Relatively) Robust regex tools • May be the only tool available • Familiarity
  • 7. Other alternatives? • Perl • Python • Java • Ruby • Others (grep, sed, awk, bash, csh, ksh, etc.)
  • 8. Components of a Regular Expression • Characters • Metacharacters • Character classes
  • 9. The R regex functions Function Purpose breaks apart strings at predefined points strsplit() returns a vector of indices where a grep() pattern is matched returns a logical vector (TRUE/FALSE) grepl() for each element of the data replaces one pattern with another at sub() first matching location replaces one pattern with another at gsub() every matching location returns an integer vector giving the starting position of regexpr() the first match, along with a match.length attribute giving the length of the matched text. returns an integer vector giving the starting position of gregexpr() the all matches, along with a match.length attribute giving the length of the matched text. Note: all functions are in the base package
  • 10. Metacharacter Symbols Modifier Meaning ^ anchors expression to beginning of target $ anchors expression to end of target . matches any single character except newline | separates alternative patterns [] accepts any of the enclosed characters [^] accepts any characters but the ones enclosed in brackets () groups patterns together for assignment or constraint * matches zero or more occurrences of preceding entity ? matches zero or one occurrences of preceding entity + matches one or more occurrences of preceding entity {n} matches exactly n occurrences of preceding entity {n,} matches at least n occurrences of preceding entity {n,m} matches n to m occurrences of preceding entity interpret succeeding character as literal Source: “Data Manipulation with R”. Spector, Phil. Springer, 2008. page 92.
  • 11. Examples [A-Za-z]+ matches one or more alphabetic characters .* matches zero or more of any character up to the newline .*.* matches zero or more characters followed by a literal .* (July? ) Accept ‘Jul’ or ‘July’ but not ‘Julyy’. Note the space. (abc|123) Match “abc” or “123” [abc|123] Match a, b, c, 1, 2 or 3.The ‘|’ is extraneous. Matches lines starting with “From:” or “Subject:” or ^(From|Subject|Date): “Date:”
  • 12. Let’s work through some examples... Data LastName, FirstName, Address, Phone Baker, Tom, 123 Unit St., 555-452-1324 Smith, Matt, 456 Tardis St., 555-326-4567 Tennant, David, 567 Torchwood Ave., 555-563-8974 1. Locate all phone numbers. 2. Locate all addresses. 3. Locate all addresses ending in ‘Street’ (including abbreviations). 4. Read in full names, reverse the order and remove the comma.
  • 13. So how would you write the regular expression to match a calendar date in format “mm/dd/yyyy” or “mm.dd.yyyy”?
  • 14. Regex to identify date format? What’s wrong with “[0-9]{2}(.|/)[0-9]{2}(.|/)[0-9]{4}” ? Or with “[1-12](.|/)[1-31](.|/)[0001-9999]” ?
  • 15. Dates are not an easy problem because they are not a simple text pattern Best bet is to validate the textual pattern (mm.dd.yyyy) and then pass to a separate function to validate the date (leap years, odd days in month, etc.) “^(1[0-2]|0[1-9])(.|/)(3[0-1]|[1-2][0-9]|0[1-9])(.|/) ([0-9]{4})$”
  • 16. Supported flavors of regex in R • POSIX 1003.2 • Perl Perl is the more robust of the two. POSIX has a few idiosyncracies handling ‘’ that may trip you up.
  • 17. Usage Patterns • Data validation • String replace on dataset • String identify in dataset (subset of data) • Pattern arithmetic (how prevalent is string in data?) • Error prevention/detection
  • 18. The R regex functions Function Purpose breaks apart strings at predefined points strsplit() returns a vector of indices where a grep() pattern is matched returns a logical vector (TRUE/FALSE) grepl() for each element of the data replaces one pattern with another at sub() first matching location replaces one pattern with another at gsub() every matching location returns an integer vector giving the starting position of regexpr() the first match, along with a match.length attribute giving the length of the matched text. returns an integer vector giving the starting position of gregexpr() the all matches, along with a match.length attribute giving the length of the matched text. Note: all functions are in the base package
  • 19. strsplit( ) Definition: strsplit(x, split, fixed=FALSE, perl=FALSE, useBytes=FALSE) Example: str <- “This is some dummy data to parse x785y8099” strsplit(str, “[ xy]”, perl=TRUE) Result: [[1]] [1] "This" "is" "some" "dumm" "" "data" "to" "parse" "" [10] "785" "8099"
  • 20. grep( ) Definition: grep(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) Example: str <- “This is some dummy data to parse x785y8099” grep(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE, value=TRUE) Result: [1] "This is some dummy data to parse x785y8099"
  • 21. grepl( ) Definition: grepl(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE,fixed = FALSE, useBytes = FALSE, invert = FALSE) Example: str <- “This is some dummy data to parse x785y8099” grepl(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE) Result: [1] TRUE
  • 22. sub( ) Definition: sub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE, fixed=FALSE, useBytes=FALSE) Example: str <- “This is some dummy data to parse x785y8099” sub("dummy(.* )([a-z][0-9]{3}).([0-9]{4})", "awesome12H3", str, perl=TRUE) Result: [1] "This is some awesome data to parse x785H8099"
  • 23. gsub( ) Definition: gsub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE,fixed=FALSE, useBytes=FALSE) Example: str <- “This is some dummy data to parse x785y8099 you dummy” gsub(“dummy”, “awesome”, perl=TRUE) Result: [1] "This is some awesome data to parse x785y8099 you awesome"
  • 24. regexpr( ) Definition: regexpr(pattern, text, ignore.case=FALSE, perl=FALSE, fixed = FALSE, useBytes = FALSE) Example: duckgoose <- "Duck, duck, duck, goose, duck, duck, goose, duck, duck" regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE) Result: [1] 1 attr(,"match.length") [1] 4
  • 25. gregexpr( ) Definition: gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE, fixed=FALSE, useBytes=FALSE) Example: duckgoose <- "Duck, duck, duck, goose, duck, duck, goose, duck, duck" regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE) Result: [[1]] [1] 1 7 13 26 32 45 51 attr(,"match.length") [1] 4 4 4 4 4 4 4
  • 26. Problem Solving & Debugging • Remember that regexes are greedy by default. They will try to grab the largest matching string possible unless constrained. • Dummy data - small datasets • Unit testing - testthis, etc. • Build up regex complexity incrementally
  • 27. Best Practices for Regex in R • Store regex string as variable to pass to function • Try to make regex expression as exact as possible (avoid lazy matching) • Pick one type of regex syntax and stick with it (POSIX or Perl) • Document all regexes in code with liberal comments • use cat() to verify regex string • Test, test, and test some more
  • 28. Regex Workflow • Define initial data pattern • Define desired data pattern • Define transformation steps • Incremental iteration to desired regex • Testing & QA
  • 29. Regex Resources • http://regexpal.com/ - online regex tester • Data Manipulation with R. Spector, Phil. Springer, 2008. • Regular Expression Cheat Sheet. http:// www.addedbytes.com/cheat-sheets/regular-expressions- cheat-sheet/ • Regular Expressions Cookbook. Goyvaerts, Jan and Levithan, Steven. O’Reilly, 2009. • Mastering Regular Expressions. Friedl, Jeffrey E.F. O’Reilly, 2006. • Twitter: @RegexTip - regex tips and tricks