regex-presentation_ed_goodwin
Upcoming SlideShare
Loading in...5
×
 

regex-presentation_ed_goodwin

on

  • 4,002 views

 

Statistics

Views

Total Views
4,002
Views on SlideShare
2,004
Embed Views
1,998

Actions

Likes
2
Downloads
69
Comments
0

44 Embeds 1,998

http://r-ecology.blogspot.com 793
http://www.r-bloggers.com 684
http://rmecab.jp 181
http://schamberlain.github.com 111
http://feeds.feedburner.com 44
http://venturesindatascience.blogspot.com 43
http://quantboxx.com 21
http://schamberlain.github.io 12
http://venturesindatascience.blogspot.co.uk 10
http://paper.li 10
http://recology.info 9
http://www.hanrss.com 7
http://r-ecology.blogspot.ca 7
http://r-ecology.blogspot.de 7
http://venturesindatascience.blogspot.be 6
http://r-ecology.blogspot.in 5
http://www.netvibes.com 4
http://venturesindatascience.blogspot.in 3
http://venturesindatascience.blogspot.de 3
http://r-ecology.blogspot.co.uk 3
http://web-18.citromail.hu 3
http://recology77.rssing.com 3
http://r-ecology.blogspot.nl 2
http://r-ecology.blogspot.mx 2
http://r-ecology.blogspot.com.br 2
http://venturesindatascience.blogspot.tw 2
http://mail.processtrends.com 2
http://r-ecology.blogspot.it 2
http://webcache.googleusercontent.com 2
http://r-ecology.blogspot.hu 1
http://venturesindatascience.blogspot.it 1
http://r-ecology.blogspot.jp 1
http://r-ecology.blogspot.com.ar 1
http://r-ecology.blogspot.ro 1
http://a0.twimg.com 1
https://sv.deutschebank.digitalsafe.net 1
http://us-w1.rockmelt.com 1
https://ilias-app2.let.ethz.ch 1
http://translate.yandex.net 1
http://r-ecology.blogspot.fr 1
http://r-ecology.blogspot.co.nz 1
http://venturesindatascience.blogspot.ca 1
http://r-ecology.blogspot.com.au 1
http://feedproxy.google.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    regex-presentation_ed_goodwin regex-presentation_ed_goodwin Presentation Transcript

    • Regular Expressions in R Houston R Users Group 10.05.2011 Ed Goodwin twitter: @egoodwintx
    • What is a Regular Expression?Regexes are an extremely flexible tool forfinding and replacing text. They can easily be applied globally across a document,dataset, or specifically to individual strings.
    • ExampleData LastName, FirstName, Address, Phone Baker, Tom, 123 Unit St., 555-452-1324 Smith, Matt, 456 Tardis St., 555-326-4567 Tennant, David, 567 Torchwood Ave., 555-563-8974Regular Expression to Convert “St.” to “Street” gsub(“St.”, “Street”, data[i]) *Note the double-slash “” to escape the ‘.’
    • Benefits of Regex• Flexible (can be applied globally or specifically across data)• Terse (very powerful scripting template)• Portable (sort of) across languages• Rich history
    • Disadvantages of regex• Non-intuitive• Easy to make errors (unintended consequences)• Difficult to robustly debug• Various flavors may cause portability issues.
    • Why do this in R?• Easier to locate all code in one place• (Relatively) Robust regex tools• May be the only tool available• Familiarity
    • Other alternatives?• Perl• Python• Java• Ruby• Others (grep, sed, awk, bash, csh, ksh, etc.)
    • Components of a Regular Expression• Characters• Metacharacters• Character classes
    • The R regex functions Function Purpose breaks apart strings at predefined pointsstrsplit() returns a vector of indices where agrep() pattern is matched returns a logical vector (TRUE/FALSE)grepl() for each element of the data replaces one pattern with another atsub() first matching location replaces one pattern with another atgsub() every matching location returns an integer vector giving the starting position ofregexpr() the first match, along with a match.length attribute giving the length of the matched text. returns an integer vector giving the starting position ofgregexpr() the all matches, along with a match.length attribute giving the length of the matched text. Note: all functions are in the base package
    • Metacharacter Symbols Modifier Meaning ^ anchors expression to beginning of target $ anchors expression to end of target . matches any single character except newline | separates alternative patterns [] accepts any of the enclosed characters [^] accepts any characters but the ones enclosed in brackets () groups patterns together for assignment or constraint * matches zero or more occurrences of preceding entity ? matches zero or one occurrences of preceding entity + matches one or more occurrences of preceding entity {n} matches exactly n occurrences of preceding entity {n,} matches at least n occurrences of preceding entity {n,m} matches n to m occurrences of preceding entity interpret succeeding character as literalSource: “Data Manipulation with R”. Spector, Phil. Springer, 2008. page 92.
    • Examples [A-Za-z]+ matches one or more alphabetic characters .* matches zero or more of any character up to the newline .*.* matches zero or more characters followed by a literal .* (July? ) Accept ‘Jul’ or ‘July’ but not ‘Julyy’. Note the space. (abc|123) Match “abc” or “123” [abc|123] Match a, b, c, 1, 2 or 3.The ‘|’ is extraneous. Matches lines starting with “From:” or “Subject:” or^(From|Subject|Date): “Date:”
    • Let’s work through some examples...DataLastName, FirstName, Address, PhoneBaker, Tom, 123 Unit St., 555-452-1324Smith, Matt, 456 Tardis St., 555-326-4567Tennant, David, 567 Torchwood Ave., 555-563-89741. Locate all phone numbers.2. Locate all addresses.3. Locate all addresses ending in ‘Street’ (includingabbreviations).4. Read in full names, reverse the order and removethe comma.
    • So how would you write the regular expression to match a calendar date informat “mm/dd/yyyy” or “mm.dd.yyyy”?
    • Regex to identify date format?What’s wrong with“[0-9]{2}(.|/)[0-9]{2}(.|/)[0-9]{4}” ?Or with“[1-12](.|/)[1-31](.|/)[0001-9999]” ?
    • Dates are not an easy problembecause they are not a simple text patternBest bet is to validate the textual pattern(mm.dd.yyyy) and then pass to a separatefunction to validate the date (leap years, odddays in month, etc.)“^(1[0-2]|0[1-9])(.|/)(3[0-1]|[1-2][0-9]|0[1-9])(.|/)([0-9]{4})$”
    • Supported flavors of regex in R• POSIX 1003.2• PerlPerl is the more robust of the two. POSIXhas a few idiosyncracies handling ‘’ that maytrip you up.
    • Usage Patterns• Data validation• String replace on dataset• String identify in dataset (subset of data)• Pattern arithmetic (how prevalent is string in data?)• Error prevention/detection
    • The R regex functions Function Purpose breaks apart strings at predefined pointsstrsplit() returns a vector of indices where agrep() pattern is matched returns a logical vector (TRUE/FALSE)grepl() for each element of the data replaces one pattern with another atsub() first matching location replaces one pattern with another atgsub() every matching location returns an integer vector giving the starting position ofregexpr() the first match, along with a match.length attribute giving the length of the matched text. returns an integer vector giving the starting position ofgregexpr() the all matches, along with a match.length attribute giving the length of the matched text. Note: all functions are in the base package
    • strsplit( )Definition: strsplit(x, split, fixed=FALSE, perl=FALSE, useBytes=FALSE)Example:str <- “This is some dummy data to parse x785y8099”strsplit(str, “[ xy]”, perl=TRUE)Result:[[1]] [1] "This" "is" "some" "dumm" "" "data" "to""parse" ""[10] "785" "8099"
    • grep( )Definition:grep(pattern, x, ignore.case=FALSE, perl=FALSE, value=FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)Example:str <- “This is some dummy data to parse x785y8099”grep(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE,value=TRUE)Result:[1] "This is some dummy data to parse x785y8099"
    • grepl( )Definition:grepl(pattern, x, ignore.case=FALSE, perl=FALSE,value=FALSE,fixed = FALSE, useBytes = FALSE, invert = FALSE)Example:str <- “This is some dummy data to parse x785y8099”grepl(“[a-z][0-9]{3}[a-z][0-9]{4}”, str, perl=TRUE)Result:[1] TRUE
    • sub( )Definition:sub(pattern, replacement, x, ignore.case=FALSE, perl=FALSE,fixed=FALSE, useBytes=FALSE)Example:str <- “This is some dummy data to parse x785y8099”sub("dummy(.* )([a-z][0-9]{3}).([0-9]{4})","awesome12H3", str, perl=TRUE)Result:[1] "This is some awesome data to parse x785H8099"
    • gsub( )Definition:gsub(pattern, replacement, x, ignore.case=FALSE,perl=FALSE,fixed=FALSE, useBytes=FALSE)Example:str <- “This is some dummy data to parse x785y8099 youdummy”gsub(“dummy”, “awesome”, perl=TRUE)Result:[1] "This is some awesome data to parse x785y8099 youawesome"
    • regexpr( )Definition:regexpr(pattern, text, ignore.case=FALSE, perl=FALSE,fixed = FALSE, useBytes = FALSE)Example:duckgoose <- "Duck, duck, duck, goose, duck, duck, goose,duck, duck"regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE)Result:[1] 1attr(,"match.length")[1] 4
    • gregexpr( )Definition:gregexpr(pattern, text, ignore.case=FALSE, perl=FALSE,fixed=FALSE, useBytes=FALSE)Example:duckgoose <- "Duck, duck, duck, goose, duck, duck, goose,duck, duck"regexpr("duck", duckgoose, ignore.case=TRUE, perl=TRUE)Result:[[1]][1] 1 7 13 26 32 45 51attr(,"match.length")[1] 4 4 4 4 4 4 4
    • Problem Solving & Debugging• Remember that regexes are greedy by default. They will try to grab the largest matching string possible unless constrained.• Dummy data - small datasets• Unit testing - testthis, etc.• Build up regex complexity incrementally
    • Best Practices for Regex in R• Store regex string as variable to pass to function• Try to make regex expression as exact as possible (avoid lazy matching)• Pick one type of regex syntax and stick with it (POSIX or Perl)• Document all regexes in code with liberal comments• use cat() to verify regex string• Test, test, and test some more
    • Regex Workflow• Define initial data pattern• Define desired data pattern• Define transformation steps• Incremental iteration to desired regex• Testing & QA
    • Regex Resources• http://regexpal.com/ - online regex tester• Data Manipulation with R. Spector, Phil. Springer, 2008.• Regular Expression Cheat Sheet. http:// www.addedbytes.com/cheat-sheets/regular-expressions- cheat-sheet/• Regular Expressions Cookbook. Goyvaerts, Jan and Levithan, Steven. O’Reilly, 2009.• Mastering Regular Expressions. Friedl, Jeffrey E.F. O’Reilly, 2006.• Twitter: @RegexTip - regex tips and tricks