5. Special characters
• Use to “escape” special characters
• " = "
• n = new line
• =
• t = tab
• ?Quotes for more
Wednesday, 18 November 2009
6. Recap
load("email.rdata")
str(contents) # second 100 are spam
# reinstall because I've made some
# improvements
install.packages("stringr")
Wednesday, 18 November 2009
8. str_split(h, "n")[[1]]
lines <- str_split(h, "n")
continued <- str_sub(lines, 1, 1) %in% c(" ", "t")
# This is a neat trick!
groups <- cumsum(!continued)
fields <- rep(NA, max(groups))
for (i in seq_along(fields)) {
fields[i] <- str_join(lines[groups == i],
collapse = "n")
}
str_split_fixed(fields, ": ", 2)
Wednesday, 18 November 2009
9. Recap
We split the content into header and
body. And split up the header into fields.
Both of these tasks used fixed strings
What if the pattern we need to match is
more complicated?
Wednesday, 18 November 2009
10. Matching challenges
• How could we match a phone number?
• How could we match a date?
• How could we match a time?
• How could we match an amount of
money?
• How could we match an email address?
Wednesday, 18 November 2009
11. Pattern matching
Each of these types of data have a fairly
regular pattern that we can easily pick out
by eye
Today we are going to learn about regular
expressions, which are a pattern
matching language
Almost every programming language has
regular expressions, often with a slightly
different dialect. Very very useful
Wednesday, 18 November 2009
12. First challenge
• Matching phone numbers
• How are phone numbers normally
written?
• How can we describe this format?
• How can we extract the phone
numbers?
Wednesday, 18 November 2009
13. Nothing. It is a generic folder that has Enron Global Markets on
the cover. It is the one that I sent you to match your insert to
when you were designing. With the dots. I am on my way over for a
meeting, I'll bring one.
Juli Salvagio
Manager, Marketing Communications
Public Relations
Enron-3641
1400 Smith Street
Houston, TX 77002
713-345-2908-Phone
713-542-0103-Mobile
713-646-5800-Fax
Wednesday, 18 November 2009
14. Mark,
Good speaking with you. I'll follow up when I get your email.
Thanks,
Rosanna
Rosanna Migliaccio
Vice President
Robert Walters Associates
(212) 704-9900
Fax: (212) 704-4312
mailto:rosanna@robertwalters.com
http://www.robertwalters.com
Wednesday, 18 November 2009
15. Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by
NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966);
Mon, 1 Oct 2001 14:39:38 -0500
x-mimeole: Produced By Microsoft Exchange V6.0.4712.0
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
Content-Transfer-Encoding: binary
Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
Date: Mon, 1 Oct 2001 14:39:38 -0500
Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com>
X-MS-Has-Attach:
X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-
MSMBX07V.corp.enron.com>
Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q==
X-Priority: 1
Priority: Urgent
Importance: high
From: "Landau, Georgi" <Georgi.Landau@ENRON.com>
To: "Bailey, Susan" <Susan.Bailey@ENRON.com>,
"Boyd, Samantha" <Samantha.Boyd@ENRON.com>,
"Heard, Marie" <Marie.Heard@ENRON.com>,
"Jones, Tana" <Tana.Jones@ENRON.com>,
"Panus, Stephanie" <Stephanie.Panus@ENRON.com>
Return-Path: Georgi.Landau@ENRON.com
What’s the pattern?
Wednesday, 18 November 2009
16. Your turn
Try and write down a description (in
words) of the usual format of phone
numbers.
Wednesday, 18 November 2009
17. Phone numbers
• 10 digits, normally grouped 3-3-4
• Separated by space, - or ()
• Format:
• [0-9]: matches any number between 0
and 9
• [- ()]: matches -, space, ( or )
• How can we match a phone number?
Wednesday, 18 November 2009
19. Qualifier ≥ ≤
? 0 1
+ 1 Inf
* 0 Inf
{m,n} m n
{,n} 0 n
{m,} m Inf
Wednesday, 18 November 2009
20. # What do these regular expression match?
mystery1 <- "[0-9]{5}(-[0-9]{4})?"
mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}"
mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}"
mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+
([a-z0-9-]*[a-z0-9]+)?)+)*"
# Think about them first, then input to
# http://xenon.stanford.edu/~xusch/regexp/analyzer.html
# (select java for language - it's closest to R for regexps)
Wednesday, 18 November 2009
21. New features
• () group parts of a regular expression
• . matches any character (. matches a .)
• d matches a digit, s matches a space
• Other character that need to be
escaped: $, ^
Wednesday, 18 November 2009
22. Regular expression
escapes
• Tricky, because we are writing strings,
but the regular expression is the
contents of the string.
• "a.b" is regular expression a.b
• "a.b" is an error
• "" is the regular expression which
matches a
Wednesday, 18 November 2009
23. Your turn
• Improve our initial regular expression
for matching a phone number
• Have a go at writing a regular
expression to match sums of money
• Hint: use the tools linked from the
website to help
Wednesday, 18 November 2009
25. Regexp String Matches
. "." Any single
character
. "." .
d "d" Any digit
s "s" Any white space
" """ "
""
b "b" Word border
Wednesday, 18 November 2009
26. Your turn
• Create a regular expression to match a
date. Test it against the following cases:
c("10/14/1979", "20/20/1945",
"1/1/1905" "5/5/5")
• Create a regular expression to match a
time. Test it against the following cases:
c("12:30 pm", "2:15 AM", "312:23
pm", "1:00 american", "08:20 am")
Wednesday, 18 November 2009
27. Your turn
• Create a regular expression that
matches dates like "12 April 2008"
• Make sure your expression works when
the date is in a sentence: "On 12 April
2008, I went to town."
• Extract just the month (hint: use strsplit)
Wednesday, 18 November 2009
28. test <- c("15 April 2005", "28 February 1982",
"113 July 1984")
dates <- extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test)
date <- dates[[1]]
strsplit(date, " ")[[1]][2]
Wednesday, 18 November 2009
29. Regexp String Matches
[abc] "[abc]" a, b, or c
[a-c] "[a-c]" a, b, or c
[ac-] "[ac-"] a, c, or -
[.] "[.]" .
[.] "[.]" . or
[ae-g.] "[ae-g.] a, e, f, g, or .
Wednesday, 18 November 2009
30. Regexp String Matches
[^abc] "[^abc]" Not a, b, or c
[^a-c] "[^a-c]" Not a, b, or c
[ac^] "[ac^]" a, c, or ^
Wednesday, 18 November 2009
31. Regexp String Matches
^a "^a" a at start of string
a$ "a$" a at end of string
^a$ "^a$" complete string =
a
$a "$a" $a
Wednesday, 18 November 2009
32. Your turn
Write a regular expression to match
strings of the form $5,000.
Extract them with str_extract_all() and
then use str_replace() to remove the $.
Convert to a number.
Wednesday, 18 November 2009
33. test <- c("$10,000,000", "$5.00 is owed", "I owe $4,356.45")
money <- extract("$[0-9]{1,3}(,[0-9]{3})*", test)
m <- money[[1]]
as.numeric(gsub("[$,]", "", m))
Wednesday, 18 November 2009
34. Function Parameters Result
str_detect string, pattern logical vector
str_locate string, pattern list of matrices
list of character
str_extract string, pattern
vectors
string, pattern,
str_replace replacement
character vector
list of character
str_split string, pattern
vectors
Wednesday, 18 November 2009
35. Atomic vector List
str_detect
str_locate str_locate_all
str_extract str_extract_all
str_replace
str_split_fixed str_split
Wednesday, 18 November 2009