• Like
24 Spam
Upcoming SlideShare
Loading in...5
×

24 Spam

  • 361 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
361
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
7
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Stat405 Regular expressions Hadley Wickham Wednesday, 18 November 2009
  • 2. 1. Review of last time 2. Regular expressions 3. Other string processing functions Wednesday, 18 November 2009
  • 3. Wednesday, 18 November 2009
  • 4. http://www.youtube.com/ watch?v=wfmvkO5x6Ng Wednesday, 18 November 2009
  • 5. Special characters • Use to “escape” special characters • " = " • n = new line • = • t = tab • ?Quotes for more Wednesday, 18 November 2009
  • 6. Recap load("email.rdata") str(contents) # second 100 are spam # reinstall because I've made some # improvements install.packages("stringr") Wednesday, 18 November 2009
  • 7. breaks <- str_locate(contents, "nn") # Remove invalid emails valid <- !is.na(breaks[, "start"]) contents <- contents[valid] breaks <- breaks[valid, ] # Extract headers and bodies header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Wednesday, 18 November 2009
  • 8. str_split(h, "n")[[1]] lines <- str_split(h, "n") continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_join(lines[groups == i], collapse = "n") } str_split_fixed(fields, ": ", 2) Wednesday, 18 November 2009
  • 9. Recap We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings What if the pattern we need to match is more complicated? Wednesday, 18 November 2009
  • 10. Matching challenges • How could we match a phone number? • How could we match a date? • How could we match a time? • How could we match an amount of money? • How could we match an email address? Wednesday, 18 November 2009
  • 11. Pattern matching Each of these types of data have a fairly regular pattern that we can easily pick out by eye Today we are going to learn about regular expressions, which are a pattern matching language Almost every programming language has regular expressions, often with a slightly different dialect. Very very useful Wednesday, 18 November 2009
  • 12. First challenge • Matching phone numbers • How are phone numbers normally written? • How can we describe this format? • How can we extract the phone numbers? Wednesday, 18 November 2009
  • 13. Nothing. It is a generic folder that has Enron Global Markets on the cover. It is the one that I sent you to match your insert to when you were designing. With the dots. I am on my way over for a meeting, I'll bring one. Juli Salvagio Manager, Marketing Communications Public Relations Enron-3641 1400 Smith Street Houston, TX 77002 713-345-2908-Phone 713-542-0103-Mobile 713-646-5800-Fax Wednesday, 18 November 2009
  • 14. Mark, Good speaking with you. I'll follow up when I get your email. Thanks, Rosanna Rosanna Migliaccio Vice President Robert Walters Associates (212) 704-9900 Fax: (212) 704-4312 mailto:rosanna@robertwalters.com http://www.robertwalters.com Wednesday, 18 November 2009
  • 15. Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966); Mon, 1 Oct 2001 14:39:38 -0500 x-mimeole: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; Content-Transfer-Encoding: binary Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Date: Mon, 1 Oct 2001 14:39:38 -0500 Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU- MSMBX07V.corp.enron.com> Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q== X-Priority: 1 Priority: Urgent Importance: high From: "Landau, Georgi" <Georgi.Landau@ENRON.com> To: "Bailey, Susan" <Susan.Bailey@ENRON.com>, "Boyd, Samantha" <Samantha.Boyd@ENRON.com>, "Heard, Marie" <Marie.Heard@ENRON.com>, "Jones, Tana" <Tana.Jones@ENRON.com>, "Panus, Stephanie" <Stephanie.Panus@ENRON.com> Return-Path: Georgi.Landau@ENRON.com What’s the pattern? Wednesday, 18 November 2009
  • 16. Your turn Try and write down a description (in words) of the usual format of phone numbers. Wednesday, 18 November 2009
  • 17. Phone numbers • 10 digits, normally grouped 3-3-4 • Separated by space, - or () • Format: • [0-9]: matches any number between 0 and 9 • [- ()]: matches -, space, ( or ) • How can we match a phone number? Wednesday, 18 November 2009
  • 18. phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()] [0-9][0-9][0-9][0-9]" phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}" test <- body[10] cat(test) str_detect(test, phone2) str_locate(test, phone2) str_locate_all(test, phone2) str_extract(test, phone2) str_extract_all(test, phone2) Wednesday, 18 November 2009
  • 19. Qualifier ≥ ≤ ? 0 1 + 1 Inf * 0 Inf {m,n} m n {,n} 0 n {m,} m Inf Wednesday, 18 November 2009
  • 20. # What do these regular expression match? mystery1 <- "[0-9]{5}(-[0-9]{4})?" mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}" mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}" mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+ ([a-z0-9-]*[a-z0-9]+)?)+)*" # Think about them first, then input to # http://xenon.stanford.edu/~xusch/regexp/analyzer.html # (select java for language - it's closest to R for regexps) Wednesday, 18 November 2009
  • 21. New features • () group parts of a regular expression • . matches any character (. matches a .) • d matches a digit, s matches a space • Other character that need to be escaped: $, ^ Wednesday, 18 November 2009
  • 22. Regular expression escapes • Tricky, because we are writing strings, but the regular expression is the contents of the string. • "a.b" is regular expression a.b • "a.b" is an error • "" is the regular expression which matches a Wednesday, 18 November 2009
  • 23. Your turn • Improve our initial regular expression for matching a phone number • Have a go at writing a regular expression to match sums of money • Hint: use the tools linked from the website to help Wednesday, 18 November 2009
  • 24. phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}" extract(phone3, bodies) phone4 <- "[0-9][0-9+() -]{6,}[0-9]" extract(phone4, bodies) money <- "$[0-9, ]+[0-9]" extract(money, bodies) Wednesday, 18 November 2009
  • 25. Regexp String Matches . "." Any single character . "." . d "d" Any digit s "s" Any white space " """ " "" b "b" Word border Wednesday, 18 November 2009
  • 26. Your turn • Create a regular expression to match a date. Test it against the following cases: c("10/14/1979", "20/20/1945", "1/1/1905" "5/5/5") • Create a regular expression to match a time. Test it against the following cases: c("12:30 pm", "2:15 AM", "312:23 pm", "1:00 american", "08:20 am") Wednesday, 18 November 2009
  • 27. Your turn • Create a regular expression that matches dates like "12 April 2008" • Make sure your expression works when the date is in a sentence: "On 12 April 2008, I went to town." • Extract just the month (hint: use strsplit) Wednesday, 18 November 2009
  • 28. test <- c("15 April 2005", "28 February 1982", "113 July 1984") dates <- extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test) date <- dates[[1]] strsplit(date, " ")[[1]][2] Wednesday, 18 November 2009
  • 29. Regexp String Matches [abc] "[abc]" a, b, or c [a-c] "[a-c]" a, b, or c [ac-] "[ac-"] a, c, or - [.] "[.]" . [.] "[.]" . or [ae-g.] "[ae-g.] a, e, f, g, or . Wednesday, 18 November 2009
  • 30. Regexp String Matches [^abc] "[^abc]" Not a, b, or c [^a-c] "[^a-c]" Not a, b, or c [ac^] "[ac^]" a, c, or ^ Wednesday, 18 November 2009
  • 31. Regexp String Matches ^a "^a" a at start of string a$ "a$" a at end of string ^a$ "^a$" complete string = a $a "$a" $a Wednesday, 18 November 2009
  • 32. Your turn Write a regular expression to match strings of the form $5,000. Extract them with str_extract_all() and then use str_replace() to remove the $. Convert to a number. Wednesday, 18 November 2009
  • 33. test <- c("$10,000,000", "$5.00 is owed", "I owe $4,356.45") money <- extract("$[0-9]{1,3}(,[0-9]{3})*", test) m <- money[[1]] as.numeric(gsub("[$,]", "", m)) Wednesday, 18 November 2009
  • 34. Function Parameters Result str_detect string, pattern logical vector str_locate string, pattern list of matrices list of character str_extract string, pattern vectors string, pattern, str_replace replacement character vector list of character str_split string, pattern vectors Wednesday, 18 November 2009
  • 35. Atomic vector List str_detect str_locate str_locate_all str_extract str_extract_all str_replace str_split_fixed str_split Wednesday, 18 November 2009