Your SlideShare is downloading. ×
0
Stat405          Regular expressions


                              Hadley Wickham
Wednesday, 18 November 2009
1. Review of last time
              2. Regular expressions
              3. Other string processing functions




Wednesd...
Wednesday, 18 November 2009
http://www.youtube.com/
       watch?v=wfmvkO5x6Ng


Wednesday, 18 November 2009
Special characters
                  • Use  to “escape” special characters
                        • " = "
               ...
Recap

              load("email.rdata")
              str(contents) # second 100 are spam

              # reinstall beca...
breaks <- str_locate(contents, "nn")

     # Remove invalid emails
     valid <- !is.na(breaks[, "start"])
     contents <...
str_split(h, "n")[[1]]
     lines <- str_split(h, "n")

     continued <- str_sub(lines, 1, 1) %in% c(" ", "t")

     # Th...
Recap

                  We split the content into header and
                  body. And split up the header into fields.
...
Matching challenges
                  • How could we match a phone number?
                  • How could we match a date?
...
Pattern matching
                  Each of these types of data have a fairly
                  regular pattern that we can...
First challenge
                  • Matching phone numbers
                  • How are phone numbers normally
            ...
Nothing. It is a generic folder that has Enron Global Markets on
        the cover. It is the one that I sent you to match...
Mark,
        Good speaking with you. I'll follow up when I get your email.
        Thanks,
        Rosanna

        Rosan...
Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by
           NAHOU-MSAPP01S.corp.enron.com with Microsoft...
Your turn


                  Try and write down a description (in
                  words) of the usual format of phone
 ...
Phone numbers
                  • 10 digits, normally grouped 3-3-4
                  • Separated by space, - or ()
      ...
phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()]
     [0-9][0-9][0-9][0-9]"

     phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ...
Qualifier   ≥   ≤
                                 ?       0    1

                                 +       1   Inf

      ...
# What do these regular expression match?


        mystery1 <- "[0-9]{5}(-[0-9]{4})?"
        mystery3 <- "[0-9]{3}-[0-9]...
New features

                  • () group parts of a regular expression
                  • . matches any character (. ma...
Regular expression
                              escapes
                  • Tricky, because we are writing strings,
     ...
Your turn

                  • Improve our initial regular expression
                    for matching a phone number
    ...
phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}"
        extract(phone3, bodies)
        phone4 <- "[0-9][0-9+() -]{6,}[...
Regexp           String        Matches
                        .      "."     Any single
                                 ...
Your turn
                  • Create a regular expression to match a
                    date. Test it against the followi...
Your turn
                  • Create a regular expression that
                    matches dates like "12 April 2008"
    ...
test <- c("15 April 2005", "28 February 1982",
                              "113 July 1984")


        dates <- extract("...
Regexp            String             Matches
                 [abc]        "[abc]"    a, b, or c
                 [a-c]   ...
Regexp             String         Matches

             [^abc]           "[^abc]"   Not a, b, or c

             [^a-c]   ...
Regexp            String        Matches
                    ^a         "^a"    a at start of string
                    a$...
Your turn

                  Write a regular expression to match
                  strings of the form $5,000.
           ...
test <- c("$10,000,000", "$5.00 is owed", "I owe $4,356.45")
        money <- extract("$[0-9]{1,3}(,[0-9]{3})*", test)


 ...
Function        Parameters            Result

           str_detect         string, pattern     logical vector

          ...
Atomic vector           List

                       str_detect

                       str_locate   str_locate_all

     ...
Upcoming SlideShare
Loading in...5
×

24 Spam

395

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
395
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "24 Spam"

  1. 1. Stat405 Regular expressions Hadley Wickham Wednesday, 18 November 2009
  2. 2. 1. Review of last time 2. Regular expressions 3. Other string processing functions Wednesday, 18 November 2009
  3. 3. Wednesday, 18 November 2009
  4. 4. http://www.youtube.com/ watch?v=wfmvkO5x6Ng Wednesday, 18 November 2009
  5. 5. Special characters • Use to “escape” special characters • " = " • n = new line • = • t = tab • ?Quotes for more Wednesday, 18 November 2009
  6. 6. Recap load("email.rdata") str(contents) # second 100 are spam # reinstall because I've made some # improvements install.packages("stringr") Wednesday, 18 November 2009
  7. 7. breaks <- str_locate(contents, "nn") # Remove invalid emails valid <- !is.na(breaks[, "start"]) contents <- contents[valid] breaks <- breaks[valid, ] # Extract headers and bodies header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Wednesday, 18 November 2009
  8. 8. str_split(h, "n")[[1]] lines <- str_split(h, "n") continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_join(lines[groups == i], collapse = "n") } str_split_fixed(fields, ": ", 2) Wednesday, 18 November 2009
  9. 9. Recap We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings What if the pattern we need to match is more complicated? Wednesday, 18 November 2009
  10. 10. Matching challenges • How could we match a phone number? • How could we match a date? • How could we match a time? • How could we match an amount of money? • How could we match an email address? Wednesday, 18 November 2009
  11. 11. Pattern matching Each of these types of data have a fairly regular pattern that we can easily pick out by eye Today we are going to learn about regular expressions, which are a pattern matching language Almost every programming language has regular expressions, often with a slightly different dialect. Very very useful Wednesday, 18 November 2009
  12. 12. First challenge • Matching phone numbers • How are phone numbers normally written? • How can we describe this format? • How can we extract the phone numbers? Wednesday, 18 November 2009
  13. 13. Nothing. It is a generic folder that has Enron Global Markets on the cover. It is the one that I sent you to match your insert to when you were designing. With the dots. I am on my way over for a meeting, I'll bring one. Juli Salvagio Manager, Marketing Communications Public Relations Enron-3641 1400 Smith Street Houston, TX 77002 713-345-2908-Phone 713-542-0103-Mobile 713-646-5800-Fax Wednesday, 18 November 2009
  14. 14. Mark, Good speaking with you. I'll follow up when I get your email. Thanks, Rosanna Rosanna Migliaccio Vice President Robert Walters Associates (212) 704-9900 Fax: (212) 704-4312 mailto:rosanna@robertwalters.com http://www.robertwalters.com Wednesday, 18 November 2009
  15. 15. Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966); Mon, 1 Oct 2001 14:39:38 -0500 x-mimeole: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; Content-Transfer-Encoding: binary Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Date: Mon, 1 Oct 2001 14:39:38 -0500 Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU- MSMBX07V.corp.enron.com> Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q== X-Priority: 1 Priority: Urgent Importance: high From: "Landau, Georgi" <Georgi.Landau@ENRON.com> To: "Bailey, Susan" <Susan.Bailey@ENRON.com>, "Boyd, Samantha" <Samantha.Boyd@ENRON.com>, "Heard, Marie" <Marie.Heard@ENRON.com>, "Jones, Tana" <Tana.Jones@ENRON.com>, "Panus, Stephanie" <Stephanie.Panus@ENRON.com> Return-Path: Georgi.Landau@ENRON.com What’s the pattern? Wednesday, 18 November 2009
  16. 16. Your turn Try and write down a description (in words) of the usual format of phone numbers. Wednesday, 18 November 2009
  17. 17. Phone numbers • 10 digits, normally grouped 3-3-4 • Separated by space, - or () • Format: • [0-9]: matches any number between 0 and 9 • [- ()]: matches -, space, ( or ) • How can we match a phone number? Wednesday, 18 November 2009
  18. 18. phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()] [0-9][0-9][0-9][0-9]" phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}" test <- body[10] cat(test) str_detect(test, phone2) str_locate(test, phone2) str_locate_all(test, phone2) str_extract(test, phone2) str_extract_all(test, phone2) Wednesday, 18 November 2009
  19. 19. Qualifier ≥ ≤ ? 0 1 + 1 Inf * 0 Inf {m,n} m n {,n} 0 n {m,} m Inf Wednesday, 18 November 2009
  20. 20. # What do these regular expression match? mystery1 <- "[0-9]{5}(-[0-9]{4})?" mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}" mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}" mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+ ([a-z0-9-]*[a-z0-9]+)?)+)*" # Think about them first, then input to # http://xenon.stanford.edu/~xusch/regexp/analyzer.html # (select java for language - it's closest to R for regexps) Wednesday, 18 November 2009
  21. 21. New features • () group parts of a regular expression • . matches any character (. matches a .) • d matches a digit, s matches a space • Other character that need to be escaped: $, ^ Wednesday, 18 November 2009
  22. 22. Regular expression escapes • Tricky, because we are writing strings, but the regular expression is the contents of the string. • "a.b" is regular expression a.b • "a.b" is an error • "" is the regular expression which matches a Wednesday, 18 November 2009
  23. 23. Your turn • Improve our initial regular expression for matching a phone number • Have a go at writing a regular expression to match sums of money • Hint: use the tools linked from the website to help Wednesday, 18 November 2009
  24. 24. phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}" extract(phone3, bodies) phone4 <- "[0-9][0-9+() -]{6,}[0-9]" extract(phone4, bodies) money <- "$[0-9, ]+[0-9]" extract(money, bodies) Wednesday, 18 November 2009
  25. 25. Regexp String Matches . "." Any single character . "." . d "d" Any digit s "s" Any white space " """ " "" b "b" Word border Wednesday, 18 November 2009
  26. 26. Your turn • Create a regular expression to match a date. Test it against the following cases: c("10/14/1979", "20/20/1945", "1/1/1905" "5/5/5") • Create a regular expression to match a time. Test it against the following cases: c("12:30 pm", "2:15 AM", "312:23 pm", "1:00 american", "08:20 am") Wednesday, 18 November 2009
  27. 27. Your turn • Create a regular expression that matches dates like "12 April 2008" • Make sure your expression works when the date is in a sentence: "On 12 April 2008, I went to town." • Extract just the month (hint: use strsplit) Wednesday, 18 November 2009
  28. 28. test <- c("15 April 2005", "28 February 1982", "113 July 1984") dates <- extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test) date <- dates[[1]] strsplit(date, " ")[[1]][2] Wednesday, 18 November 2009
  29. 29. Regexp String Matches [abc] "[abc]" a, b, or c [a-c] "[a-c]" a, b, or c [ac-] "[ac-"] a, c, or - [.] "[.]" . [.] "[.]" . or [ae-g.] "[ae-g.] a, e, f, g, or . Wednesday, 18 November 2009
  30. 30. Regexp String Matches [^abc] "[^abc]" Not a, b, or c [^a-c] "[^a-c]" Not a, b, or c [ac^] "[ac^]" a, c, or ^ Wednesday, 18 November 2009
  31. 31. Regexp String Matches ^a "^a" a at start of string a$ "a$" a at end of string ^a$ "^a$" complete string = a $a "$a" $a Wednesday, 18 November 2009
  32. 32. Your turn Write a regular expression to match strings of the form $5,000. Extract them with str_extract_all() and then use str_replace() to remove the $. Convert to a number. Wednesday, 18 November 2009
  33. 33. test <- c("$10,000,000", "$5.00 is owed", "I owe $4,356.45") money <- extract("$[0-9]{1,3}(,[0-9]{3})*", test) m <- money[[1]] as.numeric(gsub("[$,]", "", m)) Wednesday, 18 November 2009
  34. 34. Function Parameters Result str_detect string, pattern logical vector str_locate string, pattern list of matrices list of character str_extract string, pattern vectors string, pattern, str_replace replacement character vector list of character str_split string, pattern vectors Wednesday, 18 November 2009
  35. 35. Atomic vector List str_detect str_locate str_locate_all str_extract str_extract_all str_replace str_split_fixed str_split Wednesday, 18 November 2009
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×