SlideShare a Scribd company logo
1 of 35
Download to read offline
Stat405          Regular expressions


                              Hadley Wickham
Wednesday, 18 November 2009
1. Review of last time
              2. Regular expressions
              3. Other string processing functions




Wednesday, 18 November 2009
Wednesday, 18 November 2009
http://www.youtube.com/
       watch?v=wfmvkO5x6Ng


Wednesday, 18 November 2009
Special characters
                  • Use  to “escape” special characters
                        • " = "
                        • n = new line
                        •  = 
                        • t = tab
                  • ?Quotes for more


Wednesday, 18 November 2009
Recap

              load("email.rdata")
              str(contents) # second 100 are spam

              # reinstall because I've made some
              # improvements
              install.packages("stringr")




Wednesday, 18 November 2009
breaks <- str_locate(contents, "nn")

     # Remove invalid emails
     valid <- !is.na(breaks[, "start"])
     contents <- contents[valid]
     breaks <- breaks[valid, ]

     # Extract headers and bodies
     header <- str_sub(contents, end = breaks[, 1])
     body <- str_sub(contents, start = breaks[, 2])




Wednesday, 18 November 2009
str_split(h, "n")[[1]]
     lines <- str_split(h, "n")

     continued <- str_sub(lines, 1, 1) %in% c(" ", "t")

     # This is a neat trick!
     groups <- cumsum(!continued)

     fields <- rep(NA, max(groups))
     for (i in seq_along(fields)) {
       fields[i] <- str_join(lines[groups == i],
         collapse = "n")
     }

     str_split_fixed(fields, ": ", 2)
Wednesday, 18 November 2009
Recap

                  We split the content into header and
                  body. And split up the header into fields.
                  Both of these tasks used fixed strings
                  What if the pattern we need to match is
                  more complicated?




Wednesday, 18 November 2009
Matching challenges
                  • How could we match a phone number?
                  • How could we match a date?
                  • How could we match a time?
                  • How could we match an amount of
                    money?
                  • How could we match an email address?


Wednesday, 18 November 2009
Pattern matching
                  Each of these types of data have a fairly
                  regular pattern that we can easily pick out
                  by eye
                  Today we are going to learn about regular
                  expressions, which are a pattern
                  matching language
                  Almost every programming language has
                  regular expressions, often with a slightly
                  different dialect. Very very useful

Wednesday, 18 November 2009
First challenge
                  • Matching phone numbers
                  • How are phone numbers normally
                    written?
                  • How can we describe this format?
                  • How can we extract the phone
                    numbers?


Wednesday, 18 November 2009
Nothing. It is a generic folder that has Enron Global Markets on
        the cover. It is the one that I sent you to match your insert to
        when you were designing. With the dots. I am on my way over for a
        meeting, I'll bring one.

        Juli Salvagio
        Manager, Marketing Communications
        Public Relations
        Enron-3641
        1400 Smith Street
        Houston, TX 77002
        713-345-2908-Phone
        713-542-0103-Mobile
        713-646-5800-Fax




Wednesday, 18 November 2009
Mark,
        Good speaking with you. I'll follow up when I get your email.
        Thanks,
        Rosanna

        Rosanna Migliaccio
        Vice President
        Robert Walters Associates
        (212) 704-9900
        Fax: (212) 704-4312
        mailto:rosanna@robertwalters.com
        http://www.robertwalters.com




Wednesday, 18 November 2009
Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by
           NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966);
           Mon, 1 Oct 2001 14:39:38 -0500
        x-mimeole: Produced By Microsoft Exchange V6.0.4712.0
        content-class: urn:content-classes:message
        MIME-Version: 1.0
        Content-Type: text/plain;
        Content-Transfer-Encoding: binary
        Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
        Date: Mon, 1 Oct 2001 14:39:38 -0500
        Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com>
        X-MS-Has-Attach:
        X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-
           MSMBX07V.corp.enron.com>
        Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
        Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q==
        X-Priority: 1
        Priority: Urgent
        Importance: high
        From: "Landau, Georgi" <Georgi.Landau@ENRON.com>
        To: "Bailey, Susan" <Susan.Bailey@ENRON.com>,
             "Boyd, Samantha" <Samantha.Boyd@ENRON.com>,
             "Heard, Marie" <Marie.Heard@ENRON.com>,
             "Jones, Tana" <Tana.Jones@ENRON.com>,
             "Panus, Stephanie" <Stephanie.Panus@ENRON.com>
        Return-Path: Georgi.Landau@ENRON.com
                                                                  What’s the pattern?


Wednesday, 18 November 2009
Your turn


                  Try and write down a description (in
                  words) of the usual format of phone
                  numbers.




Wednesday, 18 November 2009
Phone numbers
                  • 10 digits, normally grouped 3-3-4
                  • Separated by space, - or ()
                  • Format:
                    • [0-9]: matches any number between 0
                      and 9
                    • [- ()]: matches -, space, ( or )
                  • How can we match a phone number?


Wednesday, 18 November 2009
phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()]
     [0-9][0-9][0-9][0-9]"

     phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}"

     test <- body[10]
     cat(test)

     str_detect(test, phone2)

     str_locate(test, phone2)
     str_locate_all(test, phone2)

     str_extract(test, phone2)
     str_extract_all(test, phone2)

Wednesday, 18 November 2009
Qualifier   ≥   ≤
                                 ?       0    1

                                 +       1   Inf

                                 *       0   Inf

                               {m,n}     m    n

                               {,n}      0    n

                               {m,}      m   Inf

Wednesday, 18 November 2009
# What do these regular expression match?


        mystery1 <- "[0-9]{5}(-[0-9]{4})?"
        mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}"
        mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}"
        mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+
        ([a-z0-9-]*[a-z0-9]+)?)+)*"


        # Think about them first, then input to
        # http://xenon.stanford.edu/~xusch/regexp/analyzer.html
        # (select java for language - it's closest to R for regexps)




Wednesday, 18 November 2009
New features

                  • () group parts of a regular expression
                  • . matches any character (. matches a .)
                  • d matches a digit, s matches a space
                  • Other character that need to be
                    escaped: $, ^



Wednesday, 18 November 2009
Regular expression
                              escapes
                  • Tricky, because we are writing strings,
                    but the regular expression is the
                    contents of the string.
                  • "a.b" is regular expression a.b
                  • "a.b" is an error
                  • "" is the regular expression  which
                    matches a 


Wednesday, 18 November 2009
Your turn

                  • Improve our initial regular expression
                    for matching a phone number
                  • Have a go at writing a regular
                    expression to match sums of money
                  • Hint: use the tools linked from the
                    website to help


Wednesday, 18 November 2009
phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}"
        extract(phone3, bodies)
        phone4 <- "[0-9][0-9+() -]{6,}[0-9]"
        extract(phone4, bodies)


        money <- "$[0-9, ]+[0-9]"
        extract(money, bodies)




Wednesday, 18 November 2009
Regexp           String        Matches
                        .      "."     Any single
                                       character
                      .      "."    .
                      d      "d"    Any digit
                      s      "s"    Any white space
                        "      """    "
                            ""   
                      b      "b"    Word border
Wednesday, 18 November 2009
Your turn
                  • Create a regular expression to match a
                    date. Test it against the following cases:
                    c("10/14/1979", "20/20/1945",
                    "1/1/1905" "5/5/5")
                  • Create a regular expression to match a
                    time. Test it against the following cases:
                    c("12:30 pm", "2:15 AM", "312:23
                    pm", "1:00 american", "08:20 am")


Wednesday, 18 November 2009
Your turn
                  • Create a regular expression that
                    matches dates like "12 April 2008"
                  • Make sure your expression works when
                    the date is in a sentence: "On 12 April
                    2008, I went to town."
                  • Extract just the month (hint: use strsplit)



Wednesday, 18 November 2009
test <- c("15 April 2005", "28 February 1982",
                              "113 July 1984")


        dates <- extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test)


        date <- dates[[1]]
        strsplit(date, " ")[[1]][2]




Wednesday, 18 November 2009
Regexp            String             Matches
                 [abc]        "[abc]"    a, b, or c
                 [a-c]        "[a-c]"    a, b, or c
                 [ac-]        "[ac-"]    a, c, or -
                    [.]        "[.]"     .
                   [.]       "[.]"    . or 
             [ae-g.]          "[ae-g.]   a, e, f, g, or .
Wednesday, 18 November 2009
Regexp             String         Matches

             [^abc]           "[^abc]"   Not a, b, or c

             [^a-c]           "[^a-c]"   Not a, b, or c

               [ac^]          "[ac^]"    a, c, or ^



Wednesday, 18 November 2009
Regexp            String        Matches
                    ^a         "^a"    a at start of string
                    a$         "a$"    a at end of string
                   ^a$        "^a$"    complete string =
                                       a
                   $a        "$a"   $a



Wednesday, 18 November 2009
Your turn

                  Write a regular expression to match
                  strings of the form $5,000.
                  Extract them with str_extract_all() and
                  then use str_replace() to remove the $.
                  Convert to a number.



Wednesday, 18 November 2009
test <- c("$10,000,000", "$5.00 is owed", "I owe $4,356.45")
        money <- extract("$[0-9]{1,3}(,[0-9]{3})*", test)


        m <- money[[1]]
        as.numeric(gsub("[$,]", "", m))




Wednesday, 18 November 2009
Function        Parameters            Result

           str_detect         string, pattern     logical vector

           str_locate         string, pattern    list of matrices

                                                 list of character
         str_extract          string, pattern
                                                      vectors
                              string, pattern,
         str_replace           replacement
                                                 character vector

                                                 list of character
             str_split        string, pattern
                                                      vectors

Wednesday, 18 November 2009
Atomic vector           List

                       str_detect

                       str_locate   str_locate_all

                     str_extract    str_extract_all

                     str_replace

              str_split_fixed          str_split

Wednesday, 18 November 2009

More Related Content

Similar to 24 Spam

Similar to 24 Spam (15)

22 spam
22 spam22 spam
22 spam
 
21 Polishing
21 Polishing21 Polishing
21 Polishing
 
12 Ddply
12 Ddply12 Ddply
12 Ddply
 
07 Problem Solving
07 Problem Solving07 Problem Solving
07 Problem Solving
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
08 Functions
08 Functions08 Functions
08 Functions
 
Regular Expressions Boot Camp
Regular Expressions Boot CampRegular Expressions Boot Camp
Regular Expressions Boot Camp
 
Who Wants To Be a Munger
Who Wants To Be a MungerWho Wants To Be a Munger
Who Wants To Be a Munger
 
Prototype & Scriptaculous
Prototype  & ScriptaculousPrototype  & Scriptaculous
Prototype & Scriptaculous
 
Rack Middleware
Rack MiddlewareRack Middleware
Rack Middleware
 
Apache Hadoop Talk at QCon
Apache Hadoop Talk at QConApache Hadoop Talk at QCon
Apache Hadoop Talk at QCon
 
03 Cleaning
03 Cleaning03 Cleaning
03 Cleaning
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
Regular Expressions in JavaScript and Command Line
Regular Expressions in JavaScript and Command LineRegular Expressions in JavaScript and Command Line
Regular Expressions in JavaScript and Command Line
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 

More from Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

24 Spam

  • 1. Stat405 Regular expressions Hadley Wickham Wednesday, 18 November 2009
  • 2. 1. Review of last time 2. Regular expressions 3. Other string processing functions Wednesday, 18 November 2009
  • 4. http://www.youtube.com/ watch?v=wfmvkO5x6Ng Wednesday, 18 November 2009
  • 5. Special characters • Use to “escape” special characters • " = " • n = new line • = • t = tab • ?Quotes for more Wednesday, 18 November 2009
  • 6. Recap load("email.rdata") str(contents) # second 100 are spam # reinstall because I've made some # improvements install.packages("stringr") Wednesday, 18 November 2009
  • 7. breaks <- str_locate(contents, "nn") # Remove invalid emails valid <- !is.na(breaks[, "start"]) contents <- contents[valid] breaks <- breaks[valid, ] # Extract headers and bodies header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Wednesday, 18 November 2009
  • 8. str_split(h, "n")[[1]] lines <- str_split(h, "n") continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_join(lines[groups == i], collapse = "n") } str_split_fixed(fields, ": ", 2) Wednesday, 18 November 2009
  • 9. Recap We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings What if the pattern we need to match is more complicated? Wednesday, 18 November 2009
  • 10. Matching challenges • How could we match a phone number? • How could we match a date? • How could we match a time? • How could we match an amount of money? • How could we match an email address? Wednesday, 18 November 2009
  • 11. Pattern matching Each of these types of data have a fairly regular pattern that we can easily pick out by eye Today we are going to learn about regular expressions, which are a pattern matching language Almost every programming language has regular expressions, often with a slightly different dialect. Very very useful Wednesday, 18 November 2009
  • 12. First challenge • Matching phone numbers • How are phone numbers normally written? • How can we describe this format? • How can we extract the phone numbers? Wednesday, 18 November 2009
  • 13. Nothing. It is a generic folder that has Enron Global Markets on the cover. It is the one that I sent you to match your insert to when you were designing. With the dots. I am on my way over for a meeting, I'll bring one. Juli Salvagio Manager, Marketing Communications Public Relations Enron-3641 1400 Smith Street Houston, TX 77002 713-345-2908-Phone 713-542-0103-Mobile 713-646-5800-Fax Wednesday, 18 November 2009
  • 14. Mark, Good speaking with you. I'll follow up when I get your email. Thanks, Rosanna Rosanna Migliaccio Vice President Robert Walters Associates (212) 704-9900 Fax: (212) 704-4312 mailto:rosanna@robertwalters.com http://www.robertwalters.com Wednesday, 18 November 2009
  • 15. Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966); Mon, 1 Oct 2001 14:39:38 -0500 x-mimeole: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; Content-Transfer-Encoding: binary Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Date: Mon, 1 Oct 2001 14:39:38 -0500 Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU- MSMBX07V.corp.enron.com> Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q== X-Priority: 1 Priority: Urgent Importance: high From: "Landau, Georgi" <Georgi.Landau@ENRON.com> To: "Bailey, Susan" <Susan.Bailey@ENRON.com>, "Boyd, Samantha" <Samantha.Boyd@ENRON.com>, "Heard, Marie" <Marie.Heard@ENRON.com>, "Jones, Tana" <Tana.Jones@ENRON.com>, "Panus, Stephanie" <Stephanie.Panus@ENRON.com> Return-Path: Georgi.Landau@ENRON.com What’s the pattern? Wednesday, 18 November 2009
  • 16. Your turn Try and write down a description (in words) of the usual format of phone numbers. Wednesday, 18 November 2009
  • 17. Phone numbers • 10 digits, normally grouped 3-3-4 • Separated by space, - or () • Format: • [0-9]: matches any number between 0 and 9 • [- ()]: matches -, space, ( or ) • How can we match a phone number? Wednesday, 18 November 2009
  • 18. phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()] [0-9][0-9][0-9][0-9]" phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}" test <- body[10] cat(test) str_detect(test, phone2) str_locate(test, phone2) str_locate_all(test, phone2) str_extract(test, phone2) str_extract_all(test, phone2) Wednesday, 18 November 2009
  • 19. Qualifier ≥ ≤ ? 0 1 + 1 Inf * 0 Inf {m,n} m n {,n} 0 n {m,} m Inf Wednesday, 18 November 2009
  • 20. # What do these regular expression match? mystery1 <- "[0-9]{5}(-[0-9]{4})?" mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}" mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}" mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+ ([a-z0-9-]*[a-z0-9]+)?)+)*" # Think about them first, then input to # http://xenon.stanford.edu/~xusch/regexp/analyzer.html # (select java for language - it's closest to R for regexps) Wednesday, 18 November 2009
  • 21. New features • () group parts of a regular expression • . matches any character (. matches a .) • d matches a digit, s matches a space • Other character that need to be escaped: $, ^ Wednesday, 18 November 2009
  • 22. Regular expression escapes • Tricky, because we are writing strings, but the regular expression is the contents of the string. • "a.b" is regular expression a.b • "a.b" is an error • "" is the regular expression which matches a Wednesday, 18 November 2009
  • 23. Your turn • Improve our initial regular expression for matching a phone number • Have a go at writing a regular expression to match sums of money • Hint: use the tools linked from the website to help Wednesday, 18 November 2009
  • 24. phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}" extract(phone3, bodies) phone4 <- "[0-9][0-9+() -]{6,}[0-9]" extract(phone4, bodies) money <- "$[0-9, ]+[0-9]" extract(money, bodies) Wednesday, 18 November 2009
  • 25. Regexp String Matches . "." Any single character . "." . d "d" Any digit s "s" Any white space " """ " "" b "b" Word border Wednesday, 18 November 2009
  • 26. Your turn • Create a regular expression to match a date. Test it against the following cases: c("10/14/1979", "20/20/1945", "1/1/1905" "5/5/5") • Create a regular expression to match a time. Test it against the following cases: c("12:30 pm", "2:15 AM", "312:23 pm", "1:00 american", "08:20 am") Wednesday, 18 November 2009
  • 27. Your turn • Create a regular expression that matches dates like "12 April 2008" • Make sure your expression works when the date is in a sentence: "On 12 April 2008, I went to town." • Extract just the month (hint: use strsplit) Wednesday, 18 November 2009
  • 28. test <- c("15 April 2005", "28 February 1982", "113 July 1984") dates <- extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test) date <- dates[[1]] strsplit(date, " ")[[1]][2] Wednesday, 18 November 2009
  • 29. Regexp String Matches [abc] "[abc]" a, b, or c [a-c] "[a-c]" a, b, or c [ac-] "[ac-"] a, c, or - [.] "[.]" . [.] "[.]" . or [ae-g.] "[ae-g.] a, e, f, g, or . Wednesday, 18 November 2009
  • 30. Regexp String Matches [^abc] "[^abc]" Not a, b, or c [^a-c] "[^a-c]" Not a, b, or c [ac^] "[ac^]" a, c, or ^ Wednesday, 18 November 2009
  • 31. Regexp String Matches ^a "^a" a at start of string a$ "a$" a at end of string ^a$ "^a$" complete string = a $a "$a" $a Wednesday, 18 November 2009
  • 32. Your turn Write a regular expression to match strings of the form $5,000. Extract them with str_extract_all() and then use str_replace() to remove the $. Convert to a number. Wednesday, 18 November 2009
  • 33. test <- c("$10,000,000", "$5.00 is owed", "I owe $4,356.45") money <- extract("$[0-9]{1,3}(,[0-9]{3})*", test) m <- money[[1]] as.numeric(gsub("[$,]", "", m)) Wednesday, 18 November 2009
  • 34. Function Parameters Result str_detect string, pattern logical vector str_locate string, pattern list of matrices list of character str_extract string, pattern vectors string, pattern, str_replace replacement character vector list of character str_split string, pattern vectors Wednesday, 18 November 2009
  • 35. Atomic vector List str_detect str_locate str_locate_all str_extract str_extract_all str_replace str_split_fixed str_split Wednesday, 18 November 2009