SlideShare a Scribd company logo
1 of 35
Download to read offline
Stat405          Regular expressions


                              Hadley Wickham
Wednesday, 18 November 2009
1. Review of last time
              2. Regular expressions
              3. Other string processing functions




Wednesday, 18 November 2009
Wednesday, 18 November 2009
http://www.youtube.com/
       watch?v=wfmvkO5x6Ng


Wednesday, 18 November 2009
Special characters
                  • Use  to “escape” special characters
                        • " = "
                        • n = new line
                        •  = 
                        • t = tab
                  • ?Quotes for more


Wednesday, 18 November 2009
Recap

              load("email.rdata")
              str(contents) # second 100 are spam

              # reinstall because I've made some
              # improvements
              install.packages("stringr")




Wednesday, 18 November 2009
breaks <- str_locate(contents, "nn")

     # Remove invalid emails
     valid <- !is.na(breaks[, "start"])
     contents <- contents[valid]
     breaks <- breaks[valid, ]

     # Extract headers and bodies
     header <- str_sub(contents, end = breaks[, 1])
     body <- str_sub(contents, start = breaks[, 2])




Wednesday, 18 November 2009
str_split(h, "n")[[1]]
     lines <- str_split(h, "n")

     continued <- str_sub(lines, 1, 1) %in% c(" ", "t")

     # This is a neat trick!
     groups <- cumsum(!continued)

     fields <- rep(NA, max(groups))
     for (i in seq_along(fields)) {
       fields[i] <- str_join(lines[groups == i],
         collapse = "n")
     }

     str_split_fixed(fields, ": ", 2)
Wednesday, 18 November 2009
Recap

                  We split the content into header and
                  body. And split up the header into fields.
                  Both of these tasks used fixed strings
                  What if the pattern we need to match is
                  more complicated?




Wednesday, 18 November 2009
Matching challenges
                  • How could we match a phone number?
                  • How could we match a date?
                  • How could we match a time?
                  • How could we match an amount of
                    money?
                  • How could we match an email address?


Wednesday, 18 November 2009
Pattern matching
                  Each of these types of data have a fairly
                  regular pattern that we can easily pick out
                  by eye
                  Today we are going to learn about regular
                  expressions, which are a pattern
                  matching language
                  Almost every programming language has
                  regular expressions, often with a slightly
                  different dialect. Very very useful

Wednesday, 18 November 2009
First challenge
                  • Matching phone numbers
                  • How are phone numbers normally
                    written?
                  • How can we describe this format?
                  • How can we extract the phone
                    numbers?


Wednesday, 18 November 2009
Nothing. It is a generic folder that has Enron Global Markets on
        the cover. It is the one that I sent you to match your insert to
        when you were designing. With the dots. I am on my way over for a
        meeting, I'll bring one.

        Juli Salvagio
        Manager, Marketing Communications
        Public Relations
        Enron-3641
        1400 Smith Street
        Houston, TX 77002
        713-345-2908-Phone
        713-542-0103-Mobile
        713-646-5800-Fax




Wednesday, 18 November 2009
Mark,
        Good speaking with you. I'll follow up when I get your email.
        Thanks,
        Rosanna

        Rosanna Migliaccio
        Vice President
        Robert Walters Associates
        (212) 704-9900
        Fax: (212) 704-4312
        mailto:rosanna@robertwalters.com
        http://www.robertwalters.com




Wednesday, 18 November 2009
Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by
           NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966);
           Mon, 1 Oct 2001 14:39:38 -0500
        x-mimeole: Produced By Microsoft Exchange V6.0.4712.0
        content-class: urn:content-classes:message
        MIME-Version: 1.0
        Content-Type: text/plain;
        Content-Transfer-Encoding: binary
        Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
        Date: Mon, 1 Oct 2001 14:39:38 -0500
        Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com>
        X-MS-Has-Attach:
        X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-
           MSMBX07V.corp.enron.com>
        Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED
        Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q==
        X-Priority: 1
        Priority: Urgent
        Importance: high
        From: "Landau, Georgi" <Georgi.Landau@ENRON.com>
        To: "Bailey, Susan" <Susan.Bailey@ENRON.com>,
             "Boyd, Samantha" <Samantha.Boyd@ENRON.com>,
             "Heard, Marie" <Marie.Heard@ENRON.com>,
             "Jones, Tana" <Tana.Jones@ENRON.com>,
             "Panus, Stephanie" <Stephanie.Panus@ENRON.com>
        Return-Path: Georgi.Landau@ENRON.com
                                                                  What’s the pattern?


Wednesday, 18 November 2009
Your turn


                  Try and write down a description (in
                  words) of the usual format of phone
                  numbers.




Wednesday, 18 November 2009
Phone numbers
                  • 10 digits, normally grouped 3-3-4
                  • Separated by space, - or ()
                  • Format:
                    • [0-9]: matches any number between 0
                      and 9
                    • [- ()]: matches -, space, ( or )
                  • How can we match a phone number?


Wednesday, 18 November 2009
phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()]
     [0-9][0-9][0-9][0-9]"

     phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}"

     test <- body[10]
     cat(test)

     str_detect(test, phone2)

     str_locate(test, phone2)
     str_locate_all(test, phone2)

     str_extract(test, phone2)
     str_extract_all(test, phone2)

Wednesday, 18 November 2009
Qualifier   ≥   ≤
                                 ?       0    1

                                 +       1   Inf

                                 *       0   Inf

                               {m,n}     m    n

                               {,n}      0    n

                               {m,}      m   Inf

Wednesday, 18 November 2009
# What do these regular expression match?


        mystery1 <- "[0-9]{5}(-[0-9]{4})?"
        mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}"
        mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}"
        mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+
        ([a-z0-9-]*[a-z0-9]+)?)+)*"


        # Think about them first, then input to
        # http://xenon.stanford.edu/~xusch/regexp/analyzer.html
        # (select java for language - it's closest to R for regexps)




Wednesday, 18 November 2009
New features

                  • () group parts of a regular expression
                  • . matches any character (. matches a .)
                  • d matches a digit, s matches a space
                  • Other character that need to be
                    escaped: $, ^



Wednesday, 18 November 2009
Regular expression
                              escapes
                  • Tricky, because we are writing strings,
                    but the regular expression is the
                    contents of the string.
                  • "a.b" is regular expression a.b
                  • "a.b" is an error
                  • "" is the regular expression  which
                    matches a 


Wednesday, 18 November 2009
Your turn

                  • Improve our initial regular expression
                    for matching a phone number
                  • Have a go at writing a regular
                    expression to match sums of money
                  • Hint: use the tools linked from the
                    website to help


Wednesday, 18 November 2009
phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}"
        extract(phone3, bodies)
        phone4 <- "[0-9][0-9+() -]{6,}[0-9]"
        extract(phone4, bodies)


        money <- "$[0-9, ]+[0-9]"
        extract(money, bodies)




Wednesday, 18 November 2009
Regexp           String        Matches
                        .      "."     Any single
                                       character
                      .      "."    .
                      d      "d"    Any digit
                      s      "s"    Any white space
                        "      """    "
                            ""   
                      b      "b"    Word border
Wednesday, 18 November 2009
Your turn
                  • Create a regular expression to match a
                    date. Test it against the following cases:
                    c("10/14/1979", "20/20/1945",
                    "1/1/1905" "5/5/5")
                  • Create a regular expression to match a
                    time. Test it against the following cases:
                    c("12:30 pm", "2:15 AM", "312:23
                    pm", "1:00 american", "08:20 am")


Wednesday, 18 November 2009
Your turn
                  • Create a regular expression that
                    matches dates like "12 April 2008"
                  • Make sure your expression works when
                    the date is in a sentence: "On 12 April
                    2008, I went to town."
                  • Extract just the month (hint: use strsplit)



Wednesday, 18 November 2009
test <- c("15 April 2005", "28 February 1982",
                              "113 July 1984")


        dates <- extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test)


        date <- dates[[1]]
        strsplit(date, " ")[[1]][2]




Wednesday, 18 November 2009
Regexp            String             Matches
                 [abc]        "[abc]"    a, b, or c
                 [a-c]        "[a-c]"    a, b, or c
                 [ac-]        "[ac-"]    a, c, or -
                    [.]        "[.]"     .
                   [.]       "[.]"    . or 
             [ae-g.]          "[ae-g.]   a, e, f, g, or .
Wednesday, 18 November 2009
Regexp             String         Matches

             [^abc]           "[^abc]"   Not a, b, or c

             [^a-c]           "[^a-c]"   Not a, b, or c

               [ac^]          "[ac^]"    a, c, or ^



Wednesday, 18 November 2009
Regexp            String        Matches
                    ^a         "^a"    a at start of string
                    a$         "a$"    a at end of string
                   ^a$        "^a$"    complete string =
                                       a
                   $a        "$a"   $a



Wednesday, 18 November 2009
Your turn

                  Write a regular expression to match
                  strings of the form $5,000.
                  Extract them with str_extract_all() and
                  then use str_replace() to remove the $.
                  Convert to a number.



Wednesday, 18 November 2009
test <- c("$10,000,000", "$5.00 is owed", "I owe $4,356.45")
        money <- extract("$[0-9]{1,3}(,[0-9]{3})*", test)


        m <- money[[1]]
        as.numeric(gsub("[$,]", "", m))




Wednesday, 18 November 2009
Function        Parameters            Result

           str_detect         string, pattern     logical vector

           str_locate         string, pattern    list of matrices

                                                 list of character
         str_extract          string, pattern
                                                      vectors
                              string, pattern,
         str_replace           replacement
                                                 character vector

                                                 list of character
             str_split        string, pattern
                                                      vectors

Wednesday, 18 November 2009
Atomic vector           List

                       str_detect

                       str_locate   str_locate_all

                     str_extract    str_extract_all

                     str_replace

              str_split_fixed          str_split

Wednesday, 18 November 2009

More Related Content

Similar to 24 Spam

Similar to 24 Spam (15)

22 spam
22 spam22 spam
22 spam
 
21 Polishing
21 Polishing21 Polishing
21 Polishing
 
12 Ddply
12 Ddply12 Ddply
12 Ddply
 
07 Problem Solving
07 Problem Solving07 Problem Solving
07 Problem Solving
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
08 Functions
08 Functions08 Functions
08 Functions
 
Regular Expressions Boot Camp
Regular Expressions Boot CampRegular Expressions Boot Camp
Regular Expressions Boot Camp
 
Who Wants To Be a Munger
Who Wants To Be a MungerWho Wants To Be a Munger
Who Wants To Be a Munger
 
Prototype & Scriptaculous
Prototype  & ScriptaculousPrototype  & Scriptaculous
Prototype & Scriptaculous
 
Rack Middleware
Rack MiddlewareRack Middleware
Rack Middleware
 
Apache Hadoop Talk at QCon
Apache Hadoop Talk at QConApache Hadoop Talk at QCon
Apache Hadoop Talk at QCon
 
03 Cleaning
03 Cleaning03 Cleaning
03 Cleaning
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
Regular Expressions in JavaScript and Command Line
Regular Expressions in JavaScript and Command LineRegular Expressions in JavaScript and Command Line
Regular Expressions in JavaScript and Command Line
 
P3 2018 python_regexes
P3 2018 python_regexesP3 2018 python_regexes
P3 2018 python_regexes
 

More from Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

24 Spam

  • 1. Stat405 Regular expressions Hadley Wickham Wednesday, 18 November 2009
  • 2. 1. Review of last time 2. Regular expressions 3. Other string processing functions Wednesday, 18 November 2009
  • 4. http://www.youtube.com/ watch?v=wfmvkO5x6Ng Wednesday, 18 November 2009
  • 5. Special characters • Use to “escape” special characters • " = " • n = new line • = • t = tab • ?Quotes for more Wednesday, 18 November 2009
  • 6. Recap load("email.rdata") str(contents) # second 100 are spam # reinstall because I've made some # improvements install.packages("stringr") Wednesday, 18 November 2009
  • 7. breaks <- str_locate(contents, "nn") # Remove invalid emails valid <- !is.na(breaks[, "start"]) contents <- contents[valid] breaks <- breaks[valid, ] # Extract headers and bodies header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Wednesday, 18 November 2009
  • 8. str_split(h, "n")[[1]] lines <- str_split(h, "n") continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_join(lines[groups == i], collapse = "n") } str_split_fixed(fields, ": ", 2) Wednesday, 18 November 2009
  • 9. Recap We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings What if the pattern we need to match is more complicated? Wednesday, 18 November 2009
  • 10. Matching challenges • How could we match a phone number? • How could we match a date? • How could we match a time? • How could we match an amount of money? • How could we match an email address? Wednesday, 18 November 2009
  • 11. Pattern matching Each of these types of data have a fairly regular pattern that we can easily pick out by eye Today we are going to learn about regular expressions, which are a pattern matching language Almost every programming language has regular expressions, often with a slightly different dialect. Very very useful Wednesday, 18 November 2009
  • 12. First challenge • Matching phone numbers • How are phone numbers normally written? • How can we describe this format? • How can we extract the phone numbers? Wednesday, 18 November 2009
  • 13. Nothing. It is a generic folder that has Enron Global Markets on the cover. It is the one that I sent you to match your insert to when you were designing. With the dots. I am on my way over for a meeting, I'll bring one. Juli Salvagio Manager, Marketing Communications Public Relations Enron-3641 1400 Smith Street Houston, TX 77002 713-345-2908-Phone 713-542-0103-Mobile 713-646-5800-Fax Wednesday, 18 November 2009
  • 14. Mark, Good speaking with you. I'll follow up when I get your email. Thanks, Rosanna Rosanna Migliaccio Vice President Robert Walters Associates (212) 704-9900 Fax: (212) 704-4312 mailto:rosanna@robertwalters.com http://www.robertwalters.com Wednesday, 18 November 2009
  • 15. Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966); Mon, 1 Oct 2001 14:39:38 -0500 x-mimeole: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; Content-Transfer-Encoding: binary Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Date: Mon, 1 Oct 2001 14:39:38 -0500 Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU- MSMBX07V.corp.enron.com> Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q== X-Priority: 1 Priority: Urgent Importance: high From: "Landau, Georgi" <Georgi.Landau@ENRON.com> To: "Bailey, Susan" <Susan.Bailey@ENRON.com>, "Boyd, Samantha" <Samantha.Boyd@ENRON.com>, "Heard, Marie" <Marie.Heard@ENRON.com>, "Jones, Tana" <Tana.Jones@ENRON.com>, "Panus, Stephanie" <Stephanie.Panus@ENRON.com> Return-Path: Georgi.Landau@ENRON.com What’s the pattern? Wednesday, 18 November 2009
  • 16. Your turn Try and write down a description (in words) of the usual format of phone numbers. Wednesday, 18 November 2009
  • 17. Phone numbers • 10 digits, normally grouped 3-3-4 • Separated by space, - or () • Format: • [0-9]: matches any number between 0 and 9 • [- ()]: matches -, space, ( or ) • How can we match a phone number? Wednesday, 18 November 2009
  • 18. phone <- "[0-9][0-9][0-9][- ()][0-9][0-9][0-9][- ()] [0-9][0-9][0-9][0-9]" phone2 <- "[0-9]{3}[- ()][0-9]{3}[- ()][0-9]{3}" test <- body[10] cat(test) str_detect(test, phone2) str_locate(test, phone2) str_locate_all(test, phone2) str_extract(test, phone2) str_extract_all(test, phone2) Wednesday, 18 November 2009
  • 19. Qualifier ≥ ≤ ? 0 1 + 1 Inf * 0 Inf {m,n} m n {,n} 0 n {m,} m Inf Wednesday, 18 November 2009
  • 20. # What do these regular expression match? mystery1 <- "[0-9]{5}(-[0-9]{4})?" mystery3 <- "[0-9]{3}-[0-9]{2}-[0-9]{4}" mystery4 <- "[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}" mystery5 <- "https?://[a-z]+([a-z0-9-]*[a-z0-9]+)?(.([a-z]+ ([a-z0-9-]*[a-z0-9]+)?)+)*" # Think about them first, then input to # http://xenon.stanford.edu/~xusch/regexp/analyzer.html # (select java for language - it's closest to R for regexps) Wednesday, 18 November 2009
  • 21. New features • () group parts of a regular expression • . matches any character (. matches a .) • d matches a digit, s matches a space • Other character that need to be escaped: $, ^ Wednesday, 18 November 2009
  • 22. Regular expression escapes • Tricky, because we are writing strings, but the regular expression is the contents of the string. • "a.b" is regular expression a.b • "a.b" is an error • "" is the regular expression which matches a Wednesday, 18 November 2009
  • 23. Your turn • Improve our initial regular expression for matching a phone number • Have a go at writing a regular expression to match sums of money • Hint: use the tools linked from the website to help Wednesday, 18 November 2009
  • 24. phone3 <- "[0-9]{3}[- ()]+[0-9]{3}[- ()]+[0-9]{3}" extract(phone3, bodies) phone4 <- "[0-9][0-9+() -]{6,}[0-9]" extract(phone4, bodies) money <- "$[0-9, ]+[0-9]" extract(money, bodies) Wednesday, 18 November 2009
  • 25. Regexp String Matches . "." Any single character . "." . d "d" Any digit s "s" Any white space " """ " "" b "b" Word border Wednesday, 18 November 2009
  • 26. Your turn • Create a regular expression to match a date. Test it against the following cases: c("10/14/1979", "20/20/1945", "1/1/1905" "5/5/5") • Create a regular expression to match a time. Test it against the following cases: c("12:30 pm", "2:15 AM", "312:23 pm", "1:00 american", "08:20 am") Wednesday, 18 November 2009
  • 27. Your turn • Create a regular expression that matches dates like "12 April 2008" • Make sure your expression works when the date is in a sentence: "On 12 April 2008, I went to town." • Extract just the month (hint: use strsplit) Wednesday, 18 November 2009
  • 28. test <- c("15 April 2005", "28 February 1982", "113 July 1984") dates <- extract("b[0-9]{2} [a-zA-Z]+ [0-9]{4}", test) date <- dates[[1]] strsplit(date, " ")[[1]][2] Wednesday, 18 November 2009
  • 29. Regexp String Matches [abc] "[abc]" a, b, or c [a-c] "[a-c]" a, b, or c [ac-] "[ac-"] a, c, or - [.] "[.]" . [.] "[.]" . or [ae-g.] "[ae-g.] a, e, f, g, or . Wednesday, 18 November 2009
  • 30. Regexp String Matches [^abc] "[^abc]" Not a, b, or c [^a-c] "[^a-c]" Not a, b, or c [ac^] "[ac^]" a, c, or ^ Wednesday, 18 November 2009
  • 31. Regexp String Matches ^a "^a" a at start of string a$ "a$" a at end of string ^a$ "^a$" complete string = a $a "$a" $a Wednesday, 18 November 2009
  • 32. Your turn Write a regular expression to match strings of the form $5,000. Extract them with str_extract_all() and then use str_replace() to remove the $. Convert to a number. Wednesday, 18 November 2009
  • 33. test <- c("$10,000,000", "$5.00 is owed", "I owe $4,356.45") money <- extract("$[0-9]{1,3}(,[0-9]{3})*", test) m <- money[[1]] as.numeric(gsub("[$,]", "", m)) Wednesday, 18 November 2009
  • 34. Function Parameters Result str_detect string, pattern logical vector str_locate string, pattern list of matrices list of character str_extract string, pattern vectors string, pattern, str_replace replacement character vector list of character str_split string, pattern vectors Wednesday, 18 November 2009
  • 35. Atomic vector List str_detect str_locate str_locate_all str_extract str_extract_all str_replace str_split_fixed str_split Wednesday, 18 November 2009