0
Hadley Wickham
Stat405String processing
Thursday, 4 November 2010
Next project
Poster presentation: last week of class or
finals week?
Nov 23: Tracy Volz will be presenting on
how to do a g...
Motivation
Want to try and classify spam vs. ham
(non-spam) email.
Need to process character vectors
containing complete c...
Getting started
load("email.rdata")
str(contents) # second 100 are spam
install.packages("stringr")
# all functions starti...
Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com
with Microsoft SMTPSVC(5....
...
Date: Sun, 11 Nov 2001 11:55:05 -0500
Message-Id: <200503101247.j2ACloAq014654@host.high-host.com>
To: benrobinson13@s...
Structure of an email
Headers give metadata
Empty line
Body of email
Other major complication is attachments
and alternati...
Tasks
• Split into header and contents
• Split header into fields
• Generate useful variables for
distinguishing between sp...
String
basics
Thursday, 4 November 2010
Thursday, 4 November 2010
http://www.youtube.com/
watch?v=ejweI0EQpX8
Thursday, 4 November 2010
# Special characters
a <- ""
b <- """
c <- "anbnc"
a
cat(a, "n")
b
cat(b, "n")
c
cat(c, "n")
Thursday, 4 November 2010
Special characters
• Use  to “escape” special characters
• " = "
• n = new line
•  = 
• t = tab
• ?Quotes for more
Thursda...
Displaying strings
print() will display quoted form.
cat() will display actual contents (but
need to add a newline to the ...
Your turn
Create a string for each of the following
strings:
:-
(^_^")
@_'-'
m/
Create a multiline string.
Compare the out...
a <- ":-"
b <- "(^_^")"
c <- "@_'-'"
d <- "m/"
e <- "This stringngoes overnmultiple lines"
a; b; c; d; e
cat(str_join(a, b...
Back to the
problem
Thursday, 4 November 2010
Header vs. content
Need to split the string into two pieces,
based on the the location of double line
break:
str_locate(st...
str_locate("great")
str_locate("fantastic", "a")
str_locate("super", "a")
superlatives <- c("great", "fantastic", "super")...
Your turn
Use sub_locate() to identify the location
where the double break is (make sure to
check a few!)
Split the emails...
breaks <- str_locate(contents, "nn")
# Remove invalid emails
valid <- !is.na(breaks[, "start"])
contents <- contents[valid...
Headers
• Each header starts at the beginning of
a new line
• Each header is composed of a name
and contents, separated by...
h <- header[2]
# Does this work?
str_split(h, "n")[[1]]
# Why / why not?
# How could you fix the problem?
Thursday, 4 Nove...
# Split & patch up
lines <- str_split(h, "n")
continued <- str_sub(lines, 1, 1) %in% c(" ", "t")
# This is a neat trick!
g...
Your turn
Write a small function that given a single
header field splits it into name and
contents.
Do you want to use str_...
test1 <- "Sender: <Lighthouse@independent.org>"
test2 <- "Subject: Alice: Where is my coffee?"
f1 <- function(input) {
str...
We split the content into header and
body. And split up the header into fields.
Both of these tasks used fixed strings.
What...
Upcoming SlideShare
Loading in...5
×

21 spam

828

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
828
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "21 spam"

  1. 1. Hadley Wickham Stat405String processing Thursday, 4 November 2010
  2. 2. Next project Poster presentation: last week of class or finals week? Nov 23: Tracy Volz will be presenting on how to do a good poster You need to pick your own dataset. Ideas here: http://www.delicious.com/hadley/ data. Start thinking now! Thursday, 4 November 2010
  3. 3. Motivation Want to try and classify spam vs. ham (non-spam) email. Need to process character vectors containing complete contents of email. Eventually, will create variables that will be useful for classification. Same techniques very useful for data cleaning. Thursday, 4 November 2010
  4. 4. Getting started load("email.rdata") str(contents) # second 100 are spam install.packages("stringr") # all functions starting with str_ # come from this package help(package = "stringr") apropos("str_") Thursday, 4 November 2010
  5. 5. Received: from NAHOU-MSMBX07V.corp.enron.com ([192.168.110.98]) by NAHOU-MSAPP01S.corp.enron.com with Microsoft SMTPSVC(5.0.2195.2966); Mon, 1 Oct 2001 14:39:38 -0500 x-mimeole: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; Content-Transfer-Encoding: binary Subject: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Date: Mon, 1 Oct 2001 14:39:38 -0500 Message-ID: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: <053C29CC8315964CB98E1BD5BD48E3080B522E@NAHOU-MSMBX07V.corp.enron.com> Thread-Topic: MERCATOR ENERGY INCORPORATED and ENERGY WEST INCORPORATED Thread-Index: AcFKsM5wASRhZ102QuKXl3U5Ww741Q== X-Priority: 1 Priority: Urgent Importance: high From: "Landau, Georgi" <Georgi.Landau@ENRON.com> To: "Bailey, Susan" <Susan.Bailey@ENRON.com>, "Boyd, Samantha" <Samantha.Boyd@ENRON.com>, "Heard, Marie" <Marie.Heard@ENRON.com>, "Jones, Tana" <Tana.Jones@ENRON.com>, "Panus, Stephanie" <Stephanie.Panus@ENRON.com> Return-Path: Georgi.Landau@ENRON.com Please check your records and let me know if you have any kind of documentation evidencing a merger indicating that Mercator Energy Incorporated merged with and into Energy West Incorporated? I am unable to find anything to substantiate this information. Thanks for your help. Georgi Landau Enron Net Works Thursday, 4 November 2010
  6. 6. ... Date: Sun, 11 Nov 2001 11:55:05 -0500 Message-Id: <200503101247.j2ACloAq014654@host.high-host.com> To: benrobinson13@shaw.ca Subject: LETS DO THIS TOGTHER From: ben1 <ben_1_wills@yahoo.com.au> X-Priority: 3 (Normal) CC: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: RLSP Mailer Dear Friend, Firstly, not to caurse you embarrassment, I am Barrister Benson Wills, a Solicitor at law and the personal attorney to late Mr. Mark Michelle a National of France, who used to be a private contractor with the Shell Petroleum Development Company in Saudi Arabia, herein after shall be referred to as my client. On the 21st of April 2001, him and his wife with their three children were involved in an auto crash, all occupants of the vehicle unfortunately lost their lives. Since then, I have made several enquiries with his country's embassies to locate any of my clients extended relatives, this has also proved unsuccessful. After these several unsuccessful attempts, I decided to contact you with this business partnership proposal. I have contacted you to assist in repatriating a huge amount of money left behind by my client before they get confiscated or declared unserviceable by the Finance/Security Company where these huge deposit was lodged. The deceased had a deposit valued presently at £30milion Pounds Sterling and the Company has issued me a notice to provide his next of kin or Beneficiary by Will otherwise have the account confiscated within the next thirty official working days. Since I have been unsuccessful in locating the relatives for over two years now, I seek your consent to present you as the next of kin/Will Beneficiary to the deceased so that the proceeds of this Thursday, 4 November 2010
  7. 7. Structure of an email Headers give metadata Empty line Body of email Other major complication is attachments and alternative content, but we’ll ignore those for class Thursday, 4 November 2010
  8. 8. Tasks • Split into header and contents • Split header into fields • Generate useful variables for distinguishing between spam and non- spam Thursday, 4 November 2010
  9. 9. String basics Thursday, 4 November 2010
  10. 10. Thursday, 4 November 2010
  11. 11. http://www.youtube.com/ watch?v=ejweI0EQpX8 Thursday, 4 November 2010
  12. 12. # Special characters a <- "" b <- """ c <- "anbnc" a cat(a, "n") b cat(b, "n") c cat(c, "n") Thursday, 4 November 2010
  13. 13. Special characters • Use to “escape” special characters • " = " • n = new line • = • t = tab • ?Quotes for more Thursday, 4 November 2010
  14. 14. Displaying strings print() will display quoted form. cat() will display actual contents (but need to add a newline to the end). (But generally better to use message() if you want to send a message to the user of your function) Thursday, 4 November 2010
  15. 15. Your turn Create a string for each of the following strings: :- (^_^") @_'-' m/ Create a multiline string. Compare the output from print and cat Thursday, 4 November 2010
  16. 16. a <- ":-" b <- "(^_^")" c <- "@_'-'" d <- "m/" e <- "This stringngoes overnmultiple lines" a; b; c; d; e cat(str_join(a, b, c, d, e, "n", sep = "n")) Thursday, 4 November 2010
  17. 17. Back to the problem Thursday, 4 November 2010
  18. 18. Header vs. content Need to split the string into two pieces, based on the the location of double line break: str_locate(string, pattern) Splitting = creating two substrings, one to the right, one to the left: str_sub(string, start, end) Thursday, 4 November 2010
  19. 19. str_locate("great") str_locate("fantastic", "a") str_locate("super", "a") superlatives <- c("great", "fantastic", "super") res <- str_locate(superlatives, "a") str(res) str(str_locate_all(superlatives, "a")) str_sub("testing", 1, 3) str_sub("testing", start = 4) str_sub("testing", end = 4) input <- c("abc", "defg") str_sub(input, c(2, 3)) Thursday, 4 November 2010
  20. 20. Your turn Use sub_locate() to identify the location where the double break is (make sure to check a few!) Split the emails into header and content with str_sub() Thursday, 4 November 2010
  21. 21. breaks <- str_locate(contents, "nn") # Remove invalid emails valid <- !is.na(breaks[, "start"]) contents <- contents[valid] breaks <- breaks[valid, ] # Extract headers and bodies header <- str_sub(contents, end = breaks[, 1]) body <- str_sub(contents, start = breaks[, 2]) Thursday, 4 November 2010
  22. 22. Headers • Each header starts at the beginning of a new line • Each header is composed of a name and contents, separated by a colon Thursday, 4 November 2010
  23. 23. h <- header[2] # Does this work? str_split(h, "n")[[1]] # Why / why not? # How could you fix the problem? Thursday, 4 November 2010
  24. 24. # Split & patch up lines <- str_split(h, "n") continued <- str_sub(lines, 1, 1) %in% c(" ", "t") # This is a neat trick! groups <- cumsum(!continued) fields <- rep(NA, max(groups)) for (i in seq_along(fields)) { fields[i] <- str_join(lines[groups == i], collapse = "n") } Thursday, 4 November 2010
  25. 25. Your turn Write a small function that given a single header field splits it into name and contents. Do you want to use str_split(), or str_locate() & str_sub()? Remember to get the algorithm working before you write the function Thursday, 4 November 2010
  26. 26. test1 <- "Sender: <Lighthouse@independent.org>" test2 <- "Subject: Alice: Where is my coffee?" f1 <- function(input) { str_split(input, ": ")[[1]] } f2 <- function(input) { colon <- str_locate(input, ": ") c( str_sub(input, end = colon[, 1] - 1), str_sub(input, start = colon[, 2] + 1) ) } f3 <- function(input) { str_split_fixed(input, ": ", 2)[1, ] } Thursday, 4 November 2010
  27. 27. We split the content into header and body. And split up the header into fields. Both of these tasks used fixed strings. What if the pattern we need to match is more complicated? Next time Thursday, 4 November 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×