Rbootcamp Day 5

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
RBootcamp
Day 5
Olga Scrivner and Jeﬀerson Davis
Assistant Nilima Sahoo
1 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Outline
1 Strings
2 Regular expressions
3 For loops and if statements
4 Text preprocessing
2 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Useful Libraries - String Manipulation
stringi, tau - text encoding, string searching
stringr - character manipulation, pattern matching
koRpus - language detection, hyphenation
tesseract - OCR recognition
tokenizers - split into tokens, n-grams
qdap - transcripts data
3 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Useful Libraries
quanteda - text analysis
tm - a comprehensive text mining framework
tm.plugin.webmining - import XML, JSON, HTML
openNLP - a collection of NLP tools:
pos-tagger
tokenizer
syntactic parser
name-enity detector
4 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Materials
1 Download Alice from DataCamp Day 4 under Files
2 or from link
http://cl.indiana.edu/∼obscrivn/docs/AliceChapter1.txt
3 Set working directory to the folder with the Alice ﬁle:
Session → Set Working Directory → Choose
Directory → folder’s name
5 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Importing Text - readLines()
1. ﬁle.txt <- “AliceChapter1.txt”
2. text <- readLines(ﬁle.txt)
6 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Importing Text - readLines()
1. file.txt <- “AliceChapter1.txt”
2. text <- readLines(file.txt)
3. head(text) - first 6 lines
4. tail(text) - last 6 lines
6 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
ReadLines vs Scan
R can read any text using readLines() or scan()
1 readLines (type ?readLines in the console)
2 scan - more options
3 scan requires data type speciﬁcation; by default it
assumes that you have numbers
7 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Scan
Split by words
text.scan <- scan(ﬁle.txt, character())
head(text.scan)
Split by a new line
text.scan <- scan(ﬁle.txt, character(), sep=“n” )
head(text.scan)
8 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Writing into Files - cat() and writeLines()
Let’s extract 10 first lines from text.scan
text.extract <- text.scan[1:10]
1 cat()
concatenates vectors by default
options sep -
encoding depends on your computer
cat(text.extract, file = ”file1.txt”)
cat(text.extract, file = ”file2.txt”, sep=”n”)
Vectors are separated by a new line (”n”)
9 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Writing into Files - cat() and writeLines()
Let’s extract 10 first lines from text.scan
text.extract <- text.scan[1:10]
1 cat()
concatenates vectors by default
options sep -
encoding depends on your computer
cat(text.extract, file = ”file1.txt”)
cat(text.extract, file = ”file2.txt”, sep=”n”)
Vectors are separated by a new line (”n”)
2 writeLines
writeLines(text.extract, con = ”file3.txt”, sep =
”n”)
9 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Working with Text
Let’s ﬁnd the lines with Alice
1 grep(“Alice”,text.scan, value=“TRUE”)
10 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Working with Text
Let’s ﬁnd the lines with Alice
1 grep(“Alice”,text.scan, value=“TRUE”)
2 grep(“Alice”,text.scan, value=“FALSE”)
10 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Replacement - gsub()
gsub() function replaces all matches of a string
11 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
1
11 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
1
2
11 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
1
2
3
11 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
12 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
How to replace all the numbers?
The answer is regular expression!
http://www.endmemo.com/program/R/gsub.php
12 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Regular Expression - Operators
strings <- c(”a”, ”ab”, ”acb”, ”accb”, ”cccb”,12)
13 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Regular Expression - Operators
1 ac*b
2 ac+b
3 ac?b
4 ac{2}b
13 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Regular Expression - Answers
1 grep(”ac*b”, strings, value = TRUE)
2 grep(”ac+b”, strings, value = TRUE)
3 grep(”ac?b”, strings, value = TRUE)
4 grep(”ac{2}b”, strings, value = TRUE)
14 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Regular Expression - Character Lists
15 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Regular Expression - Character Lists
1 a.b
2 [a-z]
3 [0-9]
15 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
1 grep(”a.b”, strings, value = TRUE)
2 grep(”[a-z]”, strings, value = TRUE)
3 grep(”[0-9]”, strings, value = TRUE)
16 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Regular Expression - Character Classes
grep("[[:alphanum:]]", strings, value = TRUE)
17 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Regular Expression - Character Classes
grep("[[:alphanum:]]", strings, value = TRUE)
1 Find all alphabetic characters
2 ﬁnd all numerics characters
17 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
grep("[[:alpha:]]", strings, value = TRUE)
grep("[[:digit:]]", strings, value = TRUE)
18 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Split - strsplit()
string <- “My short story”
strings <- unlist(strsplit(string, “ ”))
19 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Split - strsplit()
string <- “My short story”
strings <- unlist(strsplit(string, “ ”))
How to print each string in a sequence? For loop!
for (i in 1:length(strings)) {
mystring <- strings[i]
print(mystring)
}
19 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
For Loop
How to store for loop return?
Create an empty vector (before for loop)
myvector <- vector()
At the end of each iteration, store the result inside the
vector
myvector[i]
20 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
For Loop
How to store for loop return?
Create an empty vector (before for loop)
At the end of each iteration, store the result inside the
vector
myvector[i]
for (i in 1:length(strings)) {
mystring <- strings[i]
print(mystring)
myvector[i] <- mystring
}
20 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Working with many documents - FOR loop
input.url <-
c(”http://cl.indiana.edu/∼obscrivn/antonyCleopatra.txt”,
”http://cl.indiana.edu/∼obscrivn/comedyErrors.txt”)
1 texts <-vector()
2 for loop:
for (i in 1:length(input.url)) {
text.scan <- scan(input.url[i], what=”character”,
sep=”n”)
data=enc2utf8(text.scan)
data.collapse <- paste(data, collapse = ” ”)
texts[i] <- data.collapse }
21 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
For Loop in Search
Let’s split texts[1] into words
Let’s ﬁnd lines with nay:
text.split <- unlist(strsplit(texts[1], ” ”))
1 search <- grep(”nay”,text.split)
2 How many occurrences?:
length(search)
22 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Extract KWIC
Remember indices?
what is mysplit[1:2]?
we need to ﬁnd word nay and 5 words on the left and 5 words
on the right
let’s take the ﬁrst nay:
search[1]
type: text.split[search[1]]
23 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
KWIC
Let’s create a variable for ﬁrst nay position:
position=search[1]
let’s set up two variables for left/right context
left = 5
right = 5
extract KWIC
text.split[(position-left):(position+right)]
24 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
KWIC
We need to paste our kwic:
create a variable for it:
nay<-text.split[(position-left):(position+right)]
paste:
ﬁrst.kwic <- paste(nay,collapse=” ”)
print(ﬁrst.kwic)
25 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Extract all Occurrences
We need to a for loop:
create an empty vector
collect <- vector()
for loop:
for (i in 1:length(search)) {
mysearch <- text.split[(search[i]-left):(search[i]+right)]
mysearch.paste <- paste(mysearch, collapse=” ”)
collect[i] <-mysearch.paste
}
print(collect)
26 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
What is Bag of Words?
Simplest way to quantify text
Word order ignored
Term counts per document
N-grams (uni-grams, bi-grams)
27 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Preprocessing
Tokenization (splitting words)
Cleaning (lower case, punctuation)
Stemming
Filter (stopwords)
28 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Preprocessing
Stemming
works, worked → work
Filter (stopwords)
28 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Preprocessing
Stemming
works, worked → work
Filter (stopwords)
and, the, a
28 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Working with Text: Lower Case vs Upper Case
text <- texts[1]
1 To convert this text to lowercases - type:
text.lower <- tolower(text)
2 To convert this text.lower to uppercases - type:
text.upper <- toupper(text.lower)
29 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Cleaning texts
1 Delete punctuation
data.punct <- gsub(”[[:punct:]]”,””, text )
2 data.lower <- tolower(data.punct)
3 text.split<-strsplit(data.lower, ” ”)
4 text.split <- unlist(text.split)
30 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Text Mining Package - tm
Main structure - corpus
Corpus is constructed via DirSource, VectorSource,
DataframeSource
doc.vec <- VectorSource(text)
mycorpus <- Corpus(doc.vec))
31 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Text Mining Package - tm
Main structure - corpus
Corpus is constructed via DirSource, VectorSource,
DataframeSource
doc.vec <- VectorSource(text)
mycorpus <- Corpus(doc.vec))
Let’s inspect corpus
inspect(mycorpus)
31 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Document Collection
docs.vec <- VectorSource(texts)
mycorpora <- Corpus(docs.vec))
Inspect corpus
inspect(mycorpora)
32 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Preprocessing with tm
lower case
mycorpus <- tm map(mycorpus,
content transformer(tolower))
remove punctuation
mycorpus <- tm map(mycorpus, removePunctuation)
remove numbers
mycorpus <- tm map(mycorpus, removeNumbers)
mycorpus <- tm map(mycorpus, stripWhitespace)
33 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Frequencies
Let’s ﬁnd the most frequent 100 words:
ﬁndFreqTerms(TDM, 100)
34 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Stop Words
We need to filter stopwords. Add the following line:
Check the most frequent 100 and then 50 words:
findFreqTerms(TDM, 100) - then change the top cut off to
50:
?stopwords
35 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Frequent Terms
m <- as.matrix(TDM)
v <- sort(rowSums(m), decreasing=TRUE)
myNames <- names(v)
d <- data.frame(word=myNames, freq=v)
36 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Frequent Terms
library(wordcloud)
wordcloud(d$word, d$freq, min.freq=3)
37 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Resources
http://www.rdatamining.com/examples/text-mining
https:
//en.wikibooks.org/wiki/R_Programming/Text_Processing
http://data.library.virginia.edu/
reading-pdf-files-into-r-for-text-mining/
http://www.katrinerk.com/courses/
words-in-a-haystack-an-introductory-statistics-course/
schedule-words-in-a-haystack/
r-code-the-text-mining-package
tm package
38 / 39

Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Practice-DataCamp
1 RBootcamp day 5
2 Final DataCamp Practice!
3 Assignment Text Mining with Bag of Words
39 / 39

Rbootcamp Day 5

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rbootcamp Day 5

Similar to Rbootcamp Day 5 (20)

More from Olga Scrivner

More from Olga Scrivner (20)

Recently uploaded

Recently uploaded (20)

Rbootcamp Day 5