3. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Useful Libraries - String Manipulation
stringi, tau - text encoding, string searching
stringr - character manipulation, pattern matching
koRpus - language detection, hyphenation
tesseract - OCR recognition
tokenizers - split into tokens, n-grams
qdap - transcripts data
3 / 39
4. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Useful Libraries
quanteda - text analysis
tm - a comprehensive text mining framework
tm.plugin.webmining - import XML, JSON, HTML
openNLP - a collection of NLP tools:
pos-tagger
tokenizer
syntactic parser
name-enity detector
4 / 39
5. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Materials
1 Download Alice from DataCamp Day 4 under Files
2 or from link
http://cl.indiana.edu/∼obscrivn/docs/AliceChapter1.txt
3 Set working directory to the folder with the Alice file:
Session → Set Working Directory → Choose
Directory → folder’s name
5 / 39
8. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
ReadLines vs Scan
R can read any text using readLines() or scan()
1 readLines (type ?readLines in the console)
2 scan - more options
3 scan requires data type specification; by default it
assumes that you have numbers
7 / 39
10. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Writing into Files - cat() and writeLines()
Let’s extract 10 first lines from text.scan
text.extract <- text.scan[1:10]
1 cat()
concatenates vectors by default
options sep -
encoding depends on your computer
cat(text.extract, file = ”file1.txt”)
cat(text.extract, file = ”file2.txt”, sep=”n”)
Vectors are separated by a new line (”n”)
9 / 39
11. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Writing into Files - cat() and writeLines()
Let’s extract 10 first lines from text.scan
text.extract <- text.scan[1:10]
1 cat()
concatenates vectors by default
options sep -
encoding depends on your computer
cat(text.extract, file = ”file1.txt”)
cat(text.extract, file = ”file2.txt”, sep=”n”)
Vectors are separated by a new line (”n”)
2 writeLines
writeLines(text.extract, con = ”file3.txt”, sep =
”n”)
9 / 39
31. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Split - strsplit()
string <- “My short story”
strings <- unlist(strsplit(string, “ ”))
How to print each string in a sequence? For loop!
for (i in 1:length(strings)) {
mystring <- strings[i]
print(mystring)
}
19 / 39
33. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
For Loop
How to store for loop return?
Create an empty vector (before for loop)
myvector <- vector()
At the end of each iteration, store the result inside the
vector
myvector[i]
myvector <- vector()
for (i in 1:length(strings)) {
mystring <- strings[i]
print(mystring)
myvector[i] <- mystring
}
20 / 39
34. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Working with many documents - FOR loop
input.url <-
c(”http://cl.indiana.edu/∼obscrivn/antonyCleopatra.txt”,
”http://cl.indiana.edu/∼obscrivn/comedyErrors.txt”)
1 texts <-vector()
2 for loop:
for (i in 1:length(input.url)) {
text.scan <- scan(input.url[i], what=”character”,
sep=”n”)
data=enc2utf8(text.scan)
data.collapse <- paste(data, collapse = ” ”)
texts[i] <- data.collapse }
21 / 39
35. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
For Loop in Search
Let’s split texts[1] into words
Let’s find lines with nay:
text.split <- unlist(strsplit(texts[1], ” ”))
1 search <- grep(”nay”,text.split)
2 How many occurrences?:
length(search)
22 / 39
39. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Extract all Occurrences
We need to a for loop:
create an empty vector
collect <- vector()
for loop:
for (i in 1:length(search)) {
mysearch <- text.split[(search[i]-left):(search[i]+right)]
mysearch.paste <- paste(mysearch, collapse=” ”)
collect[i] <-mysearch.paste
}
print(collect)
26 / 39
44. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Working with Text: Lower Case vs Upper Case
text <- texts[1]
1 To convert this text to lowercases - type:
text.lower <- tolower(text)
2 To convert this text.lower to uppercases - type:
text.upper <- toupper(text.lower)
29 / 39
47. Strings
Regular
Expressions
For loop
Preprocessing
tm Package
Text Mining Package - tm
Main structure - corpus
Corpus is constructed via DirSource, VectorSource,
DataframeSource
doc.vec <- VectorSource(text)
mycorpus <- Corpus(doc.vec))
Let’s inspect corpus
inspect(mycorpus)
31 / 39