Text Mining Infrastructure in R

Text Mining Infrastructure in R
Presented By
Ashraf Uddin
(http://ashrafsau.blogspot.in/)
South Asian University, New Delhi, India.
29 January 2014

What is R?
 A free software environment for statistical computing and graphics.
 open source package based developed by Bell Labs
 Many statistical functions are already built in
 Contributed packages expand the functionality to cutting edge research
 Implementation languages C, Fortran

What is R?
 R is the result of a collaborative effort with contributions from all over the
world
 R was initially written by Robert Gentleman and Ross Ihaka—also known as
"R & R" of the Statistics Department of the University of Auckland
 R was inspired by the S environment
 R can be extended (easily) via packages.
More about R

What R does and does not
ois not a database, but
connects to DBMSs
olanguage interpreter can be
very slow, but allows to call
own C/C++ code
ono professional / commercial
support

Data Types in R
 numeric (integer, double, complex)
 character
 logical
 Data frame
 factor

Contributed Packages
 Currently, the CRAN package repository features 5034 available packages

Text Mining: Basics
Text is Unstructured collections of words
Documents are basic units consisting of a sequence of tokens or terms
Terms are words or roots of words, semantic units or phrases which are the
atoms of indexing
Repositories (databases) and corpora are collections of documents.
Corpus conceptual entity similar to a database for holding and managing text
documents
Text mining involves computations to gain interesting information

Text Mining: Practical Applications
 Spam filtering
 Business Intelligence, Marketing applications : predictive analytics
 Sentiment analysis
 Text IR, indexing
 Creating suggestion and recommendations (like amazon)
 Monitoring public opinions (for example in blogs or review sites)
 Customer service, email support
 Automatic labeling of documents in business libraries
 Fraud detection by investigating notification of claims
 Fighting cyberbullying or cybercrime in IM and IRC chat
And many more

Text Mining Packages in R
Corpora gsubfn kernlab KoNLP
koRpus `lda lsa maxent
movMF openNLP qdap RcmdrPlugin.temis
RKEA RTextTools Rweka skmeans
Snowball SnowballC tau textcat
Textir tm tm.plugin.dc tm.plugin.factiva
tm.plugin.mail topicmodels wordcloud
Wordnet zipfR

Text Mining Packages in R
plyr: Tools for splitting, applying and combining data
class: Various functions for classification
tm: A framework for text mining applications
corpora: Statistics and data sets for corpus frequency data
snowball: stemmers
Rweka: interface to Weka, a collection of ML algorithms for data mining tasks
wordnet: interface to WordNet using the Jawbone Java API to WordNet
wordcloud: to make cloud of word
textir: A suite of tools for text and sentiment mining
tau: Text Analysis Utilities
topicmodels: an interface to the C code for Latent Dirichlet Allocation (LDA)
models and Correlated Topics Models (CTM)
zipfR: Statistical models for word frequency distributions

Conceptual process in Text Mining
 organize and structure the texts (into repository)
 convenient representation (preprocessing)
 Transform texts into structured formats (e.g. TDM)

The framework
different file formats and in different locations
 standardized interfaces to access the document (sources)
Metadata valuable insights into the document structure
 must be able to alleviate metadata usage
to efficiently work with the documents
 must provide tools and algorithm to perform common task (transformation)
 To extract patterns of interest (filtering)

Text document collections: Corpus
Constructor:
Corpus(object = ..., readerControl = list(reader = object@DefaultReader,
language = "en_US", load = FALSE))
Example:
>txt <- system.file("texts", "txt", package = "tm")
>(ovid <- Corpus(DirSource(txt), readerControl =
list(reader = readPlain, language = "la", load =
TRUE)))
A corpus with 5 text documents

Corpus: Meta Data
>meta(ovid[[1]])
Available meta data pairs are:
Author :
DateTimeStamp: 2013-11-19 18:54:04
Description :
Heading :
ID : ovid_1.txt
Language : la
Origin :
>ID(ovid[[1]])
[1] "ovid_1.txt“

Corpus: Document’s text
>ovid[[1]]
Si quis in hoc artem populo non novit amandi, hoc
legat et lecto carmine doctus amet. arte citae
veloque rates remoque moventur, arte leves currus:
arte regendus amor. curribus Automedon lentisque erat
aptus habenis, Tiphys in Haemonia puppe magister
erat: me Venus artificem tenero praefecit Amori;
Tiphys et Automedon dicar Amoris ego. ille quidem
ferus est et qui mihi saepe repugnet: sed puer est,
aetas mollis et apta regi. Phillyrides puerum cithara
perfecit Achillem, atque animos placida contudit arte
feros. qui totiens socios, totiens exterruit hostes,
creditur annosum pertimuisse senem.

Corpus: Meta Data
>c(ovid[1:2], ovid[3:4])
>length(ovid)
5
>summary(ovid)
The metadata consists of 2 tag-value pairs and a data
frame
Available tags are: create_date creator
Available variables in the data frame are: MetaID

Corpus: Meta Data
>CMetaData(ovid)
$create_date [1] "2013-11-19 18:54:04 GMT"
$creator [1] "“
>DMetaData(ovid)
MetaID
1 0
2 0
3 0
4 0
5 0

Corpus: Transformations and Filters
>getTransformations()
[1] "as.PlainTextDocument" "removeNumbers"
"removePunctuation" "removeWords"
[5] "stemDocument" "stripWhitespace“
>tm_map(ovid, FUN = tolower)
>getFilters()
[1] "searchFullText" "sFilter" "tm_intersect"
>tm_filter(ovid, FUN = searchFullText, "Venus",
doclevel = TRUE)
A corpus with 1 text document

Text Preprocessing: import
>txt <- system.file("texts", "crude", package = "tm")
>(acq <- Corpus(DirSource(txt), readerControl =
TRUE)))
>txt <- system.file("texts", "crude", package = "tm")
>(crude <- Corpus(DirSource(txt), readerControl =
TRUE)))
resulting in 50 articles of topic acq and 20 articles of topic crude

Preprocessing: stemming
 Morphological variants of a word (morphemes). Similar terms derived from
a common stem:
engineer, engineered, engineering
use, user, users, used, using
 Stemming in Information Retrieval. Grouping words with a common stem
together.
 For example, a search on reads, also finds read, reading, and readable
 Stemming consists of removing suffixes and conflating the resulting
morphemes. Occasionally, prefixes are also removed.

 Reduce terms to their “roots”
 automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress

Typical rules in Stemming:
sses ss
ies  i
ational  ate
tional  tion
Weight of word sensitive rules
(m>1) EMENT →
replacement → replac
cement → cement

 help recall for some queries but harm precision on others
 Fine distinctions may be lost through stemming.

>acq[[10]]
Gulf Applied Technologies Inc said it sold its
subsidiaries engaged in pipeline and terminal
operations for 12.2 mln dlrs. The company said the
sale is subject to certain post closing adjustments,
which it did not explain. Reuter
>stemDocument(acq[[10]])
Gulf Appli Technolog Inc said it sold it subsidiari
engag in pipelin and terminal oper for 12.2 mln dlrs.
The compani said the sale is subject to certain post
clos adjustments, which it did not explain. Reuter
>tm_map(acq, stemDocument)

Preprocessing: Whitespace elimination & lower
case conversion
>stripWhitespace(acq[[10]])
Gulf Applied Technologies Inc said it sold its
operations for 12.2 mln dlrs. The company said the
which it did not explain. Reuter
>tolower(acq[[10]])
gulf applied technologies inc said it sold its
operations for 12.2 mln dlrs. the company said the
which it did not explain. reuter

Preprocessing: Stopword removal
Very common words, such as of, and, the, are rarely of use in information
retrieval.
A long stop list saves space in indexes, speeds processing, and eliminates many
false hits.
However, common words are sometimes significant in information retrieval,
which is an argument for a short stop list.
(Consider the query, "To be or not to be?")

Include the most common words in the English language (perhaps 50 to 250
words).
Do not include words that might be important for retrieval (Among the 200
most frequently occurring words in general literature in English are time, war,
home, life, water, and world).
In addition, include words that are very common in context (e.g., computer,
information, system in a set of computing documents).

about above accordingacross actually adj after
afterwards again against all almost alone
along already also although always among
amongst an another any anyhow anyone
anything anywhere are aren't around
at be became because become becomes becoming
been before beforehand begin beginning behind
being below beside besides between beyond
billion both but by can can't
cannot caption co could couldn't
did didn't do does doesn't don't down
during each eg eight eighty
either else elsewhere end ending enough
etc even ever every everyone everything

How many words should be in the stop list?
• Long list lowers recall
Which words should be in list?
• Some common words may have retrieval importance:
-- war, home, life, water, world
• In certain domains, some words are very common:
-- computer, program, source, machine, language

>mystopwords <- c("and", "for", "in", "is", "it",
"not", "the", "to")
>removeWords(acq[[10]], mystopwords)
Gulf Applied Technologies Inc said sold its
subsidiaries engaged pipeline terminal operations
12.2 mln dlrs. The company said sale subject certain
post closing adjustments, which did explain. Reuter
>tm_map(acq, removeWords, mystopwords)

Preprocessing: Synonyms
> library("wordnet")
synonyms("company")
[1] "caller" "companionship" "company" "fellowship"
[5] "party" "ship’s company" "society" "troupe“
replaceWords(acq[[10]], synonyms(dict, "company"), by = "company")
Tm_map(acq, replaceWords, synonyms(dict, "company"), by = "company")

Preprocessing: Part of speech tagging
>library("NLP","openNLP")
s <- as.String(acq[[10]])
## Need sentence and word token annotations.
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator,
word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
#pos_tag_annotator
a3 <- annotate(s, pos_tag_annotator, a2)
a3w <- subset(a3, type == "word")
tags <- sapply(a3w$features, "[[", "POS")
sprintf("%s/%s", s[a3w], tags)

Preprocessing: Part of speech tagging
"Gulf/NNP" "Applied/NNP" "Technologies/NNP" "Inc/NNP"
"said/VBD"
"it/PRP" "sold/VBD" "its/PRP$" "subsidiaries/NNS"
"engaged/VBN" "in/IN" "pipeline/NN" "and/CC"
"terminal/NN" "operations/NNS"
"for/IN" "12.2/CD" "mln/NN" "dlrs/NNS" "./." "The/DT"
"company/NN" "said/VBD" "the/DT" "sale/NN"
"is/VBZ" "subject/JJ" "to/TO" "certain/JJ" "post/NN"
"closing/NN" "adjustments/NNS" ",/," "which/WDT"
"it/PRP"
"did/VBD" "not/RB" "explain/VB" "./." "Reuter/NNP“
more

Classification using KNN
K-Nearest Neighbor algorithm:
 Most basic instance-based method
 Data are represented in a vector space
 Supervised learning
, V is the finite set {v1,......,vn}
the k-NN returns the most common value among the k training examples
nearest to xq.

KNN Training algorithm
For each training example <x,f(x)> add the example to the list
Classification algorithm
Given a query instance xq to be classified
Let x1,..,xk k instances which are nearest to xq
Where 𝛿(a,b)=1 if a=b, else 𝛿(a,b)= 0 (Kronecker function)

Classification using KNN : Example
Two classes: Red and Blue
Green is Unknown
With K=3, classification is Red
With k=4, classification is Blue

How to determine the good value for k?
 Determined experimentally
 Start with k=1 and use a test set to validate the error rate of the classifier
 Repeat with k=k+2
 Choose the value of k for which the error rate is minimum
 Note: k should be odd number to avoid ties

KNN for speech classification
Datasets:
Size: 40 instances
Barak Obama 20 speeches
Mitt Romney 20 speeches
Training datasets: 70% (28)
Test datasets: 30% (12)
Accuracy: on average more than 90%

Speech Classification Implementation in R
#initialize the R environment
libs<-c("tm","plyr","class")
lapply(libs,require,character.only=TRUE)
#Set parameters / source directory
dir.names<-c("obama","romney")
path<-"E:/Ashraf/speeches"
#clean text / preprocessing
cleanCorpus<-function(corpus){
corpus.tmp<-tm_map(corpus,removePunctuation)
corpus.tmp<-tm_map(corpus.tmp,stripWhitespace)
corpus.tmp<-tm_map(corpus.tmp,tolower)
corpus.tmp<-tm_map(corpus.tmp,removeWords,stopwords("english"))
return (corpus.tmp)
}

#build term document matrix
generateTDM<-function(dir.name,dir.path){
s.dir<-sprintf("%s/%s",dir.path,dir.name)
s.cor<-Corpus(DirSource(directory=s.dir,encoding="ANSI"))
s.cor.cl<-cleanCorpus(s.cor)
s.tdm<-TermDocumentMatrix(s.cor.cl)
s.tdm<-removeSparseTerms(s.tdm,0.7)
result<-list(name=dir.name,tdm=s.tdm)
}
tdm<-lapply(dir.names,generateTDM,dir.path=path)

#attach candidate name to each row of TDM
bindCandidateToTDM<-function(tdm){
s.mat<-t(data.matrix(tdm[["tdm"]]))
s.df<-as.data.frame(s.mat,StringAsFactors=FALSE)
s.df<-cbind(s.df,rep(tdm[["name"]],nrow(s.df)))
colnames(s.df)[ncol(s.df)]<-"targetcandidate"
return (s.df)
}
candTDM<-lapply(tdm,bindCandidateToTDM)

#stack the TDMs together (for both Obama and Romnie)
tdm.stack<-do.call(rbind.fill,candTDM)
tdm.stack[is.na(tdm.stack)]<-0
#hold-out / splitting training and test data sets
train.idx<-sample(nrow(tdm.stack),ceiling(nrow(tdm.stack)*0.7))
test.idx<-(1:nrow(tdm.stack))[-train.idx])

#model KNN
tdm.cand<-tdm.stack[,"targetcandidate"]
tdm.stack.nl<-tdm.stack[,!colnames(tdm.stack)%in%"targetcandidate"]
knn.pred<-
knn(tdm.stack.nl[train.idx,],tdm.stack.nl[test.idx,],tdm.cand[train.idx])
#accuracy of the prediction
conf.mat<-table('Predictions'=knn.pred,Actual=tdm.cand[test.idx])
(accuracy<-(sum(diag(conf.mat))/length(test.idx))*100)
#show result
show(conf.mat)
show(accuracy)

Show R Demo

References
1. Text Mining Infrastructure in R, Ingo Feinerer, Kurt Hornik, David Meyer, Vol.
25, Issue 5, Mar 2008, Journal of Statistical Software.
2. http://mittromneycentral.com/speeches/
3. http://obamaspeeches.com/
4. http://cran.r-project.org/

Text Mining Infrastructure in R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Text Mining Infrastructure in R

Similar to Text Mining Infrastructure in R (20)

More from Ashraf Uddin

More from Ashraf Uddin (7)

Recently uploaded

Recently uploaded (20)

Text Mining Infrastructure in R