SlideShare a Scribd company logo
Text Mining Infrastructure in R
Presented By
Ashraf Uddin
(http://ashrafsau.blogspot.in/)
South Asian University, New Delhi, India.
29 January 2014
What is R?
 A free software environment for statistical computing and graphics.
 open source package based developed by Bell Labs
 Many statistical functions are already built in
 Contributed packages expand the functionality to cutting edge research
 Implementation languages C, Fortran
What is R?
 R is the result of a collaborative effort with contributions from all over the
world
 R was initially written by Robert Gentleman and Ross Ihaka—also known as
"R & R" of the Statistics Department of the University of Auckland
 R was inspired by the S environment
 R can be extended (easily) via packages.
More about R
What R does and does not
ois not a database, but
connects to DBMSs
olanguage interpreter can be
very slow, but allows to call
own C/C++ code
ono professional / commercial
support
Data Types in R
 numeric (integer, double, complex)
 character
 logical
 Data frame
 factor
Contributed Packages
 Currently, the CRAN package repository features 5034 available packages
Growing users of R
Text Mining: Basics
Text is Unstructured collections of words
Documents are basic units consisting of a sequence of tokens or terms
Terms are words or roots of words, semantic units or phrases which are the
atoms of indexing
Repositories (databases) and corpora are collections of documents.
Corpus conceptual entity similar to a database for holding and managing text
documents
Text mining involves computations to gain interesting information
Text Mining: Practical Applications
 Spam filtering
 Business Intelligence, Marketing applications : predictive analytics
 Sentiment analysis
 Text IR, indexing
 Creating suggestion and recommendations (like amazon)
 Monitoring public opinions (for example in blogs or review sites)
 Customer service, email support
 Automatic labeling of documents in business libraries
 Fraud detection by investigating notification of claims
 Fighting cyberbullying or cybercrime in IM and IRC chat
And many more
A List Text Mining Tools
Text Mining Packages in R
Corpora gsubfn kernlab KoNLP
koRpus `lda lsa maxent
movMF openNLP qdap RcmdrPlugin.temis
RKEA RTextTools Rweka skmeans
Snowball SnowballC tau textcat
Textir tm tm.plugin.dc tm.plugin.factiva
tm.plugin.mail topicmodels wordcloud
Wordnet zipfR
Text Mining Packages in R
plyr: Tools for splitting, applying and combining data
class: Various functions for classification
tm: A framework for text mining applications
corpora: Statistics and data sets for corpus frequency data
snowball: stemmers
Rweka: interface to Weka, a collection of ML algorithms for data mining tasks
wordnet: interface to WordNet using the Jawbone Java API to WordNet
wordcloud: to make cloud of word
textir: A suite of tools for text and sentiment mining
tau: Text Analysis Utilities
topicmodels: an interface to the C code for Latent Dirichlet Allocation (LDA)
models and Correlated Topics Models (CTM)
zipfR: Statistical models for word frequency distributions
Conceptual process in Text Mining
 organize and structure the texts (into repository)
 convenient representation (preprocessing)
 Transform texts into structured formats (e.g. TDM)
The framework
different file formats and in different locations
 standardized interfaces to access the document (sources)
Metadata valuable insights into the document structure
 must be able to alleviate metadata usage
to efficiently work with the documents
 must provide tools and algorithm to perform common task (transformation)
 To extract patterns of interest (filtering)
Text document collections: Corpus
Constructor:
Corpus(object = ..., readerControl = list(reader = object@DefaultReader,
language = "en_US", load = FALSE))
Example:
>txt <- system.file("texts", "txt", package = "tm")
>(ovid <- Corpus(DirSource(txt), readerControl =
list(reader = readPlain, language = "la", load =
TRUE)))
A corpus with 5 text documents
Corpus: Meta Data
>meta(ovid[[1]])
Available meta data pairs are:
Author :
DateTimeStamp: 2013-11-19 18:54:04
Description :
Heading :
ID : ovid_1.txt
Language : la
Origin :
>ID(ovid[[1]])
[1] "ovid_1.txt“
Corpus: Document’s text
>ovid[[1]]
Si quis in hoc artem populo non novit amandi, hoc
legat et lecto carmine doctus amet. arte citae
veloque rates remoque moventur, arte leves currus:
arte regendus amor. curribus Automedon lentisque erat
aptus habenis, Tiphys in Haemonia puppe magister
erat: me Venus artificem tenero praefecit Amori;
Tiphys et Automedon dicar Amoris ego. ille quidem
ferus est et qui mihi saepe repugnet: sed puer est,
aetas mollis et apta regi. Phillyrides puerum cithara
perfecit Achillem, atque animos placida contudit arte
feros. qui totiens socios, totiens exterruit hostes,
creditur annosum pertimuisse senem.
Corpus: Meta Data
>c(ovid[1:2], ovid[3:4])
A corpus with 4 text documents
>length(ovid)
5
>summary(ovid)
A corpus with 5 text documents
The metadata consists of 2 tag-value pairs and a data
frame
Available tags are: create_date creator
Available variables in the data frame are: MetaID
Corpus: Meta Data
>CMetaData(ovid)
$create_date [1] "2013-11-19 18:54:04 GMT"
$creator [1] "“
>DMetaData(ovid)
MetaID
1 0
2 0
3 0
4 0
5 0
Corpus: Transformations and Filters
>getTransformations()
[1] "as.PlainTextDocument" "removeNumbers"
"removePunctuation" "removeWords"
[5] "stemDocument" "stripWhitespace“
>tm_map(ovid, FUN = tolower)
A corpus with 5 text documents
>getFilters()
[1] "searchFullText" "sFilter" "tm_intersect"
>tm_filter(ovid, FUN = searchFullText, "Venus",
doclevel = TRUE)
A corpus with 1 text document
Text Preprocessing: import
>txt <- system.file("texts", "crude", package = "tm")
>(acq <- Corpus(DirSource(txt), readerControl =
list(reader = readPlain, language = "la", load =
TRUE)))
A corpus with 50 text documents
>txt <- system.file("texts", "crude", package = "tm")
>(crude <- Corpus(DirSource(txt), readerControl =
list(reader = readPlain, language = "la", load =
TRUE)))
A corpus with 20 text documents
resulting in 50 articles of topic acq and 20 articles of topic crude
Preprocessing: stemming
 Morphological variants of a word (morphemes). Similar terms derived from
a common stem:
engineer, engineered, engineering
use, user, users, used, using
 Stemming in Information Retrieval. Grouping words with a common stem
together.
 For example, a search on reads, also finds read, reading, and readable
 Stemming consists of removing suffixes and conflating the resulting
morphemes. Occasionally, prefixes are also removed.
Preprocessing: stemming
 Reduce terms to their “roots”
 automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compress and
compress ar both accept
as equival to compress
Preprocessing: stemming
Typical rules in Stemming:
sses ss
ies  i
ational  ate
tional  tion
Weight of word sensitive rules
(m>1) EMENT →
replacement → replac
cement → cement
Preprocessing: stemming
 help recall for some queries but harm precision on others
 Fine distinctions may be lost through stemming.
Preprocessing: stemming
>acq[[10]]
Gulf Applied Technologies Inc said it sold its
subsidiaries engaged in pipeline and terminal
operations for 12.2 mln dlrs. The company said the
sale is subject to certain post closing adjustments,
which it did not explain. Reuter
>stemDocument(acq[[10]])
Gulf Appli Technolog Inc said it sold it subsidiari
engag in pipelin and terminal oper for 12.2 mln dlrs.
The compani said the sale is subject to certain post
clos adjustments, which it did not explain. Reuter
>tm_map(acq, stemDocument)
A corpus with 50 text documents
Preprocessing: Whitespace elimination & lower
case conversion
>stripWhitespace(acq[[10]])
Gulf Applied Technologies Inc said it sold its
subsidiaries engaged in pipeline and terminal
operations for 12.2 mln dlrs. The company said the
sale is subject to certain post closing adjustments,
which it did not explain. Reuter
>tolower(acq[[10]])
gulf applied technologies inc said it sold its
subsidiaries engaged in pipeline and terminal
operations for 12.2 mln dlrs. the company said the
sale is subject to certain post closing adjustments,
which it did not explain. reuter
Preprocessing: Stopword removal
Very common words, such as of, and, the, are rarely of use in information
retrieval.
A long stop list saves space in indexes, speeds processing, and eliminates many
false hits.
However, common words are sometimes significant in information retrieval,
which is an argument for a short stop list.
(Consider the query, "To be or not to be?")
Preprocessing: Stopword removal
Include the most common words in the English language (perhaps 50 to 250
words).
Do not include words that might be important for retrieval (Among the 200
most frequently occurring words in general literature in English are time, war,
home, life, water, and world).
In addition, include words that are very common in context (e.g., computer,
information, system in a set of computing documents).
Preprocessing: Stopword removal
about above accordingacross actually adj after
afterwards again against all almost alone
along already also although always among
amongst an another any anyhow anyone
anything anywhere are aren't around
at be became because become becomes becoming
been before beforehand begin beginning behind
being below beside besides between beyond
billion both but by can can't
cannot caption co could couldn't
did didn't do does doesn't don't down
during each eg eight eighty
either else elsewhere end ending enough
etc even ever every everyone everything
Preprocessing: Stopword removal
How many words should be in the stop list?
• Long list lowers recall
Which words should be in list?
• Some common words may have retrieval importance:
-- war, home, life, water, world
• In certain domains, some words are very common:
-- computer, program, source, machine, language
Preprocessing: Stopword removal
>mystopwords <- c("and", "for", "in", "is", "it",
"not", "the", "to")
>removeWords(acq[[10]], mystopwords)
Gulf Applied Technologies Inc said sold its
subsidiaries engaged pipeline terminal operations
12.2 mln dlrs. The company said sale subject certain
post closing adjustments, which did explain. Reuter
>tm_map(acq, removeWords, mystopwords)
A corpus with 50 text documents
Preprocessing: Synonyms
> library("wordnet")
synonyms("company")
[1] "caller" "companionship" "company" "fellowship"
[5] "party" "ship’s company" "society" "troupe“
replaceWords(acq[[10]], synonyms(dict, "company"), by = "company")
Tm_map(acq, replaceWords, synonyms(dict, "company"), by = "company")
Preprocessing: Part of speech tagging
>library("NLP","openNLP")
s <- as.String(acq[[10]])
## Need sentence and word token annotations.
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator,
word_token_annotator))
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
#pos_tag_annotator
a3 <- annotate(s, pos_tag_annotator, a2)
a3w <- subset(a3, type == "word")
tags <- sapply(a3w$features, "[[", "POS")
sprintf("%s/%s", s[a3w], tags)
Preprocessing: Part of speech tagging
"Gulf/NNP" "Applied/NNP" "Technologies/NNP" "Inc/NNP"
"said/VBD"
"it/PRP" "sold/VBD" "its/PRP$" "subsidiaries/NNS"
"engaged/VBN" "in/IN" "pipeline/NN" "and/CC"
"terminal/NN" "operations/NNS"
"for/IN" "12.2/CD" "mln/NN" "dlrs/NNS" "./." "The/DT"
"company/NN" "said/VBD" "the/DT" "sale/NN"
"is/VBZ" "subject/JJ" "to/TO" "certain/JJ" "post/NN"
"closing/NN" "adjustments/NNS" ",/," "which/WDT"
"it/PRP"
"did/VBD" "not/RB" "explain/VB" "./." "Reuter/NNP“
more
Preprocessing
R Demo
Classification using KNN
K-Nearest Neighbor algorithm:
 Most basic instance-based method
 Data are represented in a vector space
 Supervised learning
, V is the finite set {v1,......,vn}
the k-NN returns the most common value among the k training examples
nearest to xq.
KNN Feature space
KNN Training algorithm
For each training example <x,f(x)> add the example to the list
Classification algorithm
Given a query instance xq to be classified
Let x1,..,xk k instances which are nearest to xq
Where 𝛿(a,b)=1 if a=b, else 𝛿(a,b)= 0 (Kronecker function)
Classification using KNN : Example
Two classes: Red and Blue
Green is Unknown
With K=3, classification is Red
With k=4, classification is Blue
How to determine the good value for k?
 Determined experimentally
 Start with k=1 and use a test set to validate the error rate of the classifier
 Repeat with k=k+2
 Choose the value of k for which the error rate is minimum
 Note: k should be odd number to avoid ties
KNN for speech classification
Datasets:
Size: 40 instances
Barak Obama 20 speeches
Mitt Romney 20 speeches
Training datasets: 70% (28)
Test datasets: 30% (12)
Accuracy: on average more than 90%
Speech Classification Implementation in R
#initialize the R environment
libs<-c("tm","plyr","class")
lapply(libs,require,character.only=TRUE)
#Set parameters / source directory
dir.names<-c("obama","romney")
path<-"E:/Ashraf/speeches"
#clean text / preprocessing
cleanCorpus<-function(corpus){
corpus.tmp<-tm_map(corpus,removePunctuation)
corpus.tmp<-tm_map(corpus.tmp,stripWhitespace)
corpus.tmp<-tm_map(corpus.tmp,tolower)
corpus.tmp<-tm_map(corpus.tmp,removeWords,stopwords("english"))
return (corpus.tmp)
}
Speech Classification Implementation in R
#build term document matrix
generateTDM<-function(dir.name,dir.path){
s.dir<-sprintf("%s/%s",dir.path,dir.name)
s.cor<-Corpus(DirSource(directory=s.dir,encoding="ANSI"))
s.cor.cl<-cleanCorpus(s.cor)
s.tdm<-TermDocumentMatrix(s.cor.cl)
s.tdm<-removeSparseTerms(s.tdm,0.7)
result<-list(name=dir.name,tdm=s.tdm)
}
tdm<-lapply(dir.names,generateTDM,dir.path=path)
Speech Classification Implementation in R
#attach candidate name to each row of TDM
bindCandidateToTDM<-function(tdm){
s.mat<-t(data.matrix(tdm[["tdm"]]))
s.df<-as.data.frame(s.mat,StringAsFactors=FALSE)
s.df<-cbind(s.df,rep(tdm[["name"]],nrow(s.df)))
colnames(s.df)[ncol(s.df)]<-"targetcandidate"
return (s.df)
}
candTDM<-lapply(tdm,bindCandidateToTDM)
Speech Classification Implementation in R
#stack the TDMs together (for both Obama and Romnie)
tdm.stack<-do.call(rbind.fill,candTDM)
tdm.stack[is.na(tdm.stack)]<-0
#hold-out / splitting training and test data sets
train.idx<-sample(nrow(tdm.stack),ceiling(nrow(tdm.stack)*0.7))
test.idx<-(1:nrow(tdm.stack))[-train.idx])
Speech Classification Implementation in R
#model KNN
tdm.cand<-tdm.stack[,"targetcandidate"]
tdm.stack.nl<-tdm.stack[,!colnames(tdm.stack)%in%"targetcandidate"]
knn.pred<-
knn(tdm.stack.nl[train.idx,],tdm.stack.nl[test.idx,],tdm.cand[train.idx])
#accuracy of the prediction
conf.mat<-table('Predictions'=knn.pred,Actual=tdm.cand[test.idx])
(accuracy<-(sum(diag(conf.mat))/length(test.idx))*100)
#show result
show(conf.mat)
show(accuracy)
Speech Classification Implementation in R
Show R Demo
References
1. Text Mining Infrastructure in R, Ingo Feinerer, Kurt Hornik, David Meyer, Vol.
25, Issue 5, Mar 2008, Journal of Statistical Software.
2. http://mittromneycentral.com/speeches/
3. http://obamaspeeches.com/
4. http://cran.r-project.org/

More Related Content

What's hot

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
Dr. Christian Betz
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
WU (Vienna University of Economics and Business)
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
outsider2
 
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and HyperlinkingDCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
multimediaeval
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitBen Healey
 
Python and R for quantitative finance
Python and R for quantitative financePython and R for quantitative finance
Python and R for quantitative finance
Luca Sbardella
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
Bryan Gummibearehausen
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
ankit_ppt
 
Slides
SlidesSlides
Slidesbutest
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
Nick Hathaway
 
Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)
Eran Yahav
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
Tatiana Tarasova
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
Murugan Krishnamoorthy
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
Yanchang Zhao
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
shanbady
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 

What's hot (20)

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and HyperlinkingDCU Search Runs at MediaEval 2014 Search and Hyperlinking
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language Toolkit
 
Python and R for quantitative finance
Python and R for quantitative financePython and R for quantitative finance
Python and R for quantitative finance
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Lec1
Lec1Lec1
Lec1
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Slides
SlidesSlides
Slides
 
HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)HackYale - Natural Language Processing (Week 0)
HackYale - Natural Language Processing (Week 0)
 
Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
Phpconf2008 Sphinx En
Phpconf2008 Sphinx EnPhpconf2008 Sphinx En
Phpconf2008 Sphinx En
 
RDataMining slides-r-programming
RDataMining slides-r-programmingRDataMining slides-r-programming
RDataMining slides-r-programming
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 

Similar to Text Mining Infrastructure in R

Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsAjay Ohri
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Paco Nathan
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
Rama Bastola
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
Rama Bastola
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
Rama Bastola
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
Jonathan Stray
 
I explore
I exploreI explore
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003butest
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
lichtkind
 
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduatesScales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduatesHans Ecke
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
Grzegorz Kolpuc
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in R
Rory Winston
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
ijma
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
Massimo Schenone
 
Mit302 web technologies
Mit302 web technologiesMit302 web technologies
Mit302 web technologies
smumbahelp
 
Programming for Problem Solving
Programming for Problem SolvingProgramming for Problem Solving
Programming for Problem Solving
Kathirvel Ayyaswamy
 
Shivam PPT.pptx
Shivam PPT.pptxShivam PPT.pptx
Shivam PPT.pptx
ShivamDenge
 
Data structures and algorithms 2
Data structures and algorithms 2 Data structures and algorithms 2
Data structures and algorithms 2
Mark John Lado, MIT
 

Similar to Text Mining Infrastructure in R (20)

Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
 
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and CascadingBoulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
Boulder/Denver BigData: Cluster Computing with Apache Mesos and Cascading
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
 
Jpl presentation
Jpl presentationJpl presentation
Jpl presentation
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
I explore
I exploreI explore
I explore
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Text mining and Visualizations
Text mining  and VisualizationsText mining  and Visualizations
Text mining and Visualizations
 
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduatesScales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
Scales02WhatProgrammingLanguagesShouldWeTeachOurUndergraduates
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in R
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Mit302 web technologies
Mit302 web technologiesMit302 web technologies
Mit302 web technologies
 
Programming for Problem Solving
Programming for Problem SolvingProgramming for Problem Solving
Programming for Problem Solving
 
Shivam PPT.pptx
Shivam PPT.pptxShivam PPT.pptx
Shivam PPT.pptx
 
Data structures and algorithms 2
Data structures and algorithms 2 Data structures and algorithms 2
Data structures and algorithms 2
 

More from Ashraf Uddin

A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
Ashraf Uddin
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
Ashraf Uddin
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Ashraf Uddin
 
Dynamic source routing
Dynamic source routingDynamic source routing
Dynamic source routingAshraf Uddin
 

More from Ashraf Uddin (7)

A short tutorial on r
A short tutorial on rA short tutorial on r
A short tutorial on r
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Software piracy
Software piracySoftware piracy
Software piracy
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Freenet
FreenetFreenet
Freenet
 
Dynamic source routing
Dynamic source routingDynamic source routing
Dynamic source routing
 

Recently uploaded

"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 

Recently uploaded (20)

"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 

Text Mining Infrastructure in R

  • 1. Text Mining Infrastructure in R Presented By Ashraf Uddin (http://ashrafsau.blogspot.in/) South Asian University, New Delhi, India. 29 January 2014
  • 2. What is R?  A free software environment for statistical computing and graphics.  open source package based developed by Bell Labs  Many statistical functions are already built in  Contributed packages expand the functionality to cutting edge research  Implementation languages C, Fortran
  • 3. What is R?  R is the result of a collaborative effort with contributions from all over the world  R was initially written by Robert Gentleman and Ross Ihaka—also known as "R & R" of the Statistics Department of the University of Auckland  R was inspired by the S environment  R can be extended (easily) via packages. More about R
  • 4. What R does and does not ois not a database, but connects to DBMSs olanguage interpreter can be very slow, but allows to call own C/C++ code ono professional / commercial support
  • 5. Data Types in R  numeric (integer, double, complex)  character  logical  Data frame  factor
  • 6. Contributed Packages  Currently, the CRAN package repository features 5034 available packages
  • 8. Text Mining: Basics Text is Unstructured collections of words Documents are basic units consisting of a sequence of tokens or terms Terms are words or roots of words, semantic units or phrases which are the atoms of indexing Repositories (databases) and corpora are collections of documents. Corpus conceptual entity similar to a database for holding and managing text documents Text mining involves computations to gain interesting information
  • 9. Text Mining: Practical Applications  Spam filtering  Business Intelligence, Marketing applications : predictive analytics  Sentiment analysis  Text IR, indexing  Creating suggestion and recommendations (like amazon)  Monitoring public opinions (for example in blogs or review sites)  Customer service, email support  Automatic labeling of documents in business libraries  Fraud detection by investigating notification of claims  Fighting cyberbullying or cybercrime in IM and IRC chat And many more
  • 10. A List Text Mining Tools
  • 11. Text Mining Packages in R Corpora gsubfn kernlab KoNLP koRpus `lda lsa maxent movMF openNLP qdap RcmdrPlugin.temis RKEA RTextTools Rweka skmeans Snowball SnowballC tau textcat Textir tm tm.plugin.dc tm.plugin.factiva tm.plugin.mail topicmodels wordcloud Wordnet zipfR
  • 12. Text Mining Packages in R plyr: Tools for splitting, applying and combining data class: Various functions for classification tm: A framework for text mining applications corpora: Statistics and data sets for corpus frequency data snowball: stemmers Rweka: interface to Weka, a collection of ML algorithms for data mining tasks wordnet: interface to WordNet using the Jawbone Java API to WordNet wordcloud: to make cloud of word textir: A suite of tools for text and sentiment mining tau: Text Analysis Utilities topicmodels: an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) zipfR: Statistical models for word frequency distributions
  • 13. Conceptual process in Text Mining  organize and structure the texts (into repository)  convenient representation (preprocessing)  Transform texts into structured formats (e.g. TDM)
  • 14. The framework different file formats and in different locations  standardized interfaces to access the document (sources) Metadata valuable insights into the document structure  must be able to alleviate metadata usage to efficiently work with the documents  must provide tools and algorithm to perform common task (transformation)  To extract patterns of interest (filtering)
  • 15. Text document collections: Corpus Constructor: Corpus(object = ..., readerControl = list(reader = object@DefaultReader, language = "en_US", load = FALSE)) Example: >txt <- system.file("texts", "txt", package = "tm") >(ovid <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 5 text documents
  • 16. Corpus: Meta Data >meta(ovid[[1]]) Available meta data pairs are: Author : DateTimeStamp: 2013-11-19 18:54:04 Description : Heading : ID : ovid_1.txt Language : la Origin : >ID(ovid[[1]]) [1] "ovid_1.txt“
  • 17. Corpus: Document’s text >ovid[[1]] Si quis in hoc artem populo non novit amandi, hoc legat et lecto carmine doctus amet. arte citae veloque rates remoque moventur, arte leves currus: arte regendus amor. curribus Automedon lentisque erat aptus habenis, Tiphys in Haemonia puppe magister erat: me Venus artificem tenero praefecit Amori; Tiphys et Automedon dicar Amoris ego. ille quidem ferus est et qui mihi saepe repugnet: sed puer est, aetas mollis et apta regi. Phillyrides puerum cithara perfecit Achillem, atque animos placida contudit arte feros. qui totiens socios, totiens exterruit hostes, creditur annosum pertimuisse senem.
  • 18. Corpus: Meta Data >c(ovid[1:2], ovid[3:4]) A corpus with 4 text documents >length(ovid) 5 >summary(ovid) A corpus with 5 text documents The metadata consists of 2 tag-value pairs and a data frame Available tags are: create_date creator Available variables in the data frame are: MetaID
  • 19. Corpus: Meta Data >CMetaData(ovid) $create_date [1] "2013-11-19 18:54:04 GMT" $creator [1] "“ >DMetaData(ovid) MetaID 1 0 2 0 3 0 4 0 5 0
  • 20. Corpus: Transformations and Filters >getTransformations() [1] "as.PlainTextDocument" "removeNumbers" "removePunctuation" "removeWords" [5] "stemDocument" "stripWhitespace“ >tm_map(ovid, FUN = tolower) A corpus with 5 text documents >getFilters() [1] "searchFullText" "sFilter" "tm_intersect" >tm_filter(ovid, FUN = searchFullText, "Venus", doclevel = TRUE) A corpus with 1 text document
  • 21. Text Preprocessing: import >txt <- system.file("texts", "crude", package = "tm") >(acq <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 50 text documents >txt <- system.file("texts", "crude", package = "tm") >(crude <- Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = "la", load = TRUE))) A corpus with 20 text documents resulting in 50 articles of topic acq and 20 articles of topic crude
  • 22. Preprocessing: stemming  Morphological variants of a word (morphemes). Similar terms derived from a common stem: engineer, engineered, engineering use, user, users, used, using  Stemming in Information Retrieval. Grouping words with a common stem together.  For example, a search on reads, also finds read, reading, and readable  Stemming consists of removing suffixes and conflating the resulting morphemes. Occasionally, prefixes are also removed.
  • 23. Preprocessing: stemming  Reduce terms to their “roots”  automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress
  • 24. Preprocessing: stemming Typical rules in Stemming: sses ss ies  i ational  ate tional  tion Weight of word sensitive rules (m>1) EMENT → replacement → replac cement → cement
  • 25. Preprocessing: stemming  help recall for some queries but harm precision on others  Fine distinctions may be lost through stemming.
  • 26. Preprocessing: stemming >acq[[10]] Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter >stemDocument(acq[[10]]) Gulf Appli Technolog Inc said it sold it subsidiari engag in pipelin and terminal oper for 12.2 mln dlrs. The compani said the sale is subject to certain post clos adjustments, which it did not explain. Reuter >tm_map(acq, stemDocument) A corpus with 50 text documents
  • 27. Preprocessing: Whitespace elimination & lower case conversion >stripWhitespace(acq[[10]]) Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter >tolower(acq[[10]]) gulf applied technologies inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. the company said the sale is subject to certain post closing adjustments, which it did not explain. reuter
  • 28. Preprocessing: Stopword removal Very common words, such as of, and, the, are rarely of use in information retrieval. A long stop list saves space in indexes, speeds processing, and eliminates many false hits. However, common words are sometimes significant in information retrieval, which is an argument for a short stop list. (Consider the query, "To be or not to be?")
  • 29. Preprocessing: Stopword removal Include the most common words in the English language (perhaps 50 to 250 words). Do not include words that might be important for retrieval (Among the 200 most frequently occurring words in general literature in English are time, war, home, life, water, and world). In addition, include words that are very common in context (e.g., computer, information, system in a set of computing documents).
  • 30. Preprocessing: Stopword removal about above accordingacross actually adj after afterwards again against all almost alone along already also although always among amongst an another any anyhow anyone anything anywhere are aren't around at be became because become becomes becoming been before beforehand begin beginning behind being below beside besides between beyond billion both but by can can't cannot caption co could couldn't did didn't do does doesn't don't down during each eg eight eighty either else elsewhere end ending enough etc even ever every everyone everything
  • 31. Preprocessing: Stopword removal How many words should be in the stop list? • Long list lowers recall Which words should be in list? • Some common words may have retrieval importance: -- war, home, life, water, world • In certain domains, some words are very common: -- computer, program, source, machine, language
  • 32. Preprocessing: Stopword removal >mystopwords <- c("and", "for", "in", "is", "it", "not", "the", "to") >removeWords(acq[[10]], mystopwords) Gulf Applied Technologies Inc said sold its subsidiaries engaged pipeline terminal operations 12.2 mln dlrs. The company said sale subject certain post closing adjustments, which did explain. Reuter >tm_map(acq, removeWords, mystopwords) A corpus with 50 text documents
  • 33. Preprocessing: Synonyms > library("wordnet") synonyms("company") [1] "caller" "companionship" "company" "fellowship" [5] "party" "ship’s company" "society" "troupe“ replaceWords(acq[[10]], synonyms(dict, "company"), by = "company") Tm_map(acq, replaceWords, synonyms(dict, "company"), by = "company")
  • 34. Preprocessing: Part of speech tagging >library("NLP","openNLP") s <- as.String(acq[[10]]) ## Need sentence and word token annotations. sent_token_annotator <- Maxent_Sent_Token_Annotator() word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- annotate(s, list(sent_token_annotator, word_token_annotator)) pos_tag_annotator <- Maxent_POS_Tag_Annotator() #pos_tag_annotator a3 <- annotate(s, pos_tag_annotator, a2) a3w <- subset(a3, type == "word") tags <- sapply(a3w$features, "[[", "POS") sprintf("%s/%s", s[a3w], tags)
  • 35. Preprocessing: Part of speech tagging "Gulf/NNP" "Applied/NNP" "Technologies/NNP" "Inc/NNP" "said/VBD" "it/PRP" "sold/VBD" "its/PRP$" "subsidiaries/NNS" "engaged/VBN" "in/IN" "pipeline/NN" "and/CC" "terminal/NN" "operations/NNS" "for/IN" "12.2/CD" "mln/NN" "dlrs/NNS" "./." "The/DT" "company/NN" "said/VBD" "the/DT" "sale/NN" "is/VBZ" "subject/JJ" "to/TO" "certain/JJ" "post/NN" "closing/NN" "adjustments/NNS" ",/," "which/WDT" "it/PRP" "did/VBD" "not/RB" "explain/VB" "./." "Reuter/NNP“ more
  • 37. Classification using KNN K-Nearest Neighbor algorithm:  Most basic instance-based method  Data are represented in a vector space  Supervised learning , V is the finite set {v1,......,vn} the k-NN returns the most common value among the k training examples nearest to xq.
  • 39. KNN Training algorithm For each training example <x,f(x)> add the example to the list Classification algorithm Given a query instance xq to be classified Let x1,..,xk k instances which are nearest to xq Where 𝛿(a,b)=1 if a=b, else 𝛿(a,b)= 0 (Kronecker function)
  • 40. Classification using KNN : Example Two classes: Red and Blue Green is Unknown With K=3, classification is Red With k=4, classification is Blue
  • 41. How to determine the good value for k?  Determined experimentally  Start with k=1 and use a test set to validate the error rate of the classifier  Repeat with k=k+2  Choose the value of k for which the error rate is minimum  Note: k should be odd number to avoid ties
  • 42. KNN for speech classification Datasets: Size: 40 instances Barak Obama 20 speeches Mitt Romney 20 speeches Training datasets: 70% (28) Test datasets: 30% (12) Accuracy: on average more than 90%
  • 43. Speech Classification Implementation in R #initialize the R environment libs<-c("tm","plyr","class") lapply(libs,require,character.only=TRUE) #Set parameters / source directory dir.names<-c("obama","romney") path<-"E:/Ashraf/speeches" #clean text / preprocessing cleanCorpus<-function(corpus){ corpus.tmp<-tm_map(corpus,removePunctuation) corpus.tmp<-tm_map(corpus.tmp,stripWhitespace) corpus.tmp<-tm_map(corpus.tmp,tolower) corpus.tmp<-tm_map(corpus.tmp,removeWords,stopwords("english")) return (corpus.tmp) }
  • 44. Speech Classification Implementation in R #build term document matrix generateTDM<-function(dir.name,dir.path){ s.dir<-sprintf("%s/%s",dir.path,dir.name) s.cor<-Corpus(DirSource(directory=s.dir,encoding="ANSI")) s.cor.cl<-cleanCorpus(s.cor) s.tdm<-TermDocumentMatrix(s.cor.cl) s.tdm<-removeSparseTerms(s.tdm,0.7) result<-list(name=dir.name,tdm=s.tdm) } tdm<-lapply(dir.names,generateTDM,dir.path=path)
  • 45. Speech Classification Implementation in R #attach candidate name to each row of TDM bindCandidateToTDM<-function(tdm){ s.mat<-t(data.matrix(tdm[["tdm"]])) s.df<-as.data.frame(s.mat,StringAsFactors=FALSE) s.df<-cbind(s.df,rep(tdm[["name"]],nrow(s.df))) colnames(s.df)[ncol(s.df)]<-"targetcandidate" return (s.df) } candTDM<-lapply(tdm,bindCandidateToTDM)
  • 46. Speech Classification Implementation in R #stack the TDMs together (for both Obama and Romnie) tdm.stack<-do.call(rbind.fill,candTDM) tdm.stack[is.na(tdm.stack)]<-0 #hold-out / splitting training and test data sets train.idx<-sample(nrow(tdm.stack),ceiling(nrow(tdm.stack)*0.7)) test.idx<-(1:nrow(tdm.stack))[-train.idx])
  • 47. Speech Classification Implementation in R #model KNN tdm.cand<-tdm.stack[,"targetcandidate"] tdm.stack.nl<-tdm.stack[,!colnames(tdm.stack)%in%"targetcandidate"] knn.pred<- knn(tdm.stack.nl[train.idx,],tdm.stack.nl[test.idx,],tdm.cand[train.idx]) #accuracy of the prediction conf.mat<-table('Predictions'=knn.pred,Actual=tdm.cand[test.idx]) (accuracy<-(sum(diag(conf.mat))/length(test.idx))*100) #show result show(conf.mat) show(accuracy)
  • 49. References 1. Text Mining Infrastructure in R, Ingo Feinerer, Kurt Hornik, David Meyer, Vol. 25, Issue 5, Mar 2008, Journal of Statistical Software. 2. http://mittromneycentral.com/speeches/ 3. http://obamaspeeches.com/ 4. http://cran.r-project.org/