Text classification

Applied Management Research Project
TEXT CLASSIFICATION
using
SUPPORT VECTOR MACHINES in R
Report submitted to the
Indian Institute of Technology, Kharagpur
In partial fulfillment
For the award of the degree
of
Master of Business Administration
by
Kotni Sai Srinivas [14BM60083]
Under the guidance of
Prof. Susmita Mukhopadhyay
VINOD GUPTA SCHOOL OF MANAGEMENT
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
April, 2016

Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 2
CERTIFICATE
This is to certify that the Applied Management Research Project report titled ‘Text Classification
using Support Vector Machines in R’, submitted by Kotni Sai Srinivas bearing Roll No. 14BM60083
to Indian Institute of Technology, Kharagpur, is a record of bonafide research work under my
supervision and I consider it worthy of consideration for the award of degree of Master of Business
Administration in accordance with the regulation of the Institute.
Date: _____________________
Supervisor
.

CERTIFICATE OF EXAMIMATION
DD/MM/YYYY
Certified that the Applied Management Research Project report titled ‘Text Classification using
Support Vector Machines in R’, submitted by Kotni Sai Srinivas bearing Roll No. 14BM60083 to the
Indian Institute of Technology, Kharagpur, towards the partial fulfillment of the requirements for the
award of the degree Master of Business Administration has been accepted by the panel of examiners,
and that the student has successfully defended the work in the viva-voce examination held today.
Panel Member 1 Panel Member 2
Panel Member3 Panel Member 4

ACKNOWLEDGEMENT
I would like to thank my guide Prof. Susmita Mukhopadhyay for her support, guidance and keen
interest with which he helped me solve various problems concerned with the project and for taking
out his precious time amidst of her busy schedule.
I would relish to thank Prof. Sujoy Bhattacharya and Prof. Parama Barai for teaching the about
classification techniques and coding language R, which are fundamentals requisite for this project. I
also take this opportunity to thank all my Professors at Vinod Gupta School of Management, IIT
Kharagpur, who have been my constant source of inspiration and guidance. I have learnt a lot during
the interaction with them. This learning has helped me in successfully completing the tasks I was
assigned as part of my research.
I would like to gratefully acknowledge "Vinod Gupta School of Management" for offering this
wonderful opportunity and platform to earn exposure and garner knowledge about the various
aspects of management. I am able to say with conviction that I have immensely benefited from
my association as a student in this prestigious school.

EXECUTIVE SUMMARY
Automated text classification has been considered as a vital method to manage and process a vast
amount of documents in digital forms that are widespread and continuously increasing. In general,
text classification plays an important role in information extraction and summarization, text retrieval,
and question answering.
to easily extract actionable data from text

Conceptual process and framework
A text mining analysis involves several challenging process steps mainly influenced by the fact that
texts, from a computer perspective, are rather unstructured collections of words. A text mining
analyst typically starts with a set of highly heterogeneous input texts. So the first step is to import
these texts into one's favourite computing environment, in our case R. Simultaneously it is important
to organize and structure the texts to be able to access them in a uniform manner. Once the texts are
organized in a repository, the second step is tidying up the texts, including pre-processing the texts to
obtain a convenient representation for later analysis. This step might involve text reformatting (e.g.,
whitespace removal), stopword removal, or stemming procedures. Third, the analyst must be able to
transform the pre-processed texts into structured formats to be actually computed with. For
classical" text mining tasks, this normally implies the creation of a so-called term-document matrix,
probably the most common format to represent texts for computation. Now the analyst can work and
compute on texts with standard techniques from statistics and data mining, like clustering or
classification methods.
This rather typical process model highlights important steps that call for support by a text mining
infrastructure: A text mining framework must offer functionality for managing text documents
should abstract the process of document manipulation and ease the usage of heterogeneous text
formats. Thus there is a need for a conceptual entity similar to a database holding and managing text
documents in a generic way: we call this entity a text document collection or corpus.
Since text documents are present in different le formats and in different locations, like a com-pressed
le on the Internet or a locally stored text le with additional annotations, there has to be an
encapsulating mechanism providing standardized interfaces to access the document data. We
subsume this functionality in so-called sources.
Besides the actual textual data many modern le formats provide features to annotate text documents
(e.g., XML with special tags), i.e., there is metadata available which further de-scribes and enriches
the textual content and might offer valuable insights into the document structure or additional
concepts. Also, additional metadata is likely to be created during an analysis. Therefore the
framework must be able to alleviate metadata usage in a convenient way, both on a document level
(e.g., short summaries or descriptions of selected documents) and on a collection level (e.g.,
collection-wide classification tags).
Alongside the data infrastructure for text documents the framework must provide tools and
algorithms to efficiently work with the documents. That means the framework has to have
functionality to perform common tasks, like whitespace removal, stemming or stopword deletion.
We denote such functions operating on text document collections as transformations. Another
important concept is filtering which basically involves applying predicate functions on collections to

extract patterns of interest. A surprisingly challenging operation is the one of joining text document
collections. Merging sets of documents is straightforward, but merging metadata intelligently needs
a more sophisticated handling, since storing metadata from different sources in successive steps
necessarily results in a hierarchical, tree-like structure. The challenge is to keep these joins and
subsequent look-up operations efficient for large document collections.
Realistic scenarios in text mining use at least several hundred text documents ranging up to several
hundred thousands of documents. This means a compact storage of the documents in a document
collection is relevant for appropriate RAM usage | a simple approach would hold all documents in
memory once read in and bring down even fully RAM equipped systems shortly with document
collections of several thousand text documents. However, simple database orientated mechanisms
can already circumvent this situation, e.g., by holding only pointers or hash tables in memory instead
of full documents.
Text mining typically involves doing computations on texts to gain interesting information. The
most common approach is to create a so-called term-document matrix holding frequencies of distinct
terms for each document. Another approach is to compute directly on character sequences as is done
by string kernel methods. Thus the framework must allow mechanisms for term-document matrices
and provide interfaces to access the document corpora as plain character sequences.
Basically, the framework and infrastructure supplied by tm aims at implementing the con-ceptual
framework presented above. The next section will introduce the data structures and algorithms
provided.

Create a new folder called TextMining and store the documents in that folder. The > prompt in the
RStudio console indicates that R is ready to process commands, to see the current working directory
type in getwd() and hit return. You’ll see something like:
getwd()
[1] “C:/Users/Documents”
The exact output will of course depend on your working directory. Note the forward slashes in the
path. This is because of R’s Unix heritage (backslash is an escape character in R.). So, here’s how
would change the working directory to C:Users:
setwd(“C:/Users”)
You can now use getwd()to check that setwd() has done what it should.
getwd()
[1]”C:/Users”
Loading data into R, start RStudio and open the TextMining project you created earlier. The next step
is to load the tm package as this is not loaded by default. This is done using the library() function like
so:
library(tm)
Loading required package: NLP
Dependent packages are loaded automatically – in this case the dependency is on the NLP (natural
language processing) package.
Next, we need to create a collection of documents (technically referred to as a Corpus) in the R
environment. This basically involves loading the files created in the TextMining folder into a Corpus
object. The tm package provides the Corpus() function to do this. There are several ways to create a
Corpus. In a nutshell, the Corpus() function can read from various sources including a directory.
That’s the option we’ll use:
#Create Corpus
docs <- Corpus(DirSource(“C:/Users/Kailash/Documents/TextMining”))
A couple of things to note in the above. Any line that starts with a # is a comment, and the “<-“ tells
R to assign the result of the command on the right hand side to the variable on the left hand side. In
this case the Corpus object created is stored in a variable called docs. One can also use the equals
sign (=) for assignment if one wants to.

Type in docs to see some information about the newly created corpus:
docs
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 30
The summary() function gives more details, including a complete listing of file, but it isn’t
particularly enlightening. Instead, we’ll examine a particular document in the corpus.
#inspect a particular document
writeLines(as.character(docs[[30]]))
…output not shown…
Which prints the entire content of 30th document in the corpus to the console.
Pre-processing
Data cleansing, though tedious, is perhaps the most important step in text analysis. As we will see,
dirty data can play havoc with the results. Furthermore, as we will also see, data cleaning is
invariably an iterative process as there are always problems that are overlooked the first time around.
The tm package offers a number of transformations that ease the tedium of cleaning data. To see the
available transformations type getTransformations() at the R prompt:
> getTransformations()
[1] “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace”
There are a few preliminary clean-up steps we need to do before we use these powerful
transformations. If you inspect some documents in the corpus (and you know how to do that now),
you will notice that some quirks in writing. For example, colons and hyphens used without spaces
between the words separated by them. Using the removePunctuation transform without fixing this
will cause the two words on either side of the symbols to be combined. Clearly, we need to fix this
prior to using the transformations.
To fix the above, one has to create a custom transformation. The tm package provides the ability to
do this via the content_transformer function. This function takes a function as input, the input
function should specify what transformation needs to be done. In this case, the input function would

be one that replaces all instances of a character by spaces. As it turns out the gsub() function does just
that.
Here is the R code to build the content transformer, which we will call toSpace:
#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, ” “, x))})
Now we can use this content transformer to eliminate colons and hypens like so:
docs <- tm_map(docs, toSpace, “-“)
docs <- tm_map(docs, toSpace, “:”)
#Remove punctuation – replace punctuation marks with ” “
docs <- tm_map(docs, removePunctuation)
Inspecting the corpus reveals that several “non-standard” punctuation marks have not been removed.
These include the single curly quote marks and a space-hyphen combination. These can be removed
using our custom content transformer, toSpace. Note that you might want to copy-n-paste these
symbols directly from the relevant text file to ensure that they are accurately represented in toSpace.
docs <- tm_map(docs, toSpace, “’”)
docs <- tm_map(docs, toSpace, “‘”)
docs <- tm_map(docs, toSpace, ” -“)
Inspect the corpus again to ensure that the offenders have been eliminated. This is also a good time
to check for any other special symbols that may need to be removed manually.
If all is well, you can move to the next step which is to:
1. Convert the corpus to lower case
2. Remove all numbers
Since R is case sensitive, “Text” is not equal to “text” – and hence the rationale for converting to a
standard case. However, although there is a tolower transformation, it is not a part of the standard tm
transformations (see the output of getTransformations() in the previous section). For this reason, we
have to convert tolower into a transformation that can handle a corpus object properly. This is done
with the help of our new friend, content_transformer.
Here’s the relevant code:
#Transform to lower case (need to wrap in content_transformer)

docs <- tm_map(docs,content_transformer(tolower))
Text analysts are typically not interested in numbers since these do not usually contribute to the
meaning of the text. However, this may not always be so. For example, it is definitely not the case if
one is interested in getting a count of the number of times a particular year appears in a corpus. This
does not need to be wrapped in content_transformer as it is a standard transformation in tm.
#Strip digits (std transformation, so no need for content_transformer)
docs <- tm_map(docs, removeNumbers)
Once again, be sure to inspect the corpus before proceeding.
The next step is to remove common words from the text. These include words such as articles (a, an,
the), conjunctions (and, or but etc.), common verbs (is), qualifiers (yet, however etc) . The tm
package includes a standard list of such stop words as they are referred to. We remove stop words
using the standard removeWords transformation like so:
#remove stopwords using the standard list in tm
docs <- tm_map(docs, removeWords, stopwords(“english”))
Finally, we remove all extraneous whitespaces using the stripWhitespace transformation:
#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)
Code:
 getwd()
 setwd("C:/Users/user.user-PC.000/Documents")
 docs=Corpus(DirSource("C:/Users/user.user-PC.000/Documents/TextMining"))
 docs
 writeLines(as.character(docs[[30]]))

 getTransformations()
 toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
 docs <- tm_map(docs, toSpace, "-")
 docs <- tm_map(docs, toSpace, ":")
 docs <- tm_map(docs, removePunctuation)
 docs <- tm_map(docs, toSpace, "’")
 docs <- tm_map(docs, toSpace, "‘")
 docs <- tm_map(docs, toSpace, " -")
 docs <- tm_map(docs,content_transformer(tolower))
 docs <- tm_map(docs, removeNumbers)
 docs <- tm_map(docs, removeWords, stopwords("english"))
 docs <- tm_map(docs, stripWhitespace)
 docs <- tm_map(docs,stemDocument)
 writeLines(as.character(docs[[30]]))
 dtm <- DocumentTermMatrix(docs)
 freq <- colSums(as.matrix(dtm))
 length(freq)
 ord <- order(freq,decreasing=TRUE)
 freq[head(ord)]
 dtmr <-DocumentTermMatrix(docs, control=list(wordLengths=c(4, 20), bounds = list(global
= c(3,27))))
 freqr <- colSums(as.matrix(dtmr))
 ordr <- order(freqr,decreasing=TRUE)
 freqr[head(ordr)]
 wordcloud(names(freqr),freqr, min.freq=70)

References & Bibliography
• http://www.iamwire.com/2013/04/navigating-through-e-commerce-customer-support-
issues/8320

• http://kwiksurveys.com/s/buh5qHoV
• http://mthink.com/article/customer-care-through-e-commerce-looking-glass/
• https://www.salesforce.com/blog/2013/06/customer-service-for-ecommerce.html
• The Rise and Rise of E-Commerce in India, Aranca Research, IBEF
• Investigating customer satisfaction dimensions with service quality of online auctions: an
empirical investigation of e-Bay, Springer-Verlag Berlin Heidelberg 2012
• www.snapdeal.com
• www.amazon.in
• www.flipkart.com
• www.paytm.com

Text classification

More Related Content

What's hot

Similar to Text classification

Recently uploaded

Text classification