Applied Management Research Project
TEXT CLASSIFICATION
using
SUPPORT VECTOR MACHINES in R
Report submitted to the
Indian Institute of Technology, Kharagpur
In partial fulfillment
For the award of the degree
of
Master of Business Administration
by
Kotni Sai Srinivas [14BM60083]
Under the guidance of
Prof. Susmita Mukhopadhyay
VINOD GUPTA SCHOOL OF MANAGEMENT
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
April, 2016
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 2
CERTIFICATE
This is to certify that the Applied Management Research Project report titled ‘Text Classification
using Support Vector Machines in R’, submitted by Kotni Sai Srinivas bearing Roll No. 14BM60083
to Indian Institute of Technology, Kharagpur, is a record of bonafide research work under my
supervision and I consider it worthy of consideration for the award of degree of Master of Business
Administration in accordance with the regulation of the Institute.
Date: _____________________
Supervisor
.
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 3
CERTIFICATE OF EXAMIMATION
DD/MM/YYYY
Certified that the Applied Management Research Project report titled ‘Text Classification using
Support Vector Machines in R’, submitted by Kotni Sai Srinivas bearing Roll No. 14BM60083 to the
Indian Institute of Technology, Kharagpur, towards the partial fulfillment of the requirements for the
award of the degree Master of Business Administration has been accepted by the panel of examiners,
and that the student has successfully defended the work in the viva-voce examination held today.
Panel Member 1 Panel Member 2
Panel Member3 Panel Member 4
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 4
ACKNOWLEDGEMENT
I would like to thank my guide Prof. Susmita Mukhopadhyay for her support, guidance and keen
interest with which he helped me solve various problems concerned with the project and for taking
out his precious time amidst of her busy schedule.
I would relish to thank Prof. Sujoy Bhattacharya and Prof. Parama Barai for teaching the about
classification techniques and coding language R, which are fundamentals requisite for this project. I
also take this opportunity to thank all my Professors at Vinod Gupta School of Management, IIT
Kharagpur, who have been my constant source of inspiration and guidance. I have learnt a lot during
the interaction with them. This learning has helped me in successfully completing the tasks I was
assigned as part of my research.
I would like to gratefully acknowledge "Vinod Gupta School of Management" for offering this
wonderful opportunity and platform to earn exposure and garner knowledge about the various
aspects of management. I am able to say with conviction that I have immensely benefited from
my association as a student in this prestigious school.
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 5
EXECUTIVE SUMMARY
Automated text classification has been considered as a vital method to manage and process a vast
amount of documents in digital forms that are widespread and continuously increasing. In general,
text classification plays an important role in information extraction and summarization, text retrieval,
and question answering.
to easily extract actionable data from text
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 6
Conceptual process and framework
A text mining analysis involves several challenging process steps mainly influenced by the fact that
texts, from a computer perspective, are rather unstructured collections of words. A text mining
analyst typically starts with a set of highly heterogeneous input texts. So the first step is to import
these texts into one's favourite computing environment, in our case R. Simultaneously it is important
to organize and structure the texts to be able to access them in a uniform manner. Once the texts are
organized in a repository, the second step is tidying up the texts, including pre-processing the texts to
obtain a convenient representation for later analysis. This step might involve text reformatting (e.g.,
whitespace removal), stopword removal, or stemming procedures. Third, the analyst must be able to
transform the pre-processed texts into structured formats to be actually computed with. For
classical" text mining tasks, this normally implies the creation of a so-called term-document matrix,
probably the most common format to represent texts for computation. Now the analyst can work and
compute on texts with standard techniques from statistics and data mining, like clustering or
classification methods.
This rather typical process model highlights important steps that call for support by a text mining
infrastructure: A text mining framework must offer functionality for managing text documents
should abstract the process of document manipulation and ease the usage of heterogeneous text
formats. Thus there is a need for a conceptual entity similar to a database holding and managing text
documents in a generic way: we call this entity a text document collection or corpus.
Since text documents are present in different le formats and in different locations, like a com-pressed
le on the Internet or a locally stored text le with additional annotations, there has to be an
encapsulating mechanism providing standardized interfaces to access the document data. We
subsume this functionality in so-called sources.
Besides the actual textual data many modern le formats provide features to annotate text documents
(e.g., XML with special tags), i.e., there is metadata available which further de-scribes and enriches
the textual content and might offer valuable insights into the document structure or additional
concepts. Also, additional metadata is likely to be created during an analysis. Therefore the
framework must be able to alleviate metadata usage in a convenient way, both on a document level
(e.g., short summaries or descriptions of selected documents) and on a collection level (e.g.,
collection-wide classification tags).
Alongside the data infrastructure for text documents the framework must provide tools and
algorithms to efficiently work with the documents. That means the framework has to have
functionality to perform common tasks, like whitespace removal, stemming or stopword deletion.
We denote such functions operating on text document collections as transformations. Another
important concept is filtering which basically involves applying predicate functions on collections to
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 7
extract patterns of interest. A surprisingly challenging operation is the one of joining text document
collections. Merging sets of documents is straightforward, but merging metadata intelligently needs
a more sophisticated handling, since storing metadata from different sources in successive steps
necessarily results in a hierarchical, tree-like structure. The challenge is to keep these joins and
subsequent look-up operations efficient for large document collections.
Realistic scenarios in text mining use at least several hundred text documents ranging up to several
hundred thousands of documents. This means a compact storage of the documents in a document
collection is relevant for appropriate RAM usage | a simple approach would hold all documents in
memory once read in and bring down even fully RAM equipped systems shortly with document
collections of several thousand text documents. However, simple database orientated mechanisms
can already circumvent this situation, e.g., by holding only pointers or hash tables in memory instead
of full documents.
Text mining typically involves doing computations on texts to gain interesting information. The
most common approach is to create a so-called term-document matrix holding frequencies of distinct
terms for each document. Another approach is to compute directly on character sequences as is done
by string kernel methods. Thus the framework must allow mechanisms for term-document matrices
and provide interfaces to access the document corpora as plain character sequences.
Basically, the framework and infrastructure supplied by tm aims at implementing the con-ceptual
framework presented above. The next section will introduce the data structures and algorithms
provided.
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 8
Create a new folder called TextMining and store the documents in that folder. The > prompt in the
RStudio console indicates that R is ready to process commands, to see the current working directory
type in getwd() and hit return. You’ll see something like:
getwd()
[1] “C:/Users/Documents”
The exact output will of course depend on your working directory. Note the forward slashes in the
path. This is because of R’s Unix heritage (backslash is an escape character in R.). So, here’s how
would change the working directory to C:Users:
setwd(“C:/Users”)
You can now use getwd()to check that setwd() has done what it should.
getwd()
[1]”C:/Users”
Loading data into R, start RStudio and open the TextMining project you created earlier. The next step
is to load the tm package as this is not loaded by default. This is done using the library() function like
so:
library(tm)
Loading required package: NLP
Dependent packages are loaded automatically – in this case the dependency is on the NLP (natural
language processing) package.
Next, we need to create a collection of documents (technically referred to as a Corpus) in the R
environment. This basically involves loading the files created in the TextMining folder into a Corpus
object. The tm package provides the Corpus() function to do this. There are several ways to create a
Corpus. In a nutshell, the Corpus() function can read from various sources including a directory.
That’s the option we’ll use:
#Create Corpus
docs <- Corpus(DirSource(“C:/Users/Kailash/Documents/TextMining”))
A couple of things to note in the above. Any line that starts with a # is a comment, and the “<-“ tells
R to assign the result of the command on the right hand side to the variable on the left hand side. In
this case the Corpus object created is stored in a variable called docs. One can also use the equals
sign (=) for assignment if one wants to.
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 9
Type in docs to see some information about the newly created corpus:
docs
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 30
The summary() function gives more details, including a complete listing of file, but it isn’t
particularly enlightening. Instead, we’ll examine a particular document in the corpus.
#inspect a particular document
writeLines(as.character(docs[[30]]))
…output not shown…
Which prints the entire content of 30th document in the corpus to the console.
Pre-processing
Data cleansing, though tedious, is perhaps the most important step in text analysis. As we will see,
dirty data can play havoc with the results. Furthermore, as we will also see, data cleaning is
invariably an iterative process as there are always problems that are overlooked the first time around.
The tm package offers a number of transformations that ease the tedium of cleaning data. To see the
available transformations type getTransformations() at the R prompt:
> getTransformations()
[1] “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace”
There are a few preliminary clean-up steps we need to do before we use these powerful
transformations. If you inspect some documents in the corpus (and you know how to do that now),
you will notice that some quirks in writing. For example, colons and hyphens used without spaces
between the words separated by them. Using the removePunctuation transform without fixing this
will cause the two words on either side of the symbols to be combined. Clearly, we need to fix this
prior to using the transformations.
To fix the above, one has to create a custom transformation. The tm package provides the ability to
do this via the content_transformer function. This function takes a function as input, the input
function should specify what transformation needs to be done. In this case, the input function would
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 10
be one that replaces all instances of a character by spaces. As it turns out the gsub() function does just
that.
Here is the R code to build the content transformer, which we will call toSpace:
#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, ” “, x))})
Now we can use this content transformer to eliminate colons and hypens like so:
docs <- tm_map(docs, toSpace, “-“)
docs <- tm_map(docs, toSpace, “:”)
#Remove punctuation – replace punctuation marks with ” “
docs <- tm_map(docs, removePunctuation)
Inspecting the corpus reveals that several “non-standard” punctuation marks have not been removed.
These include the single curly quote marks and a space-hyphen combination. These can be removed
using our custom content transformer, toSpace. Note that you might want to copy-n-paste these
symbols directly from the relevant text file to ensure that they are accurately represented in toSpace.
docs <- tm_map(docs, toSpace, “’”)
docs <- tm_map(docs, toSpace, “‘”)
docs <- tm_map(docs, toSpace, ” -“)
Inspect the corpus again to ensure that the offenders have been eliminated. This is also a good time
to check for any other special symbols that may need to be removed manually.
If all is well, you can move to the next step which is to:
1. Convert the corpus to lower case
2. Remove all numbers
Since R is case sensitive, “Text” is not equal to “text” – and hence the rationale for converting to a
standard case. However, although there is a tolower transformation, it is not a part of the standard tm
transformations (see the output of getTransformations() in the previous section). For this reason, we
have to convert tolower into a transformation that can handle a corpus object properly. This is done
with the help of our new friend, content_transformer.
Here’s the relevant code:
#Transform to lower case (need to wrap in content_transformer)
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 11
docs <- tm_map(docs,content_transformer(tolower))
Text analysts are typically not interested in numbers since these do not usually contribute to the
meaning of the text. However, this may not always be so. For example, it is definitely not the case if
one is interested in getting a count of the number of times a particular year appears in a corpus. This
does not need to be wrapped in content_transformer as it is a standard transformation in tm.
#Strip digits (std transformation, so no need for content_transformer)
docs <- tm_map(docs, removeNumbers)
Once again, be sure to inspect the corpus before proceeding.
The next step is to remove common words from the text. These include words such as articles (a, an,
the), conjunctions (and, or but etc.), common verbs (is), qualifiers (yet, however etc) . The tm
package includes a standard list of such stop words as they are referred to. We remove stop words
using the standard removeWords transformation like so:
#remove stopwords using the standard list in tm
docs <- tm_map(docs, removeWords, stopwords(“english”))
Finally, we remove all extraneous whitespaces using the stripWhitespace transformation:
#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)
Code:
 getwd()
 setwd("C:/Users/user.user-PC.000/Documents")
 docs=Corpus(DirSource("C:/Users/user.user-PC.000/Documents/TextMining"))
 docs
 writeLines(as.character(docs[[30]]))
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 12
 getTransformations()
 toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
 docs <- tm_map(docs, toSpace, "-")
 docs <- tm_map(docs, toSpace, ":")
 docs <- tm_map(docs, removePunctuation)
 docs <- tm_map(docs, toSpace, "’")
 docs <- tm_map(docs, toSpace, "‘")
 docs <- tm_map(docs, toSpace, " -")
 docs <- tm_map(docs,content_transformer(tolower))
 docs <- tm_map(docs, removeNumbers)
 docs <- tm_map(docs, removeWords, stopwords("english"))
 docs <- tm_map(docs, stripWhitespace)
 docs <- tm_map(docs,stemDocument)
 writeLines(as.character(docs[[30]]))
 dtm <- DocumentTermMatrix(docs)
 freq <- colSums(as.matrix(dtm))
 length(freq)
 ord <- order(freq,decreasing=TRUE)
 freq[head(ord)]
 dtmr <-DocumentTermMatrix(docs, control=list(wordLengths=c(4, 20), bounds = list(global
= c(3,27))))
 freqr <- colSums(as.matrix(dtmr))
 ordr <- order(freqr,decreasing=TRUE)
 freqr[head(ordr)]
 wordcloud(names(freqr),freqr, min.freq=70)
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 13
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 14
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 15
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 16
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 17
References & Bibliography
• http://www.iamwire.com/2013/04/navigating-through-e-commerce-customer-support-
issues/8320
Text Classification using Support Vector Machines in R
Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 18
• http://kwiksurveys.com/s/buh5qHoV
• http://mthink.com/article/customer-care-through-e-commerce-looking-glass/
• https://www.salesforce.com/blog/2013/06/customer-service-for-ecommerce.html
• The Rise and Rise of E-Commerce in India, Aranca Research, IBEF
• Investigating customer satisfaction dimensions with service quality of online auctions: an
empirical investigation of e-Bay, Springer-Verlag Berlin Heidelberg 2012
• www.snapdeal.com
• www.amazon.in
• www.flipkart.com
• www.paytm.com

Text classification

  • 1.
    Applied Management ResearchProject TEXT CLASSIFICATION using SUPPORT VECTOR MACHINES in R Report submitted to the Indian Institute of Technology, Kharagpur In partial fulfillment For the award of the degree of Master of Business Administration by Kotni Sai Srinivas [14BM60083] Under the guidance of Prof. Susmita Mukhopadhyay VINOD GUPTA SCHOOL OF MANAGEMENT INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR April, 2016
  • 2.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 2 CERTIFICATE This is to certify that the Applied Management Research Project report titled ‘Text Classification using Support Vector Machines in R’, submitted by Kotni Sai Srinivas bearing Roll No. 14BM60083 to Indian Institute of Technology, Kharagpur, is a record of bonafide research work under my supervision and I consider it worthy of consideration for the award of degree of Master of Business Administration in accordance with the regulation of the Institute. Date: _____________________ Supervisor .
  • 3.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 3 CERTIFICATE OF EXAMIMATION DD/MM/YYYY Certified that the Applied Management Research Project report titled ‘Text Classification using Support Vector Machines in R’, submitted by Kotni Sai Srinivas bearing Roll No. 14BM60083 to the Indian Institute of Technology, Kharagpur, towards the partial fulfillment of the requirements for the award of the degree Master of Business Administration has been accepted by the panel of examiners, and that the student has successfully defended the work in the viva-voce examination held today. Panel Member 1 Panel Member 2 Panel Member3 Panel Member 4
  • 4.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 4 ACKNOWLEDGEMENT I would like to thank my guide Prof. Susmita Mukhopadhyay for her support, guidance and keen interest with which he helped me solve various problems concerned with the project and for taking out his precious time amidst of her busy schedule. I would relish to thank Prof. Sujoy Bhattacharya and Prof. Parama Barai for teaching the about classification techniques and coding language R, which are fundamentals requisite for this project. I also take this opportunity to thank all my Professors at Vinod Gupta School of Management, IIT Kharagpur, who have been my constant source of inspiration and guidance. I have learnt a lot during the interaction with them. This learning has helped me in successfully completing the tasks I was assigned as part of my research. I would like to gratefully acknowledge "Vinod Gupta School of Management" for offering this wonderful opportunity and platform to earn exposure and garner knowledge about the various aspects of management. I am able to say with conviction that I have immensely benefited from my association as a student in this prestigious school.
  • 5.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 5 EXECUTIVE SUMMARY Automated text classification has been considered as a vital method to manage and process a vast amount of documents in digital forms that are widespread and continuously increasing. In general, text classification plays an important role in information extraction and summarization, text retrieval, and question answering. to easily extract actionable data from text
  • 6.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 6 Conceptual process and framework A text mining analysis involves several challenging process steps mainly influenced by the fact that texts, from a computer perspective, are rather unstructured collections of words. A text mining analyst typically starts with a set of highly heterogeneous input texts. So the first step is to import these texts into one's favourite computing environment, in our case R. Simultaneously it is important to organize and structure the texts to be able to access them in a uniform manner. Once the texts are organized in a repository, the second step is tidying up the texts, including pre-processing the texts to obtain a convenient representation for later analysis. This step might involve text reformatting (e.g., whitespace removal), stopword removal, or stemming procedures. Third, the analyst must be able to transform the pre-processed texts into structured formats to be actually computed with. For classical" text mining tasks, this normally implies the creation of a so-called term-document matrix, probably the most common format to represent texts for computation. Now the analyst can work and compute on texts with standard techniques from statistics and data mining, like clustering or classification methods. This rather typical process model highlights important steps that call for support by a text mining infrastructure: A text mining framework must offer functionality for managing text documents should abstract the process of document manipulation and ease the usage of heterogeneous text formats. Thus there is a need for a conceptual entity similar to a database holding and managing text documents in a generic way: we call this entity a text document collection or corpus. Since text documents are present in different le formats and in different locations, like a com-pressed le on the Internet or a locally stored text le with additional annotations, there has to be an encapsulating mechanism providing standardized interfaces to access the document data. We subsume this functionality in so-called sources. Besides the actual textual data many modern le formats provide features to annotate text documents (e.g., XML with special tags), i.e., there is metadata available which further de-scribes and enriches the textual content and might offer valuable insights into the document structure or additional concepts. Also, additional metadata is likely to be created during an analysis. Therefore the framework must be able to alleviate metadata usage in a convenient way, both on a document level (e.g., short summaries or descriptions of selected documents) and on a collection level (e.g., collection-wide classification tags). Alongside the data infrastructure for text documents the framework must provide tools and algorithms to efficiently work with the documents. That means the framework has to have functionality to perform common tasks, like whitespace removal, stemming or stopword deletion. We denote such functions operating on text document collections as transformations. Another important concept is filtering which basically involves applying predicate functions on collections to
  • 7.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 7 extract patterns of interest. A surprisingly challenging operation is the one of joining text document collections. Merging sets of documents is straightforward, but merging metadata intelligently needs a more sophisticated handling, since storing metadata from different sources in successive steps necessarily results in a hierarchical, tree-like structure. The challenge is to keep these joins and subsequent look-up operations efficient for large document collections. Realistic scenarios in text mining use at least several hundred text documents ranging up to several hundred thousands of documents. This means a compact storage of the documents in a document collection is relevant for appropriate RAM usage | a simple approach would hold all documents in memory once read in and bring down even fully RAM equipped systems shortly with document collections of several thousand text documents. However, simple database orientated mechanisms can already circumvent this situation, e.g., by holding only pointers or hash tables in memory instead of full documents. Text mining typically involves doing computations on texts to gain interesting information. The most common approach is to create a so-called term-document matrix holding frequencies of distinct terms for each document. Another approach is to compute directly on character sequences as is done by string kernel methods. Thus the framework must allow mechanisms for term-document matrices and provide interfaces to access the document corpora as plain character sequences. Basically, the framework and infrastructure supplied by tm aims at implementing the con-ceptual framework presented above. The next section will introduce the data structures and algorithms provided.
  • 8.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 8 Create a new folder called TextMining and store the documents in that folder. The > prompt in the RStudio console indicates that R is ready to process commands, to see the current working directory type in getwd() and hit return. You’ll see something like: getwd() [1] “C:/Users/Documents” The exact output will of course depend on your working directory. Note the forward slashes in the path. This is because of R’s Unix heritage (backslash is an escape character in R.). So, here’s how would change the working directory to C:Users: setwd(“C:/Users”) You can now use getwd()to check that setwd() has done what it should. getwd() [1]”C:/Users” Loading data into R, start RStudio and open the TextMining project you created earlier. The next step is to load the tm package as this is not loaded by default. This is done using the library() function like so: library(tm) Loading required package: NLP Dependent packages are loaded automatically – in this case the dependency is on the NLP (natural language processing) package. Next, we need to create a collection of documents (technically referred to as a Corpus) in the R environment. This basically involves loading the files created in the TextMining folder into a Corpus object. The tm package provides the Corpus() function to do this. There are several ways to create a Corpus. In a nutshell, the Corpus() function can read from various sources including a directory. That’s the option we’ll use: #Create Corpus docs <- Corpus(DirSource(“C:/Users/Kailash/Documents/TextMining”)) A couple of things to note in the above. Any line that starts with a # is a comment, and the “<-“ tells R to assign the result of the command on the right hand side to the variable on the left hand side. In this case the Corpus object created is stored in a variable called docs. One can also use the equals sign (=) for assignment if one wants to.
  • 9.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 9 Type in docs to see some information about the newly created corpus: docs <<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 30 The summary() function gives more details, including a complete listing of file, but it isn’t particularly enlightening. Instead, we’ll examine a particular document in the corpus. #inspect a particular document writeLines(as.character(docs[[30]])) …output not shown… Which prints the entire content of 30th document in the corpus to the console. Pre-processing Data cleansing, though tedious, is perhaps the most important step in text analysis. As we will see, dirty data can play havoc with the results. Furthermore, as we will also see, data cleaning is invariably an iterative process as there are always problems that are overlooked the first time around. The tm package offers a number of transformations that ease the tedium of cleaning data. To see the available transformations type getTransformations() at the R prompt: > getTransformations() [1] “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace” There are a few preliminary clean-up steps we need to do before we use these powerful transformations. If you inspect some documents in the corpus (and you know how to do that now), you will notice that some quirks in writing. For example, colons and hyphens used without spaces between the words separated by them. Using the removePunctuation transform without fixing this will cause the two words on either side of the symbols to be combined. Clearly, we need to fix this prior to using the transformations. To fix the above, one has to create a custom transformation. The tm package provides the ability to do this via the content_transformer function. This function takes a function as input, the input function should specify what transformation needs to be done. In this case, the input function would
  • 10.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 10 be one that replaces all instances of a character by spaces. As it turns out the gsub() function does just that. Here is the R code to build the content transformer, which we will call toSpace: #create the toSpace content transformer toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, ” “, x))}) Now we can use this content transformer to eliminate colons and hypens like so: docs <- tm_map(docs, toSpace, “-“) docs <- tm_map(docs, toSpace, “:”) #Remove punctuation – replace punctuation marks with ” “ docs <- tm_map(docs, removePunctuation) Inspecting the corpus reveals that several “non-standard” punctuation marks have not been removed. These include the single curly quote marks and a space-hyphen combination. These can be removed using our custom content transformer, toSpace. Note that you might want to copy-n-paste these symbols directly from the relevant text file to ensure that they are accurately represented in toSpace. docs <- tm_map(docs, toSpace, “’”) docs <- tm_map(docs, toSpace, “‘”) docs <- tm_map(docs, toSpace, ” -“) Inspect the corpus again to ensure that the offenders have been eliminated. This is also a good time to check for any other special symbols that may need to be removed manually. If all is well, you can move to the next step which is to: 1. Convert the corpus to lower case 2. Remove all numbers Since R is case sensitive, “Text” is not equal to “text” – and hence the rationale for converting to a standard case. However, although there is a tolower transformation, it is not a part of the standard tm transformations (see the output of getTransformations() in the previous section). For this reason, we have to convert tolower into a transformation that can handle a corpus object properly. This is done with the help of our new friend, content_transformer. Here’s the relevant code: #Transform to lower case (need to wrap in content_transformer)
  • 11.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 11 docs <- tm_map(docs,content_transformer(tolower)) Text analysts are typically not interested in numbers since these do not usually contribute to the meaning of the text. However, this may not always be so. For example, it is definitely not the case if one is interested in getting a count of the number of times a particular year appears in a corpus. This does not need to be wrapped in content_transformer as it is a standard transformation in tm. #Strip digits (std transformation, so no need for content_transformer) docs <- tm_map(docs, removeNumbers) Once again, be sure to inspect the corpus before proceeding. The next step is to remove common words from the text. These include words such as articles (a, an, the), conjunctions (and, or but etc.), common verbs (is), qualifiers (yet, however etc) . The tm package includes a standard list of such stop words as they are referred to. We remove stop words using the standard removeWords transformation like so: #remove stopwords using the standard list in tm docs <- tm_map(docs, removeWords, stopwords(“english”)) Finally, we remove all extraneous whitespaces using the stripWhitespace transformation: #Strip whitespace (cosmetic?) docs <- tm_map(docs, stripWhitespace) Code:  getwd()  setwd("C:/Users/user.user-PC.000/Documents")  docs=Corpus(DirSource("C:/Users/user.user-PC.000/Documents/TextMining"))  docs  writeLines(as.character(docs[[30]]))
  • 12.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 12  getTransformations()  toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})  docs <- tm_map(docs, toSpace, "-")  docs <- tm_map(docs, toSpace, ":")  docs <- tm_map(docs, removePunctuation)  docs <- tm_map(docs, toSpace, "’")  docs <- tm_map(docs, toSpace, "‘")  docs <- tm_map(docs, toSpace, " -")  docs <- tm_map(docs,content_transformer(tolower))  docs <- tm_map(docs, removeNumbers)  docs <- tm_map(docs, removeWords, stopwords("english"))  docs <- tm_map(docs, stripWhitespace)  docs <- tm_map(docs,stemDocument)  writeLines(as.character(docs[[30]]))  dtm <- DocumentTermMatrix(docs)  freq <- colSums(as.matrix(dtm))  length(freq)  ord <- order(freq,decreasing=TRUE)  freq[head(ord)]  dtmr <-DocumentTermMatrix(docs, control=list(wordLengths=c(4, 20), bounds = list(global = c(3,27))))  freqr <- colSums(as.matrix(dtmr))  ordr <- order(freqr,decreasing=TRUE)  freqr[head(ordr)]  wordcloud(names(freqr),freqr, min.freq=70)
  • 13.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 13
  • 14.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 14
  • 15.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 15
  • 16.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 16
  • 17.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 17 References & Bibliography • http://www.iamwire.com/2013/04/navigating-through-e-commerce-customer-support- issues/8320
  • 18.
    Text Classification usingSupport Vector Machines in R Kotni Sai Srinivas, VGSOM, IIT Kharagpur Page 18 • http://kwiksurveys.com/s/buh5qHoV • http://mthink.com/article/customer-care-through-e-commerce-looking-glass/ • https://www.salesforce.com/blog/2013/06/customer-service-for-ecommerce.html • The Rise and Rise of E-Commerce in India, Aranca Research, IBEF • Investigating customer satisfaction dimensions with service quality of online auctions: an empirical investigation of e-Bay, Springer-Verlag Berlin Heidelberg 2012 • www.snapdeal.com • www.amazon.in • www.flipkart.com • www.paytm.com