As the time goes on and on,
digitization of text has been increasing remarkably and
the need to organize, categorize and classify text has
become indispensable. Disorganization and very little
categorization and classification of text may result in
gradual lower response time of text or information
retrieval. Therefore it is very important and necessary
to organize, categorize and classify texts and digitized
documents according to description proposed by text
mining experts and computer scientists. Automated
text classification has been considered as a imperative
method to manage and process a large amount of
documents in digital forms that are widespread and
continuously increasing. In general, text classification
plays and substantial role in information extraction
and text retrieval, and question answering. This paper
emphasizes the text classification process using
machine learning techniques.
Automatic Text Classification using Supervised Learning
1. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 1 | P a g e Copyright@IDL-2017
Automatic Text Classification using Supervised
Learning
Ms. NAYANA N MURTHY 1
, Mrs. SHASHIREKHA H 2
Dept. of Computer Science
1
MTech, Student– VTU PG Center, Mysuru, India
2
Guide, Assistant Professor– VTU PG Center, Mysuru, India
SURVEY PAPER
1. ABSTRACT - As the time goes on and on,
digitization of text has been increasing remarkably and
the need to organize, categorize and classify text has
become indispensable. Disorganization and very little
categorization and classification of text may result in
gradual lower response time of text or information
retrieval. Therefore it is very important and necessary
to organize, categorize and classify texts and digitized
documents according to description proposed by text
mining experts and computer scientists. Automated
text classification has been considered as a imperative
method to manage and process a large amount of
documents in digital forms that are widespread and
continuously increasing. In general, text classification
plays and substantial role in information extraction
and text retrieval, and question answering. This paper
emphasizes the text classification process using
machine learning techniques.
2. INTRODUCTION
Automatic text classification has always been an
important application and research topic since the
inception of digital documents. Today, text
classification is a necessity due to the very large
amount of text documents that we have to deal with
daily. In general, text classification includes topic
based text classification and text genre-based
classification. Topic-based text categorization
classifies documents according to their topics. Texts
can also be written in many genres, for instance:
scientific articles, news reports, movie reviews, and
advertisements. Genre is defined on the way a text was
created, the way it was edited, the register of language
it uses, and the kind of audience to whom it is
addressed. Previous work on genre classification
recognized that this task differs from topic-based
categorization. Typically, most data for genre
classification are collected from the web, through
newsgroups, bulletin boards, and broadcast or printed
news. They are multi-source, and consequently have
different formats, different preferred vocabularies and
often significantly different writing styles even for
documents within one genre. Namely, the data are
heterogeneous. Intuitively Text Classification is the
task of classifying a document under a predefined
category. More formally, if I d is a document of the
entire set of documents D and {cc c 1 2 , ,..., n} is the
set of all the categories, then text classification assigns
one category j c to a document ID. As in every
supervised machine learning task, an initial dataset is
needed. A document may be assigned to more than
one category (Ranking Classification), but in this
paper only researches on Hard Categorization
(assigning a single category to each document) are
taken into consideration. Moreover, approaches, that
take into consideration other information besides the
pure text, such as hierarchical structure of the texts or
date of publication, are not presented. This is because
the main issue of this paper is to present techniques
that exploit the most of the text of each document and
perform best under this condition.
3. PLAN OF WORK FLOW
B2B market places are an intermediate layer for
business communications providing one serious
advantage to their clients. They can communicate with
a large number of customers based on one
communication channel to the market place. A
successful market place has to deal with various
aspects. It has to integrate with various hardware and
software platforms and has to provide a common
protocol for information exchange. However, the real
problem is the heterogeneity and openness of the
exchanged content. Therefore, content management is
one of the real challenges in successful B2B electronic
commerce. One of the serious problem is document
description must be classified. Each document will be
having its own taxonomy which organizes document
2. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 2 | P a g e Copyright@IDL-2017
into its respective categories. Each supplier uses
different structures and vocabularies to describe its
documents. This may not cause a problem for a 1-1
relationship where the buyer may get used to the
private terminology of his supplier. B2B market places
that enable n-m commerce cannot rely on such an
assumption. They must classify all documents
according to a standard classification schema that help
buyers and suppliers in communicating their document
information. A widely used classification schema in
the is UNSPSC Again it is a difficult and mainly
manual task to classify the documents according to a
classification schema like UNSPSC. It requires
domain expertise and knowledge about the document
domain. Finding the right place for a document
description in a standard classification system such as
UNSPSC is not at all a trivial task. Each document
must be mapped to the corresponding document
category in UNSPSC to create the document catalog.
Document classification schemes contain huge number
of categories with far from sufficient definitions (e.g.
over 12,000 classes for UNSPSC) and millions of
documents must be classified according to them.
Document classification is expensive, complicated,
time consuming and error-prone. Content Management
needs support in automation of the document
classification process. Text mining and Machine
Learning work together for automatic classification of
document. The below figure shows that flow of txt
classification process..
The motivated perspective of text mining is
Information Extraction (IE) to extract specific
information from document description. Natural
Language Processing (NLP) is to achieve a better
understanding of natural language by use of computers
and represent the description semantically to improve
the classification process. Text representation is the
important aspect in classification process, denotes the
mapping of a document description into a compact
form of its contents. Description is typically
represented as a vector of term weights (word features)
from a set of terms (dictionary), where each term
occurs at least in any document description. A major
characteristic of the classification problem is the
extremely high dimensionality of text data. The
number of potential features often exceeds the number
of training set.
4 SURVEY
4.1 Text Classification Using Machine Learning
Techniques “This survey represent as machine
learning techniques, here automatic text classification
has always been an important application and research
topic since the inception of digital documents. Today,
text classification is a necessity due to the very large
amount of text documents that we have to deal with
daily. In general, text classification includes topic
based text classification and text genre-based
classification. Topic-based text categorization
classifies documents according to their topics. Texts
can also be written in many genres, for instance:
scientific articles, news reports, movie reviews, and
advertisements. Genre is defined on the way a text was
created, the way it was edited, the register of language
it uses, and the kind of audience to whom it is
addressed. Previous work on genre classification
recognized that this task differs from topic-based
categorization. Typically, most data for genre
classification are collected from the web, through
newsgroups, bulletin boards, and broadcast or printed
news. They are multi-source, and consequently have
different formats, different preferred vocabularies and
often significantly different writing styles even for
documents within one genre. Namely, the data are
heterogenous. Intuitively Text Classification is the
task of classifying a document under a predefined
category. More formally, if i d is a document of the
entire set of documents D and {cc c 1 2 , ,..., n} is the
set of all the categories, then text classification assigns
one category j c to a document id. As in every
supervised machine learning task, an initial dataset is
needed. A document may be assigned to more than
one category (Ranking Classification), but in this
paper only researches on Hard Categorization
(assigning a single category to each document) are
taken into consideration. Moreover, approaches, that
take into consideration other information besides the
pure text, such as hierarchical structure of the texts or
date of publication, are not presented. This is because
the main issue of this paper is to present techniques
that exploit the most of the text of each document and
perform best under this condition.”
4.2 A Review of Machine Learning Algorithms for
3. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 3 | P a g e Copyright@IDL-2017
Text-Documents Classification “The text mining
studies are gaining more importance recently because
of the availability of the increasing number of the
electronic documents from a variety of sources. The
resources of unstructured and semi structured
information include the world wide web,
governmental electronic repositories, news articles,
biological databases, chat rooms, digital libraries,
online forums, electronic mail and blog repositories.
Therefore, proper classification and knowledge
discovery from these resources is an important area for
research. Natural Language Processing (NLP), Data
Mining, and Machine Learning techniques work
together to automatically classify and discover patterns
from the electronic documents. The main goal of text
mining is to enable users to extract information from
textual resources and deals with the operations like,
retrieval, classification (supervised, unsupervised and
semi supervised) and summarization. However how
these documented can be properly annotated,
presented and classified. So it consists of several
challenges, like proper annotation to the documents,
appropriate document representation, dimensionality
reduction to handle algorithmic issues, and an
appropriate classifier function to obtain good
generalization and avoid over-fitting. Extraction,
Integration and classification of electronic documents
from different sources and knowledge discovery from
these documents are important for the research
communities. Today the web is the main source for the
text documents, the amount of textual data available to
us is consistently increasing, and approximately 80%
of the information of an organization is stored in
unstructured textual format, in the form of reports,
email, views and news etc. The shows that
approximately 90% of the world’s data is held in
unstructured formats, so Information intensive
business processes demand that we transcend from
simple document retrieval to knowledge discovery.
The need of automatically retrieval of useful
knowledge from the huge amount of textual data in
order to assist the human analysis is fully apparent.
Market trend based on the content of the online news
articles, sentiments, and events is an emerging topic
for research in data mining and text mining
community. For these purpose state-of-the-art
approaches to text classifications are presented in, in
which three problems were discussed: documents
representation, classifier construction and classifier
evaluation. So constructing a data structure that can
represent the documents, and constructing a classifier
that can be used to predicate the class label of a
document with high accuracy, are the key points in
text classification. Text-Documents Classification.”
4.3 A Concept of Text Classification Using Machine
Learning “Modern information age produces vast
amount of textual data, which can be termed in other
words as unstructured data. Internet and corporate
spread across the globe produces textual data in
exponential growth, which needs to be shared, on need
basis by individuals. If the data generated is properly
organized, classified then retrieving the needed data
can be made easily with least efforts. Hence the need
of automatic methods to organize, classify the
documents becomes inevitable due to such exponential
growth in documents, very especially after the increase
usage of internet by individuals. Automatic
classification refers to assigning the documents to a set
of pre-defined classes based on the textual content of
the document. The classification can be flat or
hierarchical. The class categories grow significantly
large in number say, in thousands then searching with
such a large number of categories becomes very
difficult. This difficulty leads to have hierarchical
classification in which the thematic relationship
between the classifications is also used, in searching of
documents. Text Categorization (TC), also known as
Text Classification, is the task of automatically
classifying a set of text documents into different
categories from a predefined set. Consider the case of
sorting and organizing emails, files in folder
hierarchies so that topic identification that would
support topic specific operations be made. On such
attempt is the yahoo web directory. If such
classification is to be done manually it has several
disadvantages.
i. It needs domain experts in the areas of predefined
categories.
ii. It is time-consuming, leads to frustration.
iii. It is error-prone and could be employee biased
(subject biased).
iv. Human decision among two experts may disagree.
v. Need to repeat the process for new documents
(possibly of another domain).
So the need to employee machine learning to
Automate the classification is needed. In machine
learning generally two types of learning algorithms are
found in the literature: supervised learning algorithms
or unsupervised learning algorithms. We restrict in the
paper about supervised learning.”
4.4 A Study on Document Classification using
Machine Learning Techniques “Due to the fast
4. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 4 | P a g e Copyright@IDL-2017
growth of digital information available electronically,
text mining plays a key role in managing information
and knowledge, and therefore has become an active
research area. Text mining, also known as intelligent
text analysis is the process of extracting interesting
and non-trivial information and knowledge from
unstructured text. Text mining is a young
interdisciplinary field, which draws on information
retrieval, data mining, machine learning, statistics and
computational linguistics. Typical text mining tasks
include information extraction, topic tracking,
document summarization, classification, clustering,
question answering. Automated text classification is
the act of dividing a set of input documents into two or
more classes where each document can be said to
belong to one or multiple classes. Text classification
aims at assigning pre-defined classes to text
documents. An example would be to automatically
label each incoming news story with a topic like
“sports”, “politics”, or “art”. The classification task
starts with a training set D d ( ,..., ) 1 n of documents
that are already labeled with a class c C (e.g. sport,
politics). The task is then to determine a classification
model f D C : f d c ( ) which is able to assign the
correct class to a new document d of the domain. Text
classification is a challenging task, as it is difficult to
capture the meaning and abstract concepts of natural
language just from a few keywords. Also, the high
dimensionality of the feature space makes
classification problem very difficult. Text
classification is commonly used to handle spam
emails, classify large text collections into topical
categories, and manage knowledge and also to help
Internet search engines.”
4.5 Various Machine Learning Techniques for Text
Classification “In this survey, we examine and
compare the effectiveness of applying machine
learning techniques to the sentiment classification
problem. A challenging aspect of this problem that
seems to distinguish it from traditional topic-based
classification is that while topics are often identifiable
by keywords alone, sentiment can be expressed in a
more subtle manner.
Sentimental Analysis
Definition Sentiment Analysis is a Natural Language
Processing and Information Extraction task that aims
to obtain writer’s feelings expressed in positive or
negative comments, questions and requests, by
analyzing a large numbers of documents. Generally
speaking, sentiment analysis aims to determine the
attitude of a speaker or a writer with respect to some
topic or the overall tonality of a document.
What are the challenges?
Sentiment Analysis approaches aim to extract positive
and negative sentiment bearing words from a text and
classify the text as positive, negative or else objective
if it cannot find any sentiment bearing words. In this
respect, it can be thought of as a text categorization
task. In text classification there are many classes
corresponding to different topics whereas in Sentiment
Analysis we have only 3 broad classes i.e. positive,
negative and neutral. Thus it seems Sentiment
Analysis is easier than text classification which is not
quite the case.
The general challenges can be summarized as.
1. Implicit Sentiment and Sarcasm
2. Domain Dependency
3. Thwarted Expectations4. Pragmatics
5. World Knowledge
6. Subjectivity Detection
7. Entity Identification
8. Negation
Hence, it’s not easy to do text categorization and
understand what the user intends to say (sentiments)
because of the above mentioned problems.
The complexity of the problems varies from high to
low. So some problems are easily solvable like World
Knowledge and some are difficult like Negation. For
this purpose various algorithms like Naive Bayes,
SVM and Decision Tree at available at our disposal.
Steps for analyzing the sentiments in the sentence:
1. Firstly we need to decide the classifier algorithms
and have an appropriate data for training.
2. Preprocess and label the data.
3. Prepare the data for training.
4. Train the classifier with the help of libraries such as
NLTK, libsvm etc.
5. Make predictions by giving new test data to the
trained classifier.
Text categorization is the task of assigning a Boolean
value to each pair (dj , ci) ∈ D × C, where D is a
domain of documents and C = {c1 , . . . , c|C| } is a set
of predefined categories. A value of T assigned to (dj
, ci) indicates a decision to file dj under ci,while a
value of F indicates a decision not to file dj under ci.”
4.6 Types of Machine Learning Algorithms
“Machine learning algorithms are organized into
taxonomy, based on the desired outcome of the
algorithm. Common algorithm types include:
• Supervised learning: where the algorithm generates
a function that maps inputs to desired outputs. One
5. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 5 | P a g e Copyright@IDL-2017
standard formulation of the supervised learning task is
the classification problem: the learner is required to
learn a function which maps a vector into one of
several classes by looking at several input-output
examples of the function.
• Unsupervised learning: Which models a set of
inputs: labeled examples are not available.
• Semi-supervised learning: Which combines both
labeled and unlabeled examples to generate an
appropriate function or classifier?
• Reinforcement learning: Where the algorithm
learns a policy of how to act given an observation of
the world. Every action has some impact in the
environment, and the environment provides feedback
that guides the learning algorithm.
• Transduction: Similar to supervised learning, but
does not explicitly construct a function: instead, tries
to predict new outputs based on training inputs,
training outputs, and new inputs.
• Learning to learn: Where the algorithm learns its
own inductive bias based on previous experience.
The performance and computational analysis of
machine learning algorithms is a branch of statistics
known as computational learning theory. Machine
learning is about designing algorithms that allow a
computer to learn. Learning is not necessarily involves
consciousness but learning is a matter of finding
statistical regularities or other patterns in the data.
Thus, many machine learning algorithms will barely
resemble how human might approach a learning task.
However, learning algorithms can give insight into the
relative difficulty of learning in different
environments.”
5. CONCLUSION
This survey finally conclude that, the text
classification problem is an Artificial Intelligence
research topic, especially given the vast number of
documents available in the form of web pages and
other electronic texts like emails, discussion forum
postings and other electronic documents. It has
observed that even for a specified
Classification method, classification performances of
the classifiers based on different training text corpuses
are different; and in some cases such differences are
quite substantial. This observation implies that a)
classifier performance is relevant to its training corpus
in some degree, and b) good or high quality training
corpuses may derive classifiers of good performance.
Unfortunately, up to now little research work in the
literature has been seen on how to exploit training text
corpuses to improve classifier’s performance. Some
important conclusions have not been reached yet,
including:
• Which feature selection methods are both
computationally scalable and high performing across
classifiers and collections? Given the high variability
of text collections, do such methods even exist?
• Would combining uncorrelated, but well performing
methods yield a performance increase?
• Change the thinking from word frequency based
vector space to concepts based vector space. Study the
methodology of feature selection under concepts, to
see if these will help in text categorization.
• Make the dimensionality reduction more efficient
over large corpus.
Moreover, there are other two open problems in text
mining: polysemy, synonymy. Polysemy refers to the
fact that a word can have multiple meanings.
Distinguishing between different meanings of a word
(called word sense disambiguation) is not easy, often
requiring the context in which the word appears.
Synonymy means that different words can have the
same or similar meaning.
OTHER REFERENCES
[1] Bao Y. and Ishii N., “Combining Multiple kNN
Classifiers for Text Categorization by Reducts”, LNCS
2534, 2002, pp. 340-347
[2] Bi Y., Bell D., Wang H., Guo G., Greer K.,
”Combining Multiple Classifiers Using Dempster's
Rule of Combination for Text Categorization”, MDAI,
2004, 127-138.
[3] Brank J., Grobelnik M., Milic-Frayling N.,
Mladenic D., “Interaction of Feature Selection
Methods and Linear Classification Models”, Proc. of
the 19th International Conference on Machine
Learning, Australia, 2002.
[4] Ana Cardoso-Cachopo, Arlindo L. Oliveira, “An
Empirical Comparison of Text Categorization
Methods,” Lecture Notes in Computer Science,
Volume 2857, Jan 2003, Pages 183 - 196
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O.,
Kegelmeyer, W. P., “SMOTE: Synthetic Minority
Over-sampling Technique,” Journal of AI Research,
16 2002, pp. 321-357.
6. IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 6 | P a g e Copyright@IDL-2017
[6] Forman, G., “An Experimental Study of Feature
Selection Metrics for Text Categorization”. Journal of
Machine Learning Research, 3 2003, pp. 1289-1305
[7] Fragoudis D., Meretakis D., Likothanassis S.,
“Integrating Feature and Instance Selection for Text
Classification”, SIGKDD ’02, July 23-26, 2002,
Edmonton, Alberta, Canada.
[8] Guan J., Zhou S., “Pruning Training Corpus to
Speedup Text Classification”, DEXA 2002, pp. 831-
840
[9] D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, “A
decision-tree-based symbolic rule induction system for
text categorization”, IBM Systems Journal, September
2002.
[10] Han X., Zu G., Ohyama W., Wakabayashi T.,
Kimura F., “Accuracy Improvement of Automatic Text
Classification Based on Feature Transformation and
Multi-classifier Combination”, LNCS, Volume 3309,
Jan 2004, pp. 463-468
[11] Ke H., Shaoping M., “Text categorization based
on Concept indexing and principal component
analysis”, Proc. TENCON 2002 Conference on
Computers, Communications, Control and Power
Engineering, 2002, pp. 51- 56.
[12] Kehagias A., Petridis V., Kaburlasos V., Fragkou
P., “A Comparison of Word- and Sense-Based Text
Categorization Using Several Classification
Algorithms”, JIIS, Volume 21, Issue 3, 2003, pp. 227-
247.
[13] B. Kessler, G. Nunberg, and H. Schutze.
“Automatic detection of text genre.” In Proceedings of
the Thirty-Fifth ACL and EACL, pages 32–38, 1997.
[14] Kim S. B., Rim H. C., Yook D. S. and Lim H. S.,
“Effective Methods for Improving Naïve Bayes Text
Classifiers”, LNAI 2417, 2002, pp. 414-423