SlideShare a Scribd company logo
1 of 8
Download to read offline
A Rough Set-Based Hybrid Method to Text Categorization
Yongguang Bao', SatoshiAoyama', Xiaoyong Du', Kazutaka Yamada', Naoho Ishii'
1 Department of Intelligence and Computer Science, Nagoya Institute of Technology,
Gokiso-cho, Showa-ku, Nagoya, 466-8555,Japan
fbaoyg,satoshi,kazzy,ishii)@egg.ics.nitech.ac.j p
School of Information, Renmin Universityof China, 100872,Beijing, China
Duyong@mail.ruc.edu.cn
2
Abstract
In this paper we present a hybrid text categorization
method based on Rough Sets theory. A centralproblem in
good text classification for information filtering and
retrieval (IF/IR) is the high dimensionality of the data. It
may contain many unnecessary and irrelevant features.
To cope with this problem, we propose a hybrid
technique using Latent Semantic Indexing (LSI) and
Rough Sets theory (RS) to alleviate this situation. Given
corpora of documents and a training set of examples of
classijied documents, the technique locates a minimal set
of co-ordinate keywords to distinguish between classes of
documents, reducing the dimensionality of the keyword
vectors. This simplifies the creation of knowledge-based
IF/IR systems, speeds up their operation, and allows easy
editing of the rule bases employed. Besides, we generate
several knowledge base instead of one knowledge base
for the classification of new object, hoping that the
combination of answers of the multiple knowledge bases
result in better performance. Multiple knowledge bases
can be formulated precisely and in a unified way within
theframework of RS. This paper describes the proposed
technique, discusses the integration of a keyword
acquisition algorithm, Latent Semantic hdexing (LSr)
with Rough Set-based rule generate algorithm. and
provides experimental results. The test results &ow the
hybrid method is better than the previous rough set-
based approach.
1. Introduction
As the volume of information available on the Internet
and corporate intranets continues to increase, there is a
growing need for tools helping people better find, filter,
and manage these resources. Text Categorization is to
classify text documents into categories or classes
automatically based on their content, and therefore is an
important component in many information management
tasks: real-time sorting of email or files into folder
hierarchies, topic identification to support topic-specific
processing operations, structured search andfor browsing,
or finding documents that match log-term standing interest
or more dynamic task-based interests.
Trained professional are employed to category new items
in many contexts, which is very time-consuming and
costly, and thus limits its applicability. Consequently there
is an increasing interest in developing technologies for
automatic text categorization. A number of statistical text
learning algorithms and machine learning technique have
been applied to text categorization. These text
classification algorithms have been used to automatically
catalog news articles [1,2] and web pages [3,4], learn the
reading interests of users [5,6],and sort electronic mail
V',VI.
However, a non-trivial obstacle in good text classification
is the high dimensionality of the data. In most IF/IR
techniques, each document is described by a vector of
extremely high dimensionality-typically one value per
word or pair of words in the document. The vector
ordinates are used as preconditions to a rule that decides
which class the document belongs to. Document vector
commonly comprise tens of thousands of dimensions,
which renders the all problem but intractable for even the
most powerful computers.
Rough Sets Theory introduced by Pawlak [lo] is a non-
statistical methodology for data analysis. It can be used to
alleviate this situation [16]. A. Chouchoulas and Q. Shen
proposed a Rough Set-based approach (RSAR) to text
classification and test it using Email messages. But we
can see from the experimental results of RSAR, with the
increasing the number of categories, the accuracy
becomes to be an unacceptable level. We think it is not
suited to apply RSAR directly after keywords are acquired
by weight. Moreover, a single knowledge base which
utilizes a single minimal set of decision rules to classify
future examples may lead to mistake, because the minimal
set of decision rules are more sensitive to noise and a
small number of rules means that a few alternatives exit
when classifying new objects. Recently, in order to
2540-7695-1393-WO2$17.000 2002IEEE
enhance the classification accuracy, the concept of
multiple knowledge base emerged. The idea is to generate
several knowledge base instead of one knowledge base for
the classification of unseen objects, hoping that the
combination of answers of multiple knowledge base result
in better performance. Multiple knowledge bases can be
formulated precisely and in a unified way within the
framework of RS.
This paper proposes a hybrid technique using Latent
Semantic Indexing (LSI) and Rough Set (RS) theory to
cope with this situation. Given corpora of documents and
a set of examples of classified documents, the technique
can quickly locate a minimal set of co-ordinate keywords
to classify new documents. As a result, it dramatically
reduces the dimensionality of the keyword space. The
resulting set of keywords of rule is typically small enough
to be understood by a human. This simplifies the creation
of knowledge-based IF/IR systems, speeds up their
operation, and allows easy editing of the rule bases
employed. Moreover, we generate several knowledge
bases instead of one knowledge base to classify new
objects for the better performance.
The remainder of this paper is organized as follows:
Section 2 introduces the text categorization. In section 3
and 4 we discuss Latent Semantic Indexing and Rough
sets Theory respectively. Section 5 provides a description
of the proposed system. Section 6 describes experimental
results. A short conclusion is given in the final section.
2. Text Categorization
Text categorization aims to dassify text documents into
categories or classes automatically based on their content.
While more and more textual information is available
online, effective retrieval becomes difficult without good
indexing and summarization of document content.
Document classification is one solution to this problem. A
growing number of statistical classification algorithms and
machine learning methods have been applied to text
classification in recent years. Like all classification tasks, it
may be tackled either by comparing new documents with
previous classified ones (distance-based techniques), or
by using rule-based approaches.
Perhaps, the most commonly used document
representation is so called vector space model. In the
vector space model, a document is represented by vector
of words. Usually, one has a collection of documents
which is represented by a MxN word-by-document matrix
A, where M is the number of words, and N the number of
documents, each entry represents the occurrences of a
word in a document, i.e.,
A 3 W 3 (2.1)
Where wk is the weight of word i in document k. Since
every word does not normally appear in each document,
the matrix is usually sparse and M can be very large.
Hence, a major characteristic, or difficulty of text
categorization problem is the high dimensionality of the
feature space.
The two most commonly used text categorization
approaches are outlined below.
2.1. Distance-BasedText Categorization
Distance-Based text categorization involves the
comparison of high-dimensionality keyword vectors. In
case where the vector describes groups of documents, it
identifies the center of a cluster of documents. Documents
are classified by comparing their document vector. To
classify an unknown document vector d, the knearest
neighbor (kNN) algorithm ranks the document’s neighbors
among the training document vectors, and use the class
labels of the k most similar neighbors to predict the class
of the input document. The classes of these neighbors are
weighted using the similarity of each neighbor to d, where
similarity may be measured by, for example, the cosine or
the Euclidean distance between the two document vectors.
kNN is lazy learning instance-based method that does not
have an off-line training phase. The main computation is
the on-line scoring of training documents given a test
document in order to find the k nearest neighbors. Using
an inverted-file indexing of training documents, the time
complexity is O(L*N/M) where L is the number of elements
of the document vector that greater than zero, M is the
length of the document vector, and N is the number of
training sample. Unfortunately, the dimensionality of the
document vector is typically extremely high (usually in the
tens of thousands); a detail that greatly slows down
classification tasks and makes storage of document
vectors expensive [2].
2.2. Rule-Based Text Categorization
Rule-Based categorization has been in use for a long time
and is an established method of classifying documents.
Common applications include the kill-file article filters used
by Usenet client software and van den Berg’s autonomous
E-mail filter,Procmail.
In this context, keyword vectors are considered as rule
preconditions; the class a document belongs to is used as
the rule decision:
ki,kz,..A E U
r,(d)=p(d,k,) A p(d,k2)A...A p(d,k,) 3 d E D
Where k, are document keywords, U is the universal
keyword set, d is one document, D is a document class,
r,(d) is rule i applied to d and p(d,k,) is a function
evaluating to true if d contains keyword I( such that it
255
satisfies some metric ( e.g. a minimum frequency or
weight). Not all keywords in the universal set need to be
checked for. This allows rule-based text classifiers to
exhibit a notation much terser than that of vector-based
classifiers, where a vector must always have the same
dimensionality as keyword space.
In most cases, the human user writes rules. Most typical
rule bases simply test for the presence of specific
keywords in the document. Fox example, a Usenet client
may use a kill-file to filter out newsgroup article by some
specific person by looking for the person’s name in the
article’s ‘From’ field. Such rule-based approaches are
inherently simple to understand, which accounts for their
popularity among end-users. Unfortunately, complex
needs often result in very complex rule bases, ones that
user have difficulty maintaining by hand.
3. Latent Semantic Indexing
A central problem in statistical text classification is the
high dimensionality of the feature space. There exits one
dimension for each unique word found in the collection of
documents, typically hundreds of thousands. Standard
classification approaches cannot deal with such a large
feature set, since the processing is extremely costly in
computation terms, and the results become unreliable due
to the lack of sufficient training data. Hence, there is a
need for a reduction of the original kature set, which is
commonly known as dimensionality reduction in the
pattern recognition literature.
Latent Semantic Indexing &SI) is a statistical, corpus-
based text comparison mechanism that was originally
developed for the task of information retrieval, but in
recent years has produced remarkably human-like abilities
in a variety of language tasks. LSI has taken the test of
English as a foreign language and performed as well as
non-native English speakers who were successful college
applicants. It has shown an ability to learn words at a rate
similar to humans. It has even graded papers as reliably as
human graders.
LSI is based on the assumption that there is some
underlying or latent structure in the patter of word usage
across documents, and that statistical techniques can be
used to estimate this structure. It created a high-
dimensional, spatial representation of a corpus and
allowed texts to be compared geometrically. LSI uses
singular value decomposition (SVD), a technique closely
related to eigenvector decomposition and factor analysis,
to compress a large amount of term and document co-
occurrence information into a smaller space. This
compression is said to capture the semantic information
that is latent in the corpus itself. In what follows we
describe the mathematics underlying the particular model
of the latent structure: the singular value decomposition.
Assuming that we have an MxN word-by-document
matrix A, the singular value decomposition of A is given
by :
A = U S VT
Where U(MxR) and V(RxN) have orthonormal columns
and S(RxR) is the diagonal matrix of singular values.
M i n ( M , N ) is rank of A. If the singular values of A are
ordered by size, the K largest may be kept and the
remaining smaller ones set to zero. The product of the
resulting matrixes is matrix Akthat is an approximation to A
with rank K.
Where Sk(KXK)is obtained by deleting the zero rows and
columns of, and U,,(MxK) and V O X K ) are obtained by
deleting the corresponding rows and columns of U and V.
Ak in one sense captures most of the underlying
structure in A, yet at the same time removes the noise or
variability in word usage. Since the number of dimensions
K is much smaller than the number of unique words M,
minor differences in terminology will be ignored. Words
that occur in similar documents may be near each other in
the IC-dimensional space even if they never co-occur in the
same document. Moreover, documents that do not share
any words with each other may turn out to be similar.
The cosine between two rows in Ab equally the cosine
between two rows in QSk,reflects the extent to which two
words have a similar pattern of occurrence across the set
of documents. If the cosine is 1, the two words have
exactly the same pattern of occurrence, while a cosine 0
means that the pattern of occurrence is very different for
the two words. By this similarity we can construct new
keywords as combinations or formations of the original
keywords.
This compression step is somewhat similar to the common
feature of neural network systems where a large number of
inputs are connected to a fairly small number of hidden
layer nodes. If there are too many nodes, anetwork will
“memorize” the training set, miss the generalities in the
data, and consequently perform poorly on a test set. The
input for LSI is a large amount of text (on the order of
magnitude of a book). The corpus is turned into a co-
occurrence matrix of terms by “documents“, where for our
purposes, a document is a paragraph. SVD computes an
approximation ofthis data structure of an arbitrary rank K.
Common values of K are between 200 and 500, and are
thus considerably smaller than the usual number of terms
or documents in a corpus, which are on the order of 10000.
It has been claimed that this compression step captures
regularities in the patterns of co-occurrence across terms
and across documents, and furthermore, that these
regularities are related to the semantic structure of the
terms and documents.
Ak= Gskvz
256
4. Information Systemsand Rough Sets
4.1. Informationsystems
An information system is composed of a 4-tuple as follow:
S=<U,Q,V,D
Where U is the closed universe, a finite nonempty set of N
objects (x1,x2,...,%), Q is a finite nonempty set of n
attributes { qi,q2,...,qn} ,V = . q E Q V, where V, is a
domain(value) of the attribute q, f: WQ 3 V is the total
decision function called the information such that
f(x,q)E V, ,for every q cQ, XE U.
Any subset P of Q determines a binary relation on U,
which will be called an indiscernibility relation denoted
by INP(P), and defined as follows: x IB y if and only if f(x,a)
= f(y,a) for every a E P. Obviously INP(P) is an equivalence
relation. The family of all equivalence classes of INP(P) will
be denoted by U/INP(P) or simply U/P; an equivalence
class of INP(P) containing x willbe denoted by P(x) or [x],.
4.2. Reduct
Reduct is a fundamental concept of rough sets. A reduct
is the essential part of an information system that can
discern all objects discernible by the original information
system.
Let qE Q. A feature c is dispensable in S, if PJD(Q-q) =
W ( Q ) ;otherwise feature q is indispensable in S.
If q .is an indispensable feature, deleting it from S will
cause S to be inconsistent. Otherwise, q can be deleted
from S.
The set of feature Rc_Q will be called a reduct of Q, if
ND(R)=IND(Q) and all features of R are indispensable in
S. We denoted it as RED(Q) or RED(S).
Attribute reduct is the minimal subset of condition
attributes Q with respect to decision attributes D, none of
the &tributes of any minimal subsets can be eliminated
without affecting the essential information. These minimal
subsets can discern decision classes with the same
discriminating power as the entire condition attributes.
The set of all indispensable from the set Q is called CORE
of C and denoted by CORE(Q):
S k o ~ r o n [ ' ~ ]proposed a good method to compute CORE
using discernibility mtrix. The CORE attributes are those
entries in the discernibility matrix that have only one
attribute.
COWQl=nRED(Q)
4.3. The Discernbility Matrix
In this section we introduce a basic notions--a
discernibility matrix that will help us understand several
properties and to construct efficient algorithm to compute
the reduct.
By M(S) we denote an nxn matrix (clJ),called the
discernibility matrix of S, such as
qJ={qEQ:flh,q)#flx,,q) ] forij=I ,...,n.
Since M(S) is symmetric and ell+ for i=l,...,n, we
represent M(S) only by elements in the lower triangle of
M(S), i.e. the c,~is with I < j < i ~ .
From the definition of the discernibility matrix M(S) we
have the following:
Proposition 4.1[l3]CORE(S)={qEQ: c,, ={q} for some i j }
Proposition 4.2'13] Let &Q. If for some i j we have B?
qJ#4then x,DIS(B)x,. In particular, if WBcc,,, for some ij,
then x,DIS(B)x,.Where x,DIS(B)x,denotes that x, and x, can
be discerned by attribute subset B.
Proposition 4.3'13' Let @BcQ. The following conditions
are equivalent:
(1) For all i j such that cijt4and 1<j<iSn we have B n c,~#$
(2) IND(B)=IND(Q) i.e. B is superset of a reduct of S.
I 1 1 1
Minimal Rule Base
257
Fig. 1. Data flow through the system
Fig. 1. Data flow through the system
5. The Proposed System
This paper proposes a hybrid method to text
classification. The approach comprises three main stages,
as shown in Fig. 1. The keyword acquisition stage reads
corpora of documents, locates candidate keywords,
estimates their importance, and builds an intermediate
dataset of high dimensionality. The group keyword part
constructs new keywords as combinations or
transformation of the high dimensionality keywords. The
attribute reductions generation examines the dataset and
removes redundancy, generates single or multiple feature
reductions, leaves a dataset or rule base containing a
drastically reduced number of preconditions per rule.
5.1. Keyword Acquisition
This sub-system uses a set of document as input. Firstly,
word are isolated and pre-filtered to avoid very short or
long keywords, or keywords that are not words (e.g. long
numbers or random sequences of characters). Every word
or pair of consecutive words in the text is considered a
candidate keyword. Then, the following weighting
function was used for word indexing to generate a set of
keywords for each document.
N
@ik = -log($)fkwf
Where U,,are the weights of keyword k in document i; N
is the total number of document and Nk is the number of
documents containing keyword k; fkis the frequency of
the keyword k in document i; and 0,denotes the current
field's importance to the classification, which depends on
the application and user preferences.
Finally, before the weighted keyword is added to the set
of keywords, it passes through two filters: one is a low-
pass filter removing words so uncommon that are
definitely not good keywords; the other is a high-pass
filter that removes far too common words such as auxiliary
verbs, articles et cetera. This gives the added advantage
of language-independence to the keyword acquisition
algorithm: most similar methods rely on English thesauri
and lists of common English words to perform the same
function. Finally all weights are normalized before the
keyword sets are output.
It must be emphasized that any keyword acquisition
approach may be substituted for the one described above,
as long as it outputs weighted keywords.
5.2. Group Keyword
Group keyword is the process of constructing new
keywords as combinations or transformations of the
original keywords using Latent Semantic Indexing (LSI).
As described in section 3, we use singular value
decomposition (SVD), a technique closely related to
eigenvector decomposition and factor analysis, to
compress a large amount of term and document co-
occurrence information into a smaller space. Assuming
that we have an MxN word-by-document matrix A. After
decompounded, we use the cosine between two rows in AI,
as similar measurement to group keyword.
We set similar threshold s to group original keywords k,
(i=1,2,...,N) to new keywords K, (j=1,2,...,N'<N) as in the
following way:
K, = Yki
r(k,P ~ P s
Where r(k,,k,)is the cosine between two rows i and 1 in AI,
and s is similar threshold which depends on the
application and user preferences.
5.3 Attribute Reductions Generation
A reduct is a minimal subset of attributes, which has the
same discernibility power as the entire condition attribute.
Finding all reducts of an information system is
combinatorial NP-hard computational problem [131. A
reduct uses a minimum number of features and represents
a minimal and complete rules set to classify new objects.
To classify unseen objects, it is optimal that different
reducts use different features as much as possible and the
union of these attributes in the reducts together include all
indispensable attributes in the database and the number of
reducts used for classification is minimum. Here, we
proposed a greedy algorithm to compute a set of reducts
which satisfy this optimal requirement partially because
our algorithm cannot guarantee the number of reducts is
minimum. Our algorithm starts with the CORE features,
then through backtracking, mu ltiple reducts are
constructed using discemibility matrix. A reduct is
computed by using forward stepwise selection and
backward stepwise elimination based on the significance
values of the features. The algorithm terminates when the
features in the union of the reducts includes all the
indispensable attributes in the database or the number of
reducts is equal to the number determined by user. Since
Rough Sets is suited to nominal datasets, we quantise the
normalized weighted space into 11 values calculated by
poor(1OW).
258
Algorithm: Generate Reductions
Let COMP(B,ADL) denotes the comparison procedure.
The result of COMP(B,ADL) is 1 if for each element c, of
ADL has E h c,#$othenvise 0.
Step 1 Create the discemibility matrix DM: =[C,];
create an absorbed discernibility list:
Delete empty and non-minimal elements of DM and
ADL={ clJE DM qJ#@andno clmE DM, clmccIJ};
Set CDL=ADL, C = U{adk ADL}, i=l;
Step 2
While card(uREDU,)<card((=)do begin
REDU=u{ce CDL :card(c)=l };
ADLKDL- REDU;
Sort the set of C - uREDU, based on frequency value in
ADL.
/* forward selection*/
While (ADL#@)do begin
Compute the frequency value for each attribute
Select the attribute q with maximum frequency value
Delete elements ad1 of ADL which qE ad1from ADL;
/* backward elimination*/
E C-FEDU,;
and add it to REDU;
End
N = card(REDU);
Forj=O to N-l do begin
Remove aJE C-REDU from REDU;
If COMP(REDU,CDL)= 0 Then add aJto REDU;
End
REDU,= REDU; i = i+l;
END
The dataset is simply post-processed to remove duplicate
rules and output in the form of a rule base.
6. Experimental Results
For evaluating the efficiency of our hybrid algorithm, we
compared it with RSAR algorithm [161(It is one case of our
algorithm when use one reduct without grouping keyword
by LSI). And for testing the efficiency of grouping
keyword and multiple knowledge bases, we run our
algorithm by the way of using knowledge base of one
reduct without grouping keyword, using knowledge base
of one reduct with grouping keyword and using multiple
knowledge bases of 5 reducts with grouping keyword,
respectively.
Six different corpora of on-line news of YAHOO!
(httn://www.yahoo.com) were used. They are
Business: Stock Market
Politics: the Presidential election
Science: Astronomy and Space News
World: Middle East Peace Process
Health: HIV
Sports: NBA Playoffs
Fig. 2 shows the average classification accuracy ofour
hybrid system. The “RSAR” curve shows the accuracy in
the case of using knowledge base of one reduct without
grouping keyword, i.e. RSAR algorithm in [16]. The
“SINGLEREDUCT” curve shows the accuracy in the case
of using knowledge base of one reduct with grouping
keyword. From the experiment result, we can see that with
the increasing the number of categories, the accuracy
becomes to be an unacceptable level. Using LSI for
grouping keyword can improve the classification
accuracy. The “5 REDUCTS’curve shows the accuracy in
the case of using multiple knowledge bases of 5 leducts
with grouping keyword. As it can be seen, using multiple
reducts instead of single reduct can make a further
improvement in text classification. What may be
concluded from the above figure is that the hybrid
method, developed in this paper, is efficient and robust
text classifier.
1+-RSAR -=-SINGLE REOUCT 5 REDUCTS I
100
BO
).U
60
i
c0D
40
20
.k
VI
VI0
ZJ
0
2 3 4 5 6
Number of Categories
Fig. 2. Comparisonof Accuracy of Hybrid System with RSAR
7. Conclusions
With the dramatic rise in the use of Internet, there has
been an explosion in the volume of online documents and
electronic mail. Text classification, the assignment of free
text documents to one or more predefined categories
based on their content, is an important component in
many information management tasks, some example are
real-time sorting of email into folder hierarchies and topic
259
identification to support topic-specific processing
operations.
In this paper, a hybrid method for text classification has
been presented by using Latent Semantic Indexing (LSI)
and Rough Set (RS) theory to classify new documents.
Given corpora of documents and a set of examples of
classified documents, the technique can quickly locate a
minimal set of co-ordinate keywords to classify new
documents. The resulting set of keywords of rule is
typically small enough to be understood by a human. For
the classification, we get high classification accuracy. The
experimental results show that grouping keyword by
Latent Semantic Indexing (LSI) and using several
knowledge bases instead of one knowledge base make
high improvement than RSAR algorithm, especially with
the increasing the number of categories.
The system is still in its early states of research. To
improve the accuracy and decrease the dimensionality of
rule, further investigation into rule induction after
reducing the attribute and compute all reducts is in
progress. And comparison with other text classification
methods using benchmark dataset Reuters-21578 is also
our future work.
Acknowledgement
This work was in part supported by the Hori information
sciencepromotion foundation.
References
[1J D.D. Lews and W.A. Gale, “A Sequential Algorithm for
Training Text Classifiers”, SIGIR94: Proceedings of the
I 7Ih Annual International ACM SIGIR Conference on
Research and Development I Information Retrieval, 1994,
[2] T. Joachims, “Text Categorization with Support Vector
Machines: Learning with Many Relevant Features”,
ECML98, 10lh European Conference on Machine
Learning, 1998,pp. 170-178.
[3] M. Craven, D. Dpasquo, D. Freitag, A. McCallum, T.
Mitchell, K. Nigam and S. Slattery, “Learning to Symbolic
Knowledge from the World Wide Web”, Proceeding of
the ISth National Conference on Artificial htelligence
[4] J. Shavlik and T. Eliassi-Rad, “Intelligent Agents for
Web-Based Tasks: An Advice-Taking Approach”, AAAI-
98 Workshop on learningfor Text categorization, Tech.
Rep. WS-98-05, AAAI press. httn://www.cs.wise.edu!
-shad iklmlrrrlnublications.html.
[5] M.J. Pazzani, J. Muramatsu and D. Billsus, “Syskill &
Webert: Identifying Interesting Web Sites”,Proceeding of
pp. 3-12.
(AAAI-98). 1998,pp. 509-516.
the 131h National Conference on Artificial Intelligence
[6] K. Lang, “Newsweeder: Learning to Filter Netnews”,
Machine Learning: Proceeding of the Twelfth
International (ICML95), 1995,pp. 331-339.
[7] D.D. Lewis and K.A. Knowles. “Threading Electronic
Mail: A Preliminary Study”, Information Processing And
Management, 3(2), 1997,pp.209-217.
[8] M. Sahami, S. Dumais, D. Heckerman and E. Horvitq “A
Bayesian Approach to Filtering Junk Email”, AAAI-98
Workshopon learningfor Text categorization,Tech. Rep.
WS-98-05, AAAI press. httv:Ilrobotics.stanford.edu/
userslsahamii naDers.htm1
[9] Y. Yang, “An Evaluation of Statistical Approaches to
Text Categorization”, Journal of Information Retrieval, I,
[lo] Z. Pawlak, “Rough Sets”, International Journal of
Computer and Information Science, 1I, 1982pp. 341-356.
[I I] Z. Pawlak, Rough Sets--Theoretical Aspects of
Reasoning about Data, Kluwer Academic Publishers,
Dordrecht. (1991)
[I21 X. Hu, N. Cercone and W. Ziarko, “Generation of
Multiple Knowledge from Database Base on Rough Sets
Theory”, in T.Y. Lin (ed.) Rough Sets and Data Mining:
Analysis of Imprecise Data. Kluwer Academic Publishers,
Dordrecht, 1997,pp. 109-121.
[I31 A. Skowron and C. Rauszer, “The Discemibility
Matrices and Functions in Information Systems”, in R.
Slowinski (ed.) Intelligent Decision Support - Handbook
of Application and Advances of Rough Sets Theory,
Kluwer Academic Publishers, Dordrecht, 1992,pp. 331-362.
[I41 S.D. Bay, ” Combining Nearest Neighbor Classifiers
Through Multiple Feature Subsets”, Intelligent Data
Analysis, 3(3), 1999,pp. 191-209.
[151 C.J. van Rijsbergen, Information retrieval,
Buttenvorths,United Kingdom, 1990.
[I61 A. Chouchoulas and Q. Shen, “A Rough Set-Based
Approach to Text Classification”, In 7th International
Workshop, RSFDGrC’99, Yamaguchi, Japan, 1999, pp.
[I71 S. Deerwester, S.T. Dumais, G.W. Furnas, T.K.
Landauer and R. Harshman, “Indexing by Latent Semantic
Analysis”, Journal of the American Society for
Information Science,No. 41, 1990,pp. 391-407.
[18] T.K. Landauer, P.W. Foltz and D. Laham,
“Introduction to Latent Semantic Analysis”, Discourse
Processes, No. 25, 1998,pp. 259-284
[19] P. W. Foltq “Using Latent Semantic Indexing for
Information Filtering”, In R. B. Allen (Ed.) Proceedings of
the Conferenceon Office Information Systems, Cambridge,
MA, 1990,pp. 40-47.
[20] Y. Bao, X Du, M. Deng and N. Ishii, “An Efficient
incremental Algorithm for Computing All Reducts”, In N.
Ishii (Ed.) Proceedings of the ACIS 2‘ International
(AAAI-96), 1996,pp. 54-59.
1999,pp. 69-90.
118-129.
260
Conference on Software Engineering, Artificial Intelli-
gence, Networking & ParalleUDistributed Computing
(SNPDZOOI). Japan, 2001,pp. 956-961.
[21] K. Aas and L. Eikvil,“Text Categorisation: A Survey”,
Rapport Nr. 941, June, 1999,ISBN 82-539-0425-8.
261

More Related Content

What's hot

Experimental Result Analysis of Text Categorization using Clustering and Clas...
Experimental Result Analysis of Text Categorization using Clustering and Clas...Experimental Result Analysis of Text Categorization using Clustering and Clas...
Experimental Result Analysis of Text Categorization using Clustering and Clas...
ijtsrd
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
IJECEIAES
 

What's hot (19)

Experimental Result Analysis of Text Categorization using Clustering and Clas...
Experimental Result Analysis of Text Categorization using Clustering and Clas...Experimental Result Analysis of Text Categorization using Clustering and Clas...
Experimental Result Analysis of Text Categorization using Clustering and Clas...
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNN
 
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MININGA CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
A CLUSTERING TECHNIQUE FOR EMAIL CONTENT MINING
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVALONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...An Improved Similarity Matching based Clustering Framework for Short and Sent...
An Improved Similarity Matching based Clustering Framework for Short and Sent...
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
Semantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detectionSemantics-based clustering approach for similar research area detection
Semantics-based clustering approach for similar research area detection
 
8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network8 efficient multi-document summary generation using neural network
8 efficient multi-document summary generation using neural network
 
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniques
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourse
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONS
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSSEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONS
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONS
 
Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 

Viewers also liked

Optimization of Biodiesel
Optimization of BiodieselOptimization of Biodiesel
Optimization of Biodiesel
danieljvv
 
Pediatric truma 60
Pediatric truma 60Pediatric truma 60
Pediatric truma 60
gallevy16
 
Pediatric truma 60 (2)
Pediatric truma 60 (2)Pediatric truma 60 (2)
Pediatric truma 60 (2)
gallevy16
 
Pediatric truma 60 (1)
Pediatric truma 60 (1)Pediatric truma 60 (1)
Pediatric truma 60 (1)
gallevy16
 
Nervous systemanatomy 60
Nervous systemanatomy 60Nervous systemanatomy 60
Nervous systemanatomy 60
gallevy16
 

Viewers also liked (14)

Aparatos mediáticos
Aparatos mediáticosAparatos mediáticos
Aparatos mediáticos
 
聯合勸募愛心匯集之努力
聯合勸募愛心匯集之努力聯合勸募愛心匯集之努力
聯合勸募愛心匯集之努力
 
Gost r 8.858 2013
Gost r 8.858 2013Gost r 8.858 2013
Gost r 8.858 2013
 
F1 video for red bull ring budapest
F1 video for red bull ring budapestF1 video for red bull ring budapest
F1 video for red bull ring budapest
 
Optimization of Biodiesel
Optimization of BiodieselOptimization of Biodiesel
Optimization of Biodiesel
 
мадияров арыстан+автосервис+решение
мадияров арыстан+автосервис+решениемадияров арыстан+автосервис+решение
мадияров арыстан+автосервис+решение
 
Phtls 60
Phtls 60Phtls 60
Phtls 60
 
Final
FinalFinal
Final
 
Diseno de proyectos
Diseno de proyectosDiseno de proyectos
Diseno de proyectos
 
Pediatric truma 60
Pediatric truma 60Pediatric truma 60
Pediatric truma 60
 
Pediatric truma 60 (2)
Pediatric truma 60 (2)Pediatric truma 60 (2)
Pediatric truma 60 (2)
 
Pediatric truma 60 (1)
Pediatric truma 60 (1)Pediatric truma 60 (1)
Pediatric truma 60 (1)
 
Nervous systemanatomy 60
Nervous systemanatomy 60Nervous systemanatomy 60
Nervous systemanatomy 60
 
Analisis del video: Sociedad del Conocimiento
Analisis del video: Sociedad del ConocimientoAnalisis del video: Sociedad del Conocimiento
Analisis del video: Sociedad del Conocimiento
 

Similar to A rough set based hybrid method to text categorization

Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
IJRAT
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
SaleihGero
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
Editor IJARCET
 

Similar to A rough set based hybrid method to text categorization (20)

Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Ay3313861388
Ay3313861388Ay3313861388
Ay3313861388
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Review of Various Text Categorization Methods
Review of Various Text Categorization MethodsReview of Various Text Categorization Methods
Review of Various Text Categorization Methods
 
C017321319
C017321319C017321319
C017321319
 
An in-depth review on News Classification through NLP
An in-depth review on News Classification through NLPAn in-depth review on News Classification through NLP
An in-depth review on News Classification through NLP
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
An efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-documentAn efficient-classification-model-for-unstructured-text-document
An efficient-classification-model-for-unstructured-text-document
 
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text Classification
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 

A rough set based hybrid method to text categorization

  • 1. A Rough Set-Based Hybrid Method to Text Categorization Yongguang Bao', SatoshiAoyama', Xiaoyong Du', Kazutaka Yamada', Naoho Ishii' 1 Department of Intelligence and Computer Science, Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya, 466-8555,Japan fbaoyg,satoshi,kazzy,ishii)@egg.ics.nitech.ac.j p School of Information, Renmin Universityof China, 100872,Beijing, China Duyong@mail.ruc.edu.cn 2 Abstract In this paper we present a hybrid text categorization method based on Rough Sets theory. A centralproblem in good text classification for information filtering and retrieval (IF/IR) is the high dimensionality of the data. It may contain many unnecessary and irrelevant features. To cope with this problem, we propose a hybrid technique using Latent Semantic Indexing (LSI) and Rough Sets theory (RS) to alleviate this situation. Given corpora of documents and a training set of examples of classijied documents, the technique locates a minimal set of co-ordinate keywords to distinguish between classes of documents, reducing the dimensionality of the keyword vectors. This simplifies the creation of knowledge-based IF/IR systems, speeds up their operation, and allows easy editing of the rule bases employed. Besides, we generate several knowledge base instead of one knowledge base for the classification of new object, hoping that the combination of answers of the multiple knowledge bases result in better performance. Multiple knowledge bases can be formulated precisely and in a unified way within theframework of RS. This paper describes the proposed technique, discusses the integration of a keyword acquisition algorithm, Latent Semantic hdexing (LSr) with Rough Set-based rule generate algorithm. and provides experimental results. The test results &ow the hybrid method is better than the previous rough set- based approach. 1. Introduction As the volume of information available on the Internet and corporate intranets continues to increase, there is a growing need for tools helping people better find, filter, and manage these resources. Text Categorization is to classify text documents into categories or classes automatically based on their content, and therefore is an important component in many information management tasks: real-time sorting of email or files into folder hierarchies, topic identification to support topic-specific processing operations, structured search andfor browsing, or finding documents that match log-term standing interest or more dynamic task-based interests. Trained professional are employed to category new items in many contexts, which is very time-consuming and costly, and thus limits its applicability. Consequently there is an increasing interest in developing technologies for automatic text categorization. A number of statistical text learning algorithms and machine learning technique have been applied to text categorization. These text classification algorithms have been used to automatically catalog news articles [1,2] and web pages [3,4], learn the reading interests of users [5,6],and sort electronic mail V',VI. However, a non-trivial obstacle in good text classification is the high dimensionality of the data. In most IF/IR techniques, each document is described by a vector of extremely high dimensionality-typically one value per word or pair of words in the document. The vector ordinates are used as preconditions to a rule that decides which class the document belongs to. Document vector commonly comprise tens of thousands of dimensions, which renders the all problem but intractable for even the most powerful computers. Rough Sets Theory introduced by Pawlak [lo] is a non- statistical methodology for data analysis. It can be used to alleviate this situation [16]. A. Chouchoulas and Q. Shen proposed a Rough Set-based approach (RSAR) to text classification and test it using Email messages. But we can see from the experimental results of RSAR, with the increasing the number of categories, the accuracy becomes to be an unacceptable level. We think it is not suited to apply RSAR directly after keywords are acquired by weight. Moreover, a single knowledge base which utilizes a single minimal set of decision rules to classify future examples may lead to mistake, because the minimal set of decision rules are more sensitive to noise and a small number of rules means that a few alternatives exit when classifying new objects. Recently, in order to 2540-7695-1393-WO2$17.000 2002IEEE
  • 2. enhance the classification accuracy, the concept of multiple knowledge base emerged. The idea is to generate several knowledge base instead of one knowledge base for the classification of unseen objects, hoping that the combination of answers of multiple knowledge base result in better performance. Multiple knowledge bases can be formulated precisely and in a unified way within the framework of RS. This paper proposes a hybrid technique using Latent Semantic Indexing (LSI) and Rough Set (RS) theory to cope with this situation. Given corpora of documents and a set of examples of classified documents, the technique can quickly locate a minimal set of co-ordinate keywords to classify new documents. As a result, it dramatically reduces the dimensionality of the keyword space. The resulting set of keywords of rule is typically small enough to be understood by a human. This simplifies the creation of knowledge-based IF/IR systems, speeds up their operation, and allows easy editing of the rule bases employed. Moreover, we generate several knowledge bases instead of one knowledge base to classify new objects for the better performance. The remainder of this paper is organized as follows: Section 2 introduces the text categorization. In section 3 and 4 we discuss Latent Semantic Indexing and Rough sets Theory respectively. Section 5 provides a description of the proposed system. Section 6 describes experimental results. A short conclusion is given in the final section. 2. Text Categorization Text categorization aims to dassify text documents into categories or classes automatically based on their content. While more and more textual information is available online, effective retrieval becomes difficult without good indexing and summarization of document content. Document classification is one solution to this problem. A growing number of statistical classification algorithms and machine learning methods have been applied to text classification in recent years. Like all classification tasks, it may be tackled either by comparing new documents with previous classified ones (distance-based techniques), or by using rule-based approaches. Perhaps, the most commonly used document representation is so called vector space model. In the vector space model, a document is represented by vector of words. Usually, one has a collection of documents which is represented by a MxN word-by-document matrix A, where M is the number of words, and N the number of documents, each entry represents the occurrences of a word in a document, i.e., A 3 W 3 (2.1) Where wk is the weight of word i in document k. Since every word does not normally appear in each document, the matrix is usually sparse and M can be very large. Hence, a major characteristic, or difficulty of text categorization problem is the high dimensionality of the feature space. The two most commonly used text categorization approaches are outlined below. 2.1. Distance-BasedText Categorization Distance-Based text categorization involves the comparison of high-dimensionality keyword vectors. In case where the vector describes groups of documents, it identifies the center of a cluster of documents. Documents are classified by comparing their document vector. To classify an unknown document vector d, the knearest neighbor (kNN) algorithm ranks the document’s neighbors among the training document vectors, and use the class labels of the k most similar neighbors to predict the class of the input document. The classes of these neighbors are weighted using the similarity of each neighbor to d, where similarity may be measured by, for example, the cosine or the Euclidean distance between the two document vectors. kNN is lazy learning instance-based method that does not have an off-line training phase. The main computation is the on-line scoring of training documents given a test document in order to find the k nearest neighbors. Using an inverted-file indexing of training documents, the time complexity is O(L*N/M) where L is the number of elements of the document vector that greater than zero, M is the length of the document vector, and N is the number of training sample. Unfortunately, the dimensionality of the document vector is typically extremely high (usually in the tens of thousands); a detail that greatly slows down classification tasks and makes storage of document vectors expensive [2]. 2.2. Rule-Based Text Categorization Rule-Based categorization has been in use for a long time and is an established method of classifying documents. Common applications include the kill-file article filters used by Usenet client software and van den Berg’s autonomous E-mail filter,Procmail. In this context, keyword vectors are considered as rule preconditions; the class a document belongs to is used as the rule decision: ki,kz,..A E U r,(d)=p(d,k,) A p(d,k2)A...A p(d,k,) 3 d E D Where k, are document keywords, U is the universal keyword set, d is one document, D is a document class, r,(d) is rule i applied to d and p(d,k,) is a function evaluating to true if d contains keyword I( such that it 255
  • 3. satisfies some metric ( e.g. a minimum frequency or weight). Not all keywords in the universal set need to be checked for. This allows rule-based text classifiers to exhibit a notation much terser than that of vector-based classifiers, where a vector must always have the same dimensionality as keyword space. In most cases, the human user writes rules. Most typical rule bases simply test for the presence of specific keywords in the document. Fox example, a Usenet client may use a kill-file to filter out newsgroup article by some specific person by looking for the person’s name in the article’s ‘From’ field. Such rule-based approaches are inherently simple to understand, which accounts for their popularity among end-users. Unfortunately, complex needs often result in very complex rule bases, ones that user have difficulty maintaining by hand. 3. Latent Semantic Indexing A central problem in statistical text classification is the high dimensionality of the feature space. There exits one dimension for each unique word found in the collection of documents, typically hundreds of thousands. Standard classification approaches cannot deal with such a large feature set, since the processing is extremely costly in computation terms, and the results become unreliable due to the lack of sufficient training data. Hence, there is a need for a reduction of the original kature set, which is commonly known as dimensionality reduction in the pattern recognition literature. Latent Semantic Indexing &SI) is a statistical, corpus- based text comparison mechanism that was originally developed for the task of information retrieval, but in recent years has produced remarkably human-like abilities in a variety of language tasks. LSI has taken the test of English as a foreign language and performed as well as non-native English speakers who were successful college applicants. It has shown an ability to learn words at a rate similar to humans. It has even graded papers as reliably as human graders. LSI is based on the assumption that there is some underlying or latent structure in the patter of word usage across documents, and that statistical techniques can be used to estimate this structure. It created a high- dimensional, spatial representation of a corpus and allowed texts to be compared geometrically. LSI uses singular value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis, to compress a large amount of term and document co- occurrence information into a smaller space. This compression is said to capture the semantic information that is latent in the corpus itself. In what follows we describe the mathematics underlying the particular model of the latent structure: the singular value decomposition. Assuming that we have an MxN word-by-document matrix A, the singular value decomposition of A is given by : A = U S VT Where U(MxR) and V(RxN) have orthonormal columns and S(RxR) is the diagonal matrix of singular values. M i n ( M , N ) is rank of A. If the singular values of A are ordered by size, the K largest may be kept and the remaining smaller ones set to zero. The product of the resulting matrixes is matrix Akthat is an approximation to A with rank K. Where Sk(KXK)is obtained by deleting the zero rows and columns of, and U,,(MxK) and V O X K ) are obtained by deleting the corresponding rows and columns of U and V. Ak in one sense captures most of the underlying structure in A, yet at the same time removes the noise or variability in word usage. Since the number of dimensions K is much smaller than the number of unique words M, minor differences in terminology will be ignored. Words that occur in similar documents may be near each other in the IC-dimensional space even if they never co-occur in the same document. Moreover, documents that do not share any words with each other may turn out to be similar. The cosine between two rows in Ab equally the cosine between two rows in QSk,reflects the extent to which two words have a similar pattern of occurrence across the set of documents. If the cosine is 1, the two words have exactly the same pattern of occurrence, while a cosine 0 means that the pattern of occurrence is very different for the two words. By this similarity we can construct new keywords as combinations or formations of the original keywords. This compression step is somewhat similar to the common feature of neural network systems where a large number of inputs are connected to a fairly small number of hidden layer nodes. If there are too many nodes, anetwork will “memorize” the training set, miss the generalities in the data, and consequently perform poorly on a test set. The input for LSI is a large amount of text (on the order of magnitude of a book). The corpus is turned into a co- occurrence matrix of terms by “documents“, where for our purposes, a document is a paragraph. SVD computes an approximation ofthis data structure of an arbitrary rank K. Common values of K are between 200 and 500, and are thus considerably smaller than the usual number of terms or documents in a corpus, which are on the order of 10000. It has been claimed that this compression step captures regularities in the patterns of co-occurrence across terms and across documents, and furthermore, that these regularities are related to the semantic structure of the terms and documents. Ak= Gskvz 256
  • 4. 4. Information Systemsand Rough Sets 4.1. Informationsystems An information system is composed of a 4-tuple as follow: S=<U,Q,V,D Where U is the closed universe, a finite nonempty set of N objects (x1,x2,...,%), Q is a finite nonempty set of n attributes { qi,q2,...,qn} ,V = . q E Q V, where V, is a domain(value) of the attribute q, f: WQ 3 V is the total decision function called the information such that f(x,q)E V, ,for every q cQ, XE U. Any subset P of Q determines a binary relation on U, which will be called an indiscernibility relation denoted by INP(P), and defined as follows: x IB y if and only if f(x,a) = f(y,a) for every a E P. Obviously INP(P) is an equivalence relation. The family of all equivalence classes of INP(P) will be denoted by U/INP(P) or simply U/P; an equivalence class of INP(P) containing x willbe denoted by P(x) or [x],. 4.2. Reduct Reduct is a fundamental concept of rough sets. A reduct is the essential part of an information system that can discern all objects discernible by the original information system. Let qE Q. A feature c is dispensable in S, if PJD(Q-q) = W ( Q ) ;otherwise feature q is indispensable in S. If q .is an indispensable feature, deleting it from S will cause S to be inconsistent. Otherwise, q can be deleted from S. The set of feature Rc_Q will be called a reduct of Q, if ND(R)=IND(Q) and all features of R are indispensable in S. We denoted it as RED(Q) or RED(S). Attribute reduct is the minimal subset of condition attributes Q with respect to decision attributes D, none of the &tributes of any minimal subsets can be eliminated without affecting the essential information. These minimal subsets can discern decision classes with the same discriminating power as the entire condition attributes. The set of all indispensable from the set Q is called CORE of C and denoted by CORE(Q): S k o ~ r o n [ ' ~ ]proposed a good method to compute CORE using discernibility mtrix. The CORE attributes are those entries in the discernibility matrix that have only one attribute. COWQl=nRED(Q) 4.3. The Discernbility Matrix In this section we introduce a basic notions--a discernibility matrix that will help us understand several properties and to construct efficient algorithm to compute the reduct. By M(S) we denote an nxn matrix (clJ),called the discernibility matrix of S, such as qJ={qEQ:flh,q)#flx,,q) ] forij=I ,...,n. Since M(S) is symmetric and ell+ for i=l,...,n, we represent M(S) only by elements in the lower triangle of M(S), i.e. the c,~is with I < j < i ~ . From the definition of the discernibility matrix M(S) we have the following: Proposition 4.1[l3]CORE(S)={qEQ: c,, ={q} for some i j } Proposition 4.2'13] Let &Q. If for some i j we have B? qJ#4then x,DIS(B)x,. In particular, if WBcc,,, for some ij, then x,DIS(B)x,.Where x,DIS(B)x,denotes that x, and x, can be discerned by attribute subset B. Proposition 4.3'13' Let @BcQ. The following conditions are equivalent: (1) For all i j such that cijt4and 1<j<iSn we have B n c,~#$ (2) IND(B)=IND(Q) i.e. B is superset of a reduct of S. I 1 1 1 Minimal Rule Base 257
  • 5. Fig. 1. Data flow through the system Fig. 1. Data flow through the system 5. The Proposed System This paper proposes a hybrid method to text classification. The approach comprises three main stages, as shown in Fig. 1. The keyword acquisition stage reads corpora of documents, locates candidate keywords, estimates their importance, and builds an intermediate dataset of high dimensionality. The group keyword part constructs new keywords as combinations or transformation of the high dimensionality keywords. The attribute reductions generation examines the dataset and removes redundancy, generates single or multiple feature reductions, leaves a dataset or rule base containing a drastically reduced number of preconditions per rule. 5.1. Keyword Acquisition This sub-system uses a set of document as input. Firstly, word are isolated and pre-filtered to avoid very short or long keywords, or keywords that are not words (e.g. long numbers or random sequences of characters). Every word or pair of consecutive words in the text is considered a candidate keyword. Then, the following weighting function was used for word indexing to generate a set of keywords for each document. N @ik = -log($)fkwf Where U,,are the weights of keyword k in document i; N is the total number of document and Nk is the number of documents containing keyword k; fkis the frequency of the keyword k in document i; and 0,denotes the current field's importance to the classification, which depends on the application and user preferences. Finally, before the weighted keyword is added to the set of keywords, it passes through two filters: one is a low- pass filter removing words so uncommon that are definitely not good keywords; the other is a high-pass filter that removes far too common words such as auxiliary verbs, articles et cetera. This gives the added advantage of language-independence to the keyword acquisition algorithm: most similar methods rely on English thesauri and lists of common English words to perform the same function. Finally all weights are normalized before the keyword sets are output. It must be emphasized that any keyword acquisition approach may be substituted for the one described above, as long as it outputs weighted keywords. 5.2. Group Keyword Group keyword is the process of constructing new keywords as combinations or transformations of the original keywords using Latent Semantic Indexing (LSI). As described in section 3, we use singular value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis, to compress a large amount of term and document co- occurrence information into a smaller space. Assuming that we have an MxN word-by-document matrix A. After decompounded, we use the cosine between two rows in AI, as similar measurement to group keyword. We set similar threshold s to group original keywords k, (i=1,2,...,N) to new keywords K, (j=1,2,...,N'<N) as in the following way: K, = Yki r(k,P ~ P s Where r(k,,k,)is the cosine between two rows i and 1 in AI, and s is similar threshold which depends on the application and user preferences. 5.3 Attribute Reductions Generation A reduct is a minimal subset of attributes, which has the same discernibility power as the entire condition attribute. Finding all reducts of an information system is combinatorial NP-hard computational problem [131. A reduct uses a minimum number of features and represents a minimal and complete rules set to classify new objects. To classify unseen objects, it is optimal that different reducts use different features as much as possible and the union of these attributes in the reducts together include all indispensable attributes in the database and the number of reducts used for classification is minimum. Here, we proposed a greedy algorithm to compute a set of reducts which satisfy this optimal requirement partially because our algorithm cannot guarantee the number of reducts is minimum. Our algorithm starts with the CORE features, then through backtracking, mu ltiple reducts are constructed using discemibility matrix. A reduct is computed by using forward stepwise selection and backward stepwise elimination based on the significance values of the features. The algorithm terminates when the features in the union of the reducts includes all the indispensable attributes in the database or the number of reducts is equal to the number determined by user. Since Rough Sets is suited to nominal datasets, we quantise the normalized weighted space into 11 values calculated by poor(1OW). 258
  • 6. Algorithm: Generate Reductions Let COMP(B,ADL) denotes the comparison procedure. The result of COMP(B,ADL) is 1 if for each element c, of ADL has E h c,#$othenvise 0. Step 1 Create the discemibility matrix DM: =[C,]; create an absorbed discernibility list: Delete empty and non-minimal elements of DM and ADL={ clJE DM qJ#@andno clmE DM, clmccIJ}; Set CDL=ADL, C = U{adk ADL}, i=l; Step 2 While card(uREDU,)<card((=)do begin REDU=u{ce CDL :card(c)=l }; ADLKDL- REDU; Sort the set of C - uREDU, based on frequency value in ADL. /* forward selection*/ While (ADL#@)do begin Compute the frequency value for each attribute Select the attribute q with maximum frequency value Delete elements ad1 of ADL which qE ad1from ADL; /* backward elimination*/ E C-FEDU,; and add it to REDU; End N = card(REDU); Forj=O to N-l do begin Remove aJE C-REDU from REDU; If COMP(REDU,CDL)= 0 Then add aJto REDU; End REDU,= REDU; i = i+l; END The dataset is simply post-processed to remove duplicate rules and output in the form of a rule base. 6. Experimental Results For evaluating the efficiency of our hybrid algorithm, we compared it with RSAR algorithm [161(It is one case of our algorithm when use one reduct without grouping keyword by LSI). And for testing the efficiency of grouping keyword and multiple knowledge bases, we run our algorithm by the way of using knowledge base of one reduct without grouping keyword, using knowledge base of one reduct with grouping keyword and using multiple knowledge bases of 5 reducts with grouping keyword, respectively. Six different corpora of on-line news of YAHOO! (httn://www.yahoo.com) were used. They are Business: Stock Market Politics: the Presidential election Science: Astronomy and Space News World: Middle East Peace Process Health: HIV Sports: NBA Playoffs Fig. 2 shows the average classification accuracy ofour hybrid system. The “RSAR” curve shows the accuracy in the case of using knowledge base of one reduct without grouping keyword, i.e. RSAR algorithm in [16]. The “SINGLEREDUCT” curve shows the accuracy in the case of using knowledge base of one reduct with grouping keyword. From the experiment result, we can see that with the increasing the number of categories, the accuracy becomes to be an unacceptable level. Using LSI for grouping keyword can improve the classification accuracy. The “5 REDUCTS’curve shows the accuracy in the case of using multiple knowledge bases of 5 leducts with grouping keyword. As it can be seen, using multiple reducts instead of single reduct can make a further improvement in text classification. What may be concluded from the above figure is that the hybrid method, developed in this paper, is efficient and robust text classifier. 1+-RSAR -=-SINGLE REOUCT 5 REDUCTS I 100 BO ).U 60 i c0D 40 20 .k VI VI0 ZJ 0 2 3 4 5 6 Number of Categories Fig. 2. Comparisonof Accuracy of Hybrid System with RSAR 7. Conclusions With the dramatic rise in the use of Internet, there has been an explosion in the volume of online documents and electronic mail. Text classification, the assignment of free text documents to one or more predefined categories based on their content, is an important component in many information management tasks, some example are real-time sorting of email into folder hierarchies and topic 259
  • 7. identification to support topic-specific processing operations. In this paper, a hybrid method for text classification has been presented by using Latent Semantic Indexing (LSI) and Rough Set (RS) theory to classify new documents. Given corpora of documents and a set of examples of classified documents, the technique can quickly locate a minimal set of co-ordinate keywords to classify new documents. The resulting set of keywords of rule is typically small enough to be understood by a human. For the classification, we get high classification accuracy. The experimental results show that grouping keyword by Latent Semantic Indexing (LSI) and using several knowledge bases instead of one knowledge base make high improvement than RSAR algorithm, especially with the increasing the number of categories. The system is still in its early states of research. To improve the accuracy and decrease the dimensionality of rule, further investigation into rule induction after reducing the attribute and compute all reducts is in progress. And comparison with other text classification methods using benchmark dataset Reuters-21578 is also our future work. Acknowledgement This work was in part supported by the Hori information sciencepromotion foundation. References [1J D.D. Lews and W.A. Gale, “A Sequential Algorithm for Training Text Classifiers”, SIGIR94: Proceedings of the I 7Ih Annual International ACM SIGIR Conference on Research and Development I Information Retrieval, 1994, [2] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, ECML98, 10lh European Conference on Machine Learning, 1998,pp. 170-178. [3] M. Craven, D. Dpasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery, “Learning to Symbolic Knowledge from the World Wide Web”, Proceeding of the ISth National Conference on Artificial htelligence [4] J. Shavlik and T. Eliassi-Rad, “Intelligent Agents for Web-Based Tasks: An Advice-Taking Approach”, AAAI- 98 Workshop on learningfor Text categorization, Tech. Rep. WS-98-05, AAAI press. httn://www.cs.wise.edu! -shad iklmlrrrlnublications.html. [5] M.J. Pazzani, J. Muramatsu and D. Billsus, “Syskill & Webert: Identifying Interesting Web Sites”,Proceeding of pp. 3-12. (AAAI-98). 1998,pp. 509-516. the 131h National Conference on Artificial Intelligence [6] K. Lang, “Newsweeder: Learning to Filter Netnews”, Machine Learning: Proceeding of the Twelfth International (ICML95), 1995,pp. 331-339. [7] D.D. Lewis and K.A. Knowles. “Threading Electronic Mail: A Preliminary Study”, Information Processing And Management, 3(2), 1997,pp.209-217. [8] M. Sahami, S. Dumais, D. Heckerman and E. Horvitq “A Bayesian Approach to Filtering Junk Email”, AAAI-98 Workshopon learningfor Text categorization,Tech. Rep. WS-98-05, AAAI press. httv:Ilrobotics.stanford.edu/ userslsahamii naDers.htm1 [9] Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization”, Journal of Information Retrieval, I, [lo] Z. Pawlak, “Rough Sets”, International Journal of Computer and Information Science, 1I, 1982pp. 341-356. [I I] Z. Pawlak, Rough Sets--Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Dordrecht. (1991) [I21 X. Hu, N. Cercone and W. Ziarko, “Generation of Multiple Knowledge from Database Base on Rough Sets Theory”, in T.Y. Lin (ed.) Rough Sets and Data Mining: Analysis of Imprecise Data. Kluwer Academic Publishers, Dordrecht, 1997,pp. 109-121. [I31 A. Skowron and C. Rauszer, “The Discemibility Matrices and Functions in Information Systems”, in R. Slowinski (ed.) Intelligent Decision Support - Handbook of Application and Advances of Rough Sets Theory, Kluwer Academic Publishers, Dordrecht, 1992,pp. 331-362. [I41 S.D. Bay, ” Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets”, Intelligent Data Analysis, 3(3), 1999,pp. 191-209. [151 C.J. van Rijsbergen, Information retrieval, Buttenvorths,United Kingdom, 1990. [I61 A. Chouchoulas and Q. Shen, “A Rough Set-Based Approach to Text Classification”, In 7th International Workshop, RSFDGrC’99, Yamaguchi, Japan, 1999, pp. [I71 S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Harshman, “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science,No. 41, 1990,pp. 391-407. [18] T.K. Landauer, P.W. Foltz and D. Laham, “Introduction to Latent Semantic Analysis”, Discourse Processes, No. 25, 1998,pp. 259-284 [19] P. W. Foltq “Using Latent Semantic Indexing for Information Filtering”, In R. B. Allen (Ed.) Proceedings of the Conferenceon Office Information Systems, Cambridge, MA, 1990,pp. 40-47. [20] Y. Bao, X Du, M. Deng and N. Ishii, “An Efficient incremental Algorithm for Computing All Reducts”, In N. Ishii (Ed.) Proceedings of the ACIS 2‘ International (AAAI-96), 1996,pp. 54-59. 1999,pp. 69-90. 118-129. 260
  • 8. Conference on Software Engineering, Artificial Intelli- gence, Networking & ParalleUDistributed Computing (SNPDZOOI). Japan, 2001,pp. 956-961. [21] K. Aas and L. Eikvil,“Text Categorisation: A Survey”, Rapport Nr. 941, June, 1999,ISBN 82-539-0425-8. 261