SlideShare a Scribd company logo
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 03 Issue: 01 June 2014, Page No. 1- 4
ISSN: 2278-2419
1
Scaling Down Dimensions and Feature Extraction
in Document Repository Classification
Asha Kurian1
,M.S.Josephine2
, V.Jeyabalaraja3
1
Research Scholar, Department of Computer Applications, Dr. M.G.R. Educational and Research Institute University, Chennai
2
Professor, Department of Computer Applications, Dr. M.G.R. Educational and Research Institute University, Chennai
3
Professor, Department of computer Science Engineering, Velammal Engineering College, Chennai
E-mail: ashk47@yahoo.com , josejbr@yahoo.com, jeyabalaraja@gmail.com
Abstract-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification.
Keywords: Feature extraction, text classification,
categorization, dimensionality reduction
I. INTRODUCTION
Text Categorization or text classification attempts to assort
documents in a repository into different class labels.A
classifier learns from a training set of documents that are
already classified and labeled. A general model is devised that
correctly labels further incoming documents.A repository
typically consists of thousands of documents. Retrieving a
selection becomes a laborious task, unless the documents are
indexed or categorized in some particular order. Document
categorization is modeled along the lines of Information
Retrieval [1] and Natural Language Processing [5] where a
user query elicits documents of maximal significance in
relation to the query. The sorting is done by grouping the
document terms and phrases and identifying some association
or correlation between them. Establishing relations among
words is compounded due to polysemy and the presence of
synonyms. Every document contains thousands of unique
terms resulting in a highly dimensional feature space.
To reduce information retrieval time, the dimensionality of the
document collection can be reduced by selecting only those
terms which best describe the document. Dimensionality
reduction techniques try to find out the context-meaning of
words, disregarding those which are inconsequential. Feature
selection algorithms reduce the feature space by selecting
appropriate vectors, whereas feature extraction algorithms
transform the vectors into a sub-space of scaled down
dimension. Feature selection can be followed by supervised or
unsupervised learning. Classification becomes supervised
when a collection of labeled documents helps in the learning
using a train-test set.
II. REVIEW OF LITERATURE
1. Text mining and Organization in Large Corpus, December
2005. Two dimensionality deduction methods: Singular
Vector Decomposition (SVD) and Random Projection (RP) are
compared, along with three selected clustering algorithms: K-
means, Non-negative Matrix Factorization (NMF) and
Frequent Itemset. These selected methods and algorithms are
compared based on their performance and time consumption.
2. Improving Methods for Single-label Text Categorization,
July 2007 – An evaluation of the evolutionary feature
reduction algorithms is done. In this paper a comprehensive
comparison of the performance of a number of text
categorization methods in two different data sets is presented.
In particular, the Vector and Latent Semantic Analysis (LSA)
methods, a classifier based on Support Vector Machines
(SVM) and the k-Nearest Neighbor variations of the Vector
and LSA models is evaluated.
3. A Fuzzy based approach to text mining and document
clustering, October 2013- In this paper, how to apply fuzzy
logic in text mining in order to perform document clustering is
shown. Fuzzy c-means (FCM) algorithm was used to cluster
these documents into clusters.
4. Feature Clustering algorithms for text classification-Novel
Techniques and Reviews, August 2010 - In this paper, some of
the important techniques for text classification have been
reviewed and novel parameters using fuzzy set approach have
been discussed in detail.
5. Classification of text using fuzzy based incremental feature
clustering algorithm, International Journal of Advanced
Research in Computer Engineering and Technology Volume 1,
Issue 5, July 2012 – A fuzzy based incremental feature
clustering algorithm is proposed. Based on the similarity test
the feature vector of a document set is classified and grouped
into clusters following clustering properties and each cluster is
characterized by a membership function with statistical mean
and deviation.
III. FEATURE SELECTION METHODS FOR
DIMENSIONALITY REDUCTION
Feature selection can either be supervised or unsupervised. A
brief summary of the different feature extraction methods used
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 03 Issue: 01 June 2014, Page No. 1- 4
ISSN: 2278-2419
2
in this study are entailed. As a first step, document pre-
processing removes stopwords, short words, numbers and
alphanumeric characters. Noise removed, the text is
metamorphosed into a term-weighted matrix whose rows
indicate the terms and columns represent the documents they
make an appearance in. Each cell entry holds the rate of
occurrence of words in the document (also called weights). The
term weighting factors are:
Local term frequency(tf) – frequency of a term in the document
global document frequency(idf) – accounts for the
proportionality of the term within a document as well as
globally for the whole collection.The normalized tf / idf factor
is the general weighting method used. The matrix thus obtained
consists mostly of sparse elements.
A. Latent Semantic Indexing (LSI)
Every document has an underlying semantic structure relating
to a particular abstraction. Latent Semantic Indexing taps this
theory to identify the relation between words and the context in
which they used. Mathematical and statistical techniques are
used for this inference [2]. Dimensions that are of no
consequence to the text are to be eliminated. At the same time
removal of these dimensions should not result in a loss of
interpretation’s starts by pre-processing the document of stop
words, stemming etc. This followed by converting the text
document to term weighted matrix subsisting of mostly sparse
vectors. Cell entries are incremented corresponding to the
frequency of words. LSI, accompanied by the powerful
Singular Value Decomposition (SVD) is used for feature
selection. Singular Value Decomposition (SVD) works with
the sparse matrix of term-weights and transforms it into a
product of three matrices. The Left Singular vector comprising
of original row elements, Right Singular Vector consisting of
original column elements both transformed orthogonally and
the Singular value which is a diagonal matrix containing the
scaling value [4]. The diagonal matrix is the component used
for dimensionality reduction. The smallest value in this matrix
indicates terms which are inconsequential and can be removed.
LSI along with SVD identifies features that do not contribute
to the semantic structure of the document.
B. Principal Component Analysis (PCA)
Principal Component Analysis uses the notion of eigenvalues
and eigenvectors for feature variable reduction procedure.
Given data with high dimensionality, PCA starts by subtracting
the mean from each data point so that the averaged mean for
each dimension is zero. The original data is projected along
with the altered data. A covariance matrix is a square matrixC
[n x n] = (cov (dimensionx,dimensiony)) C is a n x n matrix and
each cell (x,y) shows the covariance between two different
dimensions x and y.The covariance matrix represents the
dependence of dimensions on each other. Positive values
indicate that as one dimension increase, the dependent
dimension also scales up. The eigenvalues and eigenvectors
representation of the covariance matrix is used to plot the
principal component axis, the best line along which the data
can be laid out. Arranging the eigenvalues and eigenvectors in
a decreasing order shows the components with higher
relevance corresponding to higher eigenvalues [8]. The vectors
with lower values can be disregarded as non significant. The
higher second principal component plotted is perpendicular to
the first principal component. All eigenvectors are orthogonal
to each other irrespective of how many dimensions are present.
Each principal component is analyzed to what extent they
contribute to the variance of the data points[9] . As a final step
the transpose of the chosen principal components is multiplied
with the mean adjusted data to derive the final data set
representing those dimensions to be retained.
IV. FUZZY CLUSTERING
Fuzzy C-Means algorithm is an iterative algorithm to group the
data into different clusters [10]. Every cluster has a cluster
centre. The data points are put into the clusters based on the
probability with which they belong to the group. Points closer
to the center are more closely integrated to the group when
compared to those that are further away from the cluster center.
A decision as to which cluster to put a data point into is taken
based on the membership function of mean and standard
deviation. A membership function determines how well the
data fit into a particular cluster. On every iteration, the FCM
algorithm upgrades the cluster centers and the membership
functions. At the same time minimization of the objective
function is also done.
The objective function of FCM is given by
Um = ∑ ∑ ‖ − ‖
Where m is the number of iterations
xij denotes the membership of xi in the jth
cluster.
xi are the different dimensions of the data.
The iteration stops when the algorithm correctly pinpoints the
cluster center and no change seen in the minimization. The
algorithm consists of four steps .
Step1. Identify the initial cluster centers and a matrix X such
that each xij element of X takes a value in [0,1] that denotes the
extent to which the element belongs to the cluster.
Step 2.During each iteration update the matrix value [0,1] from
the given cluster center.
Step 3.Calculate the objective function for that iteration.
Step 4. If the value has reduced from the previous iteration,
continue with the iteration otherwise the procedure halts [11]
and [12]. Similar features are now clustered, with features
closer to the center having a stronger resemblance.
V. METHODOLOGY
The feature extraction methods try to interpret the underlying
semantics of the text and identify the unique words that can be
eliminated. The accuracy of classification depends on this
optimal set. To assess the efficiency and performance of
feature reduction, we first find the accuracy of classification,
given the training and test dataset [7]. The evolutionary
algorithms like LSI and PCA were mainly chosen because they
exhibit good accuracy when reducing dimensions of a
document. This is used as a measure find how much it scores
against FCM clustering. Supervised feature reduction works on
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 03 Issue: 01 June 2014, Page No. 1- 4
ISSN: 2278-2419
the principle that the number of clusters is determined
beforehand. The fuzzy C-Means clustering requires no training
or test samples to work on. The primary step in feature
reduction is to preprocess a document. Pre-processing is an
essential procedure that further reduces the complexity of
dimensionality reduction and the subsequent classification or
clustering process. The whole mass of text has to be
transformed algebraically. The text data in the document is
mapped to its vector representation, for the purpose of
document indexing. The most common form of representation
is a matrix format where words that appear frequently are
weighted [2].The resultant matrix is a sparse matrix with most
of elements being zeros. From the nature of text data, it can be
seen that certain words do not contribute to the meaning of the
text. The parser removes these words by stopword elimination.
This is followed by lemmatization or stemming. Preprocessing
also helps to eliminate words shorter than three letters,
alphanumeric characters and numbers. With the weighted
information, documents are ranked based on their similarity to
the query. The cosine of the angle formed by these vectors is
used as a similarity measure. Two common datasets have been
used in this study. 20 News groups, R8 - the 8 most frequent
classes from the Reuters-21578 collection datasets. The dataset
is split as training and test documents. Out of the 18821
documents in 20NG, 60% are taken as a training set and the
rest for testing. From the R8 group 70% documents are
considered as training set. Using the training set for learning,
the supervised techniques attempts to assign the new incoming,
and unknown documents into their correct class labels [8].
VI. SUPERVISED VS UNSUPERVISED FEATURE
SELECTION
All tests have been performed on the MATLAB computing
environment. Before any analysis is undertaken, it is
worthwhile to study the characteristics of the data collection
used. R8, the eight most frequent classes taken from the
Reuters collection is smaller in size compared to the
20Newsgroup collection. The general procedure followed is to
train the classifier using the training data. The accuracy is
judged by the number of test documents that are labeled
correctly after the learning phase. In FCM since learning is
evolving over iterations, bifurcation of data is unnecessary. In
this study the techniques are assessed based on the accuracy,
macro-averaged precision and recall, training and testing times
[6]. The datasets are divided in a 60:40 and 70:30 ratios for R8
and 20NG as training and test documents. With the class
labeling information from the training documents, the test
documents can be classified. Initially the datasets are labeled
by k-means after feature selection. This is used as a point of
reference when clustering using Fuzzy C-Means algorithm is
applied. A detailed comparison of the results shows that when
features are clustered using FCM, it executes faster and is more
accurate in classifying datasets. The savings are mainly due to
foregoing the training and testing times.
Dataset Collection Classes Train Test Total
Docs Docs Docs
20NG 20Newsgroups 20 11293 7528 18821
R8 Reuters-21578 8 5485 2189 7674
Figure.1 shows the datasets used in the study. The number of
classes in each collection along with the division of training
and test documents is shown
VII. IMPLEMENTATION
All the statistics shown have been derived after data pre-
processing followed by reduction of dimensions. The resultant
dataset is further classified using the k-means classification
method. The performance of the dimensionality reduction
algorithms are graded by comparing the evaluation measures of
recall, accuracy of classification and precision. Figure.2 and 3
depicts the result of classification using both the supervised
and unsupervised techniques on the datasets under study.
Recall Accuracy Precision
PCA 76.27 81.76 88.35
LSI 83.5 89.70 87.64
FCM 85.12 92.37 91.75
Figure.2 Collation of the performance measures for R8 dataset
FCM executes the best in terms of accuracy, precision and
recall while classifying the R8 dataset. There is an approximate
increase in accuracy of 4% when compared to LSI and 8%
increment over PCA. FCM clusters 20Newsgroups with an
accuracy of 3% over LSI and close to 10% over PCA.
Recall Accuracy Precision
PCA 75.37 73.29 77.64
LSI 81.44 88.87 84.79
FCM 84.25 92.63 87.60
Figure.3 Collation of the performance measures of the
20Newsgroups data collection.
Figure.4 Bar chart showing the comparison of performance
measures of the algorithms on R8 dataset.
Figure.5 Bar chart showing the comparison of performance
measures for 20Newsgroups dataset.
0
100
PCA
LSI
FCM
0
100
PCA
LSI
FCM
Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications
Volume: 03 Issue: 01 June 2014, Page No. 1- 4
ISSN: 2278-2419
4
Figure.6 Clustering of R8 dataset using FCM algorithm
Figure.7 Clustering of 20NG dataset using FCM algorithm
Figures. 6 and 7 shows the clustering of data points using hard
c-means clustering. Every point can be a member of only one
cluster. Dual clustering is not allowed.FCM has clustered the
datasets into four clusters. Cluster centers are marked by
slightly larger darker circles. In R8 collection the distribution
of terms are not very strongly bound to its cluster center.
Features belonging to one cluster could share membership with
another cluster as well. In the 20NG dataset, the members are
closely tied to its cluster center.
VIII. CONCLUSION
This study evaluated the effectiveness of feature selection
techniques on text categorization for dimensionality reduction.
Both supervised and supervised techniques were experimented
on. From the evolutionary techniques, Latent Semantic
Indexing exhibits superior performance over reduction using
Principal Component Analysis in terms of precision, accuracy
and recall. While using unsupervised feature clustering using
FCM, it is seen that there is an improvement over both LSI as
well as PCA in its accuracy. Clustering with FCM shows
higher accuracy as the training and testing times are
minimized. In FCM it can be seen that at least 80% terms can
be removed without degrading the resulting classification. The
datasets under scanner show some fundamental differences.
Classification results vary depending upon the features that are
eliminated, the relation between the data, what factors are used
to assess the similarity in collaboration with the classification
method employed.Given a choice of optimally proven feature
extraction methods, it is possible that the accuracy will
improve, when two feature selection algorithms are used in
conjunction with each other. Since the characteristics of every
document collection is different, devising a classification
algorithm that works best depending on the type of document
content could also be introduced. The performance of the
feature extraction algorithms can also be estimated using other
efficacy measures.
REFERENCES
[1] Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval.
Addison-Wesley, Reading, Massachusetts, USA, 1999.
[2] Wei Xu, Xin Liu, Yihong Gong. Document Clustering Based On
Nonnegative Matrix Factorization. In ACM. SIGIR, Toronto, Canada,
2003.
[3] Ian Soboroff. IR Models: The Vector Space Model. In Information
Retrieval Lecture 7.
[4] http://www.csee.umbc.edu/_ian/irF02/lectures/07
Models-VSM.pdf
[5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R.
Harshman. Indexing by latent semantic analysis. Journal of the Society
for Information Science, 41:391-407, 1990.
[6] Marko Grobelnik, Dunja Mladenic and J. Stefan. Institute, Slovenia Text-
Mining Tutorial
[7] Patterns in Unstructured Data Discovery, Aggregation, and Visualization
- A Presentation to the Andrew W. Mellon Foundation by Clara Yu,John
Cuadrado ,Maciej Ceglowski, J. Scott Payne
[8] Yang, Y., and Jan Pedersen. “A Comparative Study on Feature Selection
in Text Categorization.” ICML 1997: 412-420.
[9] A tutorial on Principal Components Analysis Lindsay I Smith.
[10] Ana Cardoso-Cachopo, Improving Methods for Single-label Text
Categorization, PhD Thesis, October, 2007.
[11] K.Sathiyakumari, V.Preamsudha, G.Manimekalai; “Unsupervised
Approach for Document Clustering Using Modified Fuzzy C mean
Algorithm”; International Journal of Computer & Organization Trends –
Volume 11 Issue3-2011.
[12] R. Rajendra Prasath, Sudeshna Sarkar: Unsupervised Feature Generation
using Knowledge Repositories for Effective Text Categorization. ECAI
2010: 1101-1102
[13] Nogueira ,T,M ;“ On The Use of Fuzzy Rules to Text Document
Classification ”; 2010 10th International Conference on Hybrid
Intelligent Systems (HIS),; 23-25 Aug 2010 Atlanta, US
Author Profile
Asha Kurian is a research scholar in the department of
computer application from Dr. MGR University, Chennai.
She did her Post Graduation in computer Application from
Coimbatore Institute of Technology, Bharathiar University in
2003 . Her areas of interest include Data Mining and Artifical
Intelligence.
M.S.Josephine is Working in Dept of Computer
Applications, Dr.MGR University, Chennai. She graduated
Masters Degree (MCA) from St.Joseph’s College,
Bharathidasan University, M.Phil (Computer Science ) from
Periyar University and Doctorate from Mother Teresa
University in Computer Applications. Her research Interest
includes Software Engineering , Expert System, Networks and
Data Mining.

More Related Content

What's hot

Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3alaa223
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
Kausar Mukadam
 
IRJET- A Novel Technique for Inferring User Search using Feedback Sessions
IRJET- A Novel Technique for Inferring User Search using Feedback SessionsIRJET- A Novel Technique for Inferring User Search using Feedback Sessions
IRJET- A Novel Technique for Inferring User Search using Feedback Sessions
IRJET Journal
 
An effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded contentAn effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded content
ijdpsjournal
 
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
cseij
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYINTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
cscpconf
 
Improving Annotations in Digital Documents using Document Features and Fuzzy ...
Improving Annotations in Digital Documents using Document Features and Fuzzy ...Improving Annotations in Digital Documents using Document Features and Fuzzy ...
Improving Annotations in Digital Documents using Document Features and Fuzzy ...
IRJET Journal
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
IJNSA Journal
 
Vertical intent prediction approach based on Doc2vec and convolutional neural...
Vertical intent prediction approach based on Doc2vec and convolutional neural...Vertical intent prediction approach based on Doc2vec and convolutional neural...
Vertical intent prediction approach based on Doc2vec and convolutional neural...
IJECEIAES
 
Using content features to enhance the
Using content features to enhance theUsing content features to enhance the
Using content features to enhance the
ijaia
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
ijdms
 
UML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVAL
UML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVALUML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVAL
UML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVAL
ijcsit
 
M045067275
M045067275M045067275
M045067275
IJERA Editor
 
Personalized web search using browsing history and domain knowledge
Personalized web search using browsing history and domain knowledgePersonalized web search using browsing history and domain knowledge
Personalized web search using browsing history and domain knowledge
Rishikesh Pathak
 
TWO WAY CHAINED PACKETS MARKING TECHNIQUE FOR SECURE COMMUNICATION IN WIRELES...
TWO WAY CHAINED PACKETS MARKING TECHNIQUE FOR SECURE COMMUNICATION IN WIRELES...TWO WAY CHAINED PACKETS MARKING TECHNIQUE FOR SECURE COMMUNICATION IN WIRELES...
TWO WAY CHAINED PACKETS MARKING TECHNIQUE FOR SECURE COMMUNICATION IN WIRELES...
pharmaindexing
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 

What's hot (20)

Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
IRJET- A Novel Technique for Inferring User Search using Feedback Sessions
IRJET- A Novel Technique for Inferring User Search using Feedback SessionsIRJET- A Novel Technique for Inferring User Search using Feedback Sessions
IRJET- A Novel Technique for Inferring User Search using Feedback Sessions
 
An effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded contentAn effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded content
 
Naresh sharma
Naresh sharmaNaresh sharma
Naresh sharma
 
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYINTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
 
Improving Annotations in Digital Documents using Document Features and Fuzzy ...
Improving Annotations in Digital Documents using Document Features and Fuzzy ...Improving Annotations in Digital Documents using Document Features and Fuzzy ...
Improving Annotations in Digital Documents using Document Features and Fuzzy ...
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
 
Vertical intent prediction approach based on Doc2vec and convolutional neural...
Vertical intent prediction approach based on Doc2vec and convolutional neural...Vertical intent prediction approach based on Doc2vec and convolutional neural...
Vertical intent prediction approach based on Doc2vec and convolutional neural...
 
Using content features to enhance the
Using content features to enhance theUsing content features to enhance the
Using content features to enhance the
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
 
UML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVAL
UML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVALUML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVAL
UML MODELING AND SYSTEM ARCHITECTURE FOR AGENT BASED INFORMATION RETRIEVAL
 
M045067275
M045067275M045067275
M045067275
 
Personalized web search using browsing history and domain knowledge
Personalized web search using browsing history and domain knowledgePersonalized web search using browsing history and domain knowledge
Personalized web search using browsing history and domain knowledge
 
TWO WAY CHAINED PACKETS MARKING TECHNIQUE FOR SECURE COMMUNICATION IN WIRELES...
TWO WAY CHAINED PACKETS MARKING TECHNIQUE FOR SECURE COMMUNICATION IN WIRELES...TWO WAY CHAINED PACKETS MARKING TECHNIQUE FOR SECURE COMMUNICATION IN WIRELES...
TWO WAY CHAINED PACKETS MARKING TECHNIQUE FOR SECURE COMMUNICATION IN WIRELES...
 
T0 numtq0njc=
T0 numtq0njc=T0 numtq0njc=
T0 numtq0njc=
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
 

Similar to Scaling Down Dimensions and Feature Extraction in Document Repository Classification

FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
cscpconf
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET Journal
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
eSAT Publishing House
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusabilityAlexander Decker
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
IJMER
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
IOSR Journals
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
IOSR Journals
 
Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...
IAESIJAI
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
IJECEIAES
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
IRJET Journal
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Editor IJARCET
 
E1062530
E1062530E1062530
E1062530
IJERD Editor
 
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
IJCSEA Journal
 
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
IJCSEA Journal
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
F04463437
F04463437F04463437
F04463437
IOSR-JEN
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
IJERA Editor
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
IJERA Editor
 

Similar to Scaling Down Dimensions and Feature Extraction in Document Repository Classification (20)

FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusability
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...
 
Feature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documentsFeature selection, optimization and clustering strategies of text documents
Feature selection, optimization and clustering strategies of text documents
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973Volume 2-issue-6-1969-1973
Volume 2-issue-6-1969-1973
 
E1062530
E1062530E1062530
E1062530
 
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
 
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
F04463437
F04463437F04463437
F04463437
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 

More from ijdmtaiir

A review on data mining techniques for Digital Mammographic Analysis
A review on data mining techniques for Digital Mammographic AnalysisA review on data mining techniques for Digital Mammographic Analysis
A review on data mining techniques for Digital Mammographic Analysis
ijdmtaiir
 
Comparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face RecognitionComparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face Recognition
ijdmtaiir
 
A Novel Approach to Mathematical Concepts in Data Mining
A Novel Approach to Mathematical Concepts in Data MiningA Novel Approach to Mathematical Concepts in Data Mining
A Novel Approach to Mathematical Concepts in Data Mining
ijdmtaiir
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
ijdmtaiir
 
Performance Analysis of Selected Classifiers in User Profiling
Performance Analysis of Selected Classifiers in User ProfilingPerformance Analysis of Selected Classifiers in User Profiling
Performance Analysis of Selected Classifiers in User Profiling
ijdmtaiir
 
Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...
Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...
Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...
ijdmtaiir
 
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
ijdmtaiir
 
An Analysis of Data Mining Applications for Fraud Detection in Securities Market
An Analysis of Data Mining Applications for Fraud Detection in Securities MarketAn Analysis of Data Mining Applications for Fraud Detection in Securities Market
An Analysis of Data Mining Applications for Fraud Detection in Securities Market
ijdmtaiir
 
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
ijdmtaiir
 
Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...
Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...
Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...
ijdmtaiir
 
A Study on Youth Violence and Aggression using DEMATEL with FCM Methods
A Study on Youth Violence and Aggression using DEMATEL with FCM MethodsA Study on Youth Violence and Aggression using DEMATEL with FCM Methods
A Study on Youth Violence and Aggression using DEMATEL with FCM Methods
ijdmtaiir
 
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic DataminingCertain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Datamining
ijdmtaiir
 
Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...
Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...
Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...
ijdmtaiir
 
An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...
An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...
An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...
ijdmtaiir
 
An Approach for the Detection of Vascular Abnormalities in Diabetic Retinopathy
An Approach for the Detection of Vascular Abnormalities in Diabetic RetinopathyAn Approach for the Detection of Vascular Abnormalities in Diabetic Retinopathy
An Approach for the Detection of Vascular Abnormalities in Diabetic Retinopathy
ijdmtaiir
 
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
ijdmtaiir
 
The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...
The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...
The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...
ijdmtaiir
 
A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)
A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)
A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)
ijdmtaiir
 
Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)
Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)
Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)
ijdmtaiir
 
A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...
A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...
A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...
ijdmtaiir
 

More from ijdmtaiir (20)

A review on data mining techniques for Digital Mammographic Analysis
A review on data mining techniques for Digital Mammographic AnalysisA review on data mining techniques for Digital Mammographic Analysis
A review on data mining techniques for Digital Mammographic Analysis
 
Comparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face RecognitionComparison on PCA ICA and LDA in Face Recognition
Comparison on PCA ICA and LDA in Face Recognition
 
A Novel Approach to Mathematical Concepts in Data Mining
A Novel Approach to Mathematical Concepts in Data MiningA Novel Approach to Mathematical Concepts in Data Mining
A Novel Approach to Mathematical Concepts in Data Mining
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
 
Performance Analysis of Selected Classifiers in User Profiling
Performance Analysis of Selected Classifiers in User ProfilingPerformance Analysis of Selected Classifiers in User Profiling
Performance Analysis of Selected Classifiers in User Profiling
 
Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...
Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...
Analysis of Sales and Distribution of an IT Industry Using Data Mining Techni...
 
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
Analysis of Influences of memory on Cognitive load Using Neural Network Back ...
 
An Analysis of Data Mining Applications for Fraud Detection in Securities Market
An Analysis of Data Mining Applications for Fraud Detection in Securities MarketAn Analysis of Data Mining Applications for Fraud Detection in Securities Market
An Analysis of Data Mining Applications for Fraud Detection in Securities Market
 
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
An Ill-identified Classification to Predict Cardiac Disease Using Data Cluste...
 
Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...
Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...
Music Promotes Gross National Happiness Using Neutrosophic fuzzyCognitive Map...
 
A Study on Youth Violence and Aggression using DEMATEL with FCM Methods
A Study on Youth Violence and Aggression using DEMATEL with FCM MethodsA Study on Youth Violence and Aggression using DEMATEL with FCM Methods
A Study on Youth Violence and Aggression using DEMATEL with FCM Methods
 
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic DataminingCertain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Datamining
 
Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...
Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...
Analyzing the Role of a Family in Constructing Gender Roles Using Combined Ov...
 
An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...
An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...
An Interval Based Fuzzy Multiple Expert System to Analyze the Impacts of Clim...
 
An Approach for the Detection of Vascular Abnormalities in Diabetic Retinopathy
An Approach for the Detection of Vascular Abnormalities in Diabetic RetinopathyAn Approach for the Detection of Vascular Abnormalities in Diabetic Retinopathy
An Approach for the Detection of Vascular Abnormalities in Diabetic Retinopathy
 
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...Improve the Performance of Clustering Using Combination of Multiple Clusterin...
Improve the Performance of Clustering Using Combination of Multiple Clusterin...
 
The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...
The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...
The Study of Symptoms of Tuberculosis Using Induced Fuzzy Coginitive Maps (IF...
 
A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)
A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)
A Study on Finding the Key Motive of Happiness Using Fuzzy Cognitive Maps (FCMs)
 
Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)
Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)
Study of sustainable development using Fuzzy Cognitive Relational Maps (FCM)
 
A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...
A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...
A Study of Personality Influence in Building Work Life Balance Using Fuzzy Re...
 

Recently uploaded

Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 

Recently uploaded (20)

Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 

Scaling Down Dimensions and Feature Extraction in Document Repository Classification

  • 1. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications Volume: 03 Issue: 01 June 2014, Page No. 1- 4 ISSN: 2278-2419 1 Scaling Down Dimensions and Feature Extraction in Document Repository Classification Asha Kurian1 ,M.S.Josephine2 , V.Jeyabalaraja3 1 Research Scholar, Department of Computer Applications, Dr. M.G.R. Educational and Research Institute University, Chennai 2 Professor, Department of Computer Applications, Dr. M.G.R. Educational and Research Institute University, Chennai 3 Professor, Department of computer Science Engineering, Velammal Engineering College, Chennai E-mail: ashk47@yahoo.com , josejbr@yahoo.com, jeyabalaraja@gmail.com Abstract-In this study a comprehensive evaluation of two supervised feature selection methods for dimensionality reduction is performed - Latent Semantic Indexing (LSI) and Principal Component Analysis (PCA). This is gauged against unsupervised techniques like fuzzy feature clustering using hard fuzzy C-means (FCM) . The main objective of the study is to estimate the relative efficiency of two supervised techniques against unsupervised fuzzy techniques while reducing the feature space. It is found that clustering using FCM leads to better accuracy in classifying documents in the face of evolutionary algorithms like LSI and PCA. Results show that the clustering of features improves the accuracy of document classification. Keywords: Feature extraction, text classification, categorization, dimensionality reduction I. INTRODUCTION Text Categorization or text classification attempts to assort documents in a repository into different class labels.A classifier learns from a training set of documents that are already classified and labeled. A general model is devised that correctly labels further incoming documents.A repository typically consists of thousands of documents. Retrieving a selection becomes a laborious task, unless the documents are indexed or categorized in some particular order. Document categorization is modeled along the lines of Information Retrieval [1] and Natural Language Processing [5] where a user query elicits documents of maximal significance in relation to the query. The sorting is done by grouping the document terms and phrases and identifying some association or correlation between them. Establishing relations among words is compounded due to polysemy and the presence of synonyms. Every document contains thousands of unique terms resulting in a highly dimensional feature space. To reduce information retrieval time, the dimensionality of the document collection can be reduced by selecting only those terms which best describe the document. Dimensionality reduction techniques try to find out the context-meaning of words, disregarding those which are inconsequential. Feature selection algorithms reduce the feature space by selecting appropriate vectors, whereas feature extraction algorithms transform the vectors into a sub-space of scaled down dimension. Feature selection can be followed by supervised or unsupervised learning. Classification becomes supervised when a collection of labeled documents helps in the learning using a train-test set. II. REVIEW OF LITERATURE 1. Text mining and Organization in Large Corpus, December 2005. Two dimensionality deduction methods: Singular Vector Decomposition (SVD) and Random Projection (RP) are compared, along with three selected clustering algorithms: K- means, Non-negative Matrix Factorization (NMF) and Frequent Itemset. These selected methods and algorithms are compared based on their performance and time consumption. 2. Improving Methods for Single-label Text Categorization, July 2007 – An evaluation of the evolutionary feature reduction algorithms is done. In this paper a comprehensive comparison of the performance of a number of text categorization methods in two different data sets is presented. In particular, the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of the Vector and LSA models is evaluated. 3. A Fuzzy based approach to text mining and document clustering, October 2013- In this paper, how to apply fuzzy logic in text mining in order to perform document clustering is shown. Fuzzy c-means (FCM) algorithm was used to cluster these documents into clusters. 4. Feature Clustering algorithms for text classification-Novel Techniques and Reviews, August 2010 - In this paper, some of the important techniques for text classification have been reviewed and novel parameters using fuzzy set approach have been discussed in detail. 5. Classification of text using fuzzy based incremental feature clustering algorithm, International Journal of Advanced Research in Computer Engineering and Technology Volume 1, Issue 5, July 2012 – A fuzzy based incremental feature clustering algorithm is proposed. Based on the similarity test the feature vector of a document set is classified and grouped into clusters following clustering properties and each cluster is characterized by a membership function with statistical mean and deviation. III. FEATURE SELECTION METHODS FOR DIMENSIONALITY REDUCTION Feature selection can either be supervised or unsupervised. A brief summary of the different feature extraction methods used
  • 2. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications Volume: 03 Issue: 01 June 2014, Page No. 1- 4 ISSN: 2278-2419 2 in this study are entailed. As a first step, document pre- processing removes stopwords, short words, numbers and alphanumeric characters. Noise removed, the text is metamorphosed into a term-weighted matrix whose rows indicate the terms and columns represent the documents they make an appearance in. Each cell entry holds the rate of occurrence of words in the document (also called weights). The term weighting factors are: Local term frequency(tf) – frequency of a term in the document global document frequency(idf) – accounts for the proportionality of the term within a document as well as globally for the whole collection.The normalized tf / idf factor is the general weighting method used. The matrix thus obtained consists mostly of sparse elements. A. Latent Semantic Indexing (LSI) Every document has an underlying semantic structure relating to a particular abstraction. Latent Semantic Indexing taps this theory to identify the relation between words and the context in which they used. Mathematical and statistical techniques are used for this inference [2]. Dimensions that are of no consequence to the text are to be eliminated. At the same time removal of these dimensions should not result in a loss of interpretation’s starts by pre-processing the document of stop words, stemming etc. This followed by converting the text document to term weighted matrix subsisting of mostly sparse vectors. Cell entries are incremented corresponding to the frequency of words. LSI, accompanied by the powerful Singular Value Decomposition (SVD) is used for feature selection. Singular Value Decomposition (SVD) works with the sparse matrix of term-weights and transforms it into a product of three matrices. The Left Singular vector comprising of original row elements, Right Singular Vector consisting of original column elements both transformed orthogonally and the Singular value which is a diagonal matrix containing the scaling value [4]. The diagonal matrix is the component used for dimensionality reduction. The smallest value in this matrix indicates terms which are inconsequential and can be removed. LSI along with SVD identifies features that do not contribute to the semantic structure of the document. B. Principal Component Analysis (PCA) Principal Component Analysis uses the notion of eigenvalues and eigenvectors for feature variable reduction procedure. Given data with high dimensionality, PCA starts by subtracting the mean from each data point so that the averaged mean for each dimension is zero. The original data is projected along with the altered data. A covariance matrix is a square matrixC [n x n] = (cov (dimensionx,dimensiony)) C is a n x n matrix and each cell (x,y) shows the covariance between two different dimensions x and y.The covariance matrix represents the dependence of dimensions on each other. Positive values indicate that as one dimension increase, the dependent dimension also scales up. The eigenvalues and eigenvectors representation of the covariance matrix is used to plot the principal component axis, the best line along which the data can be laid out. Arranging the eigenvalues and eigenvectors in a decreasing order shows the components with higher relevance corresponding to higher eigenvalues [8]. The vectors with lower values can be disregarded as non significant. The higher second principal component plotted is perpendicular to the first principal component. All eigenvectors are orthogonal to each other irrespective of how many dimensions are present. Each principal component is analyzed to what extent they contribute to the variance of the data points[9] . As a final step the transpose of the chosen principal components is multiplied with the mean adjusted data to derive the final data set representing those dimensions to be retained. IV. FUZZY CLUSTERING Fuzzy C-Means algorithm is an iterative algorithm to group the data into different clusters [10]. Every cluster has a cluster centre. The data points are put into the clusters based on the probability with which they belong to the group. Points closer to the center are more closely integrated to the group when compared to those that are further away from the cluster center. A decision as to which cluster to put a data point into is taken based on the membership function of mean and standard deviation. A membership function determines how well the data fit into a particular cluster. On every iteration, the FCM algorithm upgrades the cluster centers and the membership functions. At the same time minimization of the objective function is also done. The objective function of FCM is given by Um = ∑ ∑ ‖ − ‖ Where m is the number of iterations xij denotes the membership of xi in the jth cluster. xi are the different dimensions of the data. The iteration stops when the algorithm correctly pinpoints the cluster center and no change seen in the minimization. The algorithm consists of four steps . Step1. Identify the initial cluster centers and a matrix X such that each xij element of X takes a value in [0,1] that denotes the extent to which the element belongs to the cluster. Step 2.During each iteration update the matrix value [0,1] from the given cluster center. Step 3.Calculate the objective function for that iteration. Step 4. If the value has reduced from the previous iteration, continue with the iteration otherwise the procedure halts [11] and [12]. Similar features are now clustered, with features closer to the center having a stronger resemblance. V. METHODOLOGY The feature extraction methods try to interpret the underlying semantics of the text and identify the unique words that can be eliminated. The accuracy of classification depends on this optimal set. To assess the efficiency and performance of feature reduction, we first find the accuracy of classification, given the training and test dataset [7]. The evolutionary algorithms like LSI and PCA were mainly chosen because they exhibit good accuracy when reducing dimensions of a document. This is used as a measure find how much it scores against FCM clustering. Supervised feature reduction works on
  • 3. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications Volume: 03 Issue: 01 June 2014, Page No. 1- 4 ISSN: 2278-2419 the principle that the number of clusters is determined beforehand. The fuzzy C-Means clustering requires no training or test samples to work on. The primary step in feature reduction is to preprocess a document. Pre-processing is an essential procedure that further reduces the complexity of dimensionality reduction and the subsequent classification or clustering process. The whole mass of text has to be transformed algebraically. The text data in the document is mapped to its vector representation, for the purpose of document indexing. The most common form of representation is a matrix format where words that appear frequently are weighted [2].The resultant matrix is a sparse matrix with most of elements being zeros. From the nature of text data, it can be seen that certain words do not contribute to the meaning of the text. The parser removes these words by stopword elimination. This is followed by lemmatization or stemming. Preprocessing also helps to eliminate words shorter than three letters, alphanumeric characters and numbers. With the weighted information, documents are ranked based on their similarity to the query. The cosine of the angle formed by these vectors is used as a similarity measure. Two common datasets have been used in this study. 20 News groups, R8 - the 8 most frequent classes from the Reuters-21578 collection datasets. The dataset is split as training and test documents. Out of the 18821 documents in 20NG, 60% are taken as a training set and the rest for testing. From the R8 group 70% documents are considered as training set. Using the training set for learning, the supervised techniques attempts to assign the new incoming, and unknown documents into their correct class labels [8]. VI. SUPERVISED VS UNSUPERVISED FEATURE SELECTION All tests have been performed on the MATLAB computing environment. Before any analysis is undertaken, it is worthwhile to study the characteristics of the data collection used. R8, the eight most frequent classes taken from the Reuters collection is smaller in size compared to the 20Newsgroup collection. The general procedure followed is to train the classifier using the training data. The accuracy is judged by the number of test documents that are labeled correctly after the learning phase. In FCM since learning is evolving over iterations, bifurcation of data is unnecessary. In this study the techniques are assessed based on the accuracy, macro-averaged precision and recall, training and testing times [6]. The datasets are divided in a 60:40 and 70:30 ratios for R8 and 20NG as training and test documents. With the class labeling information from the training documents, the test documents can be classified. Initially the datasets are labeled by k-means after feature selection. This is used as a point of reference when clustering using Fuzzy C-Means algorithm is applied. A detailed comparison of the results shows that when features are clustered using FCM, it executes faster and is more accurate in classifying datasets. The savings are mainly due to foregoing the training and testing times. Dataset Collection Classes Train Test Total Docs Docs Docs 20NG 20Newsgroups 20 11293 7528 18821 R8 Reuters-21578 8 5485 2189 7674 Figure.1 shows the datasets used in the study. The number of classes in each collection along with the division of training and test documents is shown VII. IMPLEMENTATION All the statistics shown have been derived after data pre- processing followed by reduction of dimensions. The resultant dataset is further classified using the k-means classification method. The performance of the dimensionality reduction algorithms are graded by comparing the evaluation measures of recall, accuracy of classification and precision. Figure.2 and 3 depicts the result of classification using both the supervised and unsupervised techniques on the datasets under study. Recall Accuracy Precision PCA 76.27 81.76 88.35 LSI 83.5 89.70 87.64 FCM 85.12 92.37 91.75 Figure.2 Collation of the performance measures for R8 dataset FCM executes the best in terms of accuracy, precision and recall while classifying the R8 dataset. There is an approximate increase in accuracy of 4% when compared to LSI and 8% increment over PCA. FCM clusters 20Newsgroups with an accuracy of 3% over LSI and close to 10% over PCA. Recall Accuracy Precision PCA 75.37 73.29 77.64 LSI 81.44 88.87 84.79 FCM 84.25 92.63 87.60 Figure.3 Collation of the performance measures of the 20Newsgroups data collection. Figure.4 Bar chart showing the comparison of performance measures of the algorithms on R8 dataset. Figure.5 Bar chart showing the comparison of performance measures for 20Newsgroups dataset. 0 100 PCA LSI FCM 0 100 PCA LSI FCM
  • 4. Integrated Intelligent Research (IIR) International Journal of Data Mining Techniques and Applications Volume: 03 Issue: 01 June 2014, Page No. 1- 4 ISSN: 2278-2419 4 Figure.6 Clustering of R8 dataset using FCM algorithm Figure.7 Clustering of 20NG dataset using FCM algorithm Figures. 6 and 7 shows the clustering of data points using hard c-means clustering. Every point can be a member of only one cluster. Dual clustering is not allowed.FCM has clustered the datasets into four clusters. Cluster centers are marked by slightly larger darker circles. In R8 collection the distribution of terms are not very strongly bound to its cluster center. Features belonging to one cluster could share membership with another cluster as well. In the 20NG dataset, the members are closely tied to its cluster center. VIII. CONCLUSION This study evaluated the effectiveness of feature selection techniques on text categorization for dimensionality reduction. Both supervised and supervised techniques were experimented on. From the evolutionary techniques, Latent Semantic Indexing exhibits superior performance over reduction using Principal Component Analysis in terms of precision, accuracy and recall. While using unsupervised feature clustering using FCM, it is seen that there is an improvement over both LSI as well as PCA in its accuracy. Clustering with FCM shows higher accuracy as the training and testing times are minimized. In FCM it can be seen that at least 80% terms can be removed without degrading the resulting classification. The datasets under scanner show some fundamental differences. Classification results vary depending upon the features that are eliminated, the relation between the data, what factors are used to assess the similarity in collaboration with the classification method employed.Given a choice of optimally proven feature extraction methods, it is possible that the accuracy will improve, when two feature selection algorithms are used in conjunction with each other. Since the characteristics of every document collection is different, devising a classification algorithm that works best depending on the type of document content could also be introduced. The performance of the feature extraction algorithms can also be estimated using other efficacy measures. REFERENCES [1] Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval. Addison-Wesley, Reading, Massachusetts, USA, 1999. [2] Wei Xu, Xin Liu, Yihong Gong. Document Clustering Based On Nonnegative Matrix Factorization. In ACM. SIGIR, Toronto, Canada, 2003. [3] Ian Soboroff. IR Models: The Vector Space Model. In Information Retrieval Lecture 7. [4] http://www.csee.umbc.edu/_ian/irF02/lectures/07 Models-VSM.pdf [5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41:391-407, 1990. [6] Marko Grobelnik, Dunja Mladenic and J. Stefan. Institute, Slovenia Text- Mining Tutorial [7] Patterns in Unstructured Data Discovery, Aggregation, and Visualization - A Presentation to the Andrew W. Mellon Foundation by Clara Yu,John Cuadrado ,Maciej Ceglowski, J. Scott Payne [8] Yang, Y., and Jan Pedersen. “A Comparative Study on Feature Selection in Text Categorization.” ICML 1997: 412-420. [9] A tutorial on Principal Components Analysis Lindsay I Smith. [10] Ana Cardoso-Cachopo, Improving Methods for Single-label Text Categorization, PhD Thesis, October, 2007. [11] K.Sathiyakumari, V.Preamsudha, G.Manimekalai; “Unsupervised Approach for Document Clustering Using Modified Fuzzy C mean Algorithm”; International Journal of Computer & Organization Trends – Volume 11 Issue3-2011. [12] R. Rajendra Prasath, Sudeshna Sarkar: Unsupervised Feature Generation using Knowledge Repositories for Effective Text Categorization. ECAI 2010: 1101-1102 [13] Nogueira ,T,M ;“ On The Use of Fuzzy Rules to Text Document Classification ”; 2010 10th International Conference on Hybrid Intelligent Systems (HIS),; 23-25 Aug 2010 Atlanta, US Author Profile Asha Kurian is a research scholar in the department of computer application from Dr. MGR University, Chennai. She did her Post Graduation in computer Application from Coimbatore Institute of Technology, Bharathiar University in 2003 . Her areas of interest include Data Mining and Artifical Intelligence. M.S.Josephine is Working in Dept of Computer Applications, Dr.MGR University, Chennai. She graduated Masters Degree (MCA) from St.Joseph’s College, Bharathidasan University, M.Phil (Computer Science ) from Periyar University and Doctorate from Mother Teresa University in Computer Applications. Her research Interest includes Software Engineering , Expert System, Networks and Data Mining.