A two stage feature selection method for text categorization
A two-stage feature selection method for text
categorization by using information gain, principal
component analysis and genetic algorithm
The application will serve the two-stage feature selection
method for text categorization by using information
gain, principal component analysis and genetic
algorithm. Due to the increasing number of documents in
digital form, automated text categorization has become
more promising in the decades.
A two-stage feature selection and feature extraction
is used to reduce the high dimensionality of a
feature space composing of a large number of
terms, remove redundant and irrelevant features
from the feature space and thereby improve the
performance of text categorization
User classes & characteristics
There are two user module viz. decision tree and KNN
Decision tree: The first phase is tree growing where a
tree is built by greedily splitting each tree node. Because
the tree can over fit the training data, in the second
phase, the over fitted branches of the tree are removed.
KNN classifier: The KNN classifier ranks the document’s
neighbors among the training documents and uses the
class labels of the k most similar neighbors. Similarity
type between two documents may be measured by the
Euclidean distance, cosine measure, etc.
This application is developed in java platform and will be hosted by a
system using Java JDK and tomcat server. The system will
primarily be developed and tested on Windows Operating Systems.
But our goal is to make it a platform independent solution. The target
Microsoft Windows &
Design and Implementation
All designing and coding will be done on Java
Platform. However application can be
implemented in C#.NET.
Assumptions and Dependencies
Since the application is based on Java platform. Hence we assume that user
system must installed JVM to run this application.
Hard disk 80 GB
Processor Intel Pentium IV
Tools Net beans
Operating System Windows
EXTERNAL INTERFACE REQUIREMENTS
User Interfaces: The application is accessible through web browser. It will interact
with its users with web components interface. There are two types of user for this
system retail manager or analyst and customer each can interact with the system with
the following UIs.
Main screen: On this interface there are some options shown as per the user type
For the analysts there are some options related to what type of analysis they want to
Method wise analysis
Decision tree analysis
KNN classifier analysis
For each of the above analysis there is separate new screen showing advanced
options for that analysis that is something like stated below:
There are buttons for ‘In which format output should be displayed Graphical formats
like pie charts , Bar graphs, Tabular format.
On this screen output will be produced in graphical format with proper description
and some options like save result for further use or compare it with old results or
you may discard it if it is of no use.
Version Number: Version 6.0
Version Number: Version 7.0.1
The system must use My SQL server as its database
Version Number: Version 6 onward
The system will use Apache/tomcat server as the main
communication protocol trough internet/network.
• System can produce results faster on 4GB RAM.
• It may take more time for peak loads at main node
• The system will be available 100% of the time. Once
there is a fatal error, the system will provide
understandable feedback to the user.
Safety and Security Requirements
• All data will be backed-up everyday automatically and also the
system administrator can back- up the data as a function for
• The system is designed in modules where errors can be
detected and fixed easily. This makes it easier to install and
updates new functionality if required.
Software Quality Attributes
Usability : The application seem to user friendly since the GUI is
Maintainability : This application is maintained for long period of
time since it will be implemented under java platform .
Reusability : The application can be reusable by expanding it to the
new modules. Performance: The application seems to be
performing faster under 4 GB of RAM. However, the basic
requirement to run the application is 1GB.
Security: Since the application is developed on JAVA .It is much
more secure than the other environment.
The application is platform independent since it is
developed in JAVA.
The behavior of the application is user friendly since the
GUI is compatible with all operating environment.
Since the application performs several task at same
time, It seems to generate output at long interval of time.
Spam filtering, a process which tries to discern E-mail
spam messages from legitimate emails
Email routing, sending an email sent to a general address to a
specific address or mailbox depending on topic.
Language identification, automatically determining the language of a
Genre classification, automatically determining the genre of a text
Readability assessment, automatically determining the degree of
readability of a text, either to find suitable materials for different age
groups or reader types or as part of a larger text