Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
Using Class Frequency for Improving Centroid-based Text ClassificationIDES Editor
Most previous works on text classification,
represented importance of terms by term occurrence frequency
(tf) and inverse document frequency (idf). This paper presents
the ways to apply class frequency in centroid-based text
categorization. Three approaches are taken into account. The
first one is to explore the effectiveness of inverse class
frequency on the popular term weighting, i.e., TFIDF, as a
replacement of idf and an addition to TFIDF. The second
approach is to evaluate some functions, which are used to
adjust the power of inverse class frequency. The other approach
is to apply terms, which are found in only one class or few
classes, to improve classification performance, using two-step
classification. From the results, class frequency expresses its
usefulness on text classification, especially the two-step
classification.
Mapping Subsets of Scholarly InformationPaul Houle
We illustrate the use of machine learning techniques to analyze, structure, maintain,
and evolve a large online corpus of academic literature. An emerging field of research
can be identified as part of an existing corpus, permitting the implementation of a
more coherent community structure for its practitioners.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Text Categorization using N-grams and Hidden-Markov-ModelsThomas Mathew
In this paper I discuss an approach for building a soft text classifier based on a Hidden-
Markov-Model. The approach treats a multi-category text classification task as predicting the best
possible hidden sequence of classifiers based on the observed sequence of text tokens. This
method considers the possibility that different sections of a large block of text may hint towards
different yet related text categories and the HMM predicts such a sequence of categories. The most
probable such sequence of categories can be estimated using the Viterbi algorithm.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Using Class Frequency for Improving Centroid-based Text ClassificationIDES Editor
Most previous works on text classification,
represented importance of terms by term occurrence frequency
(tf) and inverse document frequency (idf). This paper presents
the ways to apply class frequency in centroid-based text
categorization. Three approaches are taken into account. The
first one is to explore the effectiveness of inverse class
frequency on the popular term weighting, i.e., TFIDF, as a
replacement of idf and an addition to TFIDF. The second
approach is to evaluate some functions, which are used to
adjust the power of inverse class frequency. The other approach
is to apply terms, which are found in only one class or few
classes, to improve classification performance, using two-step
classification. From the results, class frequency expresses its
usefulness on text classification, especially the two-step
classification.
Mapping Subsets of Scholarly InformationPaul Houle
We illustrate the use of machine learning techniques to analyze, structure, maintain,
and evolve a large online corpus of academic literature. An emerging field of research
can be identified as part of an existing corpus, permitting the implementation of a
more coherent community structure for its practitioners.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
The Text Classification slides contains the research results about the possible natural language processing algorithms. Specifically, it contains the brief overview of the natural language processing steps, the common algorithms used to transform words into meaningful vectors/data, and the algorithms used to learn and classify the data.
To learn more about RAX Automation Suite, visit: www.raxsuite.com
Text Categorization using N-grams and Hidden-Markov-ModelsThomas Mathew
In this paper I discuss an approach for building a soft text classifier based on a Hidden-
Markov-Model. The approach treats a multi-category text classification task as predicting the best
possible hidden sequence of classifiers based on the observed sequence of text tokens. This
method considers the possibility that different sections of a large block of text may hint towards
different yet related text categories and the HMM predicts such a sequence of categories. The most
probable such sequence of categories can be estimated using the Viterbi algorithm.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Icete content-based filtering with applications on tv viewing dataElaine Cecília Gatto
Recommendation systems provide recommendation based on information about users’ preferences. Information Filtering is used by recommendation systems so as information can be processed and suggested to users; and Content-Based Filtering is an Information Filtering approach very used in recommendation systems. Content-Based Filtering analyses the correlation of items content with the user’s profile, suggesting relevant items and putting away irrelevant items. Recommendation systems, which are very much used on the Internet, have been studied in order to be used on Digital TV context, and there already are several works in this sense. As they are used on the Internet, recommendation systems can be used in Digital TV in order to recommend TV programs, publicity and advertisement and also the electronic commerce. Thus, within Digital TV context, the items can be programs, advertisements and the products to be sold; and using Content-Based Filtering in the recommendation programs, for instance, these programs’ contents can be correlated with the user’s preferences, which in this scenario, are the type of program one wants to watch. This paper presents the studies accomplished with Content-Based Filtering with application on Digital TV data. The survey aims at observing and evaluating how some filtering techniques based on content can be used in recommendation systems in Digital TV context
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf
Smarter Search With Spark-Solr: Search gets smarter when you know more about your documents and their relationship to each other (think: PageRank) and the users (i.e. popularity), in addition to what you already know about their content (text search). It also gets smarter when you know more about your users (personalization) and both their affinity for certain kinds of content and their similarities to each other (collaborative filtering recommenders).
Building all of these pieces typically requires a big mix of batch workloads to do log processing, as well as training machine-learned models to use during realtime querying, and are highly domain specific, but many techniques are fairly universal: we will discuss how Spark can interface with a Solr Cloud cluster to efficiently perform many of the pieces to this puzzle in one relatively self-contained package (no HDFS/S3, all data stored in Solr!), and introduce “spark-solr” – an open-source JVM library to facilitate this.
Kristian Kersting, Associate Professor for Computer Science, TU Dortmund Univ...MLconf
Declarative Programming for Statistical ML: The democratization of complex data does not mean dropping the data on everyone’s desk and saying, “good luck”! It means to make machine learning methods usable in such a way that people can easily instruct machines to have a “look” at complex data and help them to understand and act on it. Existing statistical relational learning and probabilistic programming languages only provide partial solutions; most of them do not support convex optimization commonly used in machine learning.
In this talk I will present RELOOP, a declarative mathematical programming language embedded into Python. It allows the user to specify mathematical programs before she knows what individuals are in the domain and, therefore, before she knows what variables and constraints exist. It facilitates the formulation of abstract, general knowledge. And, it reveals the rich logical structure underlying many machine learning problems to the solver and, turn, may make it go faster.
With this, people can start to rapidly develop statistical machine learning approaches for complex data. For instance, adding just three lines of RELOOP code makes a linear support vector machines aware of any underlying network that connects the objects to be classified.
Joint work with Martin Mladenov and many others.
Jason Baldridge, Associate Professor of Computational Linguistics, University...MLconf
Disambiguating Explicit and Implicit Geographic References in Natural Language: When people speak, both they and their utterances are situated in place and time. Our utterances reflect where we are from, where we are right now, and where we are talking about—among many other things, including personality, social status, and the topics under discussion. As hearers, we naturally incorporate locational awareness into our understanding of what we are told. That is to say, we geographically ground the meaning of natural language utterances. In this talk, I will discuss both toponym resolution (e.g., identifying which Springfield is intended in a passage) and general text geolocation—deriving geographical gists from free-running text that perhaps has no explicit mentions of places. I’ll cover supervised and semi-supervised methods for solving these tasks. I’ll also briefly discuss how this work might generalize the now commonplace treatment of words-as-vectors into computational models of word meaning that use multi-dimensional representations over words, geography, time, and images.
Igor Markov, Software Engineer, Google at MLconf SEA - 5/20/16MLconf
Can AI Become a Dystopian Threat to Humanity? – A Hardware Perspective: Viewing future AI as a possible threat to humanity has long become common in the movie industry, while some serious thinkers (Hawking, Musk) have also promoted this perspective, even though prominent ML experts don’t see this happening any time soon. Why is this topic attracting so much attention? What can we learn from the the past? This talk draws attention to physical limitations of possible threats, such as energy sources and the ability to reproduce. These limitations can be made more reliable and harder to circumvent, while the hardware of future AI systems can be designed with particular attention to physical limits.
Amanda Casari, Senior Data Scientist, Concur at MLconf SEA - 5/20/16MLconf
Scaling Data Science Products, Not Data Science Teams: Congratulations! Your data science feature works! Your metrics are outstanding. Your data scientists and engineers have created useful products for your customers with proven results. Now your product and marketing teams are ready to move into new markets. Your underlying population changes. The skewed statistics of your data shifts depending on the data center you analyze. Your product is now localized and your NLP methods must adjust for a greater range of languages. Your requirements have grown. Your team has not.
How do you scale success for global products in multiple data centers with small teams?
The Data Science team at Concur’s work to grow our products into international markets has not required a global scaling of resources. This talk will share our lessons learned in creating modular, reusable data science products deployable to international, segregated data centers.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Evan Estola, Lead Machine Learning Engineer, Meetup at MLconf SEA - 5/20/16MLconf
When Recommendations Systems Go Bad: Machine learning and recommendations systems have changed the way we interact with not just the internet, but some of the basic products and services that we use to run our lives.
While the reach and impact of big data and algorithms will continue to grow, how do we ensure that people are treated justly? Certainly there are already algorithms in use that determine if someone will receive a job interview or be accepted into a school. Misuse of data in many of these cases could have serious public relations, legal, and ethical consequences.
As the people that build these systems, we have a social responsibility to consider their effect on humanity, and we should do whatever we can to prevent these models from perpetuating some of the prejudice and bias that exist in our society today.
In this talk I intend to cover some examples of recommendation systems that have gone wrong across various industries, as well as why they went wrong and what can be done about it. The first step towards solving this larger issue is raising awareness, but there are concrete technical approaches that can be employed as well. Three that will be covered are:
- Accepting simplicity with interpretable models.
- Data segregation via ensemble modelling.
- Designing test data sets for capturing unintended bias.
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...MLconf
Practical Probabilistic Programming with Figaro: Probabilistic reasoning enables you to predict the future, infer the past, and learn from experience. Probabilistic programming enables users to build and reason with a wide variety of probabilistic models without machine learning expertise. In this talk, I will present Figaro, a mature probabilistic programming system with many applications. I will describe the main design principles of the language and show example applications. I will also discuss our current efforts to fully automate and optimize the inference process.
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
Machine learning for text classification is the
underpinning
of document
cataloging
, news filtering,
document
steering
and
exemplif
ication
. In text mining realm, effective feature selection is significant to
make the learning task more accurate and competent. One of the
traditional
lazy
text classifier
k
-
Nearest
Neighborhood (
k
NN) has
a
major pitfall in calculating the similarity between
all
the
objects in training and
testing se
t
s,
there by leads to exaggeration of
both
computational complexity
of the algorithm
and
massive
consumption
of
main memory
. To diminish these shortcomings
in
viewpoint
of a
data
-
mining
practitioner
a
n
amalgamati
ve technique is proposed in this paper using
a novel restructured version of
k
NN called
Augmented
k
NN
(AkNN)
and
k
-
Medoids
(kMdd)
clustering.
The proposed work
comprises
preprocesses
on
the
initial training
set
by
imposing
attribute feature selection
for reduc
tion of high dimensionality, also it
detects and excludes the high
-
fliers
samples
in t
he
initial
training set
and
re
structure
s
a
constricted
training
set
.
The kMdd clustering algorithm generates the cluster centers (as interior objects) for each category
and
restructures
the constricted training set
with centroids
. This technique
is
amalgamated with
AkNN
classifier
that
was prearranged with
text mining similarity measure
s.
Eventually, s
ignifican
tweights
and ranks were
assigned to each object in the new
training set based upon the
ir
accessory towards the
object in testing set
.
Experiments
conducted
on Reuters
-
21578 a
UCI benchmark
text mining
data
set
, and
comparisons with
traditional
k
NN
classifier designates
the
referred
method
yield
spreeminentrecital
in b
oth clustering and
classification
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
INCREASING ACCURACY THROUGH CLASS DETECTION: ENSEMBLE CREATION USING OPTIMIZE...IJCSEA Journal
ABSTRACT
Classifier ensembles have been used successfully to improve accuracy rates of the underlying classification mechanisms. Through the use of aggregated classifications, it becomes possible to achieve lower error rates in classification than by using a single classifier instance. Ensembles are most often used with collections of decision trees or neural networks owing to their higher rates of error when used individually. In this paper, we will consider a unique implementation of a classifier ensemble which utilizes kNN classifiers. Each classifier is tailored to detecting membership in a specific class using a best subset selection process for variables. This provides the diversity needed to successfully implement an ensemble. An aggregating mechanism for determining the final classification from the ensemble is presented and tested against several well known datasets.
INCREASING ACCURACY THROUGH CLASS DETECTION: ENSEMBLE CREATION USING OPTIMIZE...IJCSEA Journal
Classifier ensembles have been used successfully to improve accuracy rates of the underlying classification mechanisms. Through the use of aggregated classifications, it becomes possible to achieve lower error rates in classification than by using a single classifier instance. Ensembles are most often used with collections of decision trees or neural networks owing to their higher rates of error when used individually. In this paper, we will consider a unique implementation of a classifier ensemble which utilizes kNN classifiers. Each classifier is tailored to detecting membership in a specific class using a best subset selection process for variables. This provides the diversity needed to successfully implement an ensemble. An aggregating mechanism for determining the final classification from the ensemble is presented and tested against several well known datasets.
INCREASING ACCURACY THROUGH CLASS DETECTION: ENSEMBLE CREATION USING OPTIMIZE...IJCSEA Journal
Classifier ensembles have been used successfully to improve accuracy rates of the underlying
classification mechanisms. Through the use of aggregated classifications, it becomes possible to achieve
lower error rates in classification than by using a single classifier instance. Ensembles are most often
used with collections of decision trees or neural networks owing to their higher rates of error when used
individually. In this paper, we will consider a unique implementation of a classifier ensemble which
utilizes kNN classifiers. Each classifier is tailored to detecting membership in a specific class using a
best subset selection process for variables. This provides the diversity needed to successfully implement
an ensemble. An aggregating mechanism for determining the final classification from the ensemble is
presented and tested against several well known datasets
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
The International Journal of Engineering and Science (IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
This paper presents a review & performs a comparative evaluation of few known machine learning
algorithms in terms of their suitability & code performance on any given data set of any size. In this paper,
we describe our Machine Learning ToolBox that we have built using python programming language. The
algorithms used in the toolbox consists of supervised classification algorithms such as Naïve Bayes,
Decision Trees, SVM, K-nearest Neighbors and Neural Network (Backpropagation). The algorithms are
tested on iris and diabetes dataset and are compared on the basis of their accuracy under different
conditions. However using our tool one can apply any of the implemented ML algorithms on any dataset of
any size. The main goal of building a toolbox is to provide users with a platform to test their datasets on
different Machine Learning algorithms and use the accuracy results to determine which algorithms fits the
data best. The toolbox allows the user to choose a dataset of his/her choice either in structured or
unstructured form and then can choose the features he/she wants to use for training the machine We have
given our concluding remarks on the performance of implemented algorithms based on experimental
analysis
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
Similar to Text Categorization Using Improved K Nearest Neighbor Algorithm (20)
Beaglebone Black Webcam Server For SecurityIJTET Journal
Web server security using BeagleBone Black is based on ARM Cortex-A8 processor and Linux operating system
is designed and implemented. In this project the server side consists of BeagleBone Black with angstrom OS and interfaced
with webcam. The client can access the web server by proper authentication. The web server displays the web page forms
like home, video, upload, settings and about. The home web page describes the functions of Web Pages. The video Web page
displays the saved videos in the server and client can view or download the videos. The upload web page is used by the client
to upload the files to server. The settings web page is used to change the username, password and date if needed. The about web page provides the description of the project
Biometrics Authentication Using Raspberry PiIJTET Journal
Biometric authentication is one of the most popular and accurate technology. Nowadays, it is used in many real time
applications. However, recognizing fingerprints in Linux based embedded computers (raspberry pi) is still a very complex problem.
This entire work is done on the Linux based embedded computer called raspberry pi , in which database creation and management
using postgresql, web page creation using PHP language, fingerprint reader access, authentication and recognition using python were
entirely done on raspberry pi This paper discusses on the standardized authentication model which is capable of extracting the
fingerprints of individual and store that in database . Then I use the final fingerprint to match with others in fingerprints present in the
database (postgresql) to show the capability of this model.
Conceal Traffic Pattern Discovery from Revealing Form of Ad Hoc NetworksIJTET Journal
Number of techniques has been planned supported packet secret writing to safeguard the
communication in MANETs. STARS functioning supported stastical characteristics of captured raw traffic.
STARS discover the relationships of offer to destination communication. To forestall STAR attack associate
offer hidding technique is introduced.The pattern aims to derive the source/destination probability distribution.
that's the probability for each node to entire traffic captured with link details message source/destination and
conjointly the end-to-end link probability distribution that's the probability for each strive of nodes to be
associate end-to-end communication strive. thence construct point-to-point traffic originate and then derive the
end-to-end traffic with a set of traffic filtering rules; thus actual traffic protected against revelation attack.
Through this protective mechanism efficiency of traffic enlarged by ninety fifth from attacked traffic. For a lot of
sweetening to avoid overall attacks second shortest path is chosen.
Node Failure Prevention by Using Energy Efficient Routing In Wireless Sensor ...IJTET Journal
The most necessary issue that has to be solved in coming up with an information transmission rule for
wireless unplanned networks is a way to save unplanned node energy whereas meeting the wants of applications
users because the unplanned nodes are battery restricted. Whereas satisfying the energy saving demand, it’s
conjointly necessary to realize the standard of service. Just in case of emergency work, it's necessary to deliver the
information on time. Achieving quality of service in is additionally necessary. So as to realize this demand, Power -
efficient Energy-Aware routing protocol for wireless unplanned networks is planned that saves the energy by
expeditiously choosing the energy economical path within the routing method. When supply finds route to
destination, it calculates α for every route. The worth α is predicated on largest minimum residual energy of the trail
and hop count of the trail. If a route has higher α, then that path is chosen for routing the information. The worth of α
are higher, if the most important of minimum residual energy of the trail is higher and also the range of hop count is
lower. Once the trail is chosen, knowledge is transferred on the trail. So as to extend the energy potency any
transmission power of the nodes is additionally adjusted supported the situation of their neighbour. If the neighbours
of a node are closely placed thereto node, then transmission vary of the node is diminished. Thus it's enough for the
node to own the transmission power to achieve the neighbour at intervals that vary. As a result transmission power
of the node is cut back that later on reduces the energy consumption of the node. Our planned work is simulated
through Network machine (NS-2). Existing AODV and Man-Min energy routing protocol conjointly simulated
through NS-2 for performance comparison. Packet Delivery quantitative relation, Energy Consumption and end-toend
delay.
Prevention of Malicious Nodes and Attacks in Manets Using Trust worthy MethodIJTET Journal
In Manet the first demand is co-operative communication among nodes. The malicious nodes might cause security issues like grey hole and cooperative attacks. To resolve these attack issue planning Dynamic supply routing mechanism, that is referred as cooperative bait detection theme (CBDS) that integrate the advantage of each proactive and reactive defence design is used. In region attacks, a node transmits a malicious broadcast informing that it's the shortest path to the destination, with the goal of intercepting messages. During this case, a malicious node (so-called region node) will attract all packets by victimisation solid Route Reply (RREP) packet to incorrectly claim that “fake” shortest route to the destination then discard these packets while not forwarding them to the destination. In grey hole attacks, the malicious node isn't abs initio recognized in and of itself since it turns malicious solely at a later time, preventing a trust-based security resolution from detective work its presence within the network. It then by selection discards/forwards the info packets once packets undergo it. During this we have a tendency to focus is on detective work grey hole/collaborative region attacks employing a dynamic supply routing (DSR)-based routing technique.
Effective Pipeline Monitoring Technology in Wireless Sensor NetworksIJTET Journal
Wireless detector nodes are a promising technology to play three-dimensional applications. Even it
will sight correct lead to could on top of ground and underground. In solid underground watching system makes
some challenges are there to propagating the signals. The detector node is moving entire the underground
pipeline and sending information to relay node that's placed within the on top of ground. If any relay node is
unsuccessful during this condition suggests that it'll not sending the info. In this watching system can specially
designed as a heterogeneous networks. Every high power relay nodes most covers minimum 2 low power relay
node. If any relay node is unsuccessful within the network, the constellation can modification mechanically
supported the heterogeneous network. The high power relay node is change the unsuccessful node and sending
the condition of pipeline. The benefits are thought-about to be extremely distributed, improved packet delivery
Raspberry Pi Based Client-Server Synchronization Using GPRSIJTET Journal
A low cost Internet-based attendance record embedded system for students which uses wireless technology to
transfer data between the client and server is designed. The proposed system consist of a Raspberry Pi which acts as a
client which stores the details of the students in the database by using user login system using web. When the user logs
into the database the data is sent through GPRS to the server machine which maintains the records of the employees and
the attendance is updated in the server database. The GPRS module provides a bidirectional real-time data transfer
between the client and server. This system can be implemented to any real time application so as to retrieve information
from a data source of the client system and send a file to the remote server through GPRS. The main aim is to avoid the
limitations in Ethernet connection and design a low cost and efficient attendance record system where the data is
transferred in a secure way from the client database and updated in the server database using GPRS technology
ECG Steganography and Hash Function Based Privacy Protection of Patients Medi...IJTET Journal
Data hiding can hide sensitive information into signals for covert communication. Most data hiding
techniques will distort the signal in order to insert additional messages. The distortion is often small; the irreversibility is
not admissible to some sensitive techniques. Most of the applications, lossless data hiding is desired to extract the
embedded data and the original host signal. The project proposes the enhancement of protection system for secret data
communication through encrypted data concealment in ECG signals of the patient. The proposed encryption technique
used to encrypt the confidential data into unreadable form and not only enhances the safety of secret carrier information by
making the information inaccessible to any intruder having a random method. For that we use twelve square ciphering
techniques. The technique is used make the communication between the sender and the receiver to be authenticated is hash
function. To evaluate the effectiveness of ECG wave at the proposed technique, distortion measurement techniques of two
are used, the percentage residue difference (PWD) and wavelets weighted PRD. Proposed technique provides high security protection for patient data with low distortion is proven in this proposed system.
An Efficient Decoding Algorithm for Concatenated Turbo-Crc CodesIJTET Journal
In this paper, a hybrid turbo decoding algorithm is used, in which the outer code, Cyclic Redundancy Check code is
not used for detection of errors as usual but for error correction and improvement. This algorithm effectively combines the iterative
decoding algorithm with Rate-Compatible Insertion Convolution Turbo Decoding, where the CRC code and the turbo code are
regarded as an integrated whole in the Decoding process. Altogether we propose an effective error detecting method based on
normalized Euclidean distance to compensate for the loss of error detection capability which should have been provided by CRC
code. Simulation results show that with the proposed approach, 0.5-2dB performance gain can be achieved for the code blocks
with short information length
Improved Trans-Z-source Inverter for Automobile ApplicationIJTET Journal
In this paper a new technology is proposed with a replacement of conventional voltage source/current
source inverter with Improved Trans-Z-source inverter in automobile applications. The improved Trans-Z-source
inverter has a high boost inversion capability and continues input current. Also this new inverter can suppress the
resonant current at the startup; this resonant current in the startup may lead the device to permanent damage. In
improved Trans-Z-source inverter a couple inductor is needed, instead of this coupled inductor a transformer is used.
By using a transformer with sufficient turns ratio the size can be reduced. The turn’s ratio of the transformer decides
the input voltage of the inverter. In this paper operating principle, comparison with conventional inverters, working
with automobiles simulation results, THD analysis, Hardware implementation using ATMEGA 328 P are included.
Wind Energy Conversion System Using PMSG with T-Source Three Phase Matrix Con...IJTET Journal
This paper presents an analysis of a PMSG wind power system using T-Sourcethree phase matrix converter. PMSG using T-Source three phase matrix converterhas advantages that it can provide any desired AC output voltage regardless of DC input with regulation in shoot-through time. In this control system T-Source capacitor voltage can be kept stable with variations in the shoot-through time, maximum power from the wind turbine to be delivered. Inaddition, of a new future, the converter employs a safe-commutation strategy toconduct along a continuous current flow, which results in theelimination of voltage spikes on switches without the need for a snubber circuit. With the use of matrix converter the surely need forrectifier circuit and passive components to store energy arereduced. The MATLAB/Simulinkmodel of the overall system is carried out and theoretical wind energy conversion output load voltage calculations are madeand feasibility of the new topology has been verified and that theconverter can produce an output voltage and output current. This proposed method has greater efficiency and lower cost.
Comprehensive Path Quality Measurement in Wireless Sensor NetworksIJTET Journal
A wireless sensor network mostly relies on multi-hop transmissions to deliver a data packet. It is of essential importance to measure the quality of multi-hop paths and such information shall be utilized in designing efficient routing strategies. Existing metrics like ETF, ETX mainly focus on quantifying the link performance in between the nodes while overlooking the forwarding capabilities inside the sensor nodes. By combining the QoF measurements within a node and over a link, we are able to comprehensively measure the intact path quality in designing efficient multihop routing protocols. We propose QoF, Quality of Forwarding, a new metric which explores the performance in the gray zone inside a node left unattended in previous studies. We implement QoF and build a modified Collection Tree Protocol.
Optimizing Data Confidentiality using Integrated Multi Query ServicesIJTET Journal
Query services have experienced terribly massive growth within past few years for that reason large usage of services need to balance outsourcing data management to Cloud service providers that provide query services to the client for data owners, therefore data owner needs data confidentiality as well as query privacy to be guaranteed attributable to disloyal behavior of cloud service provider consequently enhancing data confidentiality must not be compromise the query processed performance. It is not significant to provide slow query services as the result of security along with privacy assurance. We propose the random space perturbation data perturbation method to provide secure with kNN(k-nearest-neighbor) range query services for protecting data in the cloud and Frequency Structured R-Tree (FSR-Tree) efficient range query. Our schemes enhance data confidentiality without compromising the FSR-TREE query processing performance that also increases the user experience.
Foliage Measurement Using Image Processing TechniquesIJTET Journal
Automatic detection of fruit and leaf diseases is essential to automatically detect the symptoms of diseases as early as they appear on the growing stage. This system helps to detect the diseases on fruit during farming , right from plan and easily monitoring the diseases of grapes leaf and apple fruit. By using this system we can avoid the economical loss due to various diseases in agriculture production. K-means clustering technique is used for segmentation. The features are extracted from the segmented image and artificial neural network is used for training the image database and classified their performance to the respective disease categories. The experimental results express that what type of disease can be affected in the fruit and leaf .
Harmonic Mitigation Method for the DC-AC Converter in a Single Phase SystemIJTET Journal
This project suggest a sine-wave modulation technique is to achieve a low total harmonic distortion of Buck-Boost converter connected to a changing polarity inverter in a system. The suggested technique improves the harmonic content of the output. In addition, a proportional-resonant Integral controller is used along with harmonic compensation techniques for eliminating the DC component in the system. Also, the performance of the Proposed controller is analyzed when it connecting to the converter. The design of Buck-Boost converter is fed by modulated sine wave Pulse width modulation technique are proposed to mitigate the low order harmonics and to control the output current. So, that the output complies within the standard limit without use of low pass filter.
Comparative Study on NDCT with Different Shell Supporting StructuresIJTET Journal
Natural draft cooling towers are very essential in modern days in thermal and nuclear power stations. These are the hyperbolic shells of revolution in form and are supported on inclined columns. Several types of shell supporting structures such as A,V,X,Y are being used for construction of NDCT’s. Wind loading on NDCT governs critical cases and requires attention. In this paper a comparative study on reinforcement details has been done on NDCT’s with X and Y shell supporting structures. For this purpose 166m cooling tower with X and Y supporting structures being analyzed and design for wind (BS & IS code methods), seismic loads using SAP2000.
Experimental Investigation of Lateral Pressure on Vertical Formwork Systems u...IJTET Journal
The modeling of pressure distribution of fresh concrete poured in vertical formwork are rather dynamic than complex. Many researchers had worked on the pressure distribution modeling of concrete and formulated empirical relationship factors like formwork height, rate of pour, consistency classes of concrete. However, in the current scenario, most of high rise construction uses self compacting concrete(SCC) which is a special concrete which utilizes not only mineral and chemical admixtures but also varied aggregate proportions and hence modeling pressure distribution of SCC over other concrete in vertical formwork systems is necessitated. This research seeks to bridge the gap between the theoretical formulation of pressure distribution with the actual modeled (scaled) vertical formwork systems. The pressure distribution of SCC in the laboratory will be determined using pressure sensors, modeled and analyzed.
A Five – Level Integrated AC – DC ConverterIJTET Journal
This paper presents the implementation of a new five – level integrated AC – DC converter with high input power factor and reduced input current harmonics complied with IEC1000-3-2 harmonic standards for electrical equipments. The proposed topology is a combination of boost input power factor pre – regulator and five – level DC – DC converter. The single – stage PFC (SSPFC) approach used in this topology is an alternative solution to low – power and cost – effective applications.
A Comprehensive Approach for Multi Biometric Recognition Using Sclera Vein an...IJTET Journal
Sclera and finger print vein fusion is a new biometric approach for uniquely identifying humans. First, Sclera vein is identified and refined using image enhancement techniques. Then Y shape feature extraction algorithm is used to obtain Y shape pattern which are then fused with finger vein pattern. Second, Finger vein pattern is obtained using CCD camera by passing infrared light through the finger. The obtained image is then enhanced. A line shape feature extraction algorithm is used to get line patterns from enhanced finger vein image. Finally Sclera vein image pattern and Finger vein image pattern were combined to get the final fused image. The image thus obtained can be used to uniquely identify a person. The proposed multimodal system will produce accurate results as it combines two main traits of an individual. Therefore, it can be used in human identification and authentication systems.
Study of Eccentrically Braced Outrigger Frame under Seismic ExitationIJTET Journal
Outrigger braced structures has efficient structural form consist of a central core, comprising braced frames with
horizontal cantilever ”outrigger” trusses or girders connecting the core to the outer column. When the structure is loaded
horizontally, vertical plane rotation of the core is restrained by the outriggers through tension in windward column and
compression in leeward column. The effective structural depth of the building is greatly increased, thus augmenting the lateral
stiffness of the building and reducing the lateral deflections and moments in core. In effect, the outriggers join the columns to the
core to make the structure behave as a partly composite cantilever. By providing eccentrically braced system in outrigger frame by
varying the size of links and analyzing it. Push over analysis is carried out by varying the link size using computer programs, Sap
2007 to understand their seismic performance. The ductile behavior of eccentrically braced frame is highly desirable for structures
subjected to strong ground motion. Maximum stiffness, strength, ductility and energy dissipation capacity are provided by
eccentrically braced frame. Studies were conducted on the use of outrigger frame for the high steel building subjected to
earthquake load. Braces are designed not to buckle, regardless of the severity of lateral loading on the frame. Thus eccentrically
braced frame ensures safety against collapse.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Text Categorization Using Improved K Nearest Neighbor Algorithm
1. INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 4 ISSUE 2 – APRIL 2015 - ISSN: 2349 - 9303
61
Text Categorization Using Improved K Nearest
Neighbor Algorithm
Femi Joseph
PG scholar
MEA Engineering College
Perinthalmanna, India
femy.j46@gmail.com
Nithin Ramakrishnan
Assistant Professor
MEA Engineering College
Perinthalmanna, India
nithinramakrishnan@gmail.com
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of
algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly
used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test
documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is
categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective
strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The
text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
Index Terms— Text Categorization, K Nearest Neighbor classifier, One-pass clustering, document class, training data, test data,
classification model
—————————— ——————————
1 INTRODUCTION
With the sudden growth of available information on the internet, it
is easy to extract information than before. Collecting useful and
favorable information from huge repositories is a complex and time
consuming task. Here Information Retrieval (IR) plays an important
role. Information Retrieval is a process that collects relevant
information from a large collection of resources with the help of
some keywords. But the amount of information obtained through
Information Retrieval is larger than that a user can handle and
manage. In such a case, it requires the user to analyze the obtained
results one by one until the satisfied information is attained, which is
time-consuming and inefficient. Therefore, certain tools are required
to identify desired documents. Text categorization is one such
possible implementation.
Text categorization can be done automatically or manually. In
most cases, the text categorization is done automatically. Automatic
text categorization is the process of automatically assigning natural
language texts to predefined categories based on the content. Without
engendering training documents by hand, it automatically generates
training sentence sets utilizing keyword lists of each category. This
in turn reduces the available time for extracting satisfiable
information. A large number of potential applications make it more
popular. For Text Categorization a large number of machine learning,
knowledge engineering, and probabilistic-based methods have been
proposed. The most popular methods include Bayesian probabilistic
methods, regression models, example-based classification, decision
trees, decision rules, K Nearest Neighbor, Neural Networks (NNet),
Support Vector Machines (SVM), centroid-based approaches and
association rules mining. Among all these algorithms, K-Nearest
Neighbor is widely used as text classifier because of its simplicity
and efficiency.
K Nearest Neighbor classification finds a group of k objects from
the training set that are closer to the test object, and predicates the
assignment of a label on the predominance of a particular class in its
neighborhood. There are three key elements for this approach: a set
of labeled objects, a distance or similarity metric to compute the
distance between given objects, and the value of k, the number of
nearest neighbors. In order to classify an unlabeled object, the
distance from the unlabeled object to the other labeled objects is
calculated, and its nearest neighbors are identified. The class labels
of these nearest neighbors are then used to determine the class label
of the object.
The main contributions of this paper are as follows: 1) Proposes a
new text categorization technique based on an existing KNN text
classifier algorithm and a constrained one pass clustering. The
resulting text categorization system equally performs or outperforms
one of the best performers in this task (i.e., KNN). 2) It is suited in
many applications well, where data is updated dynamically and
required rigorously in real-time performance. This method will
incrementally or dynamically updates the classification model. 3)
The method significantly outperforms KNN and more effective.
3 PROBLEM STATEMENT
K Nearest Neighbor (KNN) is an efficient algorithm for text
categorization. But the main problem is that it does not generate the
classification model. Instead, it uses all the training data to
categorize the input test. By incorporating a constrained one-pass
clustering, the KNN algorithm can overcome this defect. In this
improved method, the constrained one-pass clustering is used to
generate the classification model.
2. INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 4 ISSUE 2 – APRIL 2015 - ISSN: 2349 - 9303
62
The classification model will allow the KNN algorithm to work
efficiently by reducing the time taken by the KNN algorithm to
categorize the input text. In addition, this improved model will
incrementally update the classification model dynamically.
2 RELATED WORK
Pascal Soucy and Guy W. Mineau[1] propose a simple KNN
algorithm for text categorization. This method does an aggressive
feature selection. Feature Selection is used for selecting a subset of
all available features. Here, the feature selection method allows the
removal of features, which adds no new information and features
with weak prediction capability. When such features interact with
other features, it may lead to redundancy. Redundancy and
irrelevancy could mislead KNN learning algorithm by some
unwanted preconception, and increases its complexity. By
considering both the redundancy and the relevancy of the features,
the simple KNN algorithm provides an efficient method. The KNN
algorithm provides a solid ground for text categorization in large
document sets and diverse vocabulary. In this method, each text
document is called as an instance. In order to categorize texts using
KNN, each example document X is represented as a vector of length
|F|, the size of the vocabulary. The algorithm uses the simple feature
weighting approach. In this approach, distance is treated as a basis to
weight the contribution of each k neighbor in the class assignment
process. With text documents, text categorization may involve
thousands of features, most of them being irrelevant. It is very
difficult to identify the relevant feature for text categorization
because the interaction between the features is much closer. The
feature selection method used here to aggressively reduce the
vocabulary size using feature interaction. This simple KNN
algorithm can reach impressive results using a few features.
Songbo Tan [2] proposes an effective refinement strategy for
KNN text classifier, where DragPushing is proposed as a refinement
strategy to enhance the performance of KNN. Since KNN uses all
training documents to predict classes of test documents, all training
documents within one class can be taken as class-representative of
that category. It will make use of training errors to refine the KNN
classifier by ‗dragging‘ and ‗pushing‘ operation, as called
‗DragPushing Strategy‘. If one training example is misclassified into
a wrong class, then the ‗DragPushing‘ technique ‗drag‘ the class-
representative in the correct class to the example, and ‗push‘ the
class-representative in the misplaced class against the example.
Clearly, after the DragPushing, the correct class tends to put more
examples in K-nearest neighbor set and these nearest neighbors share
larger similarities with test document, and vice versa. The reason
why DragPushing Strategy using K Nearest Neighbor (DPSKNN)
can overcome the problem of imbalance of text corpus is that, if it
takes class A as minority category and class B as majority category,
then according to traditional KNN decision rule, the examples in
category A tends to be classified into class B. As a result, the weight
vector of class A has many `Drag' operation than that of class B.
After a few rounds of `DragPushing' operations, the minority
category A tends to have much larger weight vector than majority
category B. Consequently, the different weight vector associated
with each category could counteract the impact of unbalance of
training corpus to a high degree. During each iteration, all training
documents need to be classified. If one document labeled as class A
is classified into class B, DragPushing is utilized to adjust it.
Conspicuously after the `Drag' and `Push' operation, all or most of
elements in the weight vector of class A will be increased while all or
most of elements in the weight vector of class B will be decreased.
As a result, the similarities of all or most documents in the class A to
document will be increased and in the class B to document will be
decreased. The weights, i.e., drag weight and push weight, are used
to control the step-size of `drag' and `push'. The major advantage of
this technique is that, it does not need to generate sophisticated
models, but only requires simple statistical data and the traditional
KNN classifier. The DPSKNN could make a significant difference in
the performance of the KNN Classifier and distributed better
performance than other commonly used methods.
Songbo Tan [3] proposes a neighbor weighted K-nearest neighbor
for unbalanced text corpus. To unbalanced text corpora, the majority
class tends to have more examples in the K-neighbor set for each test
document. If we utilize the traditional KNN technique to classify the
test document, the document tends to assign the label of the majority
class to the document. Hence, the big category tends to have higher
classification accuracy, while the other that is the minority class
tends to have low classification accuracy. So that, the total
performance of KNN will be degraded. In this method, instead of
balancing the training data, the algorithm assigns a big weight for
neighbors from small class, and assigns a little weight for the
neighbors that are in large category. For each test document d, first
select K neighbors among training documents contained in K*
categories. For test document d, it should be assigned the class that
has the highest resulting weighted sum, as in the case of traditional
KNN. The NWKNN yields much better performance than KNN. The
algorithm NWKNN is an effective algorithm for unbalanced text
classification problems.
Zhou Yong, Li Youwen and Xia Shixiong [4] propose an
improved KNN text classification algorithm based on clustering,
which reduces the high computational complexity. This method uses
a weight value for each new training sample, which indicates the
degree of the importance when classifying the documents. This
algorithm improves both the efficiency and the accuracy of text
categorization. In order to classify the document, it uses the
preprocessing technique. The pre-processing of text classification
includes many key technologies, such as sentence segmentation,
remove stop words, feature extraction and weight calculation.
Sentence segmentation is the process of dividing the given text
document into meaningful units such as words, sentences, etc. The
words that do not have any importance in the content of document
and nearly have no effect on classification are called stop words, so
the removal of the stop words is necessary. Feature extraction means
the process of reducing the amount of resources required to describe
a large set of data. This method does not use all training samples as a
traditional KNN algorithm, and it can overcome the problem of
uneven distribution of training samples. This improving algorithm
uses the k-means clustering algorithm to get the cluster centers,
which were used as the new training samples. Then a weight value is
introduced for each cluster center that indicates the importance of
them. At last, the revised samples with weight are used in the
algorithm.
2.1 Observations and analysis
The most efficient and simple algorithm that is used to categorize
the text is the K Nearest Neighbor algorithm. Different methods,
based on the KNN text classification algorithm, are used to classify
the text data. The result of analysis of different text categorization
methods is shown in the TABLE 1 below.
3. INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 4 ISSUE 2 – APRIL 2015 - ISSN: 2349 - 9303
63
TABLE 1: COMPARISON OF VARIOUS METHODS USING KNN
ALGORITHM
4 PROPOSED WORK
This paper is organized as follows: In Section 4.1, the system
architecture is illustrated. In Section 4.2, a brief description about
each module is provided. In Section 4.3, the data set description and
the evaluation of the proposed work and the base work is given.
4.1 System Architecture
The figure.1 provides an overview of the system architecture.
A set of collected documents are used as the training set. Using
the collected documents, the training sentence set is created. This
training set is used to generate the classification model. The
generation the classification model is performed using the clustering
process. During this process, a set of document clusters are
generated. After the generation of document clusters, the input text is
compared with the document clusters and it is assigned to the
corresponding class. The input text is first represented in its text
representation format. Then this represented input is given to the text
classifier. The text classifier makes use of the document clusters to
categorize the text into its corresponding category.
Figure 1: System Architecture
4.2 Module Description
In the proposed work, the text categorization technique is divided
into three modules. Each of these modules will work together to
obtain a better result for the text categorization. First one is the
classification model generation module which make use of
constrained one-pass clustering. Next is the text categorization
module which is implemented using Weighted KNN algorithm. Then
finally the updating classification model module which updates the
classification model dynamically.
Classification Model Generation Module: In this module, the
classification model is built by making use of clustering. Clustering is
used to generate different clusters of the training document. One-pass
clustering algorithm is a kind of incremental clustering algorithm with
approximately linear time complexity. To build the classification
model with the training text documents, one pass clustering algorithm
is used to constrainedly cluster the text collections. The number of the
clusters obtained with constrained clustering is much less than the
number of training samples. As a result, when KNN method is used to
classify the test documents, the text similarity computation is
substantially reduced and the impact of its performance affected by
single training sample is also reduced.
Text Categorization Module: Text categorization is done using
Weight Adjusted KNN text classifier. It will classify the test input
into the predefined category, which is done in the previous module.
This categorization is done by calculating the weight of the test
document which is learned using an iterative algorithm. It is then used
for computing the similarity measurement of the test document with
each training class.
Updating classification model Module: In this module the
classification model is updated incrementally. i.e., whenever a new
training data arrives, the classifier will update the existing
classification model. Many previous text categorization algorithms
are hard to or cannot update their classification model dynamically.
However, the size of the text data are huge and increases constantly
in the real-world applications, and rebuilding model is very time-
consuming. INNTC is an incrementally modeling algorithm, which
can update its classification model quickly based on the new training
samples. It is valuable in practical applications.
4.3 Data Set and Evaluation
Reuters-21578 is a standard benchmark for text categorization.
The Reuters-21578 collection contains 21578 documents and 135
categories appeared on the Reuters news wire in 1987. The dataset
used in this project is the Reuters Transcribed Subset. It consists of
data that was created by selecting 20 files each from the 10 largest
classes in the Reuters-21578 collection. The data format consists of
10 directories labeled by the topic name. Each contains 20 files of
transcriptions which belong to the area, business.
5 RESULTS
MATLAB has been implemented and three programs have been
built around it. The first of these programs is capable of generating
the document classes from the existing dataset. For this document
class generation, first calculate a threshold value by selecting a
random number of documents. The second program is for generating
Technique
Used
Method Advantages
Disadvantage
s
Simple
KNN
Algorithm
Feature
selection
method
Reduces the
vocabulary size
Less efficient
Effective
refinement
strategy for
KNN
DragPushing
strategy
No need to generate
sophisticated
models
Cannot predict
the number of
iterations
Neighbor
Weighted
KNN
Neighbor
Weighted
KNN
Better performance
for unbalanced text
Difficult to
calculate the
weight sum
KNN
based on
clustering
k-means
clustering
Overcome defect of
uneven distributions
of training samples
Difficult to
select the
threshold
parameter
4. INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY
VOLUME 4 ISSUE 2 – APRIL 2015 - ISSN: 2349 - 9303
64
the Weighted KNN algorithm. The third program is used to update the
document class dynamically whenever a new training data arises.
The text categorization has been evaluated with different k values
and it is compared with the existing KNN algorithm.
Figure 2: The classification results with different k values in
Reuters-21578
6 CONCLUSION AND FUTURE WORK
This paper describes a text categorization technique using K-
Nearest Neighbor algorithm. The text categorization technique based
on improved K Nearest Neighbor Algorithm is the most suitable one
due to its minimal time and computational requirements. In this text
categorization technique, clustering is a great tool to discover the
complex distribution of the training texts. It uses constrained one
pass clustering algorithm to obtain the categories relationship by the
constrained condition (each cluster only contains one label). This can
better reflect the complex distributions of the training texts than
original text samples. As integrating the advantages of the
constrained one pass clustering and Weight Adjusted KNN approach,
this method has significant performance comparable with other text
classifiers.
ACKNOWLEDGMENT
An exertion over a long period may be successful only with
advice and guidance of many well wishers. We take this opportunity
to express our gratitude to all who encouraged us to complete this
work. We would like to express our deep sense of gratitude to our
respected Principal for his inspiration and for creating an atmosphere
in the college to do the work. We would also like to thank Head of
the department, Computer Science and Engineering.
REFERENCES
[1] Pascal Soucy, Guy W Mineau. ―A Simple KNN Algorithm for
Text Categorization.‖ Proceedings of International Conference
on Date Mining. 2001: 647-648.
[2] Tan, Songbo. "An effective refinement strategy for KNN text
classifier." Expert Systems with Applications 30.2 (2006): 290-
298.
[3] Tan, Songbo. ―Neighbor-weighted k-nearest neighbor for
unbalanced text corpus‖ Expert Systems with Applications 28.4
(2005): 667-671.
[4] Zhou, Yong, Youwen Li, and Shixiong Xia. "An improved
KNN text classification algorithm based on clustering." Journal
of computers 4.3 (2009): 230-237.
[5] Jiang, Shengyi, et al. "An improved K-nearest-neighbor
algorithm for text categorization." Expert Systems with
Applications 39.1 (2012): 1503-1509.
[6] L. R. Bahl, S. Balakrishnan-Aiyer, J. Bellegarda, M. Franz, P.
Gopalakrishnan, D. Nahamoo, M. Novak, M. Padmanabhan, M.
Picheny, and S. Roukos, ―Performance of the IBM large
vocabulary continuous speech recognition system on the ARPA
wall street journal task.‖ In Proc. of ICASSP ‘95, pages 41–44,
Detroit, MI, 1995.
[7] S. Agarwal, S. Godbole, D. Punjani and S. Roy, ―How Much
Noise is too Much: A Study in Automatic Text Classification‖,
In Proc. of ICDM 2007.
[8] D. D. Lewis. Reuters-21578 text categorization test collection
distribution 1.0. http://www.research.att.com/ lewis, 1999.