Automatic Text categorization refers to the process of assigning a category or some categories
automatically among predefined ones. Text categorization is challenging in Indian languages has rich in
morphology, a large number of word forms and large feature spaces. This paper investigates the
performance of different classification approaches using different term weighting approaches in order to
decide the most applicable one to Telugu text classification problem. We have investigated on different
term weighting methods for Telugu corpus in combination with Naive Bayes ( NB), Support Vector
Machine (SVM) and k Nearest Neighbor (kNN) classifiers.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
Text mining and Text classification are the two pro
minent and challenging tasks in the field of
Machine learning. Text mining refers to the process
of deriving high quality and relevant
information from text, while Text classification de
als with the categorization of text documents
into different classes. The real challenge in these
areas is to address the problems like handling
large text corpora, similarity of words in text doc
uments, and association of text documents with
a subset of class categories. The feature extractio
n and classification of such text documents
require an efficient machine learning algorithm whi
ch performs automatic text classification.
This paper describes the classification of product
review documents as a multi-label
classification scenario and addresses the problem u
sing Structured Support Vector Machine.
The work also explains the flexibility and performan
ce of the proposed approach for e
fficient text classification.
Novelty detection via topic modeling in research articlescsandit
In today’s world redundancy is the most vital problem faced in almost all domains. Novelty
detection is the identification of new or unknown data or signal that a machine learning system
is not aware of during training. The problem becomes more intense when it comes to “Research
Articles”. A method of identifying novelty at each sections of the article is highly required for
determining the novel idea proposed in the research paper. Since research articles are semistructured,
detecting novelty of information from them requires more accurate systems. Topic
model provides a useful means to process them and provides a simple way to analyze them. This
work compares the most predominantly used topic model- Latent Dirichlet Allocation with the
hierarchical Pachinko Allocation Model. The results obtained are promising towards
hierarchical Pachinko Allocation Model when used for document retrieval.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
The amount of data present online is growing very rapidly, hence a need for organizing and categorizing
data has become an obvious need. The Information Retrieval (IR) techniques act as an aid in assisting
users in obtaining relevant information. IR in the Indian context is very relevant as there are several blogs,
news publications in Indian languages present online. This work looks at the suitability of Naïve Bayesian
methods for paragraph level text classification in the Kannada language. The Naïve Bayesian methods are
the most primitive algorithms for Text Categorization tasks. We apply dimensionality reduction technique
using Minimum term frequency, stop word identification and elimination methods for achieving the task. It
is evident that Naïve Bayesian Multinomial model outperforms simple Naïve Bayesian approach in
paragraph classification tasks.
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
This paper describes a context free grammar (CFG) based grammatical relations for Myanmar
sentences which combine corpus-based function tagging system. Part of the challenge of
statistical function tagging for Myanmar sentences comes from the fact that Myanmar has freephrase-order
and a complex morphological system. Function tagging is a pre-processing step to
show grammatical relations of Myanmar sentences. In the task of function tagging, which tags
the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the possible function
tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of
the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Using Class Frequency for Improving Centroid-based Text ClassificationIDES Editor
Most previous works on text classification,
represented importance of terms by term occurrence frequency
(tf) and inverse document frequency (idf). This paper presents
the ways to apply class frequency in centroid-based text
categorization. Three approaches are taken into account. The
first one is to explore the effectiveness of inverse class
frequency on the popular term weighting, i.e., TFIDF, as a
replacement of idf and an addition to TFIDF. The second
approach is to evaluate some functions, which are used to
adjust the power of inverse class frequency. The other approach
is to apply terms, which are found in only one class or few
classes, to improve classification performance, using two-step
classification. From the results, class frequency expresses its
usefulness on text classification, especially the two-step
classification.
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
Proposed Method for String Transformation using Probablistic ApproachEditor IJMTER
For this system the string is given as an input to the system generates the k most likely output strings corresponding to the input string. This system proposes both accurate and efficient feature by using a novel and probabilistic approach to string transformation, which is. The approach is includes the use of a log linear model, a method for training the model, and an algorithm for generating the top k candidates, whether there is or is not a predefined dictionary. The log linear model is defined as a conditional probability distribution of an output string and a rule set for the transformation conditioned on an input string. The learning method employs maximum likelihood estimation for parameter estimation. The string generation algorithm based on pruning is guaranteed to generate the optimal top k candidates. The proposed method will apply to correction of spelling errors in queries as well are formulation of queries in web search.
Clustering Algorithm for Gujarati Languageijsrd.com
Natural language processing area is still under research. But now a day it is on platform for worldwide researchers. Natural language processing includes analyzing the language based on its structure and then tagging of each word appropriately with its grammar base. Here we have 50,000 tagged words set and we try to cluster those Gujarati words based on proposed algorithm, we have defined our own algorithm for processing. Many clustering techniques are available Ex. Single linkage , complete, linkage ,average linkage, Hear no of clusters to be formed are not known, so it's all depends on the type of data set provided .Clustering is preprocess for stemming . Stemming is the process where root is extracted from its word. Ex. cats= cat+S, meaning. Cat: Noun and plural form.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
The amount of data present online is growing very rapidly, hence a need for organizing and categorizing
data has become an obvious need. The Information Retrieval (IR) techniques act as an aid in assisting
users in obtaining relevant information. IR in the Indian context is very relevant as there are several blogs,
news publications in Indian languages present online. This work looks at the suitability of Naïve Bayesian
methods for paragraph level text classification in the Kannada language. The Naïve Bayesian methods are
the most primitive algorithms for Text Categorization tasks. We apply dimensionality reduction technique
using Minimum term frequency, stop word identification and elimination methods for achieving the task. It
is evident that Naïve Bayesian Multinomial model outperforms simple Naïve Bayesian approach in
paragraph classification tasks.
Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s ...IJECEIAES
Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESacijjournal
Named Entity Recognition which is an important subject of Natural Language Processing is a key technology of information extraction, information retrieval, question answering and other text processing applications. In this study, we evaluate previously well-established association measures as an initial
attempt to extract two-worded named entities in a Turkish corpus. Furthermore we propose a new association measure, and compare it with the other methods. The evaluation of these methods is performed by precision and recall measures.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
This paper describes a context free grammar (CFG) based grammatical relations for Myanmar
sentences which combine corpus-based function tagging system. Part of the challenge of
statistical function tagging for Myanmar sentences comes from the fact that Myanmar has freephrase-order
and a complex morphological system. Function tagging is a pre-processing step to
show grammatical relations of Myanmar sentences. In the task of function tagging, which tags
the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the possible function
tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of
the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Using Class Frequency for Improving Centroid-based Text ClassificationIDES Editor
Most previous works on text classification,
represented importance of terms by term occurrence frequency
(tf) and inverse document frequency (idf). This paper presents
the ways to apply class frequency in centroid-based text
categorization. Three approaches are taken into account. The
first one is to explore the effectiveness of inverse class
frequency on the popular term weighting, i.e., TFIDF, as a
replacement of idf and an addition to TFIDF. The second
approach is to evaluate some functions, which are used to
adjust the power of inverse class frequency. The other approach
is to apply terms, which are found in only one class or few
classes, to improve classification performance, using two-step
classification. From the results, class frequency expresses its
usefulness on text classification, especially the two-step
classification.
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
Proposed Method for String Transformation using Probablistic ApproachEditor IJMTER
For this system the string is given as an input to the system generates the k most likely output strings corresponding to the input string. This system proposes both accurate and efficient feature by using a novel and probabilistic approach to string transformation, which is. The approach is includes the use of a log linear model, a method for training the model, and an algorithm for generating the top k candidates, whether there is or is not a predefined dictionary. The log linear model is defined as a conditional probability distribution of an output string and a rule set for the transformation conditioned on an input string. The learning method employs maximum likelihood estimation for parameter estimation. The string generation algorithm based on pruning is guaranteed to generate the optimal top k candidates. The proposed method will apply to correction of spelling errors in queries as well are formulation of queries in web search.
Clustering Algorithm for Gujarati Languageijsrd.com
Natural language processing area is still under research. But now a day it is on platform for worldwide researchers. Natural language processing includes analyzing the language based on its structure and then tagging of each word appropriately with its grammar base. Here we have 50,000 tagged words set and we try to cluster those Gujarati words based on proposed algorithm, we have defined our own algorithm for processing. Many clustering techniques are available Ex. Single linkage , complete, linkage ,average linkage, Hear no of clusters to be formed are not known, so it's all depends on the type of data set provided .Clustering is preprocess for stemming . Stemming is the process where root is extracted from its word. Ex. cats= cat+S, meaning. Cat: Noun and plural form.
An apriori based algorithm to mine association rules with inter itemset distanceIJDKP
Association rules discovered from transaction databases can be large in number. Reduction of association
rules is an issue in recent times. Conventionally by varying support and confidence number of rules can be
increased and decreased. By combining additional constraint with support number of frequent itemsets can
be reduced and it leads to generation of less number of rules. Average inter itemset distance(IID) or
Spread, which is the intervening separation of itemsets in the transactions has been used as a measure of
interestingness for association rules with a view to reduce the number of association rules. In this paper by
using average Inter Itemset Distance a complete algorithm based on the apriori is designed and
implemented with a view to reduce the number of frequent itemsets and the association rules and also to
find the distribution pattern of the association rules in terms of the number of transactions of non
occurrences of the frequent itemsets. Further the apriori algorithm is also implemented and results are
compared. The theoretical concepts related to inter itemset distance are also put forward.
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Feature selection, optimization and clustering strategies of text documentsIJECEIAES
Clustering is one of the most researched areas of data mining applications in the contemporary literature. The need for efficient clustering is observed across wide sectors including consumer segmentation, categorization, shared filtering, document management, and indexing. The research of clustering task is to be performed prior to its adaptation in the text environment. Conventional approaches typically emphasized on the quantitative information where the selected features are numbers. Efforts also have been put forward for achieving efficient clustering in the context of categorical information where the selected features can assume nominal values. This manuscript presents an in-depth analysis of challenges of clustering in the text environment. Further, this paper also details prominent models proposed for clustering along with the pros and cons of each model. In addition, it also focuses on various latest developments in the clustering task in the social network and associated environments.
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...idescitation
Genetic Algorithm (GA) has been a successful method
that is been used for extracting keywords. This paper presents
a full method by which keywords can be derived from the
various corpuses. We have built equations that exploit the
structure of the documents from which the keywords need to
be extracted. The procedures are been broken into two
distinguished profiles: one is to weigh the words in the whole
document content and the other is to explore the possibilities
of the occurrence of key terms by using genetic algorithm.
The basic equations of the heuristic mechanism is been varied
to allow the complete exploitation of document. The Genetic
Algorithm and the enhanced standard deviation method is
used in full potential to enable the generation of the key
terms that describe the given text document. The new
technique has an enhanced performance and better time
complexities.
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
Text preprocessing is a vital stage in text classification (TC) particularly and text mining generally. Text preprocessing tools is to reduce multiple forms of the word to one form. In addition, text preprocessing techniques are provided a lot of significance and widely studied in machine learning. The basic phase in text classification involves preprocessing features, extracting relevant features against the features in a database. However, they have a great impact on reducing the time requirement and speed resources needed. The effect of the preprocessing tools on English text classification is an area of research. This paper provides an evaluation study of several preprocessing tools for English text classification. The study includes using the raw text, the tokenization, the stop words, and the stemmed. Two different methods chi-square and TF-IDF with cosine similarity score for feature extraction are used based on BBC English dataset. The Experimental results show that the text preprocessing effect on the feature extraction methods that enhances the performance of English text classification especially for small threshold values.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Farthest Neighbor Approach for Finding Initial Centroids in K- MeansWaqas Tariq
Text document clustering is gaining popularity in the knowledge discovery field for effectively navigating, browsing and organizing large amounts of textual information into a small number of meaningful clusters. Text mining is a semi-automated process of extracting knowledge from voluminous unstructured data. A widely studied data mining problem in the text domain is clustering. Clustering is an unsupervised learning method that aims to find groups of similar objects in the data with respect to some predefined criterion. In this work we propose a variant method for finding initial centroids. The initial centroids are chosen by using farthest neighbors. For the partitioning based clustering algorithms traditionally the initial centroids are chosen randomly but in the proposed method the initial centroids are chosen by using farthest neighbors. The accuracy of the clusters and efficiency of the partition based clustering algorithms depend on the initial centroids chosen. In the experiment, kmeans algorithm is applied and the initial centroids for kmeans are chosen by using farthest neighbors. Our experimental results shows the accuracy of the clusters and efficiency of the kmeans algorithm is improved compared to the traditional way of choosing initial centroids.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Indian Language Text Representation and Categorization Using Supervised Learn...ijbuiiir1
India is the home of different languages, due to its cultural and geographical diversity. The official and regional languages of India play an important role in communication among the people living in the country. In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. In the eighth schedule as of May 2008, there are 22 official languages in India.The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. So the Classification of text documents based on languages is essential. The objective of the work is the representation and categorization of Indian language text documents using text mining techniques. South Indian language corpus such as Kannada, Tamil and Telugu language corpus, has been created. Several text mining techniques such as naive Bayes classifier, k-Nearest-Neighbor classifier and decision tree for text categorization have been used.There is not much work done in text categorization in Indian languages. Text categorization in Indian languages is challenging as Indian languages are very rich in morphology. In this paper an attempt has been made to categories Indian language text using text mining algorithms
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
Similar to A comparative study on term weighting methods for automated telugu text categorization with effective classifiers (20)
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
A comparative study on term weighting methods for automated telugu text categorization with effective classifiers
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
A COMPARATIVE STUDY ON TERM
WEIGHTING METHODS FOR AUTOMATED
TELUGU TEXT CATEGORIZATION WITH
EFFECTIVE CLASSIFIERS
Vishnu Murthy.G1, Dr. B. Vishnu Vardhan2, K. Sarangam3 and
P. Vijay pal Reddy4
1
Department of Computer Science and Engineering, AGI, Hyderabad
Department of Computer Science and Engineering, JNTU, Jagityal
3
Department of Computer Science and Engineering, TEC, Hyderabad
4
Department of Computer Science and Engineering, MRCE, Hyderabad
2
ABSTRACT
Automatic Text categorization refers to the process of assigning a category or some categories
automatically among predefined ones. Text categorization is challenging in Indian languages has rich in
morphology, a large number of word forms and large feature spaces. This paper investigates the
performance of different classification approaches using different term weighting approaches in order to
decide the most applicable one to Telugu text classification problem. We have investigated on different
term weighting methods for Telugu corpus in combination with Naive Bayes ( NB), Support Vector
Machine (SVM) and k Nearest Neighbor (kNN) classifiers.
KEYWORDS
Term Weighting Methods, Text Categorization, Support Vector Machine, Naive Bayes, k Nearest Neighbor,
CHI-square.
1. INTRODUCTION
Now a days it is a challenge that the Information retrieval from the available large amount of
document data, where the information is accessed exactly required by the used and quickly.
Information Retrieval (IR) is searching for information. Information can be retrieved from
relational databases, documents, text, multimedia files, and the World Wide Web [9]. The
applications of IR are extraction of information from large documents, searching in digital
libraries, information filtering, spam filtering, object extraction from images, automatic
summarization, document classification and clustering, and web searching.
DOI : 10.5121/ijdkp.2013.3606
95
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
Text Categorization (TC) is also known as text classification. It is the task of automatically sorting
a set of documents into categories (or topics, classes) from a predefined set. Automated text
classification tools are available and used to organize the documents into classification.
There are two types of categorization. First is Rule-based and the second is Machine Learning [4].
In Rule-based approach classification rules are framed manually and the documents classified
based on rules. In Machine learning approaches equations are defined automatically using sample
labeled documents.
TC involves many applications such as Automatic indexing for Boolean information retrieval
systems,Text filtering, Word sense disambiguation, Hierarchical categorization of Web pages,
identification of document genre, authorship attribution [14].
Extensive research works have not been conducted on Telugu corpus since Telugu language is
highly rich and requires special treatments such as ordering of verbs, morphological analysis, etc .
In Telugu morphology, words have affluent meanings and contain a great deal of grammatical and
lexical information. Telugu text documents are required significant processing in order to build
accurate classification model. In this work, single label binary categorization on labeled training
data is carried out on Telugu language text. So far no comparisons were made against Telugu
language data collections for different term weighting methods with various classification
algorithms.
The rest of the paper is organized as follows: Text Categorization model, Preprocessing with
reference to Telugu text corpus, different term weighting approaches and classification approaches
are explained in Section 2. Section 3 describes the characteristics of Telugu language. Section 4 is
dealt with data collection as well as the experimentation. Section 5 is with results analysis, and
finally the conclusion with further research is given in Section 6.
2. TEXT CATEGORIZATION PROBLEM
Text categorization is the task of assigning test documents into predefined categories. Let ‘D’ is a
document domain and C = {c1 , c2 , ..., c|c| } is a set of predefined categories. Then the task is, for
each document dj Є D, a decision to assign document dj under ci or a decision not to assign dj
under ci (ci Є C) by virtue of a function Φ, where the function Φ is also called the classifier [13].
The TC problem can be modeled as shown in Figure.1. The proposed system having three
modules mainly such as text document preprocessing, classifier construction and performance
evaluation. Document collection is divided into two sets: Training set and Test set. Training set is
a pre-classified set of documents which are used for training the classifier, while the Testing set is
to determine the accuracy of the classifier whether the set is having correct and incorrect
classifications for each input. The different phases in the model are explained below.
96
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
A. TOKENIZATION
A document is divided into small units called tokens. This process is called Tokenization. This
results a set of atomic words with semantic meaning [11]. This phase outputs the article as a set of
words by removing the unnecessary symbols like semicolons, colons, exclamation marks,
hyphens, bullets, parenthesis, numbers etc.
B. STOP WORD REMOVAL
A stop list is a list of commonly repeated features such as pronouns, conjunctions and prepositions
etc. which appear in every text document. These features are to be removed because they do not
have effect on the categorization process. For example, if the feature has a special character or a
number then the feature is removed. Stop word list is identified using Natural Language Tool Kit
(NLTK) called Telugu tagger. The Telugu tagger is trained on a tagger named as telugu.pos from
the Indian corpus that comes with NLTK. The accuracy is almost 98%.
C. STEMMING
By removing affixes, prefixes and/or suffixes from features is known as Stemming. It is used to
reduce the number of features in the feature space and improve the performance of the classifier
when the different forms of features are stemmed into a single feature. By using the tool Telugu
Morphological Analyzer (TMA) developed by IIT, Hyderabad and Central university of
Hyderabad, stem forms of the inflected words are identified.
97
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
D. VECTOR SPACE MODEL
The basic idea of vector space model [2] is representing the document in computer understandable
form. Bag-of-word model is one of the forms to represent the document followed in this paper.
In Space Vector Model, any text document is represented as vectors or dimensions. Each
dimension of space is represented as a single feature of the vector and the weight is calculated by
various weighting schemes. Hence, each document can be represented as d =(t1, w1;t2, w2;....
;tn,wn), which ti is a term, wi is the weight of the ti in the document d. in order to reflect the
importance of the term in a document we use term weighting. There is different term weighting
methods proposed in the TC study. In this paper, we considered the four term weighting
approaches which are proved to be prominent in TC for Telugu Text categorization. They are as
defined as follows:
1) Term Frequency ( TF )
Using this method [10][11], each term is assumed to have a value proportional to the number of
times it occurs in a document is as follows:
ܹሺ݀, ݐሻ = ܶܨሺ݀, ݐሻ
2) Term Frequency-Inverse Document Frequency (TF-IDF )
This approach follows Salton's definition [4][5], which combined TF and IDF to weight the terms
and the author showed that this approach gives better performance with reference to accuracy that
IDF and TF alone. The combined result of TF and IDF is given as:
ܹሺ݀, ݐሻ = ܶܨሺ݀, ݐሻ. ܨܦܫሺݐሻ
and for a given N documents, if n documents contain the term t, IDF is given as follows:
ே
ܨܦܫሺݐሻ = ݈ ݃ቀ ቁ
3) Term Frequency-Chi square (TF.CHI)
The TF.CHI scheme [12] is included for two reasons. First, it is a typical representation which
combines TF factor with one feature selection metric i.e. CHI-square.
4) Term Frequency-Relevance Frequency (TF.RF)
This term weighting method in proposed in [17].According to the proposal, this is the best term
weighting approach for TC on English documents. Hence this method is considered to study on
Telugu TC. This approach is defined as follows:
ܽ
ܶܨܴ .ܨሺݐሻ = ܶ ݈݃ ∗ ܨ൬2 +
൰
݉ܽݔሺ1, ܿሻ
where, a is the number of documents which contain the positive category term, c is the number of
documents which contain the negative category term.
98
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
E. DIMENSIONALITY REDUCTION
The feature space is too large even after removing non-informative features and the stemming
process. Features which are not positively influencing the TC process can be removed without
affecting the classifier performance, known as Dimensionality reduction (DR). DR of the feature
space is carried out by feature selection and feature extraction.
Feature selection deals with several methods such as document frequency, DIA association factor,
chi-square, information gain, mutual information, odds ratio, relevancy score, GSS coefficient.
These methods are applied to reduce the size of the full feature set. DR by feature extraction is
used to create a small set of artificial features from original set. This feature can be computed by
using Term clustering and Latent semantic indexing features.
In Indian languages, the number of features are even higher compared with English text because of
richness in morphology. We use χ2 metric [1] for feature selection in this paper, which are found
χ2 and information gain are the most effective feature selection metrics in the literature. CHI
square measures the correlation between feature and class.
For example; Let A be the times both feature t and class c exists, B be the times feature t exists,
but class c doesn’t exist, C be the times feature t doesn’t exist, but class c exists, D be the times
both feature t and class c doesn’t exist, N be the total number of the training samples. Then CHI
square ( χ2 ) statistics can be written as:
ே∗ሺିሻమ
ܺ ଶ ሺܿ ,ݐሻ = ሺାሻ∗ሺାሻ∗ሺାሻ∗ሺାሻ
F. CLASSIFIERS
There are many classification approaches such as Bayesian model, decision trees, Support vector
machines, Neural Networks and K- nearest neighbor, etc.. in the literature for Text Categorization.
Support Vector Machines (SVM) has a better performance than other methods due to its ability to
efficiently handle relatively high dimensional and large-scale data sets without decreasing
classification accuracy. K-nearest neighbor (kNN) makes prediction based on the k training
documents which are closest to the test document. It is very simple and effective but not efficient
in the case of high dimensional and large-scale data sets. The Naive Bayes (NB) method assumes
that the terms in one document are independent even this is not the case in the real world. In this
paper, we considered the SVM,KNN and NB classification approaches for Telugu Text
Categorization. The brief description about these methods are given below:
1) Naïve Bayes Algorithm
Naive Bayes [8] is a supervised, probabilistic learning method and is computed as:
where P(wi|c) is the conditional probability of term wi occurring in a document of class c. We
interpret P( wi|c) as a measure of how much evidence wi contributes that c is the correct class.
(w1, w2, . . ., wn) are the tokens in the document ‘d’; are part of the vocabulary used for
classification and ‘n’ is the number of such tokens in the document d. In text classification, the
maximum a best class in Naive Bayes classification is the most likely or maximum a posteriori and
denoted by:
99
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
2) KNN Algorithm
K-nearest neighbor ( KNN) which is also known as Text-to-Text Comparison ( TTC ), is a
statistical approach, which is has been successfully applied to TC problem [13] and showed
promising results. Given a test document to be classified, the algorithm searches for the K nearest
neighbors among the pre-classified training documents based on some similarity measure and
ranks those k- neighbors based on their similarity scores, the categories of the k-nearest neighbors
are used to predict the category of the test document by using the ranked scores of each as the
weight of the candidate categories, if more than one neighbor belong to the same category then the
sum of their scores is used as the weight of that category, the category with the highest score is
assigned to the test document provided that it exceeds a predefined threshold, more than one
category can be assigned to the test document.
3) Support Vector Machines (SVM)
In general, SVM[7] is a linear learning system that builds two-class classifiers. Let the set of
training examples D be {(x1, y1), (x2, y2), ..., (xn, yn)}, where xi = (xi1, xi2, ..., xir) is a rdimensional input vector in a real-valued space, yi is its class label (output value) and yi belongs to
{1, -1}. 1 denotes the positive class and -1 denotes the negative class. To build a classifier, SVM
finds a linear function of the form:
݂ሺݔሻ = ܾ + ݔ .ݓ
so that an input vector xi is assigned to the positive class if f(xi) >= 0, and to the negative class
otherwise.
Hence, f(x) is a real-valued function, w = (w1, w2, ..., wr) is the weight vector, b is called the bias,
<w . x> is the dot product of w and x. Without using vector notation, Equation can be written as:
݂ሺݔଵ , ݔଶ , . . . . . . ݔ ሻ = ݓଵ ݔଵ + ݓଶ ݔଶ +. . . . . +ݓ ݔ + ܾ
3. TELUGU LANGUAGE CHARACTERISITCS
There are more than 150 different languages spoken in India today. Many of the languages have
not yet been studied in any great detail in terms of Text Categorization. 22 major languages have
been given constitutional recognition by the government of India.
Indian languages are characterized by rich system morphology and a productive system of
derivation. This means that the number of surface words will be very large and so will be the raw
feature space, leading to data sparsity. Dravidian morphology is in particular more complex.
Dravidian languages such as Telugu and Kannada are morphologically among the most complex
languages in the world, comparable only to languages like Finnish and Turkish. The main reason
100
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
for richness in morphology of Telugu (and other Dravidian languages) is, a significant part of
grammar that is handled by syntax in English (and other similar languages) is handled within
morphology. The small phrases in English are mapped on to a single term in Telugu language.
Hence there is a necessity to study the influence of term weighting methods on different
classification approaches on Indian context.
4. EMPERICAL EVALUATION
1) Test Collections
The dataset was gathered from Telugu News Papers such as Eeenadu, Andhra Prabha and Sakshi
from the web during the year 2009 – 2010. The corpus is collected from the website
http://uni.medhas.org/ in unicode format. We obtained around 800 news articles from the domains
of economics, politics, science, sports,culture and health. Before proceeding, we conduct some
preprocessing like tokenisation, removing stopping words and stemming choose 60% of the
documents as training samples, remaining 40% of the documents as testing samples for all six
categories. Then we use CHI square statistics feature selection method to select 100 features, and
then we conduct the experiments using TF, TF-IDF, TF-CHI and TF-RF weighing methods
separately on classifiers such as Naïve Bayes, KNN and SVM. After the experiment, we compare
result of different weighting methods with three classifiers.
2) Evaluation Methods
In order to compare the results of all possible combinations of term weighting methods with
classifiers, we computed the precision, recall, F1 measure and macro-averaged F1 measure .
Precision is the proportion of examples labeled positive by the system that were truly positive, and
recall is the proportion of truly positive examples that were labeled positive by the system. where
F1 is computed based on the following equation:
ܨଵ =
ଶ∗ோ∗௦
ோା௦
where,
ܺ
ܺ+ܻ
ܺ
ܴ݈݈݁ܿܽ =
ܺ+ܼ
where X is documents retrieved relevant, Y is documents retrieved irrelevant and Z is documents
not retrieved relevant. First F-Measure is calculated locally over each category and then followed
by the average over all categories is taken. Macro-averaged F-measure is obtained by taking the
average of F-measure values for each category as:
ܲ= ݊݅ݏ݅ܿ݁ݎ
ܨሺ݉ܽܿ݁݃ܽݎ݁ݒܽ − ݎሻ =
∑ಾ ி
భ
ெ
where M is total number of categories. Macro-averaged F-measure assigns equal weight to each
category, irrespective of its frequency.
101
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
We have used the SVM light i.e soft-margin linear SVM tool developed by T.Joachims for SVM
classification and for KNN classifier, k values range from 5 and taken 10,15,20. In KNN
algorithm, we have used the cosine similarity measure to find the distance between training
document and text document. The corpus detains are shown in Table: 1, and the experimental
results are shown in Table: 2, 3, 4 for F1 and Macro averaged F1 results of NB Classifier for six
categories, F1 and Macro averaged F1 results of KNN Classifier for six categories,F1 and Macro
averaged F1 results of SVM Classifier for six categories respectively.
Table 1: Corpus statistics
NO. OF TRAINING
DOCUMENTS
60
NO.OF TESTING
DOCUMENTS
40
TOTAL NO. OF
DOCUMENTS
100
Politics
120
80
200
Science
90
60
150
Sports
75
48
123
Culture
54
36
90
Health
85
50
135
CATEGORY
Economics
Table 2:F1 and Macro averaged F1 results of NB Classifier for six categories
Category
TF
TF-IDF
TF-CHI
TF-RF
Economics
0.712
0.729
0.708
0.740
Politics
0.783
0.780
0.759
0.798
Science
0.719
0.740
0.698
0.731
Sports
0.859
0.867
0.853
0.875
Culture
0.814
0.829
0.795
0.824
Health
0.875
0.860
0.867
0.895
F(macro-averaged)
0.794
0.801
0.780
0.810
102
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
Table 3: F1 and Macro averaged F1 results of KNN Classifier for six categories
Category
TF
TF-IDF
TF-CHI
TF-RF
Economics
0.719
0.723
0.694
0.731
Politics
0.796
0.809
0.799
0.816
Science
0.749
0.761
0.730
0.753
Sports
0.902
0.874
0.869
0.896
Culture
0.851
0.854
0.844
0.861
Health
0.885
0.891
0.883
0.907
F(macroaveraged)
0.817
0.819
0.803
0.828
Table 4: F1 and Macro averaged F1 results of SVM Classifier for six categories
Category
TF
TF-IDF
TF-CHI
TF-RF
Economics
0.726
0.733
0.682
0.764
Politics
0.820
0.828
0.812
0.851
Science
0.709
0.751
0.711
0.747
Sports
0.874
0.890
0.889
0.915
Culture
0.848
0.843
0.829
0.857
Health
0.917
0.909
0.890
0.932
F(macroaveraged)
0.816
0.826
0.802
0.844
5. RESULTS AND DISCUSSION
After analyzing the results, we found that the SVM categorizer outperformed NB and KNN on six
data sets with regards to F1 and macro averaged-F results. TF-RF performs significantly better for
all category distributions. Best macro averaged-F is achieved by using the TF-RF scheme. From
the results it is observed that relevance frequency scheme does improve the term’s discriminating
power for text categorization. It is observation that IDF adds discriminating power TF when
combined together. The TF-CHI method has given worse performance than TF,TF-IDF in most of
the categories in all classifiers. Moreover, and for the Telugu data sets, the SVM classifier has
1.0%, 1.2% and 2.8% higher macro-averaged F1 than NB, KNN respectively. Another notable
103
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
result that was also reported is that all classifiers vary among categories. For example, the "Health"
category has a neat classification F1 of 93.2%, while the “science” category has a noticeably poor
F1 measure of 74.7% for SVM. These poor results indicate that the "Science" category is highly
overlapped with other categories.
6. CONCLUSIONS AND FUTURE SCOPE
The macro average F1 of four term weighting measures obtained against six Telugu category sets
indicated that the SVM algorithm dominant NB and KNN algorithms. Finally, SVM and KNN
classifiers perform excellent in most of the categories.
TF-RF scheme shown good performance compared with other three variants of term frequency.
The CHI-square as a factor do not improve the term’s discriminating power for text categorization.
With this emperical analysis we are planning to use TF-RF as the term weighing scheme for
further research on Telugu Text categorization. Also, planning to propose a hybrid approach, a
combination two or more classifiers to increase the accuracy of the text classification process on
Telugu documents.
ACKNOWLEDGEMENTS
The authors would like to thank everyone, just everyone!
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
YANG Y, PEDERSEN J Q. A comparative study on feature selection in text categorization.
Proceedings of the 14th International Conference on Machine Learning (ICML), 1997: 2-3.
Gerard Salton, A. Wong, C. S. Yang, “A Vector Space Model for Automatic Indexing”, CACM
18(11), 1975.
FABRIZIO SEBASTIANI. Text categorization[M]//Alessandro Zanasi. Text mining and its
applications. WIT Press, Southampton, UK, 2005: 110-120.
SALTON G, WONG A, YANG C S. A vector space model for automated indexing. Communications
of the ACM, 1975: 1-8.
SALTONG, MCGILLC. An introduction to modern information retrieval. McGraw Hill, 1983.
Yang, Yiming and Pederson, Jan O. 1997. A Comparative Study on Feature Selection in Text
Categorization. ICML-97. 412-420.
Tam, Santoso A and Setiono R., “A comparative study of centroid-based, neighborhood-based and
statistical approaches for effective document categorization”, ICPR '02 Proceedings of the 16 th
International Conference on Pattern Recognition (ICPR'02) ,vol.4 , no. 4 , 2002, pp.235–238.
Joachims, Thorsten. 1998. Text Categorization with Support Vector Machines: Learning with Many
Relevant Features. ECML-98. 137-142.
Irina Rish, “An Empirical Study of the Naïve Bayes Classifier”, Proc. of the IJCAI-01 Workshop on
Empirical Methods in Artificial Intelligence, Oct 2001. citeulike-article-id:352583.
Doyle Lauren, Joseph Becker, “Information Retrieval and Processing”, Melville, 1975.
Luhn, H.P., "A Statistical Approach to Mechanized Encoding and Searching of Literary Information",
IBM J. Res. Develop, 1957.
Manning, Raghavan, Schutze, “Introduction to Information Retrieval”, Cambridge University, 2008
text categorization Feldman, Sanger, “The Text Mining Handbook - Advanced Approaches in
Analyzing Unstructured Data”, Cambridge University, 2007.
Robertson, S., “Understanding inverse document frequency: on theoretical arguments for IDF”,
Journal of Documentation, vol. 60, pp. 503–520, 2004. machine
104
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013
[14] Sebastiani F. Machine learning in automated text categorization[ J ]. ACM Computing Surveys,
2002, 34 (I): 1247.
[15] Sebastiani, F. (1999) ‘A Tutorial on Automated Text Categorisation’, In Amandi , A. and Zunino, A.
(eds.), Proceedings of the 1st argentinian Symposium on Artificial Intelligence (ASAI'99), pp. 7-35
[16] Sebastiani, F., Sperduti, A., and Valdambrini, N. 2000. An improved boosting algorithm and its
application to automated textcategorization. In Proceedings of CIKM-00, 9th ACM International
Conference on Information and Knowledge Management (McLean, US, 2000), pp. 78–85.
[17] Man Lan, Chew Lim Tan, Hwee Boon Low and Sam Yuan Sung. A comprehensive comparative
study on term weighting schemes for text categorization with support vector machines. In the
Proceedings of 14th International World Wide Web Conference (WWW2005). page 1032–1033.
ISBN: 1-59593-051-5. May 2005. Chiba, Japan.
105